Files
portfolio/platform-projects/hpc-slurm-ai-cluster/docs/troubleshooting-cases.md
T

29 lines
874 B
Markdown
Raw Normal View History

2026-06-04 19:41:05 +00:00
# Troubleshooting Cases
## `IDLE+NOT_RESPONDING` after node maintenance
Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
Actions:
```bash
systemctl restart munge
systemctl restart slurmd
systemctl restart slurmctld
scontrol update NodeName=<node> State=RESUME || true
scontrol update NodeName=<node> State=UNDRAIN || true
scontrol update NodeName=<node> State=IDLE || true
```
## Missing GPU TRES
Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
## SlurmDBD objects already exist
Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.