This commit is contained in:
@@ -0,0 +1,28 @@
|
||||
# Troubleshooting Cases
|
||||
|
||||
## `IDLE+NOT_RESPONDING` after node maintenance
|
||||
|
||||
Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
|
||||
|
||||
Actions:
|
||||
|
||||
```bash
|
||||
systemctl restart munge
|
||||
systemctl restart slurmd
|
||||
systemctl restart slurmctld
|
||||
scontrol update NodeName=<node> State=RESUME || true
|
||||
scontrol update NodeName=<node> State=UNDRAIN || true
|
||||
scontrol update NodeName=<node> State=IDLE || true
|
||||
```
|
||||
|
||||
## Missing GPU TRES
|
||||
|
||||
Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
|
||||
|
||||
Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
|
||||
|
||||
## SlurmDBD objects already exist
|
||||
|
||||
Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
|
||||
|
||||
Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
|
||||
Reference in New Issue
Block a user