874 B
874 B
Troubleshooting Cases
IDLE+NOT_RESPONDING after node maintenance
Symptoms: sinfo shows idle* or scontrol show node shows IDLE+NOT_RESPONDING.
Actions:
systemctl restart munge
systemctl restart slurmd
systemctl restart slurmctld
scontrol update NodeName=<node> State=RESUME || true
scontrol update NodeName=<node> State=UNDRAIN || true
scontrol update NodeName=<node> State=IDLE || true
Missing GPU TRES
Symptoms: sacctmgr fails with no TRES known by type gres/gpu.
Fix: add AccountingStorageTRES=...,gres/gpu, restart/reconfigure Slurm, run a GPU job and verify with sacctmgr show tres.
SlurmDBD objects already exist
Symptoms: sacctmgr returns Nothing new added or Already existing.
Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with modify.