Files
portfolio/platform-projects/hpc-slurm-ai-cluster/docs/troubleshooting-cases.md
T
Mateusz Suski d300d490f5
lint / shell-yaml-ansible (push) Failing after 47s
Add Slurm AI/HPC cluster platform project
2026-06-04 19:42:45 +00:00

874 B

Troubleshooting Cases

IDLE+NOT_RESPONDING after node maintenance

Symptoms: sinfo shows idle* or scontrol show node shows IDLE+NOT_RESPONDING.

Actions:

systemctl restart munge
systemctl restart slurmd
systemctl restart slurmctld
scontrol update NodeName=<node> State=RESUME || true
scontrol update NodeName=<node> State=UNDRAIN || true
scontrol update NodeName=<node> State=IDLE || true

Missing GPU TRES

Symptoms: sacctmgr fails with no TRES known by type gres/gpu.

Fix: add AccountingStorageTRES=...,gres/gpu, restart/reconfigure Slurm, run a GPU job and verify with sacctmgr show tres.

SlurmDBD objects already exist

Symptoms: sacctmgr returns Nothing new added or Already existing.

Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with modify.