platform-projects/hpc-slurm-ai-cluster/docs/troubleshooting-cases.md

# Troubleshooting Cases

## `IDLE+NOT_RESPONDING` after node maintenance

Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.

Actions:

```bash
systemctl restart munge
systemctl restart slurmd
systemctl restart slurmctld
scontrol update NodeName=<node> State=RESUME || true
scontrol update NodeName=<node> State=UNDRAIN || true
scontrol update NodeName=<node> State=IDLE || true
```

## Missing GPU TRES

Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.

Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.

## SlurmDBD objects already exist

Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.

Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00			`# Troubleshooting Cases`

			## `IDLE+NOT_RESPONDING` after node maintenance

			Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.

			`Actions:`

			```bash
			`systemctl restart munge`
			`systemctl restart slurmd`
			`systemctl restart slurmctld`
			`scontrol update NodeName=<node> State=RESUME \|\| true`
			`scontrol update NodeName=<node> State=UNDRAIN \|\| true`
			`scontrol update NodeName=<node> State=IDLE \|\| true`
			```

			`## Missing GPU TRES`

			Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.

			Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.

			`## SlurmDBD objects already exist`

			Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.

			Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.