Document Slurm AI/HPC cluster project

Add Slurm AI/HPC cluster platform project
2026-06-04 19:54:43 +00:00 · 2026-06-04 19:42:45 +00:00
3 changed files with 54 additions and 1 deletions
@@ -73,7 +73,7 @@ playbooks/health/         Health checks, repair, and auto-remediation
 playbooks/tests/          CPU, GPU, cgroup, accounting, and reporting validation jobs
 playbooks/backup/         Slurm and Munge state backup helpers
 templates/                Slurm, cgroup, GRES, and SlurmDBD templates
-docs/                     Operational runbook
+docs/                     Runbook, interview notes, and troubleshooting cases
 prompts/                  Documentation prompts used to expand this project
 ```

@@ -188,6 +188,7 @@ This is more than a toy lab because it includes operational controls around the
 - Rolling upgrade playbooks include canary validation before broader worker upgrades.
 - Health and repair playbooks document remediation paths for common node states.
 - Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
+- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.

 ## Tested capabilities

@@ -231,3 +232,5 @@ This project demonstrates practical understanding of:
 ## Deeper docs

 - [Runbook](docs/runbook.md)
+- [Interview cheatsheet](docs/interview-cheatsheet.md)
+- [Troubleshooting cases](docs/troubleshooting-cases.md)
@@ -0,0 +1,22 @@
+# Interview Cheatsheet: Slurm AI/HPC Lab
+
+## One-minute summary
+
+I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
+
+## Topics I can discuss
+
+- How Slurm schedules CPU and GPU workloads.
+- Difference between GRES scheduling and cgroup device enforcement.
+- Why Munge key consistency matters.
+- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
+- How QOS, account associations, fairshare and multifactor priority work.
+- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
+
+## Real troubleshooting examples
+
+- `IDLE+NOT_RESPONDING` after node reprovisioning.
+- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
+- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
+- `sacctmgr` idempotency issues such as `Nothing new added`.
+- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.
@@ -0,0 +1,28 @@
+# Troubleshooting Cases
+
+## `IDLE+NOT_RESPONDING` after node maintenance
+
+Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
+
+Actions:
+
+```bash
+systemctl restart munge
+systemctl restart slurmd
+systemctl restart slurmctld
+scontrol update NodeName=<node> State=RESUME || true
+scontrol update NodeName=<node> State=UNDRAIN || true
+scontrol update NodeName=<node> State=IDLE || true
+```
+
+## Missing GPU TRES
+
+Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
+
+Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
+
+## SlurmDBD objects already exist
+
+Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
+
+Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
Author	SHA1	Message	Date
Mateusz Suski	83877fb598	Document Slurm AI/HPC cluster project lint / shell-yaml-ansible (push) Failing after 16s Details	2026-06-04 19:54:43 +00:00
Mateusz Suski	d300d490f5	Add Slurm AI/HPC cluster platform project lint / shell-yaml-ansible (push) Failing after 47s Details	2026-06-04 19:42:45 +00:00