Add Slurm AI/HPC cluster platform project

2026-06-04 19:41:05 +00:00
parent e2624a7533
commit d300d490f5
49 changed files with 4777 additions and 0 deletions
@@ -0,0 +1,22 @@
+# Interview Cheatsheet: Slurm AI/HPC Lab
+
+## One-minute summary
+
+I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
+
+## Topics I can discuss
+
+- How Slurm schedules CPU and GPU workloads.
+- Difference between GRES scheduling and cgroup device enforcement.
+- Why Munge key consistency matters.
+- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
+- How QOS, account associations, fairshare and multifactor priority work.
+- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
+
+## Real troubleshooting examples
+
+- `IDLE+NOT_RESPONDING` after node reprovisioning.
+- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
+- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
+- `sacctmgr` idempotency issues such as `Nothing new added`.
+- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.