# Interview Cheatsheet: Slurm AI/HPC Lab ## One-minute summary I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows. ## Topics I can discuss - How Slurm schedules CPU and GPU workloads. - Difference between GRES scheduling and cgroup device enforcement. - Why Munge key consistency matters. - How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together. - How QOS, account associations, fairshare and multifactor priority work. - Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation. ## Real troubleshooting examples - `IDLE+NOT_RESPONDING` after node reprovisioning. - Accounting delay where `sacct` temporarily showed `PENDING` while job output existed. - Missing `gres/gpu` TRES before QOS GPU limits could be configured. - `sacctmgr` idempotency issues such as `Nothing new added`. - Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.