1.2 KiB
1.2 KiB
Interview Cheatsheet: Slurm AI/HPC Lab
One-minute summary
I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
Topics I can discuss
- How Slurm schedules CPU and GPU workloads.
- Difference between GRES scheduling and cgroup device enforcement.
- Why Munge key consistency matters.
- How
slurmdbd,sacct,sacctmgrandsreportfit together. - How QOS, account associations, fairshare and multifactor priority work.
- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
Real troubleshooting examples
IDLE+NOT_RESPONDINGafter node reprovisioning.- Accounting delay where
saccttemporarily showedPENDINGwhile job output existed. - Missing
gres/gpuTRES before QOS GPU limits could be configured. sacctmgridempotency issues such asNothing new added.- Slurm version differences around state transitions such as
RESUME,UNDRAINandIDLE.