Interview Cheatsheet: Slurm AI/HPC Lab

One-minute summary

I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.

Topics I can discuss

How Slurm schedules CPU and GPU workloads.
Difference between GRES scheduling and cgroup device enforcement.
Why Munge key consistency matters.
How slurmdbd, sacct, sacctmgr and sreport fit together.
How QOS, account associations, fairshare and multifactor priority work.
Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.

Real troubleshooting examples

IDLE+NOT_RESPONDING after node reprovisioning.
Accounting delay where sacct temporarily showed PENDING while job output existed.
Missing gres/gpu TRES before QOS GPU limits could be configured.
sacctmgr idempotency issues such as Nothing new added.
Slurm version differences around state transitions such as RESUME, UNDRAIN and IDLE.

1.2 KiB Raw Blame History

Interview Cheatsheet: Slurm AI/HPC Lab

One-minute summary

Topics I can discuss

Real troubleshooting examples

1.2 KiB

Raw Blame History