Files
portfolio/platform-projects/hpc-slurm-ai-cluster/docs/interview-cheatsheet.md
T
Mateusz Suski d300d490f5
lint / shell-yaml-ansible (push) Failing after 47s
Add Slurm AI/HPC cluster platform project
2026-06-04 19:42:45 +00:00

1.2 KiB

Interview Cheatsheet: Slurm AI/HPC Lab

One-minute summary

I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.

Topics I can discuss

  • How Slurm schedules CPU and GPU workloads.
  • Difference between GRES scheduling and cgroup device enforcement.
  • Why Munge key consistency matters.
  • How slurmdbd, sacct, sacctmgr and sreport fit together.
  • How QOS, account associations, fairshare and multifactor priority work.
  • Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.

Real troubleshooting examples

  • IDLE+NOT_RESPONDING after node reprovisioning.
  • Accounting delay where sacct temporarily showed PENDING while job output existed.
  • Missing gres/gpu TRES before QOS GPU limits could be configured.
  • sacctmgr idempotency issues such as Nothing new added.
  • Slurm version differences around state transitions such as RESUME, UNDRAIN and IDLE.