Files
portfolio/platform-projects/hpc-slurm-ai-cluster/docs/interview-cheatsheet.md
T
Mateusz Suski d300d490f5
lint / shell-yaml-ansible (push) Failing after 47s
Add Slurm AI/HPC cluster platform project
2026-06-04 19:42:45 +00:00

23 lines
1.2 KiB
Markdown

# Interview Cheatsheet: Slurm AI/HPC Lab
## One-minute summary
I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
## Topics I can discuss
- How Slurm schedules CPU and GPU workloads.
- Difference between GRES scheduling and cgroup device enforcement.
- Why Munge key consistency matters.
- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
- How QOS, account associations, fairshare and multifactor priority work.
- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
## Real troubleshooting examples
- `IDLE+NOT_RESPONDING` after node reprovisioning.
- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
- `sacctmgr` idempotency issues such as `Nothing new added`.
- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.