2.4 KiB
2.4 KiB
Ansible Slurm AI/HPC Lab
Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.
This repository is sanitized for publication. Replace the example inventory values under inventories/lab/ with your own hostnames, IP addresses and users before running it.
What this lab covers
- Slurm controller and worker configuration
- Munge key distribution
- GPU GRES configuration
- cgroup CPU/GPU/device enforcement
- SlurmDBD + MariaDB accounting
sacct,sreport,sacctmgrvalidation- QOS, limits, fairshare and priority/multifactor
- Node provisioning and decommissioning
- Rolling OS upgrades with canary validation
- Health checks and node auto-remediation
Repository layout
inventories/lab/ Example inventory and group variables
templates/ Slurm, cgroup, gres and slurmdbd templates
playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup
playbooks/core/ Munge, Slurm config and safe restart workflows
playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation
playbooks/qos/ QOS, fairshare and priority configuration
playbooks/lifecycle/ Provisioning and decommissioning nodes
playbooks/upgrade/ Rolling OS upgrade and canary workflow
playbooks/health/ Health checks and auto-remediation
playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs
playbooks/backup/ Slurm config backup helpers
docs/ Operational runbook
prompts/codex/ Prompts for generating or expanding documentation
Quick start
- Edit
inventories/lab/inventory.yml. - Edit
inventories/lab/group_vars/slurm_cluster.yml. - Create and encrypt a vault file for database credentials:
cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
ansible-vault encrypt inventories/lab/group_vars/vault.yml
- Run syntax checks:
find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
- Run the bootstrap/core workflows in the order described in
docs/runbook.md.
Security notes
Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.