# Ansible Slurm AI/HPC Lab Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation. This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it. ## What this lab covers - Slurm controller and worker configuration - Munge key distribution - GPU GRES configuration - cgroup CPU/GPU/device enforcement - SlurmDBD + MariaDB accounting - `sacct`, `sreport`, `sacctmgr` validation - QOS, limits, fairshare and priority/multifactor - Node provisioning and decommissioning - Rolling OS upgrades with canary validation - Health checks and node auto-remediation ## Repository layout ```text inventories/lab/ Example inventory and group variables templates/ Slurm, cgroup, gres and slurmdbd templates playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup playbooks/core/ Munge, Slurm config and safe restart workflows playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation playbooks/qos/ QOS, fairshare and priority configuration playbooks/lifecycle/ Provisioning and decommissioning nodes playbooks/upgrade/ Rolling OS upgrade and canary workflow playbooks/health/ Health checks and auto-remediation playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs playbooks/backup/ Slurm config backup helpers docs/ Operational runbook prompts/codex/ Prompts for generating or expanding documentation ``` ## Quick start 1. Edit `inventories/lab/inventory.yml`. 2. Edit `inventories/lab/group_vars/slurm_cluster.yml`. 3. Create and encrypt a vault file for database credentials: ```bash cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml ansible-vault encrypt inventories/lab/group_vars/vault.yml ``` 4. Run syntax checks: ```bash find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check ``` 5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`. ## Security notes Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.