Files
portfolio/platform-projects/hpc-slurm-ai-cluster/README.md
T
Mateusz Suski d300d490f5
lint / shell-yaml-ansible (push) Failing after 47s
Add Slurm AI/HPC cluster platform project
2026-06-04 19:42:45 +00:00

2.4 KiB

Ansible Slurm AI/HPC Lab

Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.

This repository is sanitized for publication. Replace the example inventory values under inventories/lab/ with your own hostnames, IP addresses and users before running it.

What this lab covers

  • Slurm controller and worker configuration
  • Munge key distribution
  • GPU GRES configuration
  • cgroup CPU/GPU/device enforcement
  • SlurmDBD + MariaDB accounting
  • sacct, sreport, sacctmgr validation
  • QOS, limits, fairshare and priority/multifactor
  • Node provisioning and decommissioning
  • Rolling OS upgrades with canary validation
  • Health checks and node auto-remediation

Repository layout

inventories/lab/          Example inventory and group variables
templates/                Slurm, cgroup, gres and slurmdbd templates
playbooks/bootstrap/      Initial SSH, sudo and /etc/hosts setup
playbooks/core/           Munge, Slurm config and safe restart workflows
playbooks/accounting/     SlurmDBD, backup/restore-check and accounting validation
playbooks/qos/            QOS, fairshare and priority configuration
playbooks/lifecycle/      Provisioning and decommissioning nodes
playbooks/upgrade/        Rolling OS upgrade and canary workflow
playbooks/health/         Health checks and auto-remediation
playbooks/tests/          CPU/GPU/cgroup/accounting validation jobs
playbooks/backup/         Slurm config backup helpers
docs/                     Runbooks and interview notes
prompts/codex/            Prompts for generating or expanding documentation

Quick start

  1. Edit inventories/lab/inventory.yml.
  2. Edit inventories/lab/group_vars/slurm_cluster.yml.
  3. Create and encrypt a vault file for database credentials:
cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
ansible-vault encrypt inventories/lab/group_vars/vault.yml
  1. Run syntax checks:
find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
  1. Run the bootstrap/core workflows in the order described in docs/runbook.md.

Security notes

Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.