Files
portfolio/platform-projects/hpc-slurm-ai-cluster/README.md
T

60 lines
2.4 KiB
Markdown
Raw Normal View History

2026-06-04 19:41:05 +00:00
# Ansible Slurm AI/HPC Lab
Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.
This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it.
## What this lab covers
- Slurm controller and worker configuration
- Munge key distribution
- GPU GRES configuration
- cgroup CPU/GPU/device enforcement
- SlurmDBD + MariaDB accounting
- `sacct`, `sreport`, `sacctmgr` validation
- QOS, limits, fairshare and priority/multifactor
- Node provisioning and decommissioning
- Rolling OS upgrades with canary validation
- Health checks and node auto-remediation
## Repository layout
```text
inventories/lab/ Example inventory and group variables
templates/ Slurm, cgroup, gres and slurmdbd templates
playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup
playbooks/core/ Munge, Slurm config and safe restart workflows
playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation
playbooks/qos/ QOS, fairshare and priority configuration
playbooks/lifecycle/ Provisioning and decommissioning nodes
playbooks/upgrade/ Rolling OS upgrade and canary workflow
playbooks/health/ Health checks and auto-remediation
playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs
playbooks/backup/ Slurm config backup helpers
docs/ Runbooks and interview notes
prompts/codex/ Prompts for generating or expanding documentation
```
## Quick start
1. Edit `inventories/lab/inventory.yml`.
2. Edit `inventories/lab/group_vars/slurm_cluster.yml`.
3. Create and encrypt a vault file for database credentials:
```bash
cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
ansible-vault encrypt inventories/lab/group_vars/vault.yml
```
4. Run syntax checks:
```bash
find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
```
5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`.
## Security notes
Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.