60 lines
2.4 KiB
Markdown
60 lines
2.4 KiB
Markdown
# Ansible Slurm AI/HPC Lab
|
|
|
|
Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.
|
|
|
|
This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it.
|
|
|
|
## What this lab covers
|
|
|
|
- Slurm controller and worker configuration
|
|
- Munge key distribution
|
|
- GPU GRES configuration
|
|
- cgroup CPU/GPU/device enforcement
|
|
- SlurmDBD + MariaDB accounting
|
|
- `sacct`, `sreport`, `sacctmgr` validation
|
|
- QOS, limits, fairshare and priority/multifactor
|
|
- Node provisioning and decommissioning
|
|
- Rolling OS upgrades with canary validation
|
|
- Health checks and node auto-remediation
|
|
|
|
## Repository layout
|
|
|
|
```text
|
|
inventories/lab/ Example inventory and group variables
|
|
templates/ Slurm, cgroup, gres and slurmdbd templates
|
|
playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup
|
|
playbooks/core/ Munge, Slurm config and safe restart workflows
|
|
playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation
|
|
playbooks/qos/ QOS, fairshare and priority configuration
|
|
playbooks/lifecycle/ Provisioning and decommissioning nodes
|
|
playbooks/upgrade/ Rolling OS upgrade and canary workflow
|
|
playbooks/health/ Health checks and auto-remediation
|
|
playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs
|
|
playbooks/backup/ Slurm config backup helpers
|
|
docs/ Operational runbook
|
|
prompts/codex/ Prompts for generating or expanding documentation
|
|
```
|
|
|
|
## Quick start
|
|
|
|
1. Edit `inventories/lab/inventory.yml`.
|
|
2. Edit `inventories/lab/group_vars/slurm_cluster.yml`.
|
|
3. Create and encrypt a vault file for database credentials:
|
|
|
|
```bash
|
|
cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
|
|
ansible-vault encrypt inventories/lab/group_vars/vault.yml
|
|
```
|
|
|
|
4. Run syntax checks:
|
|
|
|
```bash
|
|
find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
|
|
```
|
|
|
|
5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`.
|
|
|
|
## Security notes
|
|
|
|
Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.
|