This commit is contained in:
@@ -0,0 +1,59 @@
|
||||
# Ansible Slurm AI/HPC Lab
|
||||
|
||||
Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.
|
||||
|
||||
This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it.
|
||||
|
||||
## What this lab covers
|
||||
|
||||
- Slurm controller and worker configuration
|
||||
- Munge key distribution
|
||||
- GPU GRES configuration
|
||||
- cgroup CPU/GPU/device enforcement
|
||||
- SlurmDBD + MariaDB accounting
|
||||
- `sacct`, `sreport`, `sacctmgr` validation
|
||||
- QOS, limits, fairshare and priority/multifactor
|
||||
- Node provisioning and decommissioning
|
||||
- Rolling OS upgrades with canary validation
|
||||
- Health checks and node auto-remediation
|
||||
|
||||
## Repository layout
|
||||
|
||||
```text
|
||||
inventories/lab/ Example inventory and group variables
|
||||
templates/ Slurm, cgroup, gres and slurmdbd templates
|
||||
playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup
|
||||
playbooks/core/ Munge, Slurm config and safe restart workflows
|
||||
playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation
|
||||
playbooks/qos/ QOS, fairshare and priority configuration
|
||||
playbooks/lifecycle/ Provisioning and decommissioning nodes
|
||||
playbooks/upgrade/ Rolling OS upgrade and canary workflow
|
||||
playbooks/health/ Health checks and auto-remediation
|
||||
playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs
|
||||
playbooks/backup/ Slurm config backup helpers
|
||||
docs/ Runbooks and interview notes
|
||||
prompts/codex/ Prompts for generating or expanding documentation
|
||||
```
|
||||
|
||||
## Quick start
|
||||
|
||||
1. Edit `inventories/lab/inventory.yml`.
|
||||
2. Edit `inventories/lab/group_vars/slurm_cluster.yml`.
|
||||
3. Create and encrypt a vault file for database credentials:
|
||||
|
||||
```bash
|
||||
cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
|
||||
ansible-vault encrypt inventories/lab/group_vars/vault.yml
|
||||
```
|
||||
|
||||
4. Run syntax checks:
|
||||
|
||||
```bash
|
||||
find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
|
||||
```
|
||||
|
||||
5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`.
|
||||
|
||||
## Security notes
|
||||
|
||||
Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.
|
||||
Reference in New Issue
Block a user