Add Slurm AI/HPC cluster platform project
lint / shell-yaml-ansible (push) Failing after 47s

This commit is contained in:
Mateusz Suski
2026-06-04 19:41:05 +00:00
parent e2624a7533
commit d300d490f5
49 changed files with 4777 additions and 0 deletions
@@ -0,0 +1,62 @@
# Slurm AI/HPC Lab Runbook
## Standard deployment order
```bash
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
ansible-playbook playbooks/core/manage-munge.yml
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
ansible-playbook playbooks/core/restart-slurm-safe.yml
ansible-playbook playbooks/tests/validate-slurm-operator.yml
ansible-playbook playbooks/tests/test-cpu-job.yml
ansible-playbook playbooks/tests/test-gpu-job.yml
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
ansible-playbook playbooks/qos/configure-slurm-qos.yml
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
ansible-playbook playbooks/health/check-slurm-health.yml
```
## Node lifecycle
Provision a node:
```bash
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
```
Decommission a node:
```bash
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
```
Repair a node:
```bash
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
```
## Rolling OS upgrade
```bash
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
```
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.