2026-06-04 19:41:05 +00:00
|
|
|
# Slurm AI/HPC Lab Runbook
|
|
|
|
|
|
|
|
|
|
## Standard deployment order
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
|
|
|
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
|
|
|
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
|
|
|
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
|
|
|
|
|
|
|
|
|
ansible-playbook playbooks/core/manage-munge.yml
|
|
|
|
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
|
|
|
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
|
|
|
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
|
|
|
|
|
|
|
|
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
|
|
|
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
|
|
|
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
|
|
|
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
|
|
|
|
|
|
|
|
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
|
|
|
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
|
|
|
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
|
|
|
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
|
|
|
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
|
|
|
|
|
|
|
|
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
|
|
|
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
|
|
|
|
|
|
|
|
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
## Node lifecycle
|
|
|
|
|
|
|
|
|
|
Provision a node:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Decommission a node:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Repair a node:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
|
|
|
|
|
```
|
|
|
|
|
|
2026-06-04 19:54:43 +00:00
|
|
|
Run health remediation for nodes that can be recovered by the automated workflow:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
Back up Slurm and Munge state before planned lifecycle work:
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/backup/backup-slurm-state.yml
|
|
|
|
|
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
|
|
|
|
|
```
|
|
|
|
|
|
2026-06-04 19:41:05 +00:00
|
|
|
## Rolling OS upgrade
|
|
|
|
|
|
|
|
|
|
```bash
|
|
|
|
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
|
|
|
|
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
|
|
|
|
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
|
|
|
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
|
|
|
|
```
|
|
|
|
|
|
|
|
|
|
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
|