Files
portfolio/platform-projects/hpc-slurm-ai-cluster/docs/runbook.md
T
Mateusz Suski 83877fb598
lint / shell-yaml-ansible (push) Failing after 16s
Document Slurm AI/HPC cluster project
2026-06-04 19:54:43 +00:00

2.5 KiB

Slurm AI/HPC Lab Runbook

Standard deployment order

ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml

ansible-playbook playbooks/core/manage-munge.yml
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
ansible-playbook playbooks/core/restart-slurm-safe.yml

ansible-playbook playbooks/tests/validate-slurm-operator.yml
ansible-playbook playbooks/tests/test-cpu-job.yml
ansible-playbook playbooks/tests/test-gpu-job.yml
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml

ansible-playbook playbooks/accounting/setup-slurmdbd.yml
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml

ansible-playbook playbooks/qos/configure-slurm-qos.yml
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml

ansible-playbook playbooks/health/check-slurm-health.yml

Node lifecycle

Provision a node:

ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02

Decommission a node:

ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"

Repair a node:

ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02

Run health remediation for nodes that can be recovered by the automated workflow:

ansible-playbook playbooks/health/auto-remediate-slurm-health.yml

Back up Slurm and Munge state before planned lifecycle work:

ansible-playbook playbooks/backup/backup-slurm-state.yml
ansible-playbook playbooks/backup/fetch-slurm-backups.yml

Rolling OS upgrade

ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml

If upgrade-slurm-controller.yml is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.