# Slurm AI/HPC Lab Runbook ## Standard deployment order ```bash ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass ansible-playbook playbooks/bootstrap/slurm-hosts.yml ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml ansible-playbook playbooks/core/manage-munge.yml ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff ansible-playbook playbooks/core/manage-slurm-config.yml --diff ansible-playbook playbooks/core/restart-slurm-safe.yml ansible-playbook playbooks/tests/validate-slurm-operator.yml ansible-playbook playbooks/tests/test-cpu-job.yml ansible-playbook playbooks/tests/test-gpu-job.yml ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml ansible-playbook playbooks/accounting/setup-slurmdbd.yml ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml ansible-playbook playbooks/accounting/backup-slurmdbd.yml ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml ansible-playbook playbooks/accounting/validate-slurm-accounting.yml ansible-playbook playbooks/qos/configure-slurm-qos.yml ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml ansible-playbook playbooks/health/check-slurm-health.yml ``` ## Node lifecycle Provision a node: ```bash ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02 ``` Decommission a node: ```bash ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance" ``` Repair a node: ```bash ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02 ``` Run health remediation for nodes that can be recovered by the automated workflow: ```bash ansible-playbook playbooks/health/auto-remediate-slurm-health.yml ``` Back up Slurm and Munge state before planned lifecycle work: ```bash ansible-playbook playbooks/backup/backup-slurm-state.yml ansible-playbook playbooks/backup/fetch-slurm-backups.yml ``` ## Rolling OS upgrade ```bash ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02 ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml ``` If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.