63 lines
2.2 KiB
Markdown
63 lines
2.2 KiB
Markdown
|
|
# Slurm AI/HPC Lab Runbook
|
||
|
|
|
||
|
|
## Standard deployment order
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||
|
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||
|
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||
|
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||
|
|
|
||
|
|
ansible-playbook playbooks/core/manage-munge.yml
|
||
|
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||
|
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||
|
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||
|
|
|
||
|
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||
|
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||
|
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||
|
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||
|
|
|
||
|
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||
|
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||
|
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||
|
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||
|
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||
|
|
|
||
|
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||
|
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||
|
|
|
||
|
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
## Node lifecycle
|
||
|
|
|
||
|
|
Provision a node:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
|
||
|
|
```
|
||
|
|
|
||
|
|
Decommission a node:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
|
||
|
|
```
|
||
|
|
|
||
|
|
Repair a node:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
|
||
|
|
```
|
||
|
|
|
||
|
|
## Rolling OS upgrade
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
|
||
|
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
|
||
|
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||
|
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
|