Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 16s

This commit is contained in:
Mateusz Suski
2026-06-04 19:54:43 +00:00
parent d300d490f5
commit 83877fb598
5 changed files with 239 additions and 40 deletions
@@ -50,6 +50,19 @@ Repair a node:
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
```
Run health remediation for nodes that can be recovered by the automated workflow:
```bash
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
```
Back up Slurm and Munge state before planned lifecycle work:
```bash
ansible-playbook playbooks/backup/backup-slurm-state.yml
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
```
## Rolling OS upgrade
```bash