Compare commits
2 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 1843796e92 | |||
| cd6830334b |
@@ -73,7 +73,7 @@ playbooks/health/ Health checks, repair, and auto-remediation
|
|||||||
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
|
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
|
||||||
playbooks/backup/ Slurm and Munge state backup helpers
|
playbooks/backup/ Slurm and Munge state backup helpers
|
||||||
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
|
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
|
||||||
docs/ Runbook, interview notes, and troubleshooting cases
|
docs/ Operational runbook
|
||||||
prompts/ Documentation prompts used to expand this project
|
prompts/ Documentation prompts used to expand this project
|
||||||
```
|
```
|
||||||
|
|
||||||
@@ -188,7 +188,6 @@ This is more than a toy lab because it includes operational controls around the
|
|||||||
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
|
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
|
||||||
- Health and repair playbooks document remediation paths for common node states.
|
- Health and repair playbooks document remediation paths for common node states.
|
||||||
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
|
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
|
||||||
- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.
|
|
||||||
|
|
||||||
## Tested capabilities
|
## Tested capabilities
|
||||||
|
|
||||||
@@ -232,5 +231,3 @@ This project demonstrates practical understanding of:
|
|||||||
## Deeper docs
|
## Deeper docs
|
||||||
|
|
||||||
- [Runbook](docs/runbook.md)
|
- [Runbook](docs/runbook.md)
|
||||||
- [Interview cheatsheet](docs/interview-cheatsheet.md)
|
|
||||||
- [Troubleshooting cases](docs/troubleshooting-cases.md)
|
|
||||||
|
|||||||
@@ -1,22 +0,0 @@
|
|||||||
# Interview Cheatsheet: Slurm AI/HPC Lab
|
|
||||||
|
|
||||||
## One-minute summary
|
|
||||||
|
|
||||||
I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
|
|
||||||
|
|
||||||
## Topics I can discuss
|
|
||||||
|
|
||||||
- How Slurm schedules CPU and GPU workloads.
|
|
||||||
- Difference between GRES scheduling and cgroup device enforcement.
|
|
||||||
- Why Munge key consistency matters.
|
|
||||||
- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
|
|
||||||
- How QOS, account associations, fairshare and multifactor priority work.
|
|
||||||
- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
|
|
||||||
|
|
||||||
## Real troubleshooting examples
|
|
||||||
|
|
||||||
- `IDLE+NOT_RESPONDING` after node reprovisioning.
|
|
||||||
- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
|
|
||||||
- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
|
|
||||||
- `sacctmgr` idempotency issues such as `Nothing new added`.
|
|
||||||
- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.
|
|
||||||
@@ -1,28 +0,0 @@
|
|||||||
# Troubleshooting Cases
|
|
||||||
|
|
||||||
## `IDLE+NOT_RESPONDING` after node maintenance
|
|
||||||
|
|
||||||
Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
|
|
||||||
|
|
||||||
Actions:
|
|
||||||
|
|
||||||
```bash
|
|
||||||
systemctl restart munge
|
|
||||||
systemctl restart slurmd
|
|
||||||
systemctl restart slurmctld
|
|
||||||
scontrol update NodeName=<node> State=RESUME || true
|
|
||||||
scontrol update NodeName=<node> State=UNDRAIN || true
|
|
||||||
scontrol update NodeName=<node> State=IDLE || true
|
|
||||||
```
|
|
||||||
|
|
||||||
## Missing GPU TRES
|
|
||||||
|
|
||||||
Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
|
|
||||||
|
|
||||||
Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
|
|
||||||
|
|
||||||
## SlurmDBD objects already exist
|
|
||||||
|
|
||||||
Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
|
|
||||||
|
|
||||||
Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
|
|
||||||
Reference in New Issue
Block a user