platform-projects/hpc-slurm-ai-cluster/README.md

# Slurm AI/HPC Cluster Automation Lab

## Executive summary

This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.

The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.

## What this project demonstrates

- Slurm controller and worker node management.
- Munge authentication across the cluster.
- GPU node integration through Slurm GRES.
- cgroup CPU, memory, and GPU device enforcement.
- SlurmDBD with MariaDB-backed accounting.
- `sacct`, `sreport`, and `sacctmgr` workflows.
- QOS, fairshare, and multifactor priority configuration.
- Node provisioning and decommissioning workflows.
- Rolling OS upgrades with canary validation.
- Health checks and auto-remediation.
- Backup and restore-check workflow for the accounting database.
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.

## Architecture overview

```mermaid
flowchart LR
    operator[Ansible control node]
    munge[Munge authentication]
    controller[Slurm controller<br/>slurmctld]
    db[MariaDB + SlurmDBD<br/>accounting]
    shared[Shared filesystem<br/>site dependency]
    cpu_part[CPU partition]
    gpu_part[GPU partition]
    cpu_nodes[CPU compute nodes<br/>slurmd]
    gpu_node[GPU node<br/>slurmd + GRES]
    jobs[User jobs<br/>sbatch / srun]

    operator -->|bootstrap and configure| controller
    operator -->|configure workers| cpu_nodes
    operator -->|configure GPU worker| gpu_node
    operator -->|deploy key and service| munge

    munge --> controller
    munge --> cpu_nodes
    munge --> gpu_node

    controller -->|accounting RPC| db
    jobs -->|submit to Slurm| controller
    controller -->|schedule CPU jobs| cpu_part
    controller -->|schedule GPU jobs| gpu_part
    cpu_part --> cpu_nodes
    gpu_part --> gpu_node

    cpu_nodes --- shared
    gpu_node --- shared
    controller --- shared
```

The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.

## Repository layout

```text
inventories/lab/          Sanitized lab inventory and group variables
playbooks/bootstrap/      Initial SSH, sudo, operator user, and host setup
playbooks/core/           Munge, Slurm config, and safe restart workflows
playbooks/accounting/     SlurmDBD, MariaDB, backup, restore-check, and reporting validation
playbooks/qos/            QOS, fairshare, and priority configuration
playbooks/lifecycle/      Node provisioning, inspection, and decommissioning
playbooks/upgrade/        Canary and rolling OS upgrade workflows
playbooks/health/         Health checks, repair, and auto-remediation
playbooks/tests/          CPU, GPU, cgroup, accounting, and reporting validation jobs
playbooks/backup/         Slurm and Munge state backup helpers
templates/                Slurm, cgroup, GRES, and SlurmDBD templates
docs/                     Runbook, interview notes, and troubleshooting cases
prompts/                  Documentation prompts used to expand this project
```

## Main operational workflows

Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.

### Bootstrap access

```bash
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
```

### Deploy Munge

```bash
ansible-playbook playbooks/core/manage-munge.yml
```

### Deploy Slurm config

```bash
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
ansible-playbook playbooks/core/restart-slurm-safe.yml
```

### Validate CPU jobs

```bash
ansible-playbook playbooks/tests/validate-slurm-operator.yml
ansible-playbook playbooks/tests/test-cpu-job.yml
```

### Validate GPU jobs

```bash
ansible-playbook playbooks/tests/test-gpu-job.yml
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
```

### Enable accounting

```bash
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
ansible-playbook playbooks/tests/test-sreport-usage.yml
```

### Configure QOS and fairshare

```bash
ansible-playbook playbooks/qos/configure-slurm-qos.yml
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
```

### Provision a node

```bash
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
```

### Decommission a node

```bash
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
  -e target_node=<node> \
  -e "decom_reason=planned maintenance"
```

### Rolling OS upgrade

```bash
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
  -e canary_node=<node> \
  -e skip_canary=true
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
```

### Health check and auto-remediation

```bash
ansible-playbook playbooks/health/check-slurm-health.yml
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
```

### Accounting backup and restore-check

```bash
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
```

## Operational maturity

This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.

- Ansible workflows are designed to be repeatable and readable.
- Configuration deployment supports check and diff review before applying changes.
- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
- QOS, fairshare, priority, and association workflows show resource governance.
- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
- Health and repair playbooks document remediation paths for common node states.
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.

## Tested capabilities

- [x] CPU job scheduling.
- [x] GPU job scheduling.
- [x] GPU denial when no GRES is requested.
- [x] CPU cgroup enforcement.
- [x] SlurmDBD accounting setup.
- [x] `sacct` job history visibility.
- [x] `sreport` usage reporting.
- [x] QOS creation and validation.
- [x] Fairshare and priority visibility.
- [x] Node decommission and reprovision workflow.
- [x] Rolling upgrade canary workflow.
- [x] Node health check and auto-remediation workflow.

These checks represent sanitized lab validation, not a claim of production certification.

## Safety and sanitization

This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.

Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.

Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.

## Why this matters for AI/HPC infrastructure roles

AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.

This project demonstrates practical understanding of:

- Linux systems operations.
- Slurm cluster operations.
- GPU infrastructure and GRES scheduling.
- Job scheduling and resource isolation.
- Accounting, reporting, QOS, fairshare, and priority policy.
- Automation and repeatability with Ansible.
- Troubleshooting and operational ownership.

## Deeper docs

- [Runbook](docs/runbook.md)
- [Interview cheatsheet](docs/interview-cheatsheet.md)
- [Troubleshooting cases](docs/troubleshooting-cases.md)
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`# Slurm AI/HPC Cluster Automation Lab`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`## Executive summary`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`## What this project demonstrates`

			`- Slurm controller and worker node management.`
			`- Munge authentication across the cluster.`
			`- GPU node integration through Slurm GRES.`
			`- cgroup CPU, memory, and GPU device enforcement.`
			`- SlurmDBD with MariaDB-backed accounting.`
			- `sacct`, `sreport`, and `sacctmgr` workflows.
			`- QOS, fairshare, and multifactor priority configuration.`
			`- Node provisioning and decommissioning workflows.`
			`- Rolling OS upgrades with canary validation.`
			`- Health checks and auto-remediation.`
			`- Backup and restore-check workflow for the accounting database.`
			`- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.`

			`## Architecture overview`

			```mermaid
			`flowchart LR`
			`operator[Ansible control node]`
			`munge[Munge authentication]`
			`controller[Slurm controller<br/>slurmctld]`
			`db[MariaDB + SlurmDBD<br/>accounting]`
			`shared[Shared filesystem<br/>site dependency]`
			`cpu_part[CPU partition]`
			`gpu_part[GPU partition]`
			`cpu_nodes[CPU compute nodes<br/>slurmd]`
			`gpu_node[GPU node<br/>slurmd + GRES]`
			`jobs[User jobs<br/>sbatch / srun]`

			`operator -->\|bootstrap and configure\| controller`
			`operator -->\|configure workers\| cpu_nodes`
			`operator -->\|configure GPU worker\| gpu_node`
			`operator -->\|deploy key and service\| munge`

			`munge --> controller`
			`munge --> cpu_nodes`
			`munge --> gpu_node`

			`controller -->\|accounting RPC\| db`
			`jobs -->\|submit to Slurm\| controller`
			`controller -->\|schedule CPU jobs\| cpu_part`
			`controller -->\|schedule GPU jobs\| gpu_part`
			`cpu_part --> cpu_nodes`
			`gpu_part --> gpu_node`

			`cpu_nodes --- shared`
			`gpu_node --- shared`
			`controller --- shared`
			```

			`The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
			`## Repository layout`

			```text
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`inventories/lab/ Sanitized lab inventory and group variables`
			`playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup`
			`playbooks/core/ Munge, Slurm config, and safe restart workflows`
			`playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation`
			`playbooks/qos/ QOS, fairshare, and priority configuration`
			`playbooks/lifecycle/ Node provisioning, inspection, and decommissioning`
			`playbooks/upgrade/ Canary and rolling OS upgrade workflows`
			`playbooks/health/ Health checks, repair, and auto-remediation`
			`playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs`
			`playbooks/backup/ Slurm and Munge state backup helpers`
			`templates/ Slurm, cgroup, GRES, and SlurmDBD templates`
			`docs/ Runbook, interview notes, and troubleshooting cases`
			`prompts/ Documentation prompts used to expand this project`
			```

			`## Main operational workflows`

			Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.

			`### Bootstrap access`

			```bash
			`ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass`
			`ansible-playbook playbooks/bootstrap/slurm-hosts.yml`
			`ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml`
			`ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml`
			```

			`### Deploy Munge`

			```bash
			`ansible-playbook playbooks/core/manage-munge.yml`
			```

			`### Deploy Slurm config`

			```bash
			`ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff`
			`ansible-playbook playbooks/core/manage-slurm-config.yml --diff`
			`ansible-playbook playbooks/core/restart-slurm-safe.yml`
			```

			`### Validate CPU jobs`

			```bash
			`ansible-playbook playbooks/tests/validate-slurm-operator.yml`
			`ansible-playbook playbooks/tests/test-cpu-job.yml`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00			```

Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`### Validate GPU jobs`

			```bash
			`ansible-playbook playbooks/tests/test-gpu-job.yml`
			`ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml`
			```
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`### Enable accounting`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
			```bash
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`ansible-playbook playbooks/accounting/setup-slurmdbd.yml`
			`ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml`
			`ansible-playbook playbooks/accounting/validate-slurm-accounting.yml`
			`ansible-playbook playbooks/tests/test-sreport-usage.yml`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00			```

Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`### Configure QOS and fairshare`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
			```bash
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`ansible-playbook playbooks/qos/configure-slurm-qos.yml`
			`ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00			```

Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`### Provision a node`

			```bash
			`ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>`
			`ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>`
			```

			`### Decommission a node`

			```bash
			`ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \`
			`-e target_node=<node> \`
			`-e "decom_reason=planned maintenance"`
			```

			`### Rolling OS upgrade`

			```bash
			`ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>`
			`ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \`
			`-e canary_node=<node> \`
			`-e skip_canary=true`
			`ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml`
			`ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml`
			```

			`### Health check and auto-remediation`

			```bash
			`ansible-playbook playbooks/health/check-slurm-health.yml`
			`ansible-playbook playbooks/health/auto-remediate-slurm-health.yml`
			`ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>`
			```

			`### Accounting backup and restore-check`

			```bash
			`ansible-playbook playbooks/accounting/backup-slurmdbd.yml`
			`ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml`
			```

			`## Operational maturity`

			This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.

			`- Ansible workflows are designed to be repeatable and readable.`
			`- Configuration deployment supports check and diff review before applying changes.`
			`- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.`
			- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
			`- QOS, fairshare, priority, and association workflows show resource governance.`
			`- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.`
			`- Rolling upgrade playbooks include canary validation before broader worker upgrades.`
			`- Health and repair playbooks document remediation paths for common node states.`
			`- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.`
			`- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.`

			`## Tested capabilities`

			`- [x] CPU job scheduling.`
			`- [x] GPU job scheduling.`
			`- [x] GPU denial when no GRES is requested.`
			`- [x] CPU cgroup enforcement.`
			`- [x] SlurmDBD accounting setup.`
			- [x] `sacct` job history visibility.
			- [x] `sreport` usage reporting.
			`- [x] QOS creation and validation.`
			`- [x] Fairshare and priority visibility.`
			`- [x] Node decommission and reprovision workflow.`
			`- [x] Rolling upgrade canary workflow.`
			`- [x] Node health check and auto-remediation workflow.`

			`These checks represent sanitized lab validation, not a claim of production certification.`

			`## Safety and sanitization`

			This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.

			`Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.`

			`Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.`

			`## Why this matters for AI/HPC infrastructure roles`

			`AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.`

			`This project demonstrates practical understanding of:`

			`- Linux systems operations.`
			`- Slurm cluster operations.`
			`- GPU infrastructure and GRES scheduling.`
			`- Job scheduling and resource isolation.`
			`- Accounting, reporting, QOS, fairshare, and priority policy.`
			`- Automation and repeatability with Ansible.`
			`- Troubleshooting and operational ownership.`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`## Deeper docs`
Add Slurm AI/HPC cluster platform project 2026-06-04 19:41:05 +00:00
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`- [Runbook](docs/runbook.md)`
			`- [Interview cheatsheet](docs/interview-cheatsheet.md)`
			`- [Troubleshooting cases](docs/troubleshooting-cases.md)`