This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
- QOS, fairshare, and multifactor priority configuration.
- Node provisioning and decommissioning workflows.
- Rolling OS upgrades with canary validation.
- Health checks and auto-remediation.
- Backup and restore-check workflow for the accounting database.
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
## Architecture overview
```mermaid
flowchart LR
operator[Ansible control node]
munge[Munge authentication]
controller[Slurm controller<br/>slurmctld]
db[MariaDB + SlurmDBD<br/>accounting]
shared[Shared filesystem<br/>site dependency]
cpu_part[CPU partition]
gpu_part[GPU partition]
cpu_nodes[CPU compute nodes<br/>slurmd]
gpu_node[GPU node<br/>slurmd + GRES]
jobs[User jobs<br/>sbatch / srun]
operator -->|bootstrap and configure| controller
operator -->|configure workers| cpu_nodes
operator -->|configure GPU worker| gpu_node
operator -->|deploy key and service| munge
munge --> controller
munge --> cpu_nodes
munge --> gpu_node
controller -->|accounting RPC| db
jobs -->|submit to Slurm| controller
controller -->|schedule CPU jobs| cpu_part
controller -->|schedule GPU jobs| gpu_part
cpu_part --> cpu_nodes
gpu_part --> gpu_node
cpu_nodes --- shared
gpu_node --- shared
controller --- shared
```
The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
- Health and repair playbooks document remediation paths for common node states.
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.
## Tested capabilities
- [x] CPU job scheduling.
- [x] GPU job scheduling.
- [x] GPU denial when no GRES is requested.
- [x] CPU cgroup enforcement.
- [x] SlurmDBD accounting setup.
- [x]`sacct` job history visibility.
- [x]`sreport` usage reporting.
- [x] QOS creation and validation.
- [x] Fairshare and priority visibility.
- [x] Node decommission and reprovision workflow.
- [x] Rolling upgrade canary workflow.
- [x] Node health check and auto-remediation workflow.
These checks represent sanitized lab validation, not a claim of production certification.
## Safety and sanitization
This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
## Why this matters for AI/HPC infrastructure roles
AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
This project demonstrates practical understanding of:
- Linux systems operations.
- Slurm cluster operations.
- GPU infrastructure and GRES scheduling.
- Job scheduling and resource isolation.
- Accounting, reporting, QOS, fairshare, and priority policy.