From 1843796e922e5c55df645a0ca43c60a96ff9efd3 Mon Sep 17 00:00:00 2001 From: Mateusz Suski Date: Thu, 4 Jun 2026 19:54:43 +0000 Subject: [PATCH] Document Slurm AI/HPC cluster project --- CHANGELOG.md | 1 + README.md | 2 + platform-projects/README.md | 10 +- .../hpc-slurm-ai-cluster/README.md | 248 +++++++++++++++--- .../hpc-slurm-ai-cluster/docs/runbook.md | 13 + 5 files changed, 235 insertions(+), 39 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 5cb8eec..affeda1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -36,6 +36,7 @@ - IBM AIX 7 role and playbook. - Shared sanitized Ansible inventory defaults for Linux and AIX examples. - Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation. +- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation. ### Changed diff --git a/README.md b/README.md index 39639bb..8426f95 100644 --- a/README.md +++ b/README.md @@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope - [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references. - [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports. - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles. +- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation. ## Planned Areas @@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION - Veritas VxVM/VCS operational awareness. - GPFS / IBM Spectrum Scale operational awareness. - Ansible role organization for selected hardening controls. +- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation. - Clear documentation of what was tested and what still needs a real system. diff --git a/platform-projects/README.md b/platform-projects/README.md index 06d1911..1a5509e 100644 --- a/platform-projects/README.md +++ b/platform-projects/README.md @@ -1,8 +1,14 @@ # platform-projects -This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/). +This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise. -Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise: +## Implemented platform projects + +- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation. + +## Planning areas + +These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise: - `monitoring-zabbix` - `elk-log-analysis` diff --git a/platform-projects/hpc-slurm-ai-cluster/README.md b/platform-projects/hpc-slurm-ai-cluster/README.md index b1fd65f..f71e36a 100644 --- a/platform-projects/hpc-slurm-ai-cluster/README.md +++ b/platform-projects/hpc-slurm-ai-cluster/README.md @@ -1,59 +1,233 @@ -# Ansible Slurm AI/HPC Lab +# Slurm AI/HPC Cluster Automation Lab -Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation. +## Executive summary -This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it. +This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs. -## What this lab covers +The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected. -- Slurm controller and worker configuration -- Munge key distribution -- GPU GRES configuration -- cgroup CPU/GPU/device enforcement -- SlurmDBD + MariaDB accounting -- `sacct`, `sreport`, `sacctmgr` validation -- QOS, limits, fairshare and priority/multifactor -- Node provisioning and decommissioning -- Rolling OS upgrades with canary validation -- Health checks and node auto-remediation +## What this project demonstrates + +- Slurm controller and worker node management. +- Munge authentication across the cluster. +- GPU node integration through Slurm GRES. +- cgroup CPU, memory, and GPU device enforcement. +- SlurmDBD with MariaDB-backed accounting. +- `sacct`, `sreport`, and `sacctmgr` workflows. +- QOS, fairshare, and multifactor priority configuration. +- Node provisioning and decommissioning workflows. +- Rolling OS upgrades with canary validation. +- Health checks and auto-remediation. +- Backup and restore-check workflow for the accounting database. +- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior. + +## Architecture overview + +```mermaid +flowchart LR + operator[Ansible control node] + munge[Munge authentication] + controller[Slurm controller
slurmctld] + db[MariaDB + SlurmDBD
accounting] + shared[Shared filesystem
site dependency] + cpu_part[CPU partition] + gpu_part[GPU partition] + cpu_nodes[CPU compute nodes
slurmd] + gpu_node[GPU node
slurmd + GRES] + jobs[User jobs
sbatch / srun] + + operator -->|bootstrap and configure| controller + operator -->|configure workers| cpu_nodes + operator -->|configure GPU worker| gpu_node + operator -->|deploy key and service| munge + + munge --> controller + munge --> cpu_nodes + munge --> gpu_node + + controller -->|accounting RPC| db + jobs -->|submit to Slurm| controller + controller -->|schedule CPU jobs| cpu_part + controller -->|schedule GPU jobs| gpu_part + cpu_part --> cpu_nodes + gpu_part --> gpu_node + + cpu_nodes --- shared + gpu_node --- shared + controller --- shared +``` + +The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups. ## Repository layout ```text -inventories/lab/ Example inventory and group variables -templates/ Slurm, cgroup, gres and slurmdbd templates -playbooks/bootstrap/ Initial SSH, sudo and /etc/hosts setup -playbooks/core/ Munge, Slurm config and safe restart workflows -playbooks/accounting/ SlurmDBD, backup/restore-check and accounting validation -playbooks/qos/ QOS, fairshare and priority configuration -playbooks/lifecycle/ Provisioning and decommissioning nodes -playbooks/upgrade/ Rolling OS upgrade and canary workflow -playbooks/health/ Health checks and auto-remediation -playbooks/tests/ CPU/GPU/cgroup/accounting validation jobs -playbooks/backup/ Slurm config backup helpers +inventories/lab/ Sanitized lab inventory and group variables +playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup +playbooks/core/ Munge, Slurm config, and safe restart workflows +playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation +playbooks/qos/ QOS, fairshare, and priority configuration +playbooks/lifecycle/ Node provisioning, inspection, and decommissioning +playbooks/upgrade/ Canary and rolling OS upgrade workflows +playbooks/health/ Health checks, repair, and auto-remediation +playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs +playbooks/backup/ Slurm and Munge state backup helpers +templates/ Slurm, cgroup, GRES, and SlurmDBD templates docs/ Operational runbook -prompts/codex/ Prompts for generating or expanding documentation +prompts/ Documentation prompts used to expand this project ``` -## Quick start +## Main operational workflows -1. Edit `inventories/lab/inventory.yml`. -2. Edit `inventories/lab/group_vars/slurm_cluster.yml`. -3. Create and encrypt a vault file for database credentials: +Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook. + +### Bootstrap access ```bash -cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml -ansible-vault encrypt inventories/lab/group_vars/vault.yml +ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass +ansible-playbook playbooks/bootstrap/slurm-hosts.yml +ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml +ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml ``` -4. Run syntax checks: +### Deploy Munge ```bash -find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check +ansible-playbook playbooks/core/manage-munge.yml ``` -5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`. +### Deploy Slurm config -## Security notes +```bash +ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff +ansible-playbook playbooks/core/manage-slurm-config.yml --diff +ansible-playbook playbooks/core/restart-slurm-safe.yml +``` -Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts. +### Validate CPU jobs + +```bash +ansible-playbook playbooks/tests/validate-slurm-operator.yml +ansible-playbook playbooks/tests/test-cpu-job.yml +``` + +### Validate GPU jobs + +```bash +ansible-playbook playbooks/tests/test-gpu-job.yml +ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml +``` + +### Enable accounting + +```bash +ansible-playbook playbooks/accounting/setup-slurmdbd.yml +ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml +ansible-playbook playbooks/accounting/validate-slurm-accounting.yml +ansible-playbook playbooks/tests/test-sreport-usage.yml +``` + +### Configure QOS and fairshare + +```bash +ansible-playbook playbooks/qos/configure-slurm-qos.yml +ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml +``` + +### Provision a node + +```bash +ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node= +ansible-playbook playbooks/tests/test-specific-node.yml -e target_node= +``` + +### Decommission a node + +```bash +ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \ + -e target_node= \ + -e "decom_reason=planned maintenance" +``` + +### Rolling OS upgrade + +```bash +ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node= +ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \ + -e canary_node= \ + -e skip_canary=true +ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml +ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml +``` + +### Health check and auto-remediation + +```bash +ansible-playbook playbooks/health/check-slurm-health.yml +ansible-playbook playbooks/health/auto-remediate-slurm-health.yml +ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node= +``` + +### Accounting backup and restore-check + +```bash +ansible-playbook playbooks/accounting/backup-slurmdbd.yml +ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml +``` + +## Operational maturity + +This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example. + +- Ansible workflows are designed to be repeatable and readable. +- Configuration deployment supports check and diff review before applying changes. +- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting. +- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation. +- QOS, fairshare, priority, and association workflows show resource governance. +- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes. +- Rolling upgrade playbooks include canary validation before broader worker upgrades. +- Health and repair playbooks document remediation paths for common node states. +- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database. + +## Tested capabilities + +- [x] CPU job scheduling. +- [x] GPU job scheduling. +- [x] GPU denial when no GRES is requested. +- [x] CPU cgroup enforcement. +- [x] SlurmDBD accounting setup. +- [x] `sacct` job history visibility. +- [x] `sreport` usage reporting. +- [x] QOS creation and validation. +- [x] Fairshare and priority visibility. +- [x] Node decommission and reprovision workflow. +- [x] Rolling upgrade canary workflow. +- [x] Node health check and auto-remediation workflow. + +These checks represent sanitized lab validation, not a claim of production certification. + +## Safety and sanitization + +This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders. + +Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store. + +Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment. + +## Why this matters for AI/HPC infrastructure roles + +AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail. + +This project demonstrates practical understanding of: + +- Linux systems operations. +- Slurm cluster operations. +- GPU infrastructure and GRES scheduling. +- Job scheduling and resource isolation. +- Accounting, reporting, QOS, fairshare, and priority policy. +- Automation and repeatability with Ansible. +- Troubleshooting and operational ownership. + +## Deeper docs + +- [Runbook](docs/runbook.md) diff --git a/platform-projects/hpc-slurm-ai-cluster/docs/runbook.md b/platform-projects/hpc-slurm-ai-cluster/docs/runbook.md index d6763af..110be7d 100644 --- a/platform-projects/hpc-slurm-ai-cluster/docs/runbook.md +++ b/platform-projects/hpc-slurm-ai-cluster/docs/runbook.md @@ -50,6 +50,19 @@ Repair a node: ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02 ``` +Run health remediation for nodes that can be recovered by the automated workflow: + +```bash +ansible-playbook playbooks/health/auto-remediate-slurm-health.yml +``` + +Back up Slurm and Munge state before planned lifecycle work: + +```bash +ansible-playbook playbooks/backup/backup-slurm-state.yml +ansible-playbook playbooks/backup/fetch-slurm-backups.yml +``` + ## Rolling OS upgrade ```bash