Document Slurm AI/HPC cluster project

2026-06-04 19:54:43 +00:00
parent cd6830334b
commit 1843796e92
5 changed files with 235 additions and 39 deletions
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
 - [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
 - [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
 - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
+- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.

 ## Planned Areas

@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
 - Veritas VxVM/VCS operational awareness.
 - GPFS / IBM Spectrum Scale operational awareness.
 - Ansible role organization for selected hardening controls.
+- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
 - Clear documentation of what was tested and what still needs a real system.