Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 16s

This commit is contained in:
Mateusz Suski
2026-06-04 19:54:43 +00:00
parent d300d490f5
commit 83877fb598
5 changed files with 239 additions and 40 deletions
+2
View File
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
## Planned Areas
@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
- Veritas VxVM/VCS operational awareness.
- GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls.
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system.