Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 17s

This commit is contained in:
Mateusz Suski
2026-06-04 19:54:43 +00:00
parent cd6830334b
commit 1843796e92
5 changed files with 235 additions and 39 deletions
+8 -2
View File
@@ -1,8 +1,14 @@
# platform-projects
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
## Implemented platform projects
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
## Planning areas
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
- `monitoring-zabbix`
- `elk-log-analysis`