Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 16s

This commit is contained in:
Mateusz Suski
2026-06-04 19:54:43 +00:00
parent d300d490f5
commit 83877fb598
5 changed files with 239 additions and 40 deletions
+8 -2
View File
@@ -1,8 +1,14 @@
# platform-projects
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
## Implemented platform projects
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
## Planning areas
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
- `monitoring-zabbix`
- `elk-log-analysis`