platform-projects
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
Implemented platform projects
- hpc-slurm-ai-cluster - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
Planning areas
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
monitoring-zabbixelk-log-analysisstorageclusteringvirtualization
Planned platform topics are tracked in ROADMAP.md. Keep future additions operational: scope, topology, validation, limitations, and runbook links should matter more than diagrams or buzzwords.
For Codex-driven changes, use AGENTS.md and the templates under docs/codex.