This commit is contained in:
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
||||
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||
|
||||
## Planned Areas
|
||||
|
||||
@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
|
||||
- Veritas VxVM/VCS operational awareness.
|
||||
- GPFS / IBM Spectrum Scale operational awareness.
|
||||
- Ansible role organization for selected hardening controls.
|
||||
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||
- Clear documentation of what was tested and what still needs a real system.
|
||||
|
||||
Reference in New Issue
Block a user