README.md

# Linux/Unix Infrastructure Engineering Portfolio

This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.

It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.

## What This Repo Is

- Practical Linux/Unix operations examples.
- Safe Bash and Ansible patterns for lab and review.
- Runbook-driven examples for incident response, storage operations, hardening, and observability.
- A place for platform and lab topics to grow without pretending unfinished areas are complete.

## What This Repo Is Not

- It is not a compliance benchmark implementation.
- It is not a drop-in change automation framework.
- It is not proof that these exact scripts ran in any production environment.
- It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.

## Repository Layout

- [infra-run](./infra-run/) - core operational tooling and automation.
- [platform-projects](./platform-projects/) - larger platform topics and case-study areas.
- [labs](./labs/) - experimental/lab environments and notes.
- [docs/codex](./docs/codex/) - guidance for future Codex-driven changes.
- [scripts](./scripts/) - lightweight repository validation helpers.

## Usable Now

- [infra-run](./infra-run/) - the main implemented project in this repository.
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.

## Planned Areas

The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).

## Documentation

### Production Operations

- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.

### Platform Engineering

- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.

### Labs & Experiments

- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.

### Codex and Review Guidance

- [AGENTS.md](./AGENTS.md) - repository rules for automated and assisted changes.
- [docs/codex/README.md](./docs/codex/README.md) - Codex workflow and expected final response format.
- [docs/codex/review-checklist.md](./docs/codex/review-checklist.md) - safety, Bash, Ansible, docs, and validation review checklist.
- [docs/codex/task-template.md](./docs/codex/task-template.md) - reusable scoped task templates.

## Safety-First Usage

Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.

- Prefer read-only commands first.
- Use dry-run/check mode before execution.
- Treat `--execute` as a change-control boundary.
- Confirm backups, monitoring, application impact, and rollback steps before live use.
- Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.

## Validation

Basic local validation:

```bash
./scripts/validate-repo.sh
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
```

The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.

Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.

See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATIONS.md](./infra-run/KNOWN_LIMITATIONS.md) for the current validation status.

## Operational Areas Demonstrated

- Linux operations triage and reporting.
- Local operational log analysis with read-only Python helpers.
- Disk pressure and deleted-file incident analysis.
- Dry-run-first Bash automation.
- Controlled storage change workflow design.
- Veritas VxVM/VCS operational awareness.
- GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls.
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system.
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`# Linux/Unix Infrastructure Engineering Portfolio`
Initial portfolio repository structure 2026-05-05 21:08:22 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.`
Add README files and diagrams across repository 2026-05-06 06:36:53 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`## What This Repo Is`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`- Practical Linux/Unix operations examples.`
			`- Safe Bash and Ansible patterns for lab and review.`
			`- Runbook-driven examples for incident response, storage operations, hardening, and observability.`
			`- A place for platform and lab topics to grow without pretending unfinished areas are complete.`

			`## What This Repo Is Not`

			`- It is not a compliance benchmark implementation.`
			`- It is not a drop-in change automation framework.`
			`- It is not proof that these exact scripts ran in any production environment.`
			`- It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.`

			`## Repository Layout`

			`- [infra-run](./infra-run/) - core operational tooling and automation.`
			`- [platform-projects](./platform-projects/) - larger platform topics and case-study areas.`
			`- [labs](./labs/) - experimental/lab environments and notes.`
			`- [docs/codex](./docs/codex/) - guidance for future Codex-driven changes.`
			`- [scripts](./scripts/) - lightweight repository validation helpers.`

			`## Usable Now`

			`- [infra-run](./infra-run/) - the main implemented project in this repository.`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.`
Add L2 incident triage report wrapper 2026-05-12 20:00:42 +00:00			`- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.`
			`- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.`
			`- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.`
Clean up Python log analysis documentation 2026-05-11 17:10:10 +00:00			`- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.`
			`- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.`
			`- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.`
			`- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.`
			- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
			`- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.`
			`- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.`
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`## Planned Areas`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add operational cheatsheets across repository 2026-05-09 09:41:55 +00:00			`## Documentation`

			`### Production Operations`

			`- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.`

			`### Platform Engineering`

			`- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.`

			`### Labs & Experiments`

			`- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.`

Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`### Codex and Review Guidance`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`- [AGENTS.md](./AGENTS.md) - repository rules for automated and assisted changes.`
			`- [docs/codex/README.md](./docs/codex/README.md) - Codex workflow and expected final response format.`
			`- [docs/codex/review-checklist.md](./docs/codex/review-checklist.md) - safety, Bash, Ansible, docs, and validation review checklist.`
			`- [docs/codex/task-template.md](./docs/codex/task-template.md) - reusable scoped task templates.`

			`## Safety-First Usage`

			`Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.`

			`- Prefer read-only commands first.`
			`- Use dry-run/check mode before execution.`
			- Treat `--execute` as a change-control boundary.
			`- Confirm backups, monitoring, application impact, and rollback steps before live use.`
			`- Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`## Validation`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`Basic local validation:`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			```bash
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`./scripts/validate-repo.sh`
			`./scripts/check-bash.sh`
			`./scripts/check-ansible.sh`
Clean up Python log analysis documentation 2026-05-11 17:10:10 +00:00			`./scripts/check-python.sh`
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`./scripts/check-docs.sh`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			```
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Clean up Python log analysis documentation 2026-05-11 17:10:10 +00:00			The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATIONS.md](./infra-run/KNOWN_LIMITATIONS.md) for the current validation status.`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Add Codex repository guidance and validation 2026-05-10 11:11:03 +00:00			`## Operational Areas Demonstrated`
Update README and add CHANGELOG with initial toolkits summary 2026-05-05 21:47:33 +00:00
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- Linux operations triage and reporting.`
Clean up Python log analysis documentation 2026-05-11 17:10:10 +00:00			`- Local operational log analysis with read-only Python helpers.`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- Disk pressure and deleted-file incident analysis.`
			`- Dry-run-first Bash automation.`
			`- Controlled storage change workflow design.`
			`- Veritas VxVM/VCS operational awareness.`
			`- GPFS / IBM Spectrum Scale operational awareness.`
			`- Ansible role organization for selected hardening controls.`
Document Slurm AI/HPC cluster project 2026-06-04 19:54:43 +00:00			`- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.`
Improve infra-run portfolio credibility 2026-05-08 21:18:22 +00:00			`- Clear documentation of what was tested and what still needs a real system.`