Linux/Unix Infrastructure Engineering Portfolio
This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.
It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.
What This Repo Is
- Practical Linux/Unix operations examples.
- Safe Bash and Ansible patterns for lab and review.
- Runbook-driven examples for incident response, storage operations, hardening, and observability.
- A place for platform and lab topics to grow without pretending unfinished areas are complete.
What This Repo Is Not
- It is not a compliance benchmark implementation.
- It is not a drop-in change automation framework.
- It is not proof that these exact scripts ran in any production environment.
- It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.
Repository Layout
- infra-run - core operational tooling and automation.
- platform-projects - larger platform topics and case-study areas.
- labs - experimental/lab environments and notes.
- docs/codex - guidance for future Codex-driven changes.
- scripts - lightweight repository validation helpers.
Usable Now
- infra-run - the main implemented project in this repository.
- Linux healthcheck scripts - host, disk, service, network, and report helpers.
- Bash incident checks - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
- Disk full workflow - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- Veritas examples - dry-run-first VxVM/VCS storage expansion workflow examples.
- GPFS examples - dry-run-first IBM Spectrum Scale expansion workflow examples.
- Incident log summary - read-only Python helper for local incident log pattern summaries.
- Log diff checker - read-only Python helper for before/after change log comparison.
- Auth log audit - read-only Python helper for local authentication log review.
- JVM log analyzer - read-only Python helper for local JVM and Java application log review.
- Journal analyzer - read-only Python helper for exported
journalctltext review. - Known error matcher - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- Python operational log analysis tools - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- Ansible hardening examples - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- Slurm AI/HPC cluster automation lab - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
Planned Areas
The labs and platform-projects trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in ROADMAP.md.
Documentation
Production Operations
- infra-run/docs/operations-cheatsheet.md - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
Platform Engineering
- platform-projects/docs/platform-cheatsheet.md - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
Labs & Experiments
- labs/docs/lab-cheatsheet.md - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
Codex and Review Guidance
- AGENTS.md - repository rules for automated and assisted changes.
- docs/codex/README.md - Codex workflow and expected final response format.
- docs/codex/review-checklist.md - safety, Bash, Ansible, docs, and validation review checklist.
- docs/codex/task-template.md - reusable scoped task templates.
Safety-First Usage
Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.
- Prefer read-only commands first.
- Use dry-run/check mode before execution.
- Treat
--executeas a change-control boundary. - Confirm backups, monitoring, application impact, and rollback steps before live use.
- Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.
Validation
Basic local validation:
./scripts/validate-repo.sh
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
The validation helpers run required lightweight checks and use optional tools such as shellcheck, yamllint, ansible-playbook, ansible-lint, and markdownlint when available. Python checks use python3 -m py_compile and do not require external Python tooling. Set STRICT=1 to fail when optional tools are missing.
Some scripts depend on platform tools such as vxdisk, hagrp, mmcrnsd, and mmlscluster. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
See infra-run/TESTED.md and infra-run/KNOWN_LIMITATIONS.md for the current validation status.
Operational Areas Demonstrated
- Linux operations triage and reporting.
- Local operational log analysis with read-only Python helpers.
- Disk pressure and deleted-file incident analysis.
- Dry-run-first Bash automation.
- Controlled storage change workflow design.
- Veritas VxVM/VCS operational awareness.
- GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls.
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system.