Compare commits

...

22 Commits

Author SHA1 Message Date
Mateusz Suski 4e739c5c99 Add Linux fresh setup toolkit
lint / shell-yaml-ansible (push) Failing after 16s
2026-06-06 00:23:11 +00:00
Mateusz Suski 8cb92de06f Add AI lab maintenance toolkit
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-06 00:10:44 +00:00
Mateusz Suski 1843796e92 Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-05 15:39:24 +00:00
Mateusz Suski cd6830334b Add Slurm AI/HPC cluster platform project 2026-06-05 15:38:56 +00:00
mateusz e2624a7533 PDF CV file upload
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-14 21:23:49 +02:00
Mateusz Suski 6475f76787 Add L2 incident triage report wrapper
lint / shell-yaml-ansible (push) Failing after 17s
2026-05-12 20:00:42 +00:00
Mateusz Suski e851568c8c Add standalone Bash incident check scripts
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-11 18:49:00 +00:00
Mateusz Suski 8a7b7c5abc Clean up Python log analysis documentation
lint / shell-yaml-ansible (push) Failing after 20s
2026-05-11 17:10:10 +00:00
Mateusz Suski 1636f46f81 Add known error matcher tool 2026-05-11 17:06:46 +00:00
Mateusz Suski 5fc96348c5 Add journal analyzer tool 2026-05-11 17:06:05 +00:00
Mateusz Suski 89b7fabb96 Add JVM log analyzer tool 2026-05-11 17:05:27 +00:00
Mateusz Suski 2da5e8b46c Add authentication log audit tool 2026-05-11 17:04:48 +00:00
Mateusz Suski 452ff4fac1 Add log diff checker tool 2026-05-11 17:04:10 +00:00
Mateusz Suski 5dde403ce3 Add incident log summary tool 2026-05-11 17:03:31 +00:00
Mateusz Suski 61483c233f Add Python tooling validation foundation 2026-05-11 17:02:35 +00:00
Mateusz Suski a527022518 Add Codex repository guidance and validation
lint / shell-yaml-ansible (push) Failing after 17s
2026-05-10 11:11:03 +00:00
Mateusz Suski 0d3905b8a1 Add operational cheatsheets across repository
lint / shell-yaml-ansible (push) Failing after 17s
2026-05-09 09:41:55 +00:00
Mateusz Suski ca5a876d03 Improve infra-run portfolio credibility
lint / shell-yaml-ansible (push) Failing after 21s
2026-05-08 21:18:22 +00:00
Mateusz Suski deb12a0b4f Update docs for Ansible hardening roles 2026-05-06 09:25:43 +00:00
Mateusz Suski 02a51f72f9 Add IBM AIX 7 CIS-inspired hardening playbook 2026-05-06 09:21:15 +00:00
Mateusz Suski 2fd9c0b5ef Add Debian 13 and Ubuntu 26.04 CIS-inspired hardening playbook 2026-05-06 08:56:45 +00:00
Mateusz Suski 75a11f7650 Add RHEL 9 CIS-inspired hardening playbook 2026-05-06 08:45:33 +00:00
282 changed files with 21509 additions and 476 deletions
+39
View File
@@ -0,0 +1,39 @@
---
name: lint
on:
pull_request:
push:
branches:
- main
jobs:
shell-yaml-ansible:
runs-on: ubuntu-latest
steps:
- name: Check out repository
uses: actions/checkout@v4
- name: Install lint tools
run: |
sudo apt-get update
sudo apt-get install -y shellcheck yamllint python3-pip
python3 -m pip install --user ansible-lint
echo "$HOME/.local/bin" >> "$GITHUB_PATH"
- name: ShellCheck Bash scripts
run: |
find infra-run/scripts/bash -name '*.sh' -print0 | xargs -0 shellcheck -x \
-P infra-run/scripts/bash/disk-full \
-P infra-run/scripts/bash/gpfs \
-P infra-run/scripts/bash/veritas
- name: Python syntax checks
run: bash scripts/check-python.sh
- name: yamllint
run: yamllint .
- name: ansible-lint
continue-on-error: true
run: cd infra-run/ansible && ansible-lint playbooks roles
+8
View File
@@ -0,0 +1,8 @@
---
extends: default
rules:
line-length:
max: 140
truthy:
allowed-values: ["true", "false", "on"]
+126
View File
@@ -0,0 +1,126 @@
# AGENTS.md
Guidance for Codex and other automated agents working in this repository.
## Purpose
This repository is a Linux/Unix infrastructure engineering portfolio. It shows practical operational work: incident response, troubleshooting, safe Bash tooling, Ansible hardening examples, storage workflows, runbooks, and platform/lab notes.
Treat it like internal operations tooling maintained by an infrastructure engineer. Preserve operational realism and avoid generic tutorial or template filler.
## Layout
- `infra-run/` - core operational tooling, Ansible, Bash scripts, runbooks, examples, and operations docs.
- `platform-projects/` - larger platform topics such as monitoring, storage, clustering, virtualization, and observability.
- `labs/` - experimental/lab environments for Kubernetes, Terraform, networking, CI/CD, Docker, and related work.
- `docs/codex/` - Codex workflow guidance, task templates, review checklist, and planning template.
- `scripts/` - repository validation helpers.
## Inspect First
Before editing, inspect the affected tree and nearby README files. Prefer:
```bash
rg --files
git status --short
sed -n '1,220p' <file>
```
Check existing style before introducing new structure. Keep changes small and reviewable.
## Validation
Run the broad repo check when practical:
```bash
./scripts/validate-repo.sh
```
Focused checks:
```bash
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
```
Optional strict mode fails when optional tools are missing:
```bash
STRICT=1 ./scripts/validate-repo.sh
```
Also run targeted checks for changed files, such as `bash -n`, `ansible-playbook --syntax-check`, or link checks when relevant.
## Bash Standards
- Use `#!/usr/bin/env bash`.
- Use `set -o errexit`, `set -o nounset`, and `set -o pipefail`.
- Validate input before using it.
- Handle missing commands clearly.
- Default to read-only or dry-run behavior.
- Require explicit `--execute` plus confirmation for destructive operations.
- Use clear `OK`, `WARNING`, and `CRITICAL` output.
- Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
- Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps.
## Python Standards
- Use Python for parsing, reporting, and structured operational tooling where it adds value over Bash.
- Keep Python tools read-only by default.
- Prefer the Python standard library.
- Avoid frameworks and unnecessary abstractions.
- Use clear operational output and meaningful exit codes.
- Keep tools small, focused, and easy to validate.
## Ansible Standards
- Keep playbooks short and roles simple.
- Prefer modules over `shell` or `command`.
- Use `shell` or `command` only when the module set cannot express the operation, and document why if risk is not obvious.
- Preserve check-mode and diff-mode friendliness where possible.
- Use handlers, tags, defaults, and validation tasks when they clarify operations.
- Keep inventory under `inventory/hosts.yml`, `group_vars/`, and `host_vars/`.
- Do not present selected hardening examples as complete compliance certification.
## Documentation Standards
- Explain what exists, what is planned, and what is intentionally not supported.
- Prefer runbook style: scope, pre-checks, execution guardrails, rollback thinking, post-checks, and evidence.
- Avoid marketing language, fake enterprise wording, and tutorial bloat.
- Update README files and `CHANGELOG.md` when adding meaningful behavior or structure.
## Safety Rules
- Do not run destructive commands.
- Do not rename large directories unless the benefit is clear and low-risk.
- Do not hide validation failures.
- Do not claim live production validation for sanitized examples.
- Do not add secrets, real hostnames, customer identifiers, or private infrastructure details.
- Do not turn placeholders into fake completed projects.
## PR and Review Expectations
- State the operational risk of the change.
- Include commands run and whether tools were missing.
- Review scripts for dry-run behavior, input validation, dependency handling, and rollback path.
- Review Ansible for idempotency, check-mode behavior, inventory targeting, tags, handlers, and module choice.
- Keep diffs focused.
## Definition of Done
- The change preserves the repository intent.
- Relevant docs are updated.
- Changed Bash scripts pass `bash -n`.
- Available validation helpers were run.
- Missing optional tools are reported.
- Any remaining risk or follow-up is documented.
## Do Not
- Do not add an "ultimate DevOps template" structure.
- Do not replace working simple Bash with unnecessary abstractions.
- Do not make examples appear production-certified.
- Do not add destructive behavior without `--execute`, confirmation, and clear rollback notes.
- Do not delete useful content unless it is clearly duplicate, broken, or misleading.
+52
View File
@@ -1,5 +1,57 @@
# Changelog # Changelog
## [Unreleased]
### Added
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
- Python tooling validation for operational scripts.
- `incident-log-summary` for general incident log summarization.
- `log-diff-checker` for pre-change and post-change log comparison.
- `auth-log-audit` for Linux authentication log review.
- `jvm-log-analyzer` for JVM application log summaries.
- `journal-analyzer` for exported `journalctl` log review.
- `known-error-matcher` with JSON-based known error patterns.
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
- Repository-level Codex guidance:
- `AGENTS.md`
- `docs/codex/README.md`
- `docs/codex/review-checklist.md`
- `docs/codex/task-template.md`
- `docs/codex/plans-template.md`
- Lightweight validation helpers:
- `scripts/validate-repo.sh`
- `scripts/check-bash.sh`
- `scripts/check-ansible.sh`
- `scripts/check-docs.sh`
- Cross-repository operational documentation structure:
- `infra-run/docs/operations-cheatsheet.md`
- `platform-projects/docs/platform-cheatsheet.md`
- `labs/docs/lab-cheatsheet.md`
- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
- Selected baseline Ansible hardening automation:
- RHEL 9 role and playbook.
- Debian 13 / Ubuntu 26.04 role and playbook.
- IBM AIX 7 role and playbook.
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
### Changed
- Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work.
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
- Updated Python tooling documentation and repository roadmap.
- Integrated Python syntax validation into repository validation workflow and CI.
### Notes
- Hardening content covers selected baseline controls and intended for portfolio/lab use; live use requires environment-specific review and validation.
## [Initial Version] ## [Initial Version]
### Added ### Added
Binary file not shown.
+102 -75
View File
@@ -1,84 +1,111 @@
# Portfolio # Linux/Unix Infrastructure Engineering Portfolio
This repository demonstrates real-world Linux infrastructure and operations experience through sanitized scripts, runbooks, and project structure. It focuses on production operations, incident response, troubleshooting, automation, and enterprise infrastructure patterns. This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.
## Repository Diagram It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.
```mermaid ## What This Repo Is
flowchart TD
A["portfolio"] --> B["infra-run"] - Practical Linux/Unix operations examples.
A --> C["platform-projects"] - Safe Bash and Ansible patterns for lab and review.
A --> D["labs"] - Runbook-driven examples for incident response, storage operations, hardening, and observability.
B --> B1["ansible"] - A place for platform and lab topics to grow without pretending unfinished areas are complete.
B --> B2["docs"]
B --> B3["runbooks"] ## What This Repo Is Not
B --> B4["scripts"]
B4 --> B41["bash"] - It is not a compliance benchmark implementation.
B4 --> B42["python"] - It is not a drop-in change automation framework.
C --> C1["storage"] - It is not proof that these exact scripts ran in any production environment.
C --> C2["clustering"] - It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.
C --> C3["monitoring-zabbix"]
C --> C4["virtualization"] ## Repository Layout
C --> C5["elk-log-analysis"]
D --> D1["docker"] - [infra-run](./infra-run/) - core operational tooling and automation.
D --> D2["kubernetes"] - [platform-projects](./platform-projects/) - larger platform topics and case-study areas.
D --> D3["terraform"] - [labs](./labs/) - experimental/lab environments and notes.
D --> D4["networking"] - [docs/codex](./docs/codex/) - guidance for future Codex-driven changes.
D --> D5["ci-cd"] - [scripts](./scripts/) - lightweight repository validation helpers.
## Usable Now
- [infra-run](./infra-run/) - the main implemented project in this repository.
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
## Planned Areas
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
## Documentation
### Production Operations
- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
### Platform Engineering
- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
### Labs & Experiments
- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
### Codex and Review Guidance
- [AGENTS.md](./AGENTS.md) - repository rules for automated and assisted changes.
- [docs/codex/README.md](./docs/codex/README.md) - Codex workflow and expected final response format.
- [docs/codex/review-checklist.md](./docs/codex/review-checklist.md) - safety, Bash, Ansible, docs, and validation review checklist.
- [docs/codex/task-template.md](./docs/codex/task-template.md) - reusable scoped task templates.
## Safety-First Usage
Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.
- Prefer read-only commands first.
- Use dry-run/check mode before execution.
- Treat `--execute` as a change-control boundary.
- Confirm backups, monitoring, application impact, and rollback steps before live use.
- Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.
## Validation
Basic local validation:
```bash
./scripts/validate-repo.sh
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
``` ```
## Core Project The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
### infra-run Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
`infra-run` is the core operational project in this repository. It contains Linux operations automation, incident response tooling, Bash-based operational scripts, and runbook-style workflows for pre-checks, controlled changes, troubleshooting, and post-change validation. See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATIONS.md](./infra-run/KNOWN_LIMITATIONS.md) for the current validation status.
## Toolkits ## Operational Areas Demonstrated
### Linux Operations Toolkit - Linux operations triage and reporting.
- Local operational log analysis with read-only Python helpers.
[infra-run/scripts/bash/os-healthcheck/](./infra-run/scripts/bash/os-healthcheck/) - Disk pressure and deleted-file incident analysis.
- Dry-run-first Bash automation.
General Linux operations scripts for host health checks, disk usage checks, service validation, system reporting, and first-pass OS-level diagnostics. The toolkit is written for practical operations checks on RHEL, Oracle Linux, and Ubuntu-style systems. - Controlled storage change workflow design.
- Veritas VxVM/VCS operational awareness.
### Disk Full Incident Toolkit - GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls.
[infra-run/scripts/bash/disk-full/](./infra-run/scripts/bash/disk-full/) - Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system.
Production-style disk full incident workflow covering filesystem usage, inode pressure, large file discovery, deleted open files, top directory analysis, log cleanup review, and safe cleanup suggestions. The scenario reflects common incidents involving logs, temporary files, deleted files held open by processes, and inode exhaustion.
### Network Troubleshooting
[infra-run/scripts/bash/os-healthcheck/](./infra-run/scripts/bash/os-healthcheck/)
OS-level network diagnostics for interfaces, routes, DNS resolution, gateway reachability, listening sockets, and optional remote connectivity checks. The script is designed for first-pass troubleshooting during Linux operations incidents.
### Veritas Storage Toolkit
[infra-run/scripts/bash/veritas/](./infra-run/scripts/bash/veritas/)
Veritas VxVM and VCS storage expansion workflow covering new LUN detection, VxVM disk initialization, diskgroup extension, volume and filesystem resize, and VCS service group freeze/unfreeze handling. The approach is cluster-safe, dry-run by default, and organized around pre-check, change, and post-check steps.
### GPFS Storage Toolkit
[infra-run/scripts/bash/gpfs/](./infra-run/scripts/bash/gpfs/)
GPFS / IBM Spectrum Scale filesystem expansion workflow covering cluster validation, candidate disk discovery, NSD stanza planning, NSD creation, filesystem expansion, optional rebalance, post-checks, and change reporting.
## Repository Structure
- `infra-run` - core operational automation, scripts, runbooks, and infrastructure operations examples.
- `platform-projects` - larger infrastructure topics including storage, clustering, monitoring, virtualization, and log analysis.
- `labs` - experimentation and lab work for Kubernetes, Terraform, Docker, networking, and CI/CD.
## Design Principles
- Safety first, with dry-run behavior by default.
- Pre-check, change, and post-check workflow.
- Real-world scenarios, not tutorials.
- Minimal but practical tooling.
## Notes
- Scripts are simplified and sanitized for portfolio use.
- Examples are based on real production operations patterns.
+38
View File
@@ -0,0 +1,38 @@
# Roadmap
This file keeps future portfolio ideas in one place so empty folders do not look like finished work.
## Planned Lab Areas
- Docker: image build notes, container troubleshooting, and small service examples.
- Kubernetes: workload inspection, basic operations checks, and failure scenario notes.
- Terraform: small infrastructure-as-code examples with clear plan/apply separation.
- Networking: DNS, routing, firewall, and connectivity troubleshooting labs.
- CI/CD: validation pipelines for shell, YAML, and Ansible examples.
## Planned Platform Case Studies
- Storage: expansion planning, filesystem checks, and SAN handoff documentation.
- Clustering: service group checks, failover review, and operational checklists.
- Monitoring: Zabbix-oriented alert review and host onboarding notes.
- Virtualization: VM lifecycle and platform operations examples.
- Log analysis: optional ELK-style search case study under `platform-projects`, separate from current local Python helpers.
## Implemented Portfolio Additions
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
- Python operational log analysis suite under `infra-run/scripts/python/`:
- `incident-log-summary`
- `log-diff-checker`
- `auth-log-audit`
- `jvm-log-analyzer`
- `journal-analyzer`
- `known-error-matcher`
## Future Python Tooling Ideas
- Real-world sample report examples using sanitized evidence.
- Integration examples that combine log summaries with change evidence collection.
- A shared Python helper library only if the standalone tools begin duplicating enough stable behavior to justify it.
Planned sections remain future work unless listed as implemented.
+55
View File
@@ -0,0 +1,55 @@
# Codex Workflow
This directory keeps future Codex sessions consistent when working in this infrastructure portfolio.
## How To Start
1. Read [AGENTS.md](../../AGENTS.md).
2. Inspect the affected tree and nearby README files.
3. Check `git status --short` so existing user work is preserved.
4. Decide whether a plan is needed before editing.
5. Make small, reviewable changes.
6. Run focused validation plus `./scripts/validate-repo.sh` when practical.
## When To Plan First
Plan before editing when a task touches more than one subsystem, changes operational behavior, adds or modifies destructive actions, changes Ansible targeting, or updates repository conventions.
For small typo fixes, narrow README updates, or obvious syntax fixes, inspect first and then make the change directly.
Use [plans-template.md](./plans-template.md) for larger changes.
## Scoped Tasks
Good tasks name the operational goal, affected directories, constraints, validation commands, and what "done" means. Use [task-template.md](./task-template.md) for reusable prompts.
Keep scope tied to real operations:
- Bash tool: discovery, pre-check, dry-run, execute, post-check, report.
- Ansible change: inventory target, role/playbook scope, check mode, idempotency, validation.
- Runbook: incident signal, triage, decision points, rollback, evidence.
- Lab/platform project: status, prerequisites, validation, limitations.
## Validation
Prefer the repository helpers:
```bash
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-docs.sh
./scripts/validate-repo.sh
```
If optional tools are missing, report that clearly and continue with available checks. Do not claim skipped checks passed.
## Final Response Format
End with:
1. Summary of what changed.
2. Files created or modified.
3. Validation commands run and results.
4. Skipped checks and why.
5. Risks or follow-ups.
6. Whether the repo is ready for future Codex-driven work.
+35
View File
@@ -0,0 +1,35 @@
# Implementation Plan Template
Use this for changes that touch multiple files, alter operational behavior, or add new repository conventions.
## Goal
State the operational or maintenance outcome.
## Current State
Summarize the directories and conventions inspected.
## Scope
List files or directories expected to change.
## Non-Goals
Name what will not be redesigned, renamed, deleted, or claimed as complete.
## Plan
1. Inspect relevant scripts, playbooks, docs, and examples.
2. Make the smallest structural or documentation changes needed.
3. Update validation or runbook guidance.
4. Run focused checks.
5. Summarize residual risk and follow-ups.
## Validation
List commands to run, including fallback behavior for missing tools.
## Risks
Call out destructive operations, platform assumptions, missing lab environments, or checks that require real systems.
+52
View File
@@ -0,0 +1,52 @@
# Review Checklist
Use this checklist for repository reviews and pull requests.
## Safety
- Destructive actions default to dry-run or read-only.
- Real changes require explicit `--execute` and operator confirmation.
- Inputs are validated before use.
- Paths, service names, disks, volumes, and inventory targets are constrained.
- Rollback or recovery thinking is documented where the operation can change state.
## Bash
- Uses `#!/usr/bin/env bash`.
- Uses `set -o errexit`, `set -o nounset`, and `set -o pipefail`.
- Missing commands return a clear warning or invalid-input/dependency exit.
- Output uses `OK`, `WARNING`, and `CRITICAL` consistently.
- Exit codes follow repo convention: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
- Help output exists for scripts that accept arguments.
## Ansible
- Target hosts are explicit and appropriate for the role.
- Modules are preferred over `shell` or `command`.
- Check mode and diff mode are considered.
- Tasks are idempotent or clearly documented when a check is inherently read-only or platform-specific.
- Handlers, tags, defaults, and validation tasks are used where useful.
- Inventory, vars, and role defaults do not contain secrets or real environment data.
## Documentation
- README files explain current state without overstating completeness.
- Runbooks include scope, pre-checks, execution controls, post-checks, and evidence.
- Docs avoid tutorial filler and fake enterprise complexity.
- Important limitations are linked or documented.
- `CHANGELOG.md` is updated for meaningful repo changes.
## Operational Realism
- The change reflects RHEL/Oracle Linux, Debian/Ubuntu, AIX, Veritas, GPFS, Zabbix, ELK, Docker, Kubernetes/K3s, Terraform, VMware, or Proxmox operations accurately.
- Examples remain sanitized.
- Placeholder projects are identified as placeholders.
- There is no unnecessary abstraction or invented complexity.
## Validation
- Changed Bash scripts pass `bash -n`.
- `shellcheck` was run if available, or its absence was reported.
- Ansible syntax/lint checks were run if available and relevant.
- YAML/Markdown sanity checks were run if available.
- Failures and skipped checks are visible in the final summary.
+276
View File
@@ -0,0 +1,276 @@
# Task Templates
Copy the relevant section into a future Codex request and fill in the blanks.
## Operational Bash Tool
### Goal
Build or improve a Bash tool for:
### Context
Affected platform, incident, or operational workflow:
### Constraints
- Default to dry-run/read-only.
- Require `--execute` for changes.
- Use `OK`, `WARNING`, and `CRITICAL`.
- Exit `0` OK, `1` operational issue, `2` invalid input or missing dependency.
### Files/directories to inspect
- `infra-run/scripts/bash/`
- Relevant runbook or README:
### Implementation steps
1. Inspect neighboring scripts and shared helpers.
2. Add or adjust usage/help output.
3. Add discovery, pre-check, guarded change, post-check, and reporting sections where useful.
4. Update README or runbook notes.
### Validation commands
```bash
bash -n <script>
./scripts/check-bash.sh
```
### Done when
The tool is readable, safe by default, validates inputs, reports clearly, and has updated docs.
## Ansible Playbook/Role
### Goal
Add or improve Ansible automation for:
### Context
Target OS and inventory group:
### Constraints
- Preserve check-mode friendliness.
- Prefer modules over shell/command.
- Keep playbooks short.
- Keep role defaults sanitized.
### Files/directories to inspect
- `infra-run/ansible/README.md`
- `infra-run/ansible/inventory/`
- `infra-run/ansible/playbooks/`
- `infra-run/ansible/roles/`
### Implementation steps
1. Inspect existing role/playbook patterns.
2. Add defaults, tasks, handlers, and tags only where needed.
3. Add validation or post-check tasks for operational evidence.
4. Update role/playbook README.
### Validation commands
```bash
./scripts/check-ansible.sh
cd infra-run/ansible && ansible-playbook --syntax-check -i inventory/hosts.yml playbooks/<playbook>.yml
```
### Done when
The playbook targets the right hosts, is idempotent where practical, supports review with `--check --diff`, and docs explain limitations.
## Runbook
### Goal
Create or improve a runbook for:
### Context
Incident signal, platform, and affected service:
### Constraints
- Include pre-checks, decision points, rollback, post-checks, and evidence.
- Avoid pretending lab notes are production-certified.
### Files/directories to inspect
- `infra-run/runbooks/`
- `infra-run/docs/`
- Related scripts/examples:
### Implementation steps
1. Define scope and assumptions.
2. Add triage steps and command examples.
3. Add safe execution gates.
4. Add validation and handoff notes.
### Validation commands
```bash
./scripts/check-docs.sh
```
### Done when
An operator can follow the runbook without guessing the risk, inputs, or success criteria.
## Lab Scenario
### Goal
Add or improve a lab scenario for:
### Context
Technology and local environment:
### Constraints
- Mark lab-only behavior clearly.
- Keep prerequisites and cleanup explicit.
### Files/directories to inspect
- `labs/`
- `labs/docs/lab-cheatsheet.md`
### Implementation steps
1. Document prerequisites and topology.
2. Add setup, validation, failure injection if relevant, and cleanup.
3. Link related scripts or runbooks.
### Validation commands
```bash
./scripts/check-docs.sh
```
### Done when
The lab is reproducible enough to review and does not imply production readiness.
## Platform Project
### Goal
Add or improve a platform project for:
### Context
Monitoring, storage, clustering, virtualization, observability, or related topic:
### Constraints
- Keep status honest: planned, partial, lab-tested, or complete.
- Prefer operational notes over marketing language.
### Files/directories to inspect
- `platform-projects/`
- `platform-projects/docs/platform-cheatsheet.md`
### Implementation steps
1. Identify scope and current maturity.
2. Add design notes, operational workflows, and validation.
3. Link runbooks, examples, and known limitations.
### Validation commands
```bash
./scripts/check-docs.sh
```
### Done when
The project explains what exists, how to validate it, and what remains unproven.
## Documentation Cleanup
### Goal
Clean up documentation for:
### Context
Current confusion, duplication, or missing links:
### Constraints
- Preserve useful operational detail.
- Avoid tutorial-style filler.
### Files/directories to inspect
- Root `README.md`
- Section README files
- Related docs/runbooks:
### Implementation steps
1. Remove duplication where it hurts navigation.
2. Add links to canonical docs.
3. Make limitations explicit.
4. Update changelog if meaningful.
### Validation commands
```bash
./scripts/check-docs.sh
```
### Done when
Readers can find the right tool, runbook, or validation command quickly.
## Repository Review
### Goal
Review repository quality for:
### Context
Areas of concern:
### Constraints
- Findings first, ordered by severity.
- Include file/line references where possible.
- Do not rewrite unrelated content.
### Files/directories to inspect
- `AGENTS.md`
- `README.md`
- `infra-run/`
- `platform-projects/`
- `labs/`
- `scripts/`
### Implementation steps
1. Inspect structure and conventions.
2. Review safety, validation, docs, and maintainability.
3. Patch only low-risk issues if requested.
4. Report risks and follow-ups.
### Validation commands
```bash
./scripts/validate-repo.sh
git diff --stat
```
### Done when
The review identifies practical risks and leaves a clear next action list.
+9
View File
@@ -0,0 +1,9 @@
# Known Limitations
- Veritas scripts require manual review before real use. VxVM and VCS behavior varies by version, cluster design, naming convention, and operational policy.
- GPFS commands require a real cluster and must be adapted to the site layout, NSD naming standard, failure groups, storage pools, and maintenance process.
- The AIX Ansible role is a portfolio example unless tested on a real AIX LPAR with the target OpenSSH, sudo, audit, and OS levels.
- SSH hardening must be validated against the full `sshd` configuration, not only a managed drop-in file.
- The hardening examples cover selected controls only. They are not a full CIS benchmark implementation or compliance attestation.
- Scripts do not replace formal change procedures, peer review, backups, monitoring checks, or rollback planning.
- Sample outputs are fake and sanitized. They should be used for documentation review, not operational decisions.
+93 -19
View File
@@ -1,27 +1,101 @@
# infra-run # infra-run
`infra-run` is the operational core of this repository. It groups automation, scripts, runbooks, and supporting documentation for Linux infrastructure work, incident response, and controlled change execution. `infra-run` is a sanitized infrastructure operations project. It contains Bash, Ansible, Python, and documentation examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
## Diagram The goal is to show operational judgment, not to ship a universal automation product.
```mermaid ## Current Contents
flowchart TD
A["infra-run"] --> B["ansible"] ### Bash Operational Scripts
A --> C["docs"]
A --> D["runbooks"] - [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
A --> E["scripts"] - [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
E --> E1["bash"] - [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
E --> E2["python"] - [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
### Python Log And Reporting Tools
- [scripts/python](./scripts/python/) - read-only Python operational helpers using the standard library only.
- [scripts/python/incident-log-summary](./scripts/python/incident-log-summary/) - read-only Python log summary helper for incident pattern review.
- [scripts/python/log-diff-checker](./scripts/python/log-diff-checker/) - read-only Python before/after log comparison helper for change review.
- [scripts/python/auth-log-audit](./scripts/python/auth-log-audit/) - read-only Python authentication log audit helper for SSH, sudo, su, and PAM review.
- [scripts/python/jvm-log-analyzer](./scripts/python/jvm-log-analyzer/) - read-only Python JVM and Java application log analyzer for exception, stack trace, HTTP 5xx, database, and TLS review.
- [scripts/python/journal-analyzer](./scripts/python/journal-analyzer/) - read-only Python exported journal analyzer for failed units, restart patterns, OOM events, and service warnings.
- [scripts/python/known-error-matcher](./scripts/python/known-error-matcher/) - read-only Python matcher for local logs and JSON known-error catalogs with runbook references.
### Ansible Automation
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
### Runbooks And Documentation
- [examples](./examples/) - sanitized sample command outputs and incident notes.
## Documentation
- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
## What This Is
- A portfolio project for Linux and infrastructure operations roles.
- A set of readable examples showing precheck, dry-run, execution guardrails, postcheck, and reporting patterns.
- A place to demonstrate Bash, Ansible, storage workflow, and troubleshooting habits with sanitized inputs.
## What This Is Not
- Not intended for direct live use.
- Not a complete CIS benchmark implementation.
- Not a replacement for site-specific change procedures.
- Not tested against live Veritas, GPFS, or AIX systems in this repository.
- Not safe to run blindly on servers without review.
## Currently Usable
- Bash syntax can be checked locally.
- Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked.
- Disk-full read-only scripts can be run against local paths for basic behavior checks.
- Python log analysis examples can be run against sanitized sample logs under each tool directory.
- Ansible YAML and role structure can be linted locally.
## Running Safely
- Start with the relevant README or runbook before executing a script.
- Prefer read-only discovery scripts before remediation scripts.
- Use dry-run mode unless a script explicitly documents safe local behavior.
- Only use `--execute` after reviewing inputs, affected systems, rollback options, and post-checks.
- For Ansible, start with `--check --diff` against a lab inventory.
## Lab-Safe Examples
- Veritas and GPFS scripts default to dry-run behavior where they plan destructive or platform-changing operations.
- Ansible hardening roles are examples of selected controls and need adaptation before use.
- Sample outputs under [examples](./examples/) are fake and sanitized.
## Tested
See [TESTED.md](./TESTED.md) for current validation status.
Short version:
- Shell scripts were reviewed for dry-run behavior and obvious quoting issues.
- YAML and Ansible files are intended for local linting.
- Veritas, GPFS, and AIX behavior was not validated against real systems here.
## Basic Validation
From the repository root:
```bash
./scripts/validate-repo.sh
``` ```
## Scope Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, `scripts/check-python.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
- `ansible` - placeholder structure for infrastructure automation and testing. ## Supporting Notes
- `docs` - supporting technical notes and written documentation.
- `runbooks` - procedural operational guides.
- `scripts` - executable tooling for operations and diagnostics.
## Notes - [SOURCE.md](./SOURCE.md) explains why this project exists and what experience shaped it.
- [TESTED.md](./TESTED.md) lists what was checked locally and what was not.
- This folder reflects the structure of a production-oriented operations repository. - [KNOWN_LIMITATIONS.md](./KNOWN_LIMITATIONS.md) documents technical limits and operational cautions.
- Current implementation is strongest in the Bash tooling under `scripts/bash`. - [ROADMAP.md](./ROADMAP.md) tracks planned additions without presenting them as completed work.
- [../AGENTS.md](../AGENTS.md) and [../docs/codex](../docs/codex/) document repository working rules and review expectations.
+31
View File
@@ -0,0 +1,31 @@
# infra-run Roadmap
This file tracks planned `infra-run` additions without presenting them as completed work.
## Candidate Additions
- More sample reports for disk pressure, service failures, and network incidents.
- A small Python parser for converting script output into a markdown change note.
- Additional Ansible molecule or container-based syntax checks where platform support is realistic.
- Standalone runbooks that reference the existing Bash workflows.
- Shared known-error pattern catalog review.
- Additional links between Python findings and existing runbooks.
- Change evidence collector for pre-check and post-check notes.
- Report examples suitable for incident and change tickets.
- Optional wrapper command only after the standalone Python tools stabilize.
## Implemented Additions
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
- `infra-run/scripts/python/jvm-log-analyzer/` - read-only JVM and Java application log analyzer for exceptions, stack traces, HTTP 5xx entries, database issues, TLS failures, and JVM failure symptoms.
- `infra-run/scripts/python/journal-analyzer/` - read-only exported `journalctl` text analyzer for summarizing failed units, dependency issues, restart patterns, OOM findings, disk/filesystem symptoms, and related service warnings.
- `infra-run/scripts/python/known-error-matcher/` - read-only known-error matcher for local logs and JSON pattern catalogs with severity, category, samples, and runbook references.
## Not Planned
- A full compliance benchmark implementation.
- Automated production changes without review gates.
- Vendor-specific storage actions that cannot be tested in a lab.
+37
View File
@@ -0,0 +1,37 @@
# Source And Intent
`infra-run` exists to present infrastructure operations work in a form that can be reviewed without exposing employer systems, hostnames, storage identifiers, tickets, or internal procedures.
The project is inspired by professional Linux and infrastructure operations work: prechecks before changes, postchecks after changes, disk-pressure incidents, SSH and sudo hardening, storage expansion planning, cluster awareness, and the need to leave clear notes for other engineers.
## What Is Realistic
- The workflow shape: precheck, dry-run, execute only with explicit approval, postcheck, and report.
- The operational topics: Linux health checks, disk-full triage, Veritas VxVM/VCS concepts, GPFS / IBM Spectrum Scale concepts, and selected OS hardening controls.
- The caution around storage, clustering, SSH, sudo, audit, and filesystem changes.
## What Is Simplified
- Commands are written as examples and do not cover every vendor, OS release, package layout, or site standard.
- The Veritas and GPFS scripts model common workflow steps but cannot validate a real cluster from this repository.
- The Ansible roles apply selected baseline controls; they are not full compliance implementations.
- Reporting examples use sanitized sample data.
## What Was Sanitized
- Hostnames, IP addresses, disk names, WWNs, ticket numbers, application names, company names, and environment-specific values.
- Exact production procedures and internal approval paths.
- Any data that could identify a real system or organization.
## Production Caution
Do not run these scripts blindly on production systems. Review every command, adapt variables and paths, test in a lab, confirm backups and rollback plans, and follow the local change process.
This project does not claim that the exact scripts were used in production.
## Roles This Supports
- Linux System Administrator
- Infrastructure Engineer
- SRE / DevOps Operations Engineer
- Linux Platform Engineer
+44
View File
@@ -0,0 +1,44 @@
# Tested
This file documents the validation status for `infra-run`.
## Tested Locally
- Repository structure and documentation links were reviewed.
- Bash scripts were reviewed for dry-run defaults, quoting, and obvious unsafe cleanup behavior.
- Disk-full examples use fake data and can be read without access to production systems.
## Syntax Checked
Recommended local checks:
```bash
find infra-run/scripts/bash -name '*.sh' -print0 | xargs -0 shellcheck -x -P infra-run/scripts/bash/disk-full -P infra-run/scripts/bash/gpfs -P infra-run/scripts/bash/veritas
yamllint .
cd infra-run/ansible && ansible-lint playbooks roles
```
The GitHub Actions workflow runs shell and YAML validation. `ansible-lint` is non-blocking because role behavior depends on platform facts, installed collections, and target OS support.
## Not Tested Against Real Systems
- Veritas VxVM/VCS commands were not tested against a live Veritas cluster here.
- GPFS / IBM Spectrum Scale commands were not tested against a live GPFS cluster here.
- AIX hardening tasks were not tested against a real AIX LPAR here.
- SSH hardening was not validated across every possible `sshd_config` layout.
## Known Limitations
- Destructive storage operations are dry-run by default where applicable, but dry-run output is not a substitute for peer review.
- Some scripts require vendor commands that are not available on a normal Linux workstation.
- Ansible examples are selected baseline controls, not full hardening benchmarks.
- Local linting does not prove production safety.
## Suggested Validation Steps
1. Run `shellcheck` against all Bash scripts.
2. Run `yamllint` against repository YAML.
3. Run `cd infra-run/ansible && ansible-lint playbooks roles` and review any non-blocking warnings.
4. Run disk-full read-only scripts on disposable local paths.
5. For Veritas or GPFS, test only in a lab with fake volumes/disks or a controlled training environment.
6. Validate SSH changes on a disposable host using the full effective `sshd` configuration.
+16 -7
View File
@@ -1,6 +1,6 @@
# infra-run/ansible # infra-run/ansible
This directory reserves the Ansible automation area for future infrastructure-as-code content. It is organized around the standard separation of inventory, roles, playbooks, collections, and tests. This directory contains Ansible automation for infrastructure operations and OS hardening. It is organized around the standard separation of inventory, roles, playbooks, collections, and tests.
## Diagram ## Diagram
@@ -17,13 +17,22 @@ flowchart TD
## Scope ## Scope
- `collections` - vendored or custom Ansible collections. - `collections` - collection requirements for supported automation targets.
- `inventory` - environment inventory definitions and variables. - `inventory` - sanitized Linux and AIX inventory examples with shared defaults.
- `playbooks` - executable playbooks for repeatable operations. - `playbooks` - executable selected baseline hardening playbooks.
- `roles` - reusable automation roles. - `roles` - reusable hardening roles for supported operating systems.
- `tests` - validation and test harnesses for Ansible content. - `tests` - validation and test harnesses for Ansible content.
## Hardening Coverage
- `cis-rhel9-hardening` - RHEL 9 baseline tasks for packages, services, SSH, sudo, sysctl, auditing, logging, filesystem controls, and validation.
- `cis-debian-ubuntu-hardening` - Debian 13 and Ubuntu 26.04 baseline tasks for apt packages, services, SSH, sudo, sysctl, auditing, logging, filesystem controls, and validation.
- `cis-aix7-hardening` - IBM AIX 7 baseline tasks for SSH, sudo, audit, logging, cron, users, password policy, network settings, filesystem controls, services, and validation.
## Notes ## Notes
- The directory layout is already prepared for growth even where content is still placeholder-only. - Roles are selected baseline examples intended for portfolio and lab use, not a drop-in compliance certification.
- This keeps the repository ready for automation expansion alongside the existing script toolkits. - Defaults are sanitized and configurable through inventory or `--extra-vars`.
- Run platform-specific playbooks against appropriate test hosts before adapting them to managed environments.
- Prefer `--check --diff` for review runs before applying changes.
- Validate from the repository root with `./scripts/check-ansible.sh`.
+9
View File
@@ -0,0 +1,9 @@
[defaults]
inventory = inventory/hosts.yml
roles_path = roles
host_key_checking = False
retry_files_enabled = False
stdout_callback = yaml
[privilege_escalation]
become = True
@@ -0,0 +1,4 @@
---
collections:
- name: ansible.posix
- name: community.general
+9 -2
View File
@@ -16,8 +16,15 @@ flowchart TD
- `group_vars` - variables applied at group or environment level. - `group_vars` - variables applied at group or environment level.
- `host_vars` - variables tailored to individual nodes. - `host_vars` - variables tailored to individual nodes.
- `hosts.yml` - sanitized example groups for Linux and AIX hardening targets.
## Current Inventory Shape
- `linux` - local example host for Linux hardening playbooks.
- `aix` - empty sanitized group ready for AIX host definitions.
- `group_vars/all.yml` - shared hardening defaults such as NTP servers, SSH behavior, audit/logging toggles, sysctl hardening, and optional mount management.
## Notes ## Notes
- The structure is present even though the repository currently keeps this area sanitized and mostly empty. - Inventory values are intentionally sanitized.
- This is the natural companion to future playbooks and roles under `infra-run/ansible`. - Override defaults per host, per group, or per run before applying any hardening playbook.
@@ -0,0 +1,18 @@
---
timezone: UTC
cis_ntp_servers:
- 0.rhel.pool.ntp.org
- 1.rhel.pool.ntp.org
- 2.rhel.pool.ntp.org
- 3.rhel.pool.ntp.org
# Operational defaults. Override per run with --extra-vars or inventory when needed.
cis_disable_root_login: true
cis_disable_password_auth: false
cis_install_auditd: true
cis_enable_chrony: true
cis_enable_rsyslog: true
cis_remove_legacy_packages: true
cis_enable_sysctl_hardening: true
cis_manage_mount_options: false
+8
View File
@@ -0,0 +1,8 @@
---
linux:
hosts:
localhost:
ansible_connection: local
aix:
hosts: {}
+5 -3
View File
@@ -1,6 +1,6 @@
# infra-run/ansible/playbooks # infra-run/ansible/playbooks
This directory is intended for executable Ansible playbooks that coordinate roles, inventories, and operational tasks. In the current portfolio state it acts as a prepared entry point for future automation runs. This directory contains executable Ansible playbooks that coordinate roles, inventories, and operational hardening tasks.
## Diagram ## Diagram
@@ -14,5 +14,7 @@ flowchart TD
## Notes ## Notes
- Playbooks belong here when the repository expands beyond script-first operations. - `cis-rhel9-hardening.yml` applies the RHEL 9 selected baseline hardening role to Linux inventory targets.
- The directory currently contains only placeholder content. - `cis-debian-ubuntu-hardening.yml` applies the Debian 13 / Ubuntu 26.04 selected baseline hardening role to Linux inventory targets.
- `cis-aix7-hardening.yml` applies the IBM AIX 7 selected baseline hardening role to AIX inventory targets.
- Use the sanitized inventory under `../inventory/` as a starting point and override defaults per environment.
@@ -0,0 +1,21 @@
---
- name: Apply selected baseline IBM AIX 7 hardening controls
hosts: aix
become: true
gather_facts: true
roles:
- role: cis-aix7-hardening
tags:
- cis
- aix7
- hardening
post_tasks:
- name: Show AIX hardening validation summary
ansible.builtin.debug:
var: cis_aix_validation_summary
when: cis_aix_validation_summary is defined
tags:
- always
- postcheck
@@ -0,0 +1,20 @@
---
- name: Apply selected baseline Debian and Ubuntu hardening controls
hosts: linux
become: true
gather_facts: true
roles:
- role: cis-debian-ubuntu-hardening
tags:
- cis
- hardening
post_tasks:
- name: Show validation summary
ansible.builtin.debug:
var: cis_validation_summary
when: cis_validation_summary is defined
tags:
- always
- postcheck
@@ -0,0 +1,20 @@
---
- name: Apply selected baseline RHEL 9 hardening controls
hosts: linux
become: true
gather_facts: true
roles:
- role: cis-rhel9-hardening
tags:
- cis
- hardening
post_tasks:
- name: Show validation summary
ansible.builtin.debug:
var: cis_validation_summary
when: cis_validation_summary is defined
tags:
- always
- postcheck
+12 -3
View File
@@ -1,6 +1,6 @@
# infra-run/ansible/roles # infra-run/ansible/roles
This folder is reserved for reusable Ansible roles. Roles make it possible to organize configuration logic into predictable, testable units that can be shared across playbooks. This folder contains reusable Ansible roles. Roles organize configuration logic into predictable, testable units that can be shared across playbooks.
## Diagram ## Diagram
@@ -10,9 +10,18 @@ flowchart TD
A --> C["monitoring"] A --> C["monitoring"]
A --> D["storage"] A --> D["storage"]
A --> E["security"] A --> E["security"]
E --> E1["cis-rhel9-hardening"]
E --> E2["cis-debian-ubuntu-hardening"]
E --> E3["cis-aix7-hardening"]
``` ```
## Current Roles
- `cis-rhel9-hardening` - RHEL 9 baseline example with package, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
- `cis-debian-ubuntu-hardening` - Debian 13 and Ubuntu 26.04 baseline example with apt, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
- `cis-aix7-hardening` - IBM AIX 7 baseline example with SSH, sudo, audit, logging, cron, user, password, network, filesystem, service, and validation tasks.
## Notes ## Notes
- The role layout is not yet populated, but the structure is in place for future automation modules. - Each role includes defaults, task includes, handlers where needed, and role-specific README guidance.
- Keeping a README here documents intent even before role code exists. - The hardening content is sanitized for portfolio use and should be reviewed against site policy before live use.
@@ -0,0 +1,67 @@
# cis-aix7-hardening
Operational IBM AIX 7.x hardening role inspired by CIS Benchmark 1.2.0 and common Unix security practices.
Reference: https://www.cisecurity.org/benchmark/aix
This role is intended for infrastructure and security operations teams that manage AIX estates. It favors readable, conservative controls over broad benchmark coverage.
## Supported OS
- IBM AIX 7.x
## Implemented Areas
- Platform prechecks for AIX 7.x, SRC, SSH, audit tooling, required commands, disk safety, and baseline security state.
- SSH daemon hardening in `/etc/ssh/sshd_config` with validation through `sshd -t`.
- Account and password controls through AIX-native `lssec`, `chsec`, and `pwdadm`.
- Network tunable validation and optional hardening through `no`, with optional `nfso` support.
- SRC-aware service checks and safe inetd legacy service disablement.
- Filesystem review for JFS2, world-writable directories, and invalid owners or groups.
- Syslog and audit validation, with audit enablement disabled by default.
- Cron and at permission hardening under `/var/adm/cron`.
- Sudo defaults with validation through `visudo -cf` when sudo is present.
- Postcheck reporting for SSH, services, network values, and password policy.
## AIX Operational Notes
AIX is not Linux. This role does not assume systemd, sysctl, Linux package managers, or Linux service paths. Service operations use SRC commands such as `lssrc`, `startsrc`, `stopsrc`, and `refresh`.
AIX environments vary heavily between environments. Filesystem layout, OpenSSH source, sudo packaging, audit classes, NFS tuning, and security policy ownership should be validated before managed rollout.
## Safety Philosophy
- Defaults are conservative.
- Audit enablement is opt-in with `cis_enable_audit`.
- Filesystem mount option management is opt-in with `cis_manage_mount_options`.
- SSH password authentication is not disabled by default.
- Native AIX security files are updated with targeted `chsec` calls instead of wholesale replacement.
- Check mode is supported where practical, though AIX command modules may still need read-only probes for validation.
## Check Mode Examples
```bash
ansible-playbook playbooks/cis-aix7-hardening.yml --check
```
```bash
ansible-playbook playbooks/cis-aix7-hardening.yml --check --tags precheck,ssh,postcheck
```
## Tag Examples
```bash
ansible-playbook playbooks/cis-aix7-hardening.yml --tags precheck
```
```bash
ansible-playbook playbooks/cis-aix7-hardening.yml --tags ssh,password_policy,network
```
```bash
ansible-playbook playbooks/cis-aix7-hardening.yml --tags audit -e cis_enable_audit=true
```
## Important Warning
This is not a full compliance certification implementation and does not implement the entire CIS AIX benchmark. It is a practical baseline example that should be reviewed by infrastructure, security, and application owners before managed enforcement.
@@ -0,0 +1,98 @@
---
cis_benchmark_version: "1.2.0"
cis_disable_root_login: true
cis_disable_password_auth: false
cis_enable_network_hardening: true
cis_enable_password_policy: true
cis_enable_audit: false
cis_manage_mount_options: false
cis_ssh_max_auth_tries: 4
cis_ssh_login_grace_time: 60
cis_ssh_client_alive_interval: 300
cis_ssh_client_alive_count_max: 3
cis_ssh_config_path: /etc/ssh/sshd_config
cis_sshd_test_command: sshd -t
cis_min_root_free_mb: 1024
cis_password_minlen: 14
cis_password_histsize: 10
cis_password_maxage_weeks: 12
cis_password_minalpha: 1
cis_password_minother: 1
cis_password_maxrepeats: 2
cis_password_minage_weeks: 1
cis_login_retries: 5
cis_login_lockout: 30
cis_required_commands:
- lsattr
- chdev
- lssrc
- chsec
- lssec
- pwdadm
- "no"
- audit
- cron
cis_ssh_candidate_paths:
- /usr/sbin/sshd
- /usr/bin/sshd
- /opt/freeware/sbin/sshd
- /opt/freeware/bin/sshd
cis_network_no_settings:
ipforwarding: "0"
ipsendredirects: "0"
ipignoreredirects: "1"
ipsrcrouteforward: "0"
clean_partial_conns: "1"
tcp_pmtu_discover: "0"
cis_network_nfso_settings: {}
cis_legacy_inetd_services:
- telnet
- shell
- login
- exec
- comsat
- talk
- ntalk
- tftp
- uucp
- finger
cis_src_subsystems:
- sshd
- inetd
- syslogd
- audit
cis_mount_option_targets:
- path: /tmp
options:
- nosuid
- path: /var/tmp
options:
- nosuid
cis_manage_sudo: true
cis_sudoers_path: /etc/sudoers
cis_sudo_logfile: /var/log/sudo.log
cis_sudo_use_pty: true
cis_cron_allow_path: /var/adm/cron/cron.allow
cis_cron_deny_path: /var/adm/cron/cron.deny
cis_at_allow_path: /var/adm/cron/at.allow
cis_at_deny_path: /var/adm/cron/at.deny
cis_cron_directories:
- /var/adm/cron
- /var/spool/cron
- /var/spool/cron/crontabs
cis_syslog_config_path: /etc/syslog.conf
cis_audit_config_path: /etc/security/audit/config
@@ -0,0 +1,44 @@
---
- name: Validate sshd configuration
ansible.builtin.command: "{{ cis_sshd_test_command }}"
changed_when: false
listen: validate sshd
- name: Restart sshd using SRC
ansible.builtin.shell: |
set -o pipefail
if lssrc -s sshd >/dev/null 2>&1; then
stopsrc -s sshd >/dev/null 2>&1 || true
startsrc -s sshd
fi
args:
executable: /bin/ksh
changed_when: true
listen: restart sshd
- name: Refresh inetd
ansible.builtin.command: refresh -s inetd
changed_when: true
failed_when: false
listen: refresh inetd
- name: Refresh syslog
ansible.builtin.command: refresh -s syslogd
changed_when: true
failed_when: false
listen: refresh syslog
- name: Restart audit subsystem
ansible.builtin.shell: |
set -o pipefail
if lssrc -s audit >/dev/null 2>&1; then
stopsrc -s audit >/dev/null 2>&1 || true
startsrc -s audit
else
audit start
fi
args:
executable: /bin/ksh
changed_when: true
when: cis_enable_audit | bool
listen: restart audit
@@ -0,0 +1,32 @@
---
- name: Validate AIX audit configuration file
ansible.builtin.stat:
path: "{{ cis_audit_config_path }}"
register: cis_aix_audit_config
- name: Collect AIX audit query status
ansible.builtin.command: audit query
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_audit_status
- name: Enable AIX audit subsystem when explicitly configured
ansible.builtin.command: audit start
changed_when: true
when:
- cis_enable_audit | bool
- cis_aix_audit_config.stat.exists
- cis_aix_audit_status.rc != 0 or 'auditing off' in (cis_aix_audit_status.stdout | default('') | lower)
notify: restart audit
- name: Report audit status
ansible.builtin.debug:
msg:
- >-
{{ 'OK: AIX audit configuration file exists.'
if cis_aix_audit_config.stat.exists else 'WARNING: AIX audit configuration file was not found.' }}
- >-
{{ 'OK: Audit enablement is explicitly allowed by cis_enable_audit.'
if cis_enable_audit | bool else 'WARNING: Audit enablement is disabled by default; validation only was performed.' }}
- "OK: audit query rc={{ cis_aix_audit_status.rc }} output={{ cis_aix_audit_status.stdout | default('') }}"
@@ -0,0 +1,49 @@
---
- name: Ensure cron and at control files exist with safe ownership
ansible.builtin.file:
path: "{{ item }}"
state: touch
owner: root
group: cron
mode: "0600"
modification_time: preserve
access_time: preserve
loop:
- "{{ cis_cron_allow_path }}"
- "{{ cis_at_allow_path }}"
- name: Ensure deny files are not world readable when present
ansible.builtin.file:
path: "{{ item }}"
owner: root
group: cron
mode: "0600"
loop:
- "{{ cis_cron_deny_path }}"
- "{{ cis_at_deny_path }}"
failed_when: false
- name: Secure cron directories when present
ansible.builtin.file:
path: "{{ item }}"
state: directory
owner: root
group: cron
mode: "0750"
loop: "{{ cis_cron_directories }}"
failed_when: false
- name: Validate cron SRC state
ansible.builtin.command: lssrc -s cron
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_cron_state
- name: Report cron and at hardening status
ansible.builtin.debug:
msg:
- "OK: cron.allow and at.allow ownership and permissions are managed."
- >-
{{ 'OK: cron SRC subsystem exists.'
if cis_aix_cron_state.rc == 0 else 'WARNING: cron SRC subsystem was not found.' }}
@@ -0,0 +1,60 @@
---
- name: Build mounted filesystem list from gathered facts
ansible.builtin.set_fact:
cis_aix_mount_points: "{{ ansible_mounts | map(attribute='mount') | list }}"
- name: Validate JFS2 filesystems
ansible.builtin.shell: |
set -o pipefail
lsfs -q | awk '/vfs[[:space:]]*=[[:space:]]*jfs2/{print prev} {prev=$0}'
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_jfs2_filesystems
- name: Review configured mount option targets
ansible.builtin.debug:
msg: >-
OK: Mount option management is disabled by default.
Review target {{ item.path }} for options {{ item.options | join(', ') }} before managed rollout.
loop: "{{ cis_mount_option_targets }}"
when: not cis_manage_mount_options | bool
- name: Apply configured mount options only when explicitly enabled
ansible.builtin.command: "chfs -a options={{ item.options | join(',') }} {{ item.path }}"
changed_when: true
loop: "{{ cis_mount_option_targets }}"
when:
- cis_manage_mount_options | bool
- item.path in cis_aix_mount_points
- name: Identify world-writable directories on local filesystems
ansible.builtin.shell: |
set -o pipefail
find / -xdev -type d -perm -0002 -print 2>/dev/null | head -200
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_world_writable_dirs
- name: Identify files without valid owner or group on local filesystems
ansible.builtin.shell: |
set -o pipefail
find / -xdev \( -nouser -o -nogroup \) -print 2>/dev/null | head -200
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_unowned_files
- name: Report filesystem review findings
ansible.builtin.debug:
msg:
- "OK: JFS2 filesystem review completed."
- "WARNING: World-writable directories found: {{ cis_aix_world_writable_dirs.stdout_lines | default([]) }}"
- "WARNING: Files without valid owner/group found: {{ cis_aix_unowned_files.stdout_lines | default([]) }}"
@@ -0,0 +1,40 @@
---
- name: Collect syslog SRC state
ansible.builtin.command: lssrc -s syslogd
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_syslog_state
- name: Ensure syslog configuration exists
ansible.builtin.stat:
path: "{{ cis_syslog_config_path }}"
register: cis_aix_syslog_config
- name: Start syslogd when installed but inactive
ansible.builtin.command: startsrc -s syslogd
changed_when: true
when:
- cis_aix_syslog_state.rc == 0
- "'active' not in cis_aix_syslog_state.stdout"
- name: Validate syslog configuration has active entries
ansible.builtin.shell: "awk 'NF && $1 !~ /^#/ {found=1} END {exit found ? 0 : 1}' {{ cis_syslog_config_path }}"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_syslog_has_rules
when: cis_aix_syslog_config.stat.exists
- name: Report logging status
ansible.builtin.debug:
msg:
- >-
{{ 'OK: syslogd SRC subsystem exists.'
if cis_aix_syslog_state.rc == 0 else 'WARNING: syslogd SRC subsystem was not found.' }}
- >-
{{ 'OK: syslog configuration has active rules.'
if cis_aix_syslog_has_rules.rc | default(1) == 0
else 'WARNING: syslog configuration has no active rules or could not be validated.' }}
@@ -0,0 +1,65 @@
---
- name: Run AIX platform safety prechecks
ansible.builtin.import_tasks: precheck.yml
tags:
- always
- precheck
- name: Harden AIX SSH daemon configuration
ansible.builtin.import_tasks: ssh.yml
tags:
- ssh
- name: Apply AIX user account controls
ansible.builtin.import_tasks: users.yml
tags:
- users
- name: Apply AIX password policy controls
ansible.builtin.import_tasks: password_policy.yml
when: cis_enable_password_policy | bool
tags:
- password_policy
- name: Apply AIX network hardening controls
ansible.builtin.import_tasks: network.yml
when: cis_enable_network_hardening | bool
tags:
- network
- name: Manage AIX baseline services
ansible.builtin.import_tasks: services.yml
tags:
- services
- name: Review AIX filesystem controls
ansible.builtin.import_tasks: filesystem.yml
tags:
- filesystem
- name: Validate AIX logging controls
ansible.builtin.import_tasks: logging.yml
tags:
- logging
- name: Validate AIX audit controls
ansible.builtin.import_tasks: audit.yml
tags:
- audit
- name: Harden AIX cron and at controls
ansible.builtin.import_tasks: cron.yml
tags:
- cron
- name: Harden sudo configuration
ansible.builtin.import_tasks: sudo.yml
when: cis_manage_sudo | bool
tags:
- sudo
- name: Run AIX validation postchecks
ansible.builtin.import_tasks: postcheck.yml
tags:
- always
- postcheck
@@ -0,0 +1,65 @@
---
- name: Collect current AIX network tunables
ansible.builtin.command: no -a
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_no_current
- name: Query configured AIX network tunables
ansible.builtin.command: "no -o {{ item.key }}"
changed_when: false
failed_when: false
check_mode: false
loop: "{{ cis_network_no_settings | dict2items }}"
register: cis_aix_no_query
- name: Apply configured AIX network tunables
ansible.builtin.command: "no -p -o {{ item.item.key }}={{ item.item.value }}"
changed_when: true
loop: "{{ cis_aix_no_query.results }}"
when:
- item.rc == 0
- item.stdout is not search('=\\s*' ~ (item.item.value | string) ~ '\\b')
- name: Warn about unsupported AIX network tunables
ansible.builtin.debug:
msg: "WARNING: AIX network tunable {{ item.item.key }} is not supported on this host."
loop: "{{ cis_aix_no_query.results }}"
when: item.rc != 0
- name: Check nfso availability
ansible.builtin.shell: "command -v nfso >/dev/null 2>&1 || whence nfso >/dev/null 2>&1"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_nfso_available
- name: Query configured AIX NFS tunables
ansible.builtin.command: "nfso -o {{ item.key }}"
changed_when: false
failed_when: false
check_mode: false
loop: "{{ cis_network_nfso_settings | dict2items }}"
register: cis_aix_nfso_query
when:
- cis_aix_nfso_available.rc == 0
- cis_network_nfso_settings | length > 0
- name: Apply configured AIX NFS tunables
ansible.builtin.command: "nfso -p -o {{ item.item.key }}={{ item.item.value }}"
changed_when: true
loop: "{{ cis_aix_nfso_query.results | default([]) }}"
when:
- item.rc == 0
- item.stdout is not search('=\\s*' ~ (item.item.value | string) ~ '\\b')
- name: Report network hardening status
ansible.builtin.debug:
msg:
- "OK: AIX network tunables were validated before changes."
- >-
{{ 'OK: nfso is available for optional NFS network tunables.'
if cis_aix_nfso_available.rc == 0 else 'WARNING: nfso was not found; NFS tunables were skipped.' }}
@@ -0,0 +1,66 @@
---
- name: Collect current default password policy
ansible.builtin.command: lssec -f /etc/security/user -s default -a minlen histsize maxage minage minalpha minother maxrepeats loginretries
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_password_policy_current
- name: Collect current default login policy
ansible.builtin.command: lssec -f /etc/security/login.cfg -s usw -a logindisable logininterval loginreenable
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_login_policy_current
- name: Manage default password security attributes
ansible.builtin.command: "chsec -f /etc/security/user -s default -a {{ item.key }}={{ item.value }}"
changed_when: true
loop:
- key: minlen
value: "{{ cis_password_minlen }}"
- key: histsize
value: "{{ cis_password_histsize }}"
- key: maxage
value: "{{ cis_password_maxage_weeks }}"
- key: minage
value: "{{ cis_password_minage_weeks }}"
- key: minalpha
value: "{{ cis_password_minalpha }}"
- key: minother
value: "{{ cis_password_minother }}"
- key: maxrepeats
value: "{{ cis_password_maxrepeats }}"
- key: loginretries
value: "{{ cis_login_retries }}"
when: >-
(item.key ~ '=' ~ (item.value | string))
not in (cis_aix_password_policy_current.stdout | default(''))
- name: Manage login lockout interval
ansible.builtin.command: "chsec -f /etc/security/login.cfg -s usw -a loginreenable={{ cis_login_lockout }}"
changed_when: true
when: >-
('loginreenable=' ~ (cis_login_lockout | string))
not in (cis_aix_login_policy_current.stdout | default(''))
- name: Collect updated default password policy
ansible.builtin.command: lssec -f /etc/security/user -s default -a minlen histsize maxage minage minalpha minother maxrepeats loginretries
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_password_policy_updated
- name: Validate password database state
ansible.builtin.command: pwdadm -q root
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_pwdadm_root
- name: Report password policy status
ansible.builtin.debug:
msg:
- "OK: Password policy managed through AIX chsec defaults, without replacing security files."
- "OK: Current default policy: {{ cis_aix_password_policy_updated.stdout | default('unavailable') }}"
- "OK: pwdadm root status: {{ cis_aix_pwdadm_root.stdout | default('unavailable') }}"
@@ -0,0 +1,58 @@
---
- name: Validate sshd configuration after hardening
ansible.builtin.command: "{{ cis_sshd_test_command }}"
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_post_sshd
- name: Show selected AIX network security values
ansible.builtin.command: "no -o {{ item.key }}"
changed_when: false
failed_when: false
check_mode: false
loop: "{{ cis_network_no_settings | dict2items }}"
register: cis_aix_post_network
- name: Show key SRC service states
ansible.builtin.command: "lssrc -s {{ item }}"
changed_when: false
failed_when: false
check_mode: false
loop:
- sshd
- syslogd
- audit
register: cis_aix_post_services
- name: Show password policy summary
ansible.builtin.command: lssec -f /etc/security/user -s default -a minlen histsize maxage minage minalpha minother loginretries
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_post_password
- name: Build AIX hardening validation summary
ansible.builtin.set_fact:
cis_aix_validation_summary:
oslevel: "{{ cis_aix_oslevel.stdout | default('unavailable') }}"
sshd_config_valid: "{{ cis_aix_post_sshd.rc == 0 }}"
sshd_validation_output: "{{ cis_aix_post_sshd.stderr | default(cis_aix_post_sshd.stdout | default('')) }}"
network_values: "{{ cis_aix_post_network.results | map(attribute='stdout') | list }}"
service_states: "{{ cis_aix_post_services.results | map(attribute='stdout') | list }}"
password_policy: "{{ cis_aix_post_password.stdout | default('unavailable') }}"
recommendations:
- "Validate SSH access from a second privileged session before enforcing passwordless-only access."
- "Review audit classes and events with security operations before setting cis_enable_audit=true."
- "Keep cis_manage_mount_options=false until filesystem owners approve remount or chfs behavior."
- name: Print AIX operational postcheck recommendations
ansible.builtin.debug:
msg:
- >-
{{ 'OK: sshd configuration validates.'
if cis_aix_post_sshd.rc == 0 else 'CRITICAL: sshd validation failed; review SSH config before restarting sessions.' }}
- "OK: Service states: {{ cis_aix_validation_summary.service_states }}"
- "OK: Password policy summary: {{ cis_aix_validation_summary.password_policy }}"
- "WARNING: This role is selected baseline and does not represent a complete compliance certification implementation."
- "{{ cis_aix_validation_summary.recommendations }}"
@@ -0,0 +1,147 @@
---
- name: Determine root filesystem free space
ansible.builtin.set_fact:
cis_aix_root_mount: "{{ ansible_mounts | selectattr('mount', 'equalto', '/') | list | first | default({}) }}"
- name: Calculate root filesystem free space in MB
ansible.builtin.set_fact:
cis_aix_root_free_mb: "{{ ((cis_aix_root_mount.size_available | default(0) | int) / 1024 / 1024) | round(0, 'floor') | int }}"
- name: Collect AIX maintenance level
ansible.builtin.command: oslevel -s
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_oslevel
- name: Check required AIX commands
ansible.builtin.shell: "command -v {{ item | quote }} >/dev/null 2>&1 || whence {{ item | quote }} >/dev/null 2>&1"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
loop: "{{ cis_required_commands }}"
register: cis_aix_required_command_checks
- name: Build missing required command list
ansible.builtin.set_fact:
cis_aix_missing_required_commands: >-
{{
cis_aix_required_command_checks.results
| selectattr('rc', 'ne', 0)
| map(attribute='item')
| list
}}
- name: Locate sshd binary
ansible.builtin.stat:
path: "{{ item }}"
loop: "{{ cis_ssh_candidate_paths }}"
register: cis_aix_sshd_path_checks
- name: Store detected sshd binary
ansible.builtin.set_fact:
cis_aix_sshd_path: >-
{{
(
cis_aix_sshd_path_checks.results
| selectattr('stat.exists')
| map(attribute='item')
| list
| first
)
| default('')
}}
- name: Validate SRC subsystem availability
ansible.builtin.command: lssrc -a
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_src_summary
- name: Validate audit subsystem availability
ansible.builtin.command: audit query
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_audit_query
- name: Collect LPAR summary when available
ansible.builtin.shell: "command -v lparstat >/dev/null 2>&1 && lparstat -i || true"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_lparstat
- name: Collect current network tunable summary
ansible.builtin.command: no -a
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_network_summary
- name: Collect default AIX user security summary
ansible.builtin.command: lssec -f /etc/security/user -s default -a ALL
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_security_user_summary
- name: Report AIX precheck status
ansible.builtin.debug:
msg:
- >-
OK: Facts gathered for {{ ansible_distribution | default(ansible_system | default('unknown')) }}
{{ ansible_distribution_version | default(ansible_kernel | default('unknown')) }}.
- "OK: oslevel -s reports {{ cis_aix_oslevel.stdout | default('unavailable') }}."
- "OK: Root filesystem free space is {{ cis_aix_root_free_mb }} MB."
- >-
{{ 'OK: sshd binary detected at ' ~ cis_aix_sshd_path
if cis_aix_sshd_path | length > 0 else 'CRITICAL: sshd binary was not found in expected AIX paths.' }}
- >-
{{ 'OK: SRC subsystem commands are functional.'
if cis_aix_src_summary.rc == 0 else 'CRITICAL: lssrc failed; SRC is unavailable or not usable.' }}
- >-
{{ 'OK: AIX audit subsystem responded to audit query.'
if cis_aix_audit_query.rc == 0 else 'WARNING: audit query did not complete; audit may be disabled or unconfigured.' }}
- >-
{{ 'OK: Required commands are present.'
if cis_aix_missing_required_commands | length == 0
else 'CRITICAL: Missing required commands: ' ~ (cis_aix_missing_required_commands | join(', ')) }}
- name: Fail when operating system is unsupported
ansible.builtin.assert:
that:
- ansible_system | default(ansible_distribution | default('')) == 'AIX'
- ansible_distribution_version | default('') is match('^7\\.')
fail_msg: >-
CRITICAL: This role supports IBM AIX 7.x only.
Detected {{ ansible_distribution | default(ansible_system | default('unknown')) }}
{{ ansible_distribution_version | default('unknown') }}.
success_msg: "OK: Supported IBM AIX 7.x platform detected."
- name: Fail when root filesystem free space is below safety threshold
ansible.builtin.assert:
that:
- cis_aix_root_free_mb | int >= cis_min_root_free_mb | int
fail_msg: >-
CRITICAL: Root filesystem has {{ cis_aix_root_free_mb }} MB free.
Minimum required free space is {{ cis_min_root_free_mb }} MB.
success_msg: "OK: Root filesystem free space meets the safety threshold."
- name: Fail when critical AIX commands are missing
ansible.builtin.assert:
that:
- cis_aix_missing_required_commands | length == 0
- cis_aix_src_summary.rc == 0
- cis_aix_sshd_path | length > 0
fail_msg: >-
CRITICAL: Required AIX hardening prerequisites are missing.
Missing commands={{ cis_aix_missing_required_commands | join(', ') | default('none', true) }},
SRC rc={{ cis_aix_src_summary.rc }},
sshd={{ cis_aix_sshd_path | default('not found', true) }}.
success_msg: "OK: Critical AIX hardening prerequisites are available."
@@ -0,0 +1,51 @@
---
- name: Collect SRC subsystem states
ansible.builtin.command: "lssrc -s {{ item }}"
changed_when: false
failed_when: false
check_mode: false
loop: "{{ cis_src_subsystems }}"
register: cis_aix_src_service_states
- name: Validate inetd configuration exists
ansible.builtin.stat:
path: /etc/inetd.conf
register: cis_aix_inetd_config
- name: Read inetd configuration
ansible.builtin.slurp:
src: /etc/inetd.conf
register: cis_aix_inetd_conf_content
when: cis_aix_inetd_config.stat.exists
- name: Disable insecure inetd services when present
ansible.builtin.lineinfile:
path: /etc/inetd.conf
regexp: '^(?!#)({{ item }})\s+'
line: '# \1 disabled by cis-aix7-hardening'
backrefs: true
backup: true
loop: "{{ cis_legacy_inetd_services }}"
when: cis_aix_inetd_config.stat.exists
notify: refresh inetd
- name: Report inetd configuration status
ansible.builtin.debug:
msg:
- >-
{{ 'OK: /etc/inetd.conf exists and legacy entries were reviewed.'
if cis_aix_inetd_config.stat.exists else 'WARNING: /etc/inetd.conf was not found; inetd review skipped.' }}
- "OK: SRC states collected for {{ cis_src_subsystems | join(', ') }}."
- name: Stop inactive legacy SRC subsystems when present
ansible.builtin.command: "stopsrc -s {{ item }}"
changed_when: true
failed_when: false
loop:
- routed
- gated
- named
when: >-
cis_aix_src_summary.stdout is defined
and item in cis_aix_src_summary.stdout
and 'active' in cis_aix_src_summary.stdout
@@ -0,0 +1,42 @@
---
- name: Ensure sshd configuration exists
ansible.builtin.stat:
path: "{{ cis_ssh_config_path }}"
register: cis_aix_sshd_config
- name: Fail when sshd configuration is missing
ansible.builtin.assert:
that:
- cis_aix_sshd_config.stat.exists
fail_msg: "CRITICAL: {{ cis_ssh_config_path }} was not found; refusing to manage SSH hardening."
success_msg: "OK: {{ cis_ssh_config_path }} exists."
- name: Set sshd validation command from detected binary
ansible.builtin.set_fact:
cis_sshd_test_command: "{{ cis_aix_sshd_path }} -t"
when: cis_aix_sshd_path is defined and cis_aix_sshd_path | length > 0
- name: Apply managed AIX sshd hardening block
ansible.builtin.blockinfile:
path: "{{ cis_ssh_config_path }}"
marker: "# {mark} ANSIBLE MANAGED BLOCK cis-aix7-hardening"
owner: root
group: system
mode: "0600"
backup: true
validate: "{{ cis_sshd_test_command }} -f %s"
block: |
PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}
PermitEmptyPasswords no
PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}
MaxAuthTries {{ cis_ssh_max_auth_tries }}
LoginGraceTime {{ cis_ssh_login_grace_time }}
ClientAliveInterval {{ cis_ssh_client_alive_interval }}
ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}
notify:
- validate sshd
- restart sshd
- name: Validate effective sshd configuration
ansible.builtin.command: "{{ cis_sshd_test_command }}"
changed_when: false
@@ -0,0 +1,50 @@
---
- name: Check sudoers file availability
ansible.builtin.stat:
path: "{{ cis_sudoers_path }}"
register: cis_aix_sudoers
- name: Check visudo availability
ansible.builtin.shell: "command -v visudo >/dev/null 2>&1 || whence visudo >/dev/null 2>&1"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_visudo_available
- name: Manage sudo use_pty default when supported
ansible.builtin.lineinfile:
path: "{{ cis_sudoers_path }}"
regexp: '^Defaults\s+use_pty\b'
line: "Defaults use_pty"
validate: "visudo -cf %s"
when:
- cis_sudo_use_pty | bool
- cis_aix_sudoers.stat.exists
- cis_aix_visudo_available.rc == 0
- name: Manage sudo logfile default
ansible.builtin.lineinfile:
path: "{{ cis_sudoers_path }}"
regexp: '^Defaults\s+logfile='
line: 'Defaults logfile="{{ cis_sudo_logfile }}"'
validate: "visudo -cf %s"
when:
- cis_aix_sudoers.stat.exists
- cis_aix_visudo_available.rc == 0
- name: Validate sudoers syntax
ansible.builtin.command: "visudo -cf {{ cis_sudoers_path }}"
changed_when: false
when:
- cis_aix_sudoers.stat.exists
- cis_aix_visudo_available.rc == 0
- name: Report sudo hardening status
ansible.builtin.debug:
msg:
- >-
{{ 'OK: sudoers exists and visudo validation is available.'
if cis_aix_sudoers.stat.exists and cis_aix_visudo_available.rc == 0
else 'WARNING: sudo or visudo was not found; sudo controls were skipped.' }}
@@ -0,0 +1,51 @@
---
- name: Collect root account security attributes
ansible.builtin.command: lssec -f /etc/security/user -s root -a account_locked login rlogin su sugroups
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_root_security
- name: Collect accounts with administrative UID
ansible.builtin.shell: "awk -F: '$3 == 0 {print $1}' /etc/passwd"
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_uid_zero_accounts
- name: Report administrative account review
ansible.builtin.debug:
msg:
- >-
{{ 'OK: Only root has UID 0.'
if cis_aix_uid_zero_accounts.stdout_lines | default([]) | length == 1
else 'WARNING: Multiple UID 0 accounts detected: ' ~ (cis_aix_uid_zero_accounts.stdout_lines | default([]) | join(', ')) }}
- "OK: Root security attributes: {{ cis_aix_root_security.stdout | default('unavailable') }}"
- name: Ensure root remote login is disabled when requested
ansible.builtin.command: chsec -f /etc/security/user -s root -a rlogin=false
changed_when: true
when:
- cis_disable_root_login | bool
- "'rlogin=false' not in (cis_aix_root_security.stdout | default(''))"
- name: Collect locked or administratively disabled accounts
ansible.builtin.shell: |
set -o pipefail
awk -F: '{print $1}' /etc/passwd | while read user; do
lsuser -a account_locked "$user" 2>/dev/null
done
args:
executable: /bin/ksh
changed_when: false
failed_when: false
check_mode: false
register: cis_aix_account_lock_summary
- name: Report account lock summary
ansible.builtin.debug:
msg:
- "OK: Collected account lock status for local users."
- "{{ cis_aix_account_lock_summary.stdout_lines | default([]) }}"
@@ -0,0 +1,90 @@
# Debian And Ubuntu Baseline Hardening Role
This role applies a small, practical set of selected baseline operational hardening controls for Debian and Ubuntu servers. It is intentionally readable, conservative, and suitable as a baseline for managed environments that still need local review.
## Supported OS
- Debian 13 Trixie
- Ubuntu Server 26.04 LTS
Unsupported distributions and versions fail during precheck before hardening tasks run.
## Implemented Areas
- SSH daemon hardening through a managed drop-in and final `sshd -t` validation
- Legacy network package removal
- Optional installation and enablement of `auditd`, `chrony`, `rsyslog`, and `sudo`
- Kernel network sysctl hardening
- Basic audit rule examples, disabled by default
- Sudo `use_pty` and optional sudo logfile configuration
- Logging service checks without replacing existing logging configuration
- Filesystem mount option recommendations, disabled by default
## Safety Philosophy
The defaults are intended to be operationally safe:
- Check mode is supported.
- SSH password authentication remains enabled by default.
- Filesystem mount option management is disabled by default.
- Audit rules are not written unless explicitly enabled.
- Services are enabled only when the matching feature is enabled and the service exists.
- Existing logging configuration is not replaced.
This role does not implement the full CIS benchmark and is not a compliance certification implementation.
## Usage
Run in check mode first:
```bash
ansible-playbook playbooks/cis-debian-ubuntu-hardening.yml --check --diff
```
Apply the full baseline:
```bash
ansible-playbook playbooks/cis-debian-ubuntu-hardening.yml
```
Run only selected areas:
```bash
ansible-playbook playbooks/cis-debian-ubuntu-hardening.yml --tags precheck,ssh,postcheck
ansible-playbook playbooks/cis-debian-ubuntu-hardening.yml --tags packages,services
ansible-playbook playbooks/cis-debian-ubuntu-hardening.yml --tags sudo,logging
```
## Key Variables
```yaml
cis_disable_root_login: true
cis_disable_password_auth: false
cis_install_auditd: true
cis_enable_chrony: true
cis_enable_rsyslog: true
cis_remove_legacy_packages: true
cis_enable_sysctl_hardening: true
cis_manage_mount_options: false
cis_manage_audit_rules: false
cis_ssh_max_auth_tries: 4
cis_ssh_login_grace_time: 60
cis_ssh_client_alive_interval: 300
cis_ssh_client_alive_count_max: 3
cis_sudo_use_pty: true
cis_sudo_logfile: /var/log/sudo.log
```
Enable audit rules only after reviewing the examples:
```yaml
cis_manage_audit_rules: true
```
Enable mount option persistence only after reviewing each filesystem target:
```yaml
cis_manage_mount_options: true
```
@@ -0,0 +1,90 @@
---
cis_disable_root_login: true
cis_disable_password_auth: false
cis_install_auditd: true
cis_enable_chrony: true
cis_enable_rsyslog: true
cis_remove_legacy_packages: true
cis_enable_sysctl_hardening: true
cis_manage_mount_options: false
cis_manage_audit_rules: false
cis_ssh_max_auth_tries: 4
cis_ssh_login_grace_time: 60
cis_ssh_client_alive_interval: 300
cis_ssh_client_alive_count_max: 3
cis_sudo_use_pty: true
cis_sudo_logfile: /var/log/sudo.log
cis_min_root_free_mb: 1024
cis_supported_debian_major_version: "13"
cis_supported_ubuntu_version: "26.04"
cis_ssh_service_name: ssh
cis_ssh_dropin_path: /etc/ssh/sshd_config.d/50-cis-debian-ubuntu-hardening.conf
cis_ssh_main_config_path: /etc/ssh/sshd_config
cis_hardening_packages:
- chrony
- rsyslog
- sudo
cis_audit_packages:
- auditd
- audispd-plugins
cis_legacy_packages:
- telnet
- rsh-client
- rsh-server
- talk
- talkd
- nis
cis_sysctl_settings:
net.ipv4.ip_forward: 0
net.ipv4.conf.all.send_redirects: 0
net.ipv4.conf.default.send_redirects: 0
net.ipv4.conf.all.accept_source_route: 0
net.ipv4.conf.default.accept_source_route: 0
net.ipv4.conf.all.accept_redirects: 0
net.ipv4.conf.default.accept_redirects: 0
net.ipv4.tcp_syncookies: 1
cis_sysctl_config_file: /etc/sysctl.d/60-cis-debian-ubuntu-hardening.conf
cis_audit_rules_path: /etc/audit/rules.d/50-cis-debian-ubuntu-hardening.rules
cis_audit_rules:
- "-w /etc/passwd -p wa -k identity"
- "-w /etc/shadow -p wa -k identity"
- "-w /etc/group -p wa -k identity"
- "-w /etc/gshadow -p wa -k identity"
- "-w /etc/sudoers -p wa -k scope"
- "-w /etc/sudoers.d/ -p wa -k scope"
cis_sudoers_dropin_path: /etc/sudoers.d/50-cis-debian-ubuntu-hardening
cis_mount_option_targets:
- path: /tmp
options:
- nodev
- nosuid
- noexec
- path: /var/tmp
options:
- nodev
- nosuid
- noexec
- path: /home
options:
- nodev
cis_container_virtualization_types:
- container
- docker
- lxc
- podman
- containerd
- systemd-nspawn
@@ -0,0 +1,30 @@
---
- name: Validate ssh configuration
ansible.builtin.command: sshd -t
changed_when: false
listen: validate ssh
- name: Restart ssh service safely
ansible.builtin.service:
name: "{{ cis_ssh_service_name }}"
state: restarted
listen: restart ssh
- name: Restart auditd
ansible.builtin.service:
name: auditd
state: restarted
use: service
listen: restart auditd
- name: Restart rsyslog
ansible.builtin.service:
name: rsyslog
state: restarted
listen: restart rsyslog
- name: Restart chrony
ansible.builtin.service:
name: chrony
state: restarted
listen: restart chrony
@@ -0,0 +1,39 @@
---
- name: Ensure audit rules directory exists
ansible.builtin.file:
path: /etc/audit/rules.d
state: directory
owner: root
group: root
mode: "0750"
- name: Report audit rules management mode
ansible.builtin.debug:
msg: >-
{{ 'OK: Baseline audit rule management is enabled.'
if cis_manage_audit_rules | bool
else 'WARNING: Audit rules are not managed because cis_manage_audit_rules is false.' }}
- name: Install baseline audit rules when explicitly enabled
ansible.builtin.lineinfile:
path: "{{ cis_audit_rules_path }}"
line: "{{ item }}"
create: true
owner: root
group: root
mode: "0640"
loop: "{{ cis_audit_rules }}"
loop_control:
label: "{{ item }}"
when: cis_manage_audit_rules | bool
notify: restart auditd
- name: Ensure auditd is enabled and running
ansible.builtin.systemd:
name: auditd
enabled: true
state: started
when:
- cis_install_auditd | bool
- "'auditd.service' in ansible_facts.services"
- not cis_container_detected | default(false) | bool
@@ -0,0 +1,36 @@
---
- name: Gather current mount facts
ansible.builtin.set_fact:
cis_current_mount_paths: "{{ ansible_mounts | map(attribute='mount') | list }}"
- name: Report filesystem mount option mode
ansible.builtin.debug:
msg: >-
{{ 'OK: Mount option management is enabled for configured targets.'
if cis_manage_mount_options | bool
else 'WARNING: Mount option management is disabled. No production filesystems will be remounted.' }}
- name: Show configured mount option recommendations
ansible.builtin.debug:
msg: "Review {{ item.path }} for options: {{ item.options | join(',') }}"
loop: "{{ cis_mount_option_targets }}"
loop_control:
label: "{{ item.path }}"
when: not cis_manage_mount_options | bool
- name: Persist configured mount options without remounting
ansible.posix.mount:
path: "{{ item.path }}"
src: "{{ cis_mount_fact.device }}"
fstype: "{{ cis_mount_fact.fstype }}"
state: present
opts: "{{ ((cis_mount_fact.options | default('defaults')).split(',') + item.options) | unique | join(',') }}"
loop: "{{ cis_mount_option_targets }}"
loop_control:
label: "{{ item.path }}"
vars:
cis_mount_fact: "{{ ansible_mounts | selectattr('mount', 'equalto', item.path) | list | first | default({}) }}"
when:
- cis_manage_mount_options | bool
- item.path in cis_current_mount_paths
register: cis_mount_option_results
@@ -0,0 +1,28 @@
---
- name: Ensure rsyslog is installed
ansible.builtin.apt:
name: rsyslog
state: present
update_cache: true
cache_valid_time: 3600
when: cis_enable_rsyslog | bool
- name: Ensure rsyslog is enabled and running
ansible.builtin.systemd:
name: rsyslog
enabled: true
state: started
when:
- cis_enable_rsyslog | bool
- not cis_container_detected | default(false) | bool
- name: Validate journald configuration file presence
ansible.builtin.stat:
path: /etc/systemd/journald.conf
register: cis_journald_conf
- name: Report journald configuration status
ansible.builtin.debug:
msg: >-
{{ 'OK: /etc/systemd/journald.conf is present.'
if cis_journald_conf.stat.exists else 'WARNING: /etc/systemd/journald.conf was not found.' }}
@@ -0,0 +1,54 @@
---
- name: Run platform safety prechecks
ansible.builtin.import_tasks: precheck.yml
tags:
- always
- precheck
- name: Manage packages
ansible.builtin.import_tasks: packages.yml
tags:
- packages
- name: Harden SSH daemon configuration
ansible.builtin.import_tasks: ssh.yml
tags:
- ssh
- name: Apply kernel network hardening
ansible.builtin.import_tasks: sysctl.yml
when: cis_enable_sysctl_hardening | bool
tags:
- sysctl
- name: Manage baseline services
ansible.builtin.import_tasks: services.yml
tags:
- services
- name: Configure Linux audit controls
ansible.builtin.import_tasks: audit.yml
when: cis_install_auditd | bool
tags:
- audit
- name: Configure sudo controls
ansible.builtin.import_tasks: sudo.yml
tags:
- sudo
- name: Configure logging controls
ansible.builtin.import_tasks: logging.yml
tags:
- logging
- name: Review filesystem mount options
ansible.builtin.import_tasks: filesystem.yml
tags:
- filesystem
- name: Run validation postchecks
ansible.builtin.import_tasks: postcheck.yml
tags:
- always
- postcheck
@@ -0,0 +1,48 @@
---
- name: Remove legacy network packages
ansible.builtin.apt:
name: "{{ cis_legacy_packages }}"
state: absent
purge: false
when: cis_remove_legacy_packages | bool
- name: Build enabled hardening package list
ansible.builtin.set_fact:
cis_enabled_hardening_packages: >-
{{
['sudo']
+ (['chrony'] if cis_enable_chrony | bool else [])
+ (['rsyslog'] if cis_enable_rsyslog | bool else [])
}}
- name: Install baseline hardening packages
ansible.builtin.apt:
name: "{{ cis_enabled_hardening_packages }}"
state: present
update_cache: true
cache_valid_time: 3600
- name: Install auditd when enabled
ansible.builtin.apt:
name: auditd
state: present
update_cache: true
cache_valid_time: 3600
when: cis_install_auditd | bool
- name: Install audispd plugins when available
ansible.builtin.apt:
name: audispd-plugins
state: present
update_cache: true
cache_valid_time: 3600
register: cis_audispd_plugins_install
failed_when: false
when: cis_install_auditd | bool
- name: Report audispd plugins availability
ansible.builtin.debug:
msg: "WARNING: audispd-plugins was not installed; package may be unavailable for this release."
when:
- cis_install_auditd | bool
- cis_audispd_plugins_install is failed
@@ -0,0 +1,105 @@
---
- name: Validate ssh effective configuration syntax
ansible.builtin.command: sshd -t
register: cis_sshd_validate
changed_when: false
check_mode: false
- name: Read sysctl values for validation
ansible.builtin.command: "sysctl -n {{ item.key }}"
loop: "{{ cis_sysctl_settings | dict2items }}"
loop_control:
label: "{{ item.key }}"
register: cis_sysctl_validation
changed_when: false
failed_when: false
check_mode: false
when:
- cis_enable_sysctl_hardening | bool
- not cis_container_detected | default(false) | bool
- name: Gather installed package facts
ansible.builtin.package_facts:
manager: auto
- name: Gather final service facts
ansible.builtin.service_facts:
- name: Build service state summary
ansible.builtin.set_fact:
cis_service_state_summary:
ssh: "{{ ansible_facts.services['ssh.service'].state | default('not-found') }}"
chrony: "{{ ansible_facts.services['chrony.service'].state | default('not-found') }}"
auditd: "{{ ansible_facts.services['auditd.service'].state | default('not-found') }}"
rsyslog: "{{ ansible_facts.services['rsyslog.service'].state | default('not-found') }}"
- name: Build package validation summary
ansible.builtin.set_fact:
cis_package_validation_summary:
legacy_absent: "{{ cis_legacy_packages | difference(ansible_facts.packages.keys() | list) }}"
hardening_present: >-
{{ (cis_enabled_hardening_packages | default(cis_hardening_packages))
| intersect(ansible_facts.packages.keys() | list) }}
audit_present: "{{ cis_audit_packages | intersect(ansible_facts.packages.keys() | list) }}"
- name: Build sysctl validation summary
ansible.builtin.set_fact:
cis_sysctl_validation_summary: >-
{{ cis_sysctl_validation_summary | default({})
| combine({item.item.key: item.stdout | default('unreadable')}) }}
loop: "{{ cis_sysctl_validation.results | default([]) }}"
loop_control:
label: "{{ item.item.key }}"
when:
- cis_enable_sysctl_hardening | bool
- not cis_container_detected | default(false) | bool
- name: Build mount option change summary
ansible.builtin.set_fact:
cis_mount_option_summary: >-
{{
cis_mount_option_results.results
| default([])
| selectattr('changed', 'defined')
| selectattr('changed')
| map(attribute='item.path')
| list
}}
- name: Publish validation summary
ansible.builtin.set_fact:
cis_validation_summary:
benchmark: "selected controls for Debian 13 Trixie and Ubuntu Server 26.04 LTS"
sshd_config: "{{ 'OK' if cis_sshd_validate.rc == 0 else 'CRITICAL' }}"
services: "{{ cis_service_state_summary }}"
packages: "{{ cis_package_validation_summary }}"
sysctl: "{{ cis_sysctl_validation_summary | default({}) }}"
mount_option_updates: "{{ cis_mount_option_summary | default([]) }}"
audit_rules_managed: "{{ cis_manage_audit_rules | bool }}"
applied_controls:
- ssh
- packages
- sysctl
- services
- audit
- sudo
- logging
- filesystem
- name: Show service states
ansible.builtin.debug:
var: cis_service_state_summary
- name: Show package validation
ansible.builtin.debug:
var: cis_package_validation_summary
- name: Show changed mount options
ansible.builtin.debug:
msg: >-
{{ cis_mount_option_summary | default([]) if cis_mount_option_summary | default([]) | length > 0
else 'OK: No mount option changes were applied.' }}
- name: Show applied control summary
ansible.builtin.debug:
var: cis_validation_summary
@@ -0,0 +1,73 @@
---
- name: Determine root filesystem free space
ansible.builtin.set_fact:
cis_root_mount: "{{ ansible_mounts | selectattr('mount', 'equalto', '/') | list | first | default({}) }}"
- name: Calculate root filesystem free space in MB
ansible.builtin.set_fact:
cis_root_free_mb: "{{ ((cis_root_mount.size_available | default(0) | int) / 1024 / 1024) | round(0, 'floor') | int }}"
- name: Detect containerized runtime
ansible.builtin.set_fact:
cis_container_detected: >-
{{
ansible_virtualization_type | default('') in cis_container_virtualization_types
or ansible_env.container | default('') | length > 0
}}
- name: Check for apt
ansible.builtin.stat:
path: /usr/bin/apt-get
register: cis_apt_check
- name: Report platform precheck status
ansible.builtin.debug:
msg:
- "OK: Facts gathered for {{ ansible_distribution }} {{ ansible_distribution_version }}."
- "OK: Root filesystem free space is {{ cis_root_free_mb }} MB."
- >-
{{ 'OK: apt package manager detected.'
if cis_apt_check.stat.exists else 'CRITICAL: apt package manager was not found.' }}
- >-
{{ 'OK: systemd service manager detected.'
if ansible_service_mgr == 'systemd' else 'CRITICAL: systemd service manager is required.' }}
- >-
{{ 'WARNING: Containerized environment detected; service and kernel controls may be limited.'
if cis_container_detected else 'OK: No containerized runtime detected from Ansible facts.' }}
- name: Fail when operating system is unsupported
ansible.builtin.assert:
that:
- >-
(ansible_distribution == 'Debian'
and ansible_distribution_major_version == cis_supported_debian_major_version)
or
(ansible_distribution == 'Ubuntu'
and ansible_distribution_version is version(cis_supported_ubuntu_version, '=='))
fail_msg: >-
CRITICAL: This role supports only Debian 13 / Trixie and Ubuntu Server 26.04 LTS.
Detected {{ ansible_distribution }} {{ ansible_distribution_version }}.
success_msg: "OK: Supported Debian/Ubuntu platform detected."
- name: Fail when systemd is unavailable
ansible.builtin.assert:
that:
- ansible_service_mgr == 'systemd'
fail_msg: "CRITICAL: systemd is required for this operational hardening role."
success_msg: "OK: systemd is available."
- name: Fail when apt is unavailable
ansible.builtin.assert:
that:
- cis_apt_check.stat.exists
fail_msg: "CRITICAL: apt-get is required for this Debian/Ubuntu hardening role."
success_msg: "OK: apt-get is available."
- name: Fail when root filesystem free space is below safety threshold
ansible.builtin.assert:
that:
- cis_root_free_mb | int >= cis_min_root_free_mb | int
fail_msg: >-
CRITICAL: Root filesystem has {{ cis_root_free_mb }} MB free.
Minimum required free space is {{ cis_min_root_free_mb }} MB.
success_msg: "OK: Root filesystem free space meets the safety threshold."
@@ -0,0 +1,30 @@
---
- name: Gather service facts
ansible.builtin.service_facts:
- name: Enable chrony service when present and enabled
ansible.builtin.systemd:
name: chrony
enabled: true
state: started
when:
- cis_enable_chrony | bool
- "'chrony.service' in ansible_facts.services"
- name: Enable rsyslog service when present and enabled
ansible.builtin.systemd:
name: rsyslog
enabled: true
state: started
when:
- cis_enable_rsyslog | bool
- "'rsyslog.service' in ansible_facts.services"
- name: Enable auditd service when present and enabled
ansible.builtin.systemd:
name: auditd
enabled: true
state: started
when:
- cis_install_auditd | bool
- "'auditd.service' in ansible_facts.services"
@@ -0,0 +1,92 @@
---
- name: Ensure sshd drop-in directory exists
ansible.builtin.file:
path: "{{ cis_ssh_dropin_path | dirname }}"
state: directory
owner: root
group: root
mode: "0755"
- name: Ensure sshd hardening drop-in exists
ansible.builtin.file:
path: "{{ cis_ssh_dropin_path }}"
state: touch
owner: root
group: root
mode: "0644"
modification_time: preserve
access_time: preserve
- name: Ensure sshd drop-in directory is included
ansible.builtin.lineinfile:
path: "{{ cis_ssh_main_config_path }}"
regexp: '^Include\s+/etc/ssh/sshd_config\.d/\*\.conf'
line: "Include /etc/ssh/sshd_config.d/*.conf"
insertbefore: BOF
validate: sshd -t -f %s
notify:
- validate ssh
- restart ssh
- name: Configure SSH root login
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PermitRootLogin\s+'
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
notify:
- validate ssh
- restart ssh
- name: Configure SSH empty password restriction
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PermitEmptyPasswords\s+'
line: "PermitEmptyPasswords no"
notify:
- validate ssh
- restart ssh
- name: Configure SSH password authentication
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PasswordAuthentication\s+'
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
notify:
- validate ssh
- restart ssh
- name: Configure SSH MaxAuthTries
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^MaxAuthTries\s+'
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
notify:
- validate ssh
- restart ssh
- name: Configure SSH LoginGraceTime
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^LoginGraceTime\s+'
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
notify:
- validate ssh
- restart ssh
- name: Configure SSH ClientAliveInterval
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^ClientAliveInterval\s+'
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
notify:
- validate ssh
- restart ssh
- name: Configure SSH ClientAliveCountMax
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^ClientAliveCountMax\s+'
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
notify:
- validate ssh
- restart ssh
@@ -0,0 +1,23 @@
---
- name: Build sudo hardening directives
ansible.builtin.set_fact:
cis_sudo_directives: >-
{{
([{'regexp': '^Defaults\s+use_pty', 'line': 'Defaults use_pty'}]
if cis_sudo_use_pty | bool else [])
+ [{'regexp': '^Defaults\s+logfile=', 'line': 'Defaults logfile="' ~ cis_sudo_logfile ~ '"'}]
}}
- name: Configure sudo hardening drop-in
ansible.builtin.lineinfile:
path: "{{ cis_sudoers_dropin_path }}"
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
create: true
owner: root
group: root
mode: "0440"
validate: /usr/sbin/visudo -cf %s
loop: "{{ cis_sudo_directives }}"
loop_control:
label: "{{ item.line }}"
@@ -0,0 +1,17 @@
---
- name: Apply selected sysctl settings
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_file: "{{ cis_sysctl_config_file }}"
state: present
reload: true
loop: "{{ cis_sysctl_settings | dict2items }}"
loop_control:
label: "{{ item.key }}"
when: not cis_container_detected | default(false) | bool
- name: Report skipped sysctl hardening inside containers
ansible.builtin.debug:
msg: "WARNING: Sysctl hardening skipped because a containerized environment was detected."
when: cis_container_detected | default(false) | bool
@@ -0,0 +1,83 @@
# RHEL 9 Baseline Hardening Role
This role provides a practical, baseline hardening example for RHEL 9 and Oracle Linux 9 systems. It is inspired by hardening benchmark controls for Red Hat Enterprise Linux 9 version 2.0.0, but it is intentionally scoped to common operational controls that infrastructure and security operations teams frequently automate.
This is not a full compliance certification implementation.
## Supported Platforms
- Red Hat Enterprise Linux 9
- Oracle Linux 9
The role fails safely on unsupported operating systems or unsupported major versions.
## Implemented Controls
- SSH daemon hardening for root login, empty passwords, password authentication, retry limits, login grace time, and client keepalive behavior.
- Removal of selected legacy network packages such as telnet, rsh-server, and ypbind.
- Optional installation and enablement of chrony, auditd, and rsyslog.
- Selected IPv4 network sysctl settings.
- Service enablement for chronyd, auditd, and rsyslog.
- Safe disabling of known legacy services when they are present.
- Basic audit backlog and audit rule examples.
- Sudo defaults for `use_pty` and a configurable sudo logfile.
- Rsyslog service validation and journald configuration presence checks.
- Optional filesystem mount option persistence for selected paths.
## Safety Philosophy
The defaults are conservative. The role supports Ansible check mode and avoids destructive live-system behavior by default. Filesystem mount option management is disabled unless `cis_manage_mount_options` is explicitly enabled, and even then the role persists configured options without remounting live filesystems.
Review variables before adapting this role to managed hosts.
## Common Variables
```yaml
cis_disable_root_login: true
cis_disable_password_auth: false
cis_install_auditd: true
cis_enable_chrony: true
cis_enable_rsyslog: true
cis_remove_legacy_packages: true
cis_enable_sysctl_hardening: true
cis_manage_mount_options: false
```
## Check Mode
Run a full safety preview:
```bash
ansible-playbook playbooks/cis-rhel9-hardening.yml --check --diff
```
Run only SSH controls in check mode:
```bash
ansible-playbook playbooks/cis-rhel9-hardening.yml --check --diff --tags ssh
```
## Tags
Useful tags include:
- `precheck`
- `packages`
- `ssh`
- `sysctl`
- `services`
- `audit`
- `sudo`
- `logging`
- `filesystem`
- `postcheck`
Example:
```bash
ansible-playbook playbooks/cis-rhel9-hardening.yml --tags precheck,ssh,postcheck
```
## Rollout Notes
This role is a hardening starting point for internal infrastructure teams. It should be reviewed against local access patterns, break-glass procedures, compliance requirements, monitoring expectations, and host build standards before rollout.
@@ -0,0 +1,80 @@
---
cis_benchmark_version: "2.0.0"
cis_disable_root_login: true
cis_disable_password_auth: false
cis_install_auditd: true
cis_enable_chrony: true
cis_enable_rsyslog: true
cis_remove_legacy_packages: true
cis_enable_sysctl_hardening: true
cis_manage_mount_options: false
cis_ssh_max_auth_tries: 4
cis_ssh_login_grace_time: 60
cis_ssh_client_alive_interval: 300
cis_ssh_client_alive_count_max: 3
cis_ssh_dropin_path: /etc/ssh/sshd_config.d/50-cis-rhel9-hardening.conf
cis_min_root_free_mb: 1024
cis_legacy_packages:
- telnet
- rsh-server
- ypbind
cis_legacy_services:
- telnet.socket
- rsh.socket
- rexec.socket
- rlogin.socket
- ypbind.service
cis_sysctl_settings:
net.ipv4.ip_forward: 0
net.ipv4.conf.all.send_redirects: 0
net.ipv4.conf.default.send_redirects: 0
net.ipv4.conf.all.accept_source_route: 0
net.ipv4.conf.default.accept_source_route: 0
net.ipv4.conf.all.accept_redirects: 0
net.ipv4.conf.default.accept_redirects: 0
net.ipv4.tcp_syncookies: 1
cis_sysctl_config_file: /etc/sysctl.d/60-cis-rhel9-hardening.conf
cis_audit_rules_path: /etc/audit/rules.d/50-cis-rhel9-hardening.rules
cis_audit_backlog_limit: 8192
cis_audit_rules:
- "-w /etc/passwd -p wa -k identity"
- "-w /etc/shadow -p wa -k identity"
- "-w /etc/group -p wa -k identity"
- "-w /etc/gshadow -p wa -k identity"
- "-w /etc/sudoers -p wa -k scope"
- "-w /etc/sudoers.d/ -p wa -k scope"
- "-a always,exit -F arch=b64 -S adjtimex,settimeofday,clock_settime -k time-change"
cis_sudoers_dropin_path: /etc/sudoers.d/50-cis-rhel9-hardening
cis_sudo_logfile: /var/log/sudo.log
cis_mount_option_targets:
- path: /tmp
options:
- nodev
- nosuid
- noexec
- path: /var/tmp
options:
- nodev
- nosuid
- noexec
- path: /home
options:
- nodev
cis_container_virtualization_types:
- container
- docker
- lxc
- podman
- containerd
- systemd-nspawn
@@ -0,0 +1,24 @@
---
- name: Validate sshd configuration
ansible.builtin.command: sshd -t
changed_when: false
listen: validate sshd
- name: Reload sshd
ansible.builtin.service:
name: sshd
state: reloaded
listen: reload sshd
- name: Restart auditd
ansible.builtin.service:
name: auditd
state: restarted
use: service
listen: restart auditd
- name: Restart rsyslog
ansible.builtin.service:
name: rsyslog
state: restarted
listen: restart rsyslog
@@ -0,0 +1,38 @@
---
- name: Ensure audit rules directory exists
ansible.builtin.file:
path: /etc/audit/rules.d
state: directory
owner: root
group: root
mode: "0750"
- name: Configure audit backlog limit
ansible.builtin.lineinfile:
path: /etc/audit/audit.rules
regexp: '^-b\s+'
line: "-b {{ cis_audit_backlog_limit }}"
create: true
owner: root
group: root
mode: "0640"
notify: restart auditd
- name: Install baseline audit rules
ansible.builtin.lineinfile:
path: "{{ cis_audit_rules_path }}"
line: "{{ item }}"
create: true
owner: root
group: root
mode: "0640"
loop: "{{ cis_audit_rules }}"
loop_control:
label: "{{ item }}"
notify: restart auditd
- name: Ensure auditd is enabled and running
ansible.builtin.systemd:
name: auditd
enabled: true
state: started
@@ -0,0 +1,36 @@
---
- name: Gather current mount facts
ansible.builtin.set_fact:
cis_current_mount_paths: "{{ ansible_mounts | map(attribute='mount') | list }}"
- name: Report filesystem mount option mode
ansible.builtin.debug:
msg: >-
{{ 'OK: Mount option management is enabled for configured targets.'
if cis_manage_mount_options | bool
else 'WARNING: Mount option management is disabled. No production filesystems will be remounted.' }}
- name: Show configured mount option recommendations
ansible.builtin.debug:
msg: "Review {{ item.path }} for options: {{ item.options | join(',') }}"
loop: "{{ cis_mount_option_targets }}"
loop_control:
label: "{{ item.path }}"
when: not cis_manage_mount_options | bool
- name: Persist configured mount options without remounting
ansible.posix.mount:
path: "{{ item.path }}"
src: "{{ cis_mount_fact.device }}"
fstype: "{{ cis_mount_fact.fstype }}"
state: present
opts: "{{ ((cis_mount_fact.options | default('defaults')).split(',') + item.options) | unique | join(',') }}"
loop: "{{ cis_mount_option_targets }}"
loop_control:
label: "{{ item.path }}"
vars:
cis_mount_fact: "{{ ansible_mounts | selectattr('mount', 'equalto', item.path) | list | first | default({}) }}"
when:
- cis_manage_mount_options | bool
- item.path in cis_current_mount_paths
register: cis_mount_option_results
@@ -0,0 +1,24 @@
---
- name: Ensure rsyslog is installed
ansible.builtin.package:
name: rsyslog
state: present
when: cis_enable_rsyslog | bool
- name: Ensure rsyslog is enabled and running
ansible.builtin.systemd:
name: rsyslog
enabled: true
state: started
when: cis_enable_rsyslog | bool
- name: Validate journald configuration file presence
ansible.builtin.stat:
path: /etc/systemd/journald.conf
register: cis_journald_conf
- name: Report journald configuration status
ansible.builtin.debug:
msg: >-
{{ 'OK: /etc/systemd/journald.conf is present.'
if cis_journald_conf.stat.exists else 'WARNING: /etc/systemd/journald.conf was not found.' }}
@@ -0,0 +1,54 @@
---
- name: Run platform safety prechecks
ansible.builtin.import_tasks: precheck.yml
tags:
- always
- precheck
- name: Manage packages
ansible.builtin.import_tasks: packages.yml
tags:
- packages
- name: Harden SSH daemon configuration
ansible.builtin.import_tasks: ssh.yml
tags:
- ssh
- name: Apply kernel network hardening
ansible.builtin.import_tasks: sysctl.yml
when: cis_enable_sysctl_hardening | bool
tags:
- sysctl
- name: Manage baseline services
ansible.builtin.import_tasks: services.yml
tags:
- services
- name: Configure Linux audit controls
ansible.builtin.import_tasks: audit.yml
when: cis_install_auditd | bool
tags:
- audit
- name: Configure sudo controls
ansible.builtin.import_tasks: sudo.yml
tags:
- sudo
- name: Configure logging controls
ansible.builtin.import_tasks: logging.yml
tags:
- logging
- name: Review filesystem mount options
ansible.builtin.import_tasks: filesystem.yml
tags:
- filesystem
- name: Run validation postchecks
ansible.builtin.import_tasks: postcheck.yml
tags:
- always
- postcheck
@@ -0,0 +1,24 @@
---
- name: Remove legacy network packages
ansible.builtin.package:
name: "{{ cis_legacy_packages }}"
state: absent
when: cis_remove_legacy_packages | bool
- name: Install chrony when enabled
ansible.builtin.package:
name: chrony
state: present
when: cis_enable_chrony | bool
- name: Install auditd when enabled
ansible.builtin.package:
name: audit
state: present
when: cis_install_auditd | bool
- name: Install rsyslog when enabled
ansible.builtin.package:
name: rsyslog
state: present
when: cis_enable_rsyslog | bool
@@ -0,0 +1,81 @@
---
- name: Validate sshd effective configuration syntax
ansible.builtin.command: sshd -t
register: cis_sshd_validate
changed_when: false
check_mode: false
- name: Read sysctl values for validation
ansible.builtin.command: "sysctl -n {{ item.key }}"
loop: "{{ cis_sysctl_settings | dict2items }}"
loop_control:
label: "{{ item.key }}"
register: cis_sysctl_validation
changed_when: false
failed_when: false
check_mode: false
when: cis_enable_sysctl_hardening | bool
- name: Gather final service facts
ansible.builtin.service_facts:
- name: Build service state summary
ansible.builtin.set_fact:
cis_service_state_summary:
chronyd: "{{ ansible_facts.services['chronyd.service'].state | default('not-found') }}"
auditd: "{{ ansible_facts.services['auditd.service'].state | default('not-found') }}"
rsyslog: "{{ ansible_facts.services['rsyslog.service'].state | default('not-found') }}"
- name: Build sysctl validation summary
ansible.builtin.set_fact:
cis_sysctl_validation_summary: >-
{{ cis_sysctl_validation_summary | default({})
| combine({item.item.key: item.stdout | default('unreadable')}) }}
loop: "{{ cis_sysctl_validation.results | default([]) }}"
loop_control:
label: "{{ item.item.key }}"
when: cis_enable_sysctl_hardening | bool
- name: Build mount option change summary
ansible.builtin.set_fact:
cis_mount_option_summary: >-
{{
cis_mount_option_results.results
| default([])
| selectattr('changed', 'defined')
| selectattr('changed')
| map(attribute='item.path')
| list
}}
- name: Publish validation summary
ansible.builtin.set_fact:
cis_validation_summary:
benchmark: "CIS RHEL 9 Benchmark {{ cis_benchmark_version }} inspired controls"
sshd_config: "{{ 'OK' if cis_sshd_validate.rc == 0 else 'CRITICAL' }}"
services: "{{ cis_service_state_summary }}"
sysctl: "{{ cis_sysctl_validation_summary | default({}) }}"
mount_option_updates: "{{ cis_mount_option_summary | default([]) }}"
applied_controls:
- ssh
- packages
- sysctl
- services
- audit
- sudo
- logging
- filesystem
- name: Show service states
ansible.builtin.debug:
var: cis_service_state_summary
- name: Show changed mount options
ansible.builtin.debug:
msg: >-
{{ cis_mount_option_summary | default([]) if cis_mount_option_summary | default([]) | length > 0
else 'OK: No mount option changes were applied.' }}
- name: Show applied control summary
ansible.builtin.debug:
var: cis_validation_summary
@@ -0,0 +1,54 @@
---
- name: Determine root filesystem free space
ansible.builtin.set_fact:
cis_root_mount: "{{ ansible_mounts | selectattr('mount', 'equalto', '/') | list | first | default({}) }}"
- name: Calculate root filesystem free space in MB
ansible.builtin.set_fact:
cis_root_free_mb: "{{ ((cis_root_mount.size_available | default(0) | int) / 1024 / 1024) | round(0, 'floor') | int }}"
- name: Detect containerized runtime
ansible.builtin.set_fact:
cis_container_detected: >-
{{
ansible_virtualization_type | default('') in cis_container_virtualization_types
or ansible_env.container | default('') | length > 0
}}
- name: Report platform precheck status
ansible.builtin.debug:
msg:
- "OK: Facts gathered for {{ ansible_distribution }} {{ ansible_distribution_version }}."
- "OK: Root filesystem free space is {{ cis_root_free_mb }} MB."
- >-
{{ 'WARNING: Containerized environment detected; service and kernel controls may be limited.'
if cis_container_detected else 'OK: No containerized runtime detected from Ansible facts.' }}
- >-
{{ 'OK: systemd service manager detected.'
if ansible_service_mgr == 'systemd' else 'CRITICAL: systemd service manager is required.' }}
- name: Fail when operating system is unsupported
ansible.builtin.assert:
that:
- ansible_distribution in cis_supported_distributions
- ansible_distribution_major_version == cis_supported_major_version
fail_msg: >-
CRITICAL: This role supports only RHEL 9 / Oracle Linux 9 compatible systems.
Detected {{ ansible_distribution }} {{ ansible_distribution_version }}.
success_msg: "OK: Supported RHEL 9 compatible platform detected."
- name: Fail when systemd is unavailable
ansible.builtin.assert:
that:
- ansible_service_mgr == 'systemd'
fail_msg: "CRITICAL: systemd is required for this operational hardening role."
success_msg: "OK: systemd is available."
- name: Fail when root filesystem free space is below safety threshold
ansible.builtin.assert:
that:
- cis_root_free_mb | int >= cis_min_root_free_mb | int
fail_msg: >-
CRITICAL: Root filesystem has {{ cis_root_free_mb }} MB free.
Minimum required free space is {{ cis_min_root_free_mb }} MB.
success_msg: "OK: Root filesystem free space meets the safety threshold."
@@ -0,0 +1,36 @@
---
- name: Enable chronyd service
ansible.builtin.systemd:
name: chronyd
enabled: true
state: started
when: cis_enable_chrony | bool
- name: Enable rsyslog service
ansible.builtin.systemd:
name: rsyslog
enabled: true
state: started
when: cis_enable_rsyslog | bool
- name: Enable auditd service
ansible.builtin.systemd:
name: auditd
enabled: true
state: started
when: cis_install_auditd | bool
- name: Gather service facts
ansible.builtin.service_facts:
- name: Disable unnecessary legacy services when present
ansible.builtin.systemd:
name: "{{ item }}"
enabled: false
state: stopped
loop: "{{ cis_legacy_services }}"
loop_control:
label: "{{ item }}"
when:
- cis_remove_legacy_packages | bool
- item in ansible_facts.services
@@ -0,0 +1,81 @@
---
- name: Ensure sshd drop-in directory exists
ansible.builtin.file:
path: "{{ cis_ssh_dropin_path | dirname }}"
state: directory
owner: root
group: root
mode: "0755"
- name: Ensure sshd hardening drop-in exists
ansible.builtin.file:
path: "{{ cis_ssh_dropin_path }}"
state: touch
owner: root
group: root
mode: "0644"
modification_time: preserve
access_time: preserve
- name: Configure SSH root login
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PermitRootLogin\s+'
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
notify:
- validate sshd
- reload sshd
- name: Configure SSH empty password restriction
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PermitEmptyPasswords\s+'
line: "PermitEmptyPasswords no"
notify:
- validate sshd
- reload sshd
- name: Configure SSH password authentication
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^PasswordAuthentication\s+'
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
notify:
- validate sshd
- reload sshd
- name: Configure SSH MaxAuthTries
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^MaxAuthTries\s+'
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
notify:
- validate sshd
- reload sshd
- name: Configure SSH LoginGraceTime
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^LoginGraceTime\s+'
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
notify:
- validate sshd
- reload sshd
- name: Configure SSH ClientAliveInterval
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^ClientAliveInterval\s+'
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
notify:
- validate sshd
- reload sshd
- name: Configure SSH ClientAliveCountMax
ansible.builtin.lineinfile:
path: "{{ cis_ssh_dropin_path }}"
regexp: '^ClientAliveCountMax\s+'
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
notify:
- validate sshd
- reload sshd
@@ -0,0 +1,18 @@
---
- name: Configure sudo hardening drop-in
ansible.builtin.lineinfile:
path: "{{ cis_sudoers_dropin_path }}"
regexp: "{{ item.regexp }}"
line: "{{ item.line }}"
create: true
owner: root
group: root
mode: "0440"
validate: /usr/sbin/visudo -cf %s
loop:
- regexp: '^Defaults\s+use_pty'
line: "Defaults use_pty"
- regexp: '^Defaults\s+logfile='
line: 'Defaults logfile="{{ cis_sudo_logfile }}"'
loop_control:
label: "{{ item.line }}"
@@ -0,0 +1,11 @@
---
- name: Apply selected sysctl settings
ansible.posix.sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
sysctl_file: "{{ cis_sysctl_config_file }}"
state: present
reload: true
loop: "{{ cis_sysctl_settings | dict2items }}"
loop_control:
label: "{{ item.key }}"
@@ -0,0 +1,6 @@
---
cis_supported_distributions:
- RedHat
- OracleLinux
cis_supported_major_version: "9"
+7 -14
View File
@@ -1,17 +1,10 @@
# infra-run/docs # docs
This directory is intended for supporting technical documentation tied to the operational tooling in `infra-run`. It is the natural home for implementation notes, architecture writeups, and operational reference material. Planned area for longer technical notes.
## Diagram Current documentation lives in the project README files plus:
```mermaid - [SOURCE.md](../SOURCE.md)
flowchart TD - [TESTED.md](../TESTED.md)
A["docs"] --> B["Architecture notes"] - [KNOWN_LIMITATIONS.md](../KNOWN_LIMITATIONS.md)
A --> C["Operational references"] - [ROADMAP.md](../ROADMAP.md)
A --> D["Change preparation notes"]
```
## Notes
- The folder currently contains only a placeholder file.
- It complements `runbooks` by focusing on reference material rather than step-by-step execution flows.
+857
View File
@@ -0,0 +1,857 @@
# Production Operations Cheatsheet
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
## Linux / Unix Daily Operations
### Uptime and Host State
Check host age, kernel, clock, and recent reboot history before touching anything:
```bash
uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20
```
Pre-check pattern:
```bash
date -u
uptime
df -h
free -m
systemctl --failed
```
### Process Management
```bash
ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid> # DANGEROUS: last resort only
```
Validation:
```bash
ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager
```
### systemctl
```bash
systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service> # impact: confirms service interruption policy first
```
### journalctl
```bash
journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage
```
### Service Troubleshooting Flow
1. Confirm service state and recent restart count.
2. Read the last 100-200 journal lines.
3. Validate config syntax before restart if the daemon supports it.
4. Check dependent ports, mounts, credentials, and name resolution.
5. Restart only after cause is understood or rollback exists.
Example:
```bash
systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/
```
### CPU and Memory Diagnostics
```bash
uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5
```
Quick interpretation:
- high `%wa`: storage path or NFS issue
- high run queue with low CPU idle: CPU contention
- swap growth plus page scans: memory pressure
### Disk Usage
```bash
df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1
```
### Inode Exhaustion
```bash
df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l
```
### Mounts
```bash
mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a # can expose bad fstab entries; use in change window
```
### Permissions
```bash
namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file
```
### SELinux
State and mode:
```bash
getenforce
sestatus
cat /etc/selinux/config
```
Check file, process, and port context:
```bash
ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http
```
Audit and denial review:
```bash
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log
```
Typical flow:
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
2. Identify the failing path, process domain, and target context.
3. Read AVC denials before changing labels or booleans.
4. Prefer persistent policy-aligned fixes over `chcon`.
5. Restore default labels and retest service path.
Modify and restore context:
```bash
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt
```
Booleans and validation:
```bash
getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z
```
Notes:
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
- use `chcon` only as a short-lived diagnostic or emergency workaround
- avoid generating local policy modules from `audit2allow` until root cause is understood
- after context changes, validate service startup, AVC silence, and application path access
### Archives
```bash
tar tf backup.tar | head
tar czf logs-$(date +%F).tgz /var/log/app
tar xzf bundle.tgz -C /restore/path
gzip -t file.gz
```
### File Operations
```bash
cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file
```
## Text Processing & Regex
### Core Tools
```bash
grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
```
### Regex Reference
```text
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level \b(?:ERROR|WARN|INFO)\b
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
```
### Log Parsing Examples
IP extraction:
```bash
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
```
Timestamp filter:
```bash
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
```
UUID extraction:
```bash
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
```
ERROR/WARN/INFO parsing:
```bash
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
```
Failed SSH login parsing:
```bash
grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head
```
Extract fields from logs:
```bash
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
```
Filter Ansible output:
```bash
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log
```
## Incident Response
### Disk Full
Workflow:
```bash
df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage
```
Typical branches:
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
- inode full: remove file storms, spool buildup, temp-file leaks
- deleted open files: restart offender only after sizing impact
Post-check:
```bash
df -hT
df -ih
systemctl --failed
```
### High CPU
```bash
uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
```
Flow:
1. Confirm sustained load, not a short spike.
2. Separate user CPU vs system CPU vs I/O wait.
3. Identify hot process and hot threads.
4. Correlate with deploys, cron, backups, or JVM GC.
5. Throttle, stop, or fail over only with service impact understood.
### Memory Pressure
```bash
free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'
```
Flow:
1. Check swap growth and page scan rates.
2. Identify top RSS owners.
3. Check kernel logs for OOM.
4. Validate cache vs real process growth.
5. Restart leaking service only after capturing evidence.
### Failed Service
```bash
systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
```
Flow:
1. Validate config.
2. Validate credentials, ports, mounts, permissions.
3. Confirm dependency availability.
4. Restart and recheck logs immediately.
### SELinux Denials
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
Triage:
```bash
getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*
```
Flow:
1. Confirm the failure is current and reproducible.
2. Identify the denied process domain, target path, and requested access from AVC logs.
3. Validate expected default context with `matchpathcon`.
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
5. Apply the smallest persistent fix, then retest in `Enforcing`.
Common fixes:
```bash
matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on
```
Validation:
```bash
getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health
```
Operational notes:
- do not leave systems in `Permissive` as the fix
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
- treat `audit2allow` output as investigation material, not automatic remediation
- if policy changes are unavoidable, document exact AVC evidence and rollback path
### SSL Issues
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/
```
Check for:
- expired certificate
- missing SAN
- incomplete chain
- hostname mismatch
- TLS version or cipher mismatch
### DNS Issues
```bash
dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status
```
Flow:
1. Compare resolver result with authoritative result.
2. Check TTL and stale cache.
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
4. Test from affected host and unaffected host.
### Network Issues
```bash
ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host
```
Flow:
1. Interface/link state.
2. Route and source IP selection.
3. Listening socket on target.
4. Firewall and security controls.
5. Packet capture if app logs are inconclusive.
### JVM / Tomcat Issues
```bash
ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out
```
Focus:
- stuck threads
- full GC loops
- heap exhaustion
- connector bind failures
- slow backend dependency
### Certificate Expiration
```bash
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate
openssl x509 -checkend 2592000 -noout -in cert.pem
```
### Suspicious Login Attempts
```bash
last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent
```
Workflow:
1. Identify source IPs and usernames.
2. Validate whether attempts are expected from bastions/scanners.
3. Check successful logins from same sources.
4. Review sudo usage and persistence changes.
5. Preserve logs before cleanup or rotation.
## Networking Operations
```bash
ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null
```
## Storage Operations
### Block and Filesystem Discovery
```bash
lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll
```
### LVM
```bash
pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>
```
Growth example:
```bash
pvcreate /dev/mapper/mpatha # impact: write metadata
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp
```
### XFS
```bash
xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint
```
### ext4
```bash
tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv
```
### Multipath
```bash
multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
```
### NFS
```bash
showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server
```
### iSCSI
```bash
iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>
```
### Mount Troubleshooting
```bash
findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager
```
Check:
- device path stable
- UUID correct
- filesystem type correct
- multipath settled
- network and RPC available for NFS
### Filesystem Validation
```bash
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test
```
### Migration Validation Example
```bash
findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile
```
## AIX Operations
```bash
oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn
```
## SSL/TLS Operations
### OpenSSL Checks
```bash
openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem
```
### Expiration Validation
```bash
openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem
```
### keytool Basics
```bash
keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
```
### Chain Validation
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
```
## Automation Operations
### Bash Safety Patterns
```bash
set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT
```
Safe loop examples:
```bash
while IFS= read -r host; do
ssh "$host" uptime
done < hostlist.txt
find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
gzip -t "$file"
done
```
Operational scripting patterns:
- default to read-only mode
- require explicit `--execute` for changes
- log actions with timestamps
- validate dependencies with `command -v`
- use temp files with `mktemp`
- guard destructive paths and empty variables
## Ansible Operations
### Execution
```bash
ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
```
### Safe Rollout Workflow
1. Validate inventory and variable targeting.
2. Run syntax-check.
3. Run `--check --diff` on a single host.
4. Execute against one host or one tier.
5. Validate service health, logs, and config.
6. Expand rollout only after post-check passes.
Rollback mindset:
- keep before/after config copies
- know which tasks restart services
- define manual backout if package/config changes fail
- avoid broad `--limit` mistakes by reviewing resolved host list first
## Monitoring & Observability
### Zabbix Checks
```bash
systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
```
### ELK Log Workflows
```bash
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty
```
### Grafana Checks
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50
```
### Health Endpoints and Alert Validation
```bash
curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head
```
False positive validation:
1. Compare alert timestamp with deploy/change window.
2. Confirm on-host evidence, not only dashboard data.
3. Check collector lag, scrape failures, and stale metrics.
4. Validate from a second source before escalating.
## Operational Habits
### Pre-checks
- capture time, hostname, and operator
- capture current config and service state
- check recent alerts, maintenance windows, and dependencies
- confirm backup or rollback path exists
### Post-checks
- validate service state
- validate logs for fresh errors
- validate client path, ports, and name resolution
- compare metrics before/after
### Rollback Thinking
- define exact backout trigger before change
- prefer reversible steps
- keep config backups with timestamps
- avoid bundling unrelated changes
### Change Validation
```bash
systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager
```
### Operational Communication
- state scope, risk, and expected impact before action
- record start and stop times in UTC
- document what changed, what was checked, and remaining risk
- escalate with evidence, not assumptions
### Evidence Collection During Incidents
```bash
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt
```
+12
View File
@@ -0,0 +1,12 @@
# examples
Sanitized sample outputs for documentation and review.
These files use fake hostnames, reserved example domains, reserved IP address ranges, and invented storage names. They are useful for reading the workflow without exposing real system details.
## Included
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
- `veritas/` - sample VxVM disk and VCS service group output.
- `gpfs/` - sample GPFS cluster and NSD output.
@@ -0,0 +1,4 @@
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vgapp-lvlog 80G 76G 4.0G 95% /var/log/app
/dev/mapper/vgapp-lvdata 200G 121G 79G 61% /srv/app
/dev/sda2 40G 19G 21G 48% /
@@ -0,0 +1,4 @@
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
appworker 1842 appsvc 12w REG 253,7 8589934592 0 9911 /var/log/app/app.log.1 (deleted)
java 2210 appsvc 45w REG 253,7 2147483648 0 9919 /var/log/app/gc.log.2 (deleted)
rsyslogd 712 root 7w REG 253,7 524288000 0 9924 /var/log/app/messages.old (deleted)
@@ -0,0 +1,13 @@
Disk Full Review - Sanitized Example
Host: host-app-01.example.invalid
Filesystem: /var/log/app
Before: 95% used
After: 72% used
Actions reviewed:
- Confirmed largest files under /var/log/app.
- Identified deleted files still held by appworker and java processes.
- Confirmed no symlinks were removed during rotated log cleanup.
- Recommended application owner restart during approved window to release deleted files.
No real hostnames, tickets, or application names are included in this sample.
@@ -0,0 +1,11 @@
GPFS cluster information
========================
GPFS cluster name: gpfs-lab.example.invalid
GPFS cluster id: 1234567890123456789
GPFS UID domain: gpfs-lab.example.invalid
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
Node Daemon node name IP address Admin node name Designation
1 gpfs-node-a.example.invalid 192.0.2.11 gpfs-node-a.example.invalid quorum-manager
2 gpfs-node-b.example.invalid 192.0.2.12 gpfs-node-b.example.invalid quorum-manager
@@ -0,0 +1,5 @@
File system Disk name NSD servers
-------------------------------------------------------------------
fs_data nsd_data_01 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
fs_data nsd_data_02 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
fs_data nsd_data_03 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
@@ -0,0 +1,131 @@
# L2 Incident Triage Report
- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago
## Executed Checks
| Check | Script | Status | Exit | Command |
| --- | --- | --- | --- | --- |
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
## Summary
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected
## Raw Evidence
### CPU saturation
Script: `check_high_cpu.sh`
Command: `./check_high_cpu.sh`
Status: OK, exit: 0
```text
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
Load average:
1m=0.42 5m=0.38 15m=0.31
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process
Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
```
### Memory and OOM
Script: `check_high_memory_oom.sh`
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
Status: WARNING, exit: 1
```text
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583
OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
```
### Service restart loop
Script: `check_service_restart_loop.sh`
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
Status: OK, exit: 0
```text
OK: Service nginx state=active substate=running restarts=0
Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
```
### Skipped or limited checks
```text
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
```
## L2 Handover Checklist
- [ ] Business impact confirmed
- [ ] Affected host/service identified
- [ ] Monitoring alert attached
- [ ] Recent changes checked
- [ ] Logs attached
- [ ] Service owner identified
- [ ] Escalation target identified
## Escalation Notes
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
## Recommended Next Steps
- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
@@ -0,0 +1,3 @@
#Group Attribute System Value
app_sg01 State node-a.example.invalid |ONLINE|
app_sg01 State node-b.example.invalid |OFFLINE|
@@ -0,0 +1,5 @@
DEVICE TYPE DISK GROUP STATUS
san_lun_001 auto:none - - online invalid
san_lun_002 auto:none - - online invalid
san_lun_010 auto:cdsdisk dgapp01_01 dgapp01 online
san_lun_011 auto:cdsdisk dgapp01_02 dgapp01 online
+3 -16
View File
@@ -1,18 +1,5 @@
# infra-run/runbooks # runbooks
This directory is reserved for runbook-style procedures that describe how to perform controlled operational work. It sits alongside the executable scripts and captures the human workflow around them. Planned area for standalone runbooks.
## Diagram Current runnable workflow notes live with the Bash toolkits under [scripts/bash](../scripts/bash/).
```mermaid
flowchart TD
A["runbooks"] --> B["Pre-check"]
A --> C["Change execution"]
A --> D["Post-check"]
A --> E["Rollback or escalation"]
```
## Notes
- The directory is currently a placeholder.
- It is intended to hold narrative procedures that complement the script-based toolkits.
+8 -6
View File
@@ -1,6 +1,6 @@
# infra-run/scripts # infra-run/scripts
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from future Python-based utilities while keeping both under one automation entry point. This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from Python-based analysis utilities while keeping both under one automation entry point.
## Diagram ## Diagram
@@ -9,15 +9,17 @@ flowchart TD
A["scripts"] --> B["bash"] A["scripts"] --> B["bash"]
A --> C["python"] A --> C["python"]
B --> D["Operational toolkits"] B --> D["Operational toolkits"]
C --> E["Future helper utilities"] C --> E["Analysis helper utilities"]
``` ```
## Scope ## Scope
- `bash` - current implementation area with production-style operations toolkits. - [bash](./bash/) - operational toolkits for host health checks, disk-full triage, Veritas examples, and GPFS examples.
- `python` - reserved space for future supporting utilities. - [python](./python/) - read-only tools for local log parsing, reporting, and structured operational analysis.
## Notes ## Notes
- The repository currently emphasizes Bash because it maps directly to day-to-day Linux operations. - Bash remains the right default for direct host checks and operational wrappers.
- The structure leaves room for higher-level helpers without mixing concerns. - Python is used where parsing, report generation, comparison, or JSON output is clearer than shell.
- Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root.
- Python tooling should remain read-only by default, standard-library based, and validated with `../../scripts/check-python.sh` from the repository root.
+23 -6
View File
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
```mermaid ```mermaid
flowchart TD flowchart TD
A["bash"] --> B["os-healthcheck"] A["bash"] --> B["os-healthcheck"]
A --> C["disk-full"] A --> C["incident-checks"]
A --> D["veritas"] A --> D["disk-full"]
A --> E["gpfs"] A --> E["veritas"]
A --> F["gpfs"]
B --> B1["Host diagnostics"] B --> B1["Host diagnostics"]
C --> C1["Incident workflow"] C --> C1["Standalone triage checks"]
D --> D1["VxVM and VCS change flow"] D --> D1["Incident workflow"]
E --> E1["Spectrum Scale expansion flow"] E --> E1["VxVM and VCS change flow"]
F --> F1["Spectrum Scale expansion flow"]
``` ```
## Scripts ## Scripts
@@ -23,6 +25,7 @@ flowchart TD
- `os-healthcheck/service_check.sh` - critical service status check. - `os-healthcheck/service_check.sh` - critical service status check.
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`. - `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics. - `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
## Usage ## Usage
@@ -37,8 +40,22 @@ cd infra-run/scripts/bash/os-healthcheck
./system_report.sh ./system_report.sh
./network_troubleshoot.sh ./network_troubleshoot.sh
./network_troubleshoot.sh google.com ./network_troubleshoot.sh google.com
cd ../incident-checks
./check_high_cpu.sh
./check_high_memory_oom.sh --since "24 hours ago"
./check_service_restart_loop.sh --service sshd
./check_certificate_expiry.sh --host example.com
``` ```
## Standards
- Scripts use Bash and should keep `#!/usr/bin/env bash` plus strict mode.
- Read-only checks should report missing tools without hiding the problem.
- Change-capable scripts must default to dry-run behavior and require explicit `--execute`.
- Output should use `OK`, `WARNING`, and `CRITICAL` where practical.
- Validate changed scripts with `./scripts/check-bash.sh` from the repository root.
## Exit Codes ## Exit Codes
`disk_check.sh`: `disk_check.sh`:
+1 -3
View File
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}" TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
DRY_RUN="${DRY_RUN:-true}" DRY_RUN="${DRY_RUN:-true}"
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
+1 -3
View File
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}" TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
DRY_RUN="${DRY_RUN:-true}" DRY_RUN="${DRY_RUN:-true}"
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh
@@ -1,7 +1,5 @@
#!/usr/bin/env bash #!/usr/bin/env bash
set -o errexit set -euo pipefail
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00_env.sh # shellcheck source=00_env.sh

Some files were not shown because too many files have changed in this diff Show More