Compare commits
18 Commits
deb12a0b4f
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 4e739c5c99 | |||
| 8cb92de06f | |||
| 1843796e92 | |||
| cd6830334b | |||
| e2624a7533 | |||
| 6475f76787 | |||
| e851568c8c | |||
| 8a7b7c5abc | |||
| 1636f46f81 | |||
| 5fc96348c5 | |||
| 89b7fabb96 | |||
| 2da5e8b46c | |||
| 452ff4fac1 | |||
| 5dde403ce3 | |||
| 61483c233f | |||
| a527022518 | |||
| 0d3905b8a1 | |||
| ca5a876d03 |
@@ -0,0 +1,39 @@
|
|||||||
|
---
|
||||||
|
name: lint
|
||||||
|
|
||||||
|
on:
|
||||||
|
pull_request:
|
||||||
|
push:
|
||||||
|
branches:
|
||||||
|
- main
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
shell-yaml-ansible:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
steps:
|
||||||
|
- name: Check out repository
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
|
||||||
|
- name: Install lint tools
|
||||||
|
run: |
|
||||||
|
sudo apt-get update
|
||||||
|
sudo apt-get install -y shellcheck yamllint python3-pip
|
||||||
|
python3 -m pip install --user ansible-lint
|
||||||
|
echo "$HOME/.local/bin" >> "$GITHUB_PATH"
|
||||||
|
|
||||||
|
- name: ShellCheck Bash scripts
|
||||||
|
run: |
|
||||||
|
find infra-run/scripts/bash -name '*.sh' -print0 | xargs -0 shellcheck -x \
|
||||||
|
-P infra-run/scripts/bash/disk-full \
|
||||||
|
-P infra-run/scripts/bash/gpfs \
|
||||||
|
-P infra-run/scripts/bash/veritas
|
||||||
|
|
||||||
|
- name: Python syntax checks
|
||||||
|
run: bash scripts/check-python.sh
|
||||||
|
|
||||||
|
- name: yamllint
|
||||||
|
run: yamllint .
|
||||||
|
|
||||||
|
- name: ansible-lint
|
||||||
|
continue-on-error: true
|
||||||
|
run: cd infra-run/ansible && ansible-lint playbooks roles
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
extends: default
|
||||||
|
|
||||||
|
rules:
|
||||||
|
line-length:
|
||||||
|
max: 140
|
||||||
|
truthy:
|
||||||
|
allowed-values: ["true", "false", "on"]
|
||||||
@@ -0,0 +1,126 @@
|
|||||||
|
# AGENTS.md
|
||||||
|
|
||||||
|
Guidance for Codex and other automated agents working in this repository.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This repository is a Linux/Unix infrastructure engineering portfolio. It shows practical operational work: incident response, troubleshooting, safe Bash tooling, Ansible hardening examples, storage workflows, runbooks, and platform/lab notes.
|
||||||
|
|
||||||
|
Treat it like internal operations tooling maintained by an infrastructure engineer. Preserve operational realism and avoid generic tutorial or template filler.
|
||||||
|
|
||||||
|
## Layout
|
||||||
|
|
||||||
|
- `infra-run/` - core operational tooling, Ansible, Bash scripts, runbooks, examples, and operations docs.
|
||||||
|
- `platform-projects/` - larger platform topics such as monitoring, storage, clustering, virtualization, and observability.
|
||||||
|
- `labs/` - experimental/lab environments for Kubernetes, Terraform, networking, CI/CD, Docker, and related work.
|
||||||
|
- `docs/codex/` - Codex workflow guidance, task templates, review checklist, and planning template.
|
||||||
|
- `scripts/` - repository validation helpers.
|
||||||
|
|
||||||
|
## Inspect First
|
||||||
|
|
||||||
|
Before editing, inspect the affected tree and nearby README files. Prefer:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
rg --files
|
||||||
|
git status --short
|
||||||
|
sed -n '1,220p' <file>
|
||||||
|
```
|
||||||
|
|
||||||
|
Check existing style before introducing new structure. Keep changes small and reviewable.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
Run the broad repo check when practical:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/validate-repo.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Focused checks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-bash.sh
|
||||||
|
./scripts/check-ansible.sh
|
||||||
|
./scripts/check-python.sh
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Optional strict mode fails when optional tools are missing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
STRICT=1 ./scripts/validate-repo.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Also run targeted checks for changed files, such as `bash -n`, `ansible-playbook --syntax-check`, or link checks when relevant.
|
||||||
|
|
||||||
|
## Bash Standards
|
||||||
|
|
||||||
|
- Use `#!/usr/bin/env bash`.
|
||||||
|
- Use `set -o errexit`, `set -o nounset`, and `set -o pipefail`.
|
||||||
|
- Validate input before using it.
|
||||||
|
- Handle missing commands clearly.
|
||||||
|
- Default to read-only or dry-run behavior.
|
||||||
|
- Require explicit `--execute` plus confirmation for destructive operations.
|
||||||
|
- Use clear `OK`, `WARNING`, and `CRITICAL` output.
|
||||||
|
- Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
|
||||||
|
- Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps.
|
||||||
|
|
||||||
|
## Python Standards
|
||||||
|
|
||||||
|
- Use Python for parsing, reporting, and structured operational tooling where it adds value over Bash.
|
||||||
|
- Keep Python tools read-only by default.
|
||||||
|
- Prefer the Python standard library.
|
||||||
|
- Avoid frameworks and unnecessary abstractions.
|
||||||
|
- Use clear operational output and meaningful exit codes.
|
||||||
|
- Keep tools small, focused, and easy to validate.
|
||||||
|
|
||||||
|
## Ansible Standards
|
||||||
|
|
||||||
|
- Keep playbooks short and roles simple.
|
||||||
|
- Prefer modules over `shell` or `command`.
|
||||||
|
- Use `shell` or `command` only when the module set cannot express the operation, and document why if risk is not obvious.
|
||||||
|
- Preserve check-mode and diff-mode friendliness where possible.
|
||||||
|
- Use handlers, tags, defaults, and validation tasks when they clarify operations.
|
||||||
|
- Keep inventory under `inventory/hosts.yml`, `group_vars/`, and `host_vars/`.
|
||||||
|
- Do not present selected hardening examples as complete compliance certification.
|
||||||
|
|
||||||
|
## Documentation Standards
|
||||||
|
|
||||||
|
- Explain what exists, what is planned, and what is intentionally not supported.
|
||||||
|
- Prefer runbook style: scope, pre-checks, execution guardrails, rollback thinking, post-checks, and evidence.
|
||||||
|
- Avoid marketing language, fake enterprise wording, and tutorial bloat.
|
||||||
|
- Update README files and `CHANGELOG.md` when adding meaningful behavior or structure.
|
||||||
|
|
||||||
|
## Safety Rules
|
||||||
|
|
||||||
|
- Do not run destructive commands.
|
||||||
|
- Do not rename large directories unless the benefit is clear and low-risk.
|
||||||
|
- Do not hide validation failures.
|
||||||
|
- Do not claim live production validation for sanitized examples.
|
||||||
|
- Do not add secrets, real hostnames, customer identifiers, or private infrastructure details.
|
||||||
|
- Do not turn placeholders into fake completed projects.
|
||||||
|
|
||||||
|
## PR and Review Expectations
|
||||||
|
|
||||||
|
- State the operational risk of the change.
|
||||||
|
- Include commands run and whether tools were missing.
|
||||||
|
- Review scripts for dry-run behavior, input validation, dependency handling, and rollback path.
|
||||||
|
- Review Ansible for idempotency, check-mode behavior, inventory targeting, tags, handlers, and module choice.
|
||||||
|
- Keep diffs focused.
|
||||||
|
|
||||||
|
## Definition of Done
|
||||||
|
|
||||||
|
- The change preserves the repository intent.
|
||||||
|
- Relevant docs are updated.
|
||||||
|
- Changed Bash scripts pass `bash -n`.
|
||||||
|
- Available validation helpers were run.
|
||||||
|
- Missing optional tools are reported.
|
||||||
|
- Any remaining risk or follow-up is documented.
|
||||||
|
|
||||||
|
## Do Not
|
||||||
|
|
||||||
|
- Do not add an "ultimate DevOps template" structure.
|
||||||
|
- Do not replace working simple Bash with unnecessary abstractions.
|
||||||
|
- Do not make examples appear production-certified.
|
||||||
|
- Do not add destructive behavior without `--execute`, confirmation, and clear rollback notes.
|
||||||
|
- Do not delete useful content unless it is clearly duplicate, broken, or misleading.
|
||||||
+35
-2
@@ -4,20 +4,53 @@
|
|||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
- CIS-inspired Ansible hardening automation:
|
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
|
||||||
|
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
|
||||||
|
- Python tooling validation for operational scripts.
|
||||||
|
- `incident-log-summary` for general incident log summarization.
|
||||||
|
- `log-diff-checker` for pre-change and post-change log comparison.
|
||||||
|
- `auth-log-audit` for Linux authentication log review.
|
||||||
|
- `jvm-log-analyzer` for JVM application log summaries.
|
||||||
|
- `journal-analyzer` for exported `journalctl` log review.
|
||||||
|
- `known-error-matcher` with JSON-based known error patterns.
|
||||||
|
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
|
||||||
|
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
|
||||||
|
- Repository-level Codex guidance:
|
||||||
|
- `AGENTS.md`
|
||||||
|
- `docs/codex/README.md`
|
||||||
|
- `docs/codex/review-checklist.md`
|
||||||
|
- `docs/codex/task-template.md`
|
||||||
|
- `docs/codex/plans-template.md`
|
||||||
|
- Lightweight validation helpers:
|
||||||
|
- `scripts/validate-repo.sh`
|
||||||
|
- `scripts/check-bash.sh`
|
||||||
|
- `scripts/check-ansible.sh`
|
||||||
|
- `scripts/check-docs.sh`
|
||||||
|
- Cross-repository operational documentation structure:
|
||||||
|
- `infra-run/docs/operations-cheatsheet.md`
|
||||||
|
- `platform-projects/docs/platform-cheatsheet.md`
|
||||||
|
- `labs/docs/lab-cheatsheet.md`
|
||||||
|
- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
|
||||||
|
- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
|
||||||
|
- Selected baseline Ansible hardening automation:
|
||||||
- RHEL 9 role and playbook.
|
- RHEL 9 role and playbook.
|
||||||
- Debian 13 / Ubuntu 26.04 role and playbook.
|
- Debian 13 / Ubuntu 26.04 role and playbook.
|
||||||
- IBM AIX 7 role and playbook.
|
- IBM AIX 7 role and playbook.
|
||||||
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
||||||
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
||||||
|
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
|
- Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work.
|
||||||
|
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
|
||||||
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
||||||
|
- Updated Python tooling documentation and repository roadmap.
|
||||||
|
- Integrated Python syntax validation into repository validation workflow and CI.
|
||||||
|
|
||||||
### Notes
|
### Notes
|
||||||
|
|
||||||
- Hardening content is CIS-inspired and intended for portfolio/lab use; production use requires environment-specific review and validation.
|
- Hardening content covers selected baseline controls and intended for portfolio/lab use; live use requires environment-specific review and validation.
|
||||||
|
|
||||||
## [Initial Version]
|
## [Initial Version]
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
@@ -1,92 +1,111 @@
|
|||||||
# Portfolio
|
# Linux/Unix Infrastructure Engineering Portfolio
|
||||||
|
|
||||||
This repository demonstrates real-world Linux infrastructure and operations experience through sanitized scripts, runbooks, and project structure. It focuses on production operations, incident response, troubleshooting, automation, and enterprise infrastructure patterns.
|
This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.
|
||||||
|
|
||||||
## Repository Diagram
|
It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.
|
||||||
|
|
||||||
```mermaid
|
## What This Repo Is
|
||||||
flowchart TD
|
|
||||||
A["portfolio"] --> B["infra-run"]
|
- Practical Linux/Unix operations examples.
|
||||||
A --> C["platform-projects"]
|
- Safe Bash and Ansible patterns for lab and review.
|
||||||
A --> D["labs"]
|
- Runbook-driven examples for incident response, storage operations, hardening, and observability.
|
||||||
B --> B1["ansible"]
|
- A place for platform and lab topics to grow without pretending unfinished areas are complete.
|
||||||
B --> B2["docs"]
|
|
||||||
B --> B3["runbooks"]
|
## What This Repo Is Not
|
||||||
B --> B4["scripts"]
|
|
||||||
B1 --> B11["hardening roles"]
|
- It is not a compliance benchmark implementation.
|
||||||
B4 --> B41["bash"]
|
- It is not a drop-in change automation framework.
|
||||||
B4 --> B42["python"]
|
- It is not proof that these exact scripts ran in any production environment.
|
||||||
C --> C1["storage"]
|
- It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.
|
||||||
C --> C2["clustering"]
|
|
||||||
C --> C3["monitoring-zabbix"]
|
## Repository Layout
|
||||||
C --> C4["virtualization"]
|
|
||||||
C --> C5["elk-log-analysis"]
|
- [infra-run](./infra-run/) - core operational tooling and automation.
|
||||||
D --> D1["docker"]
|
- [platform-projects](./platform-projects/) - larger platform topics and case-study areas.
|
||||||
D --> D2["kubernetes"]
|
- [labs](./labs/) - experimental/lab environments and notes.
|
||||||
D --> D3["terraform"]
|
- [docs/codex](./docs/codex/) - guidance for future Codex-driven changes.
|
||||||
D --> D4["networking"]
|
- [scripts](./scripts/) - lightweight repository validation helpers.
|
||||||
D --> D5["ci-cd"]
|
|
||||||
|
## Usable Now
|
||||||
|
|
||||||
|
- [infra-run](./infra-run/) - the main implemented project in this repository.
|
||||||
|
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
|
||||||
|
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
|
||||||
|
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
|
||||||
|
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
|
||||||
|
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
|
||||||
|
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
|
||||||
|
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
|
||||||
|
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
|
||||||
|
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
|
||||||
|
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
|
||||||
|
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||||
|
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||||
|
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||||
|
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
|
## Planned Areas
|
||||||
|
|
||||||
|
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
### Production Operations
|
||||||
|
|
||||||
|
- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
|
||||||
|
|
||||||
|
### Platform Engineering
|
||||||
|
|
||||||
|
- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
|
||||||
|
|
||||||
|
### Labs & Experiments
|
||||||
|
|
||||||
|
- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
|
||||||
|
|
||||||
|
### Codex and Review Guidance
|
||||||
|
|
||||||
|
- [AGENTS.md](./AGENTS.md) - repository rules for automated and assisted changes.
|
||||||
|
- [docs/codex/README.md](./docs/codex/README.md) - Codex workflow and expected final response format.
|
||||||
|
- [docs/codex/review-checklist.md](./docs/codex/review-checklist.md) - safety, Bash, Ansible, docs, and validation review checklist.
|
||||||
|
- [docs/codex/task-template.md](./docs/codex/task-template.md) - reusable scoped task templates.
|
||||||
|
|
||||||
|
## Safety-First Usage
|
||||||
|
|
||||||
|
Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.
|
||||||
|
|
||||||
|
- Prefer read-only commands first.
|
||||||
|
- Use dry-run/check mode before execution.
|
||||||
|
- Treat `--execute` as a change-control boundary.
|
||||||
|
- Confirm backups, monitoring, application impact, and rollback steps before live use.
|
||||||
|
- Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
Basic local validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/validate-repo.sh
|
||||||
|
./scripts/check-bash.sh
|
||||||
|
./scripts/check-ansible.sh
|
||||||
|
./scripts/check-python.sh
|
||||||
|
./scripts/check-docs.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
## Core Project
|
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
|
||||||
|
|
||||||
### infra-run
|
Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
|
||||||
|
|
||||||
`infra-run` is the core operational project in this repository. It contains Linux operations automation, incident response tooling, Bash-based operational scripts, and runbook-style workflows for pre-checks, controlled changes, troubleshooting, and post-change validation.
|
See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATIONS.md](./infra-run/KNOWN_LIMITATIONS.md) for the current validation status.
|
||||||
|
|
||||||
## Toolkits
|
## Operational Areas Demonstrated
|
||||||
|
|
||||||
### Linux Operations Toolkit
|
- Linux operations triage and reporting.
|
||||||
|
- Local operational log analysis with read-only Python helpers.
|
||||||
[infra-run/scripts/bash/os-healthcheck/](./infra-run/scripts/bash/os-healthcheck/)
|
- Disk pressure and deleted-file incident analysis.
|
||||||
|
- Dry-run-first Bash automation.
|
||||||
General Linux operations scripts for host health checks, disk usage checks, service validation, system reporting, and first-pass OS-level diagnostics. The toolkit is written for practical operations checks on RHEL, Oracle Linux, and Ubuntu-style systems.
|
- Controlled storage change workflow design.
|
||||||
|
- Veritas VxVM/VCS operational awareness.
|
||||||
### Disk Full Incident Toolkit
|
- GPFS / IBM Spectrum Scale operational awareness.
|
||||||
|
- Ansible role organization for selected hardening controls.
|
||||||
[infra-run/scripts/bash/disk-full/](./infra-run/scripts/bash/disk-full/)
|
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||||
|
- Clear documentation of what was tested and what still needs a real system.
|
||||||
Production-style disk full incident workflow covering filesystem usage, inode pressure, large file discovery, deleted open files, top directory analysis, log cleanup review, and safe cleanup suggestions. The scenario reflects common incidents involving logs, temporary files, deleted files held open by processes, and inode exhaustion.
|
|
||||||
|
|
||||||
### Network Troubleshooting
|
|
||||||
|
|
||||||
[infra-run/scripts/bash/os-healthcheck/](./infra-run/scripts/bash/os-healthcheck/)
|
|
||||||
|
|
||||||
OS-level network diagnostics for interfaces, routes, DNS resolution, gateway reachability, listening sockets, and optional remote connectivity checks. The script is designed for first-pass troubleshooting during Linux operations incidents.
|
|
||||||
|
|
||||||
### Veritas Storage Toolkit
|
|
||||||
|
|
||||||
[infra-run/scripts/bash/veritas/](./infra-run/scripts/bash/veritas/)
|
|
||||||
|
|
||||||
Veritas VxVM and VCS storage expansion workflow covering new LUN detection, VxVM disk initialization, diskgroup extension, volume and filesystem resize, and VCS service group freeze/unfreeze handling. The approach is cluster-safe, dry-run by default, and organized around pre-check, change, and post-check steps.
|
|
||||||
|
|
||||||
### GPFS Storage Toolkit
|
|
||||||
|
|
||||||
[infra-run/scripts/bash/gpfs/](./infra-run/scripts/bash/gpfs/)
|
|
||||||
|
|
||||||
GPFS / IBM Spectrum Scale filesystem expansion workflow covering cluster validation, candidate disk discovery, NSD stanza planning, NSD creation, filesystem expansion, optional rebalance, post-checks, and change reporting.
|
|
||||||
|
|
||||||
### Ansible Hardening Toolkit
|
|
||||||
|
|
||||||
[infra-run/ansible/](./infra-run/ansible/)
|
|
||||||
|
|
||||||
CIS-inspired Ansible automation for repeatable operating system hardening across RHEL 9, Debian 13 / Ubuntu 26.04, and IBM AIX 7 targets. The roles are organized around pre-checks, configurable safeguards, SSH and sudo policy, auditing, logging, services, filesystem controls, platform-specific system settings, handlers, and post-change validation.
|
|
||||||
|
|
||||||
## Repository Structure
|
|
||||||
|
|
||||||
- `infra-run` - core operational automation, scripts, runbooks, and infrastructure operations examples.
|
|
||||||
- `platform-projects` - larger infrastructure topics including storage, clustering, monitoring, virtualization, and log analysis.
|
|
||||||
- `labs` - experimentation and lab work for Kubernetes, Terraform, Docker, networking, and CI/CD.
|
|
||||||
|
|
||||||
## Design Principles
|
|
||||||
|
|
||||||
- Safety first, with dry-run behavior by default.
|
|
||||||
- Pre-check, change, and post-check workflow.
|
|
||||||
- Real-world scenarios, not tutorials.
|
|
||||||
- Minimal but practical tooling.
|
|
||||||
- Configurable automation with sanitized defaults and explicit overrides.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- Scripts are simplified and sanitized for portfolio use.
|
|
||||||
- Examples are based on real production operations patterns.
|
|
||||||
|
|||||||
+38
@@ -0,0 +1,38 @@
|
|||||||
|
# Roadmap
|
||||||
|
|
||||||
|
This file keeps future portfolio ideas in one place so empty folders do not look like finished work.
|
||||||
|
|
||||||
|
## Planned Lab Areas
|
||||||
|
|
||||||
|
- Docker: image build notes, container troubleshooting, and small service examples.
|
||||||
|
- Kubernetes: workload inspection, basic operations checks, and failure scenario notes.
|
||||||
|
- Terraform: small infrastructure-as-code examples with clear plan/apply separation.
|
||||||
|
- Networking: DNS, routing, firewall, and connectivity troubleshooting labs.
|
||||||
|
- CI/CD: validation pipelines for shell, YAML, and Ansible examples.
|
||||||
|
|
||||||
|
## Planned Platform Case Studies
|
||||||
|
|
||||||
|
- Storage: expansion planning, filesystem checks, and SAN handoff documentation.
|
||||||
|
- Clustering: service group checks, failover review, and operational checklists.
|
||||||
|
- Monitoring: Zabbix-oriented alert review and host onboarding notes.
|
||||||
|
- Virtualization: VM lifecycle and platform operations examples.
|
||||||
|
- Log analysis: optional ELK-style search case study under `platform-projects`, separate from current local Python helpers.
|
||||||
|
|
||||||
|
## Implemented Portfolio Additions
|
||||||
|
|
||||||
|
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
|
||||||
|
- Python operational log analysis suite under `infra-run/scripts/python/`:
|
||||||
|
- `incident-log-summary`
|
||||||
|
- `log-diff-checker`
|
||||||
|
- `auth-log-audit`
|
||||||
|
- `jvm-log-analyzer`
|
||||||
|
- `journal-analyzer`
|
||||||
|
- `known-error-matcher`
|
||||||
|
|
||||||
|
## Future Python Tooling Ideas
|
||||||
|
|
||||||
|
- Real-world sample report examples using sanitized evidence.
|
||||||
|
- Integration examples that combine log summaries with change evidence collection.
|
||||||
|
- A shared Python helper library only if the standalone tools begin duplicating enough stable behavior to justify it.
|
||||||
|
|
||||||
|
Planned sections remain future work unless listed as implemented.
|
||||||
@@ -0,0 +1,55 @@
|
|||||||
|
# Codex Workflow
|
||||||
|
|
||||||
|
This directory keeps future Codex sessions consistent when working in this infrastructure portfolio.
|
||||||
|
|
||||||
|
## How To Start
|
||||||
|
|
||||||
|
1. Read [AGENTS.md](../../AGENTS.md).
|
||||||
|
2. Inspect the affected tree and nearby README files.
|
||||||
|
3. Check `git status --short` so existing user work is preserved.
|
||||||
|
4. Decide whether a plan is needed before editing.
|
||||||
|
5. Make small, reviewable changes.
|
||||||
|
6. Run focused validation plus `./scripts/validate-repo.sh` when practical.
|
||||||
|
|
||||||
|
## When To Plan First
|
||||||
|
|
||||||
|
Plan before editing when a task touches more than one subsystem, changes operational behavior, adds or modifies destructive actions, changes Ansible targeting, or updates repository conventions.
|
||||||
|
|
||||||
|
For small typo fixes, narrow README updates, or obvious syntax fixes, inspect first and then make the change directly.
|
||||||
|
|
||||||
|
Use [plans-template.md](./plans-template.md) for larger changes.
|
||||||
|
|
||||||
|
## Scoped Tasks
|
||||||
|
|
||||||
|
Good tasks name the operational goal, affected directories, constraints, validation commands, and what "done" means. Use [task-template.md](./task-template.md) for reusable prompts.
|
||||||
|
|
||||||
|
Keep scope tied to real operations:
|
||||||
|
|
||||||
|
- Bash tool: discovery, pre-check, dry-run, execute, post-check, report.
|
||||||
|
- Ansible change: inventory target, role/playbook scope, check mode, idempotency, validation.
|
||||||
|
- Runbook: incident signal, triage, decision points, rollback, evidence.
|
||||||
|
- Lab/platform project: status, prerequisites, validation, limitations.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
Prefer the repository helpers:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-bash.sh
|
||||||
|
./scripts/check-ansible.sh
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
./scripts/validate-repo.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
If optional tools are missing, report that clearly and continue with available checks. Do not claim skipped checks passed.
|
||||||
|
|
||||||
|
## Final Response Format
|
||||||
|
|
||||||
|
End with:
|
||||||
|
|
||||||
|
1. Summary of what changed.
|
||||||
|
2. Files created or modified.
|
||||||
|
3. Validation commands run and results.
|
||||||
|
4. Skipped checks and why.
|
||||||
|
5. Risks or follow-ups.
|
||||||
|
6. Whether the repo is ready for future Codex-driven work.
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
# Implementation Plan Template
|
||||||
|
|
||||||
|
Use this for changes that touch multiple files, alter operational behavior, or add new repository conventions.
|
||||||
|
|
||||||
|
## Goal
|
||||||
|
|
||||||
|
State the operational or maintenance outcome.
|
||||||
|
|
||||||
|
## Current State
|
||||||
|
|
||||||
|
Summarize the directories and conventions inspected.
|
||||||
|
|
||||||
|
## Scope
|
||||||
|
|
||||||
|
List files or directories expected to change.
|
||||||
|
|
||||||
|
## Non-Goals
|
||||||
|
|
||||||
|
Name what will not be redesigned, renamed, deleted, or claimed as complete.
|
||||||
|
|
||||||
|
## Plan
|
||||||
|
|
||||||
|
1. Inspect relevant scripts, playbooks, docs, and examples.
|
||||||
|
2. Make the smallest structural or documentation changes needed.
|
||||||
|
3. Update validation or runbook guidance.
|
||||||
|
4. Run focused checks.
|
||||||
|
5. Summarize residual risk and follow-ups.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
List commands to run, including fallback behavior for missing tools.
|
||||||
|
|
||||||
|
## Risks
|
||||||
|
|
||||||
|
Call out destructive operations, platform assumptions, missing lab environments, or checks that require real systems.
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# Review Checklist
|
||||||
|
|
||||||
|
Use this checklist for repository reviews and pull requests.
|
||||||
|
|
||||||
|
## Safety
|
||||||
|
|
||||||
|
- Destructive actions default to dry-run or read-only.
|
||||||
|
- Real changes require explicit `--execute` and operator confirmation.
|
||||||
|
- Inputs are validated before use.
|
||||||
|
- Paths, service names, disks, volumes, and inventory targets are constrained.
|
||||||
|
- Rollback or recovery thinking is documented where the operation can change state.
|
||||||
|
|
||||||
|
## Bash
|
||||||
|
|
||||||
|
- Uses `#!/usr/bin/env bash`.
|
||||||
|
- Uses `set -o errexit`, `set -o nounset`, and `set -o pipefail`.
|
||||||
|
- Missing commands return a clear warning or invalid-input/dependency exit.
|
||||||
|
- Output uses `OK`, `WARNING`, and `CRITICAL` consistently.
|
||||||
|
- Exit codes follow repo convention: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
|
||||||
|
- Help output exists for scripts that accept arguments.
|
||||||
|
|
||||||
|
## Ansible
|
||||||
|
|
||||||
|
- Target hosts are explicit and appropriate for the role.
|
||||||
|
- Modules are preferred over `shell` or `command`.
|
||||||
|
- Check mode and diff mode are considered.
|
||||||
|
- Tasks are idempotent or clearly documented when a check is inherently read-only or platform-specific.
|
||||||
|
- Handlers, tags, defaults, and validation tasks are used where useful.
|
||||||
|
- Inventory, vars, and role defaults do not contain secrets or real environment data.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- README files explain current state without overstating completeness.
|
||||||
|
- Runbooks include scope, pre-checks, execution controls, post-checks, and evidence.
|
||||||
|
- Docs avoid tutorial filler and fake enterprise complexity.
|
||||||
|
- Important limitations are linked or documented.
|
||||||
|
- `CHANGELOG.md` is updated for meaningful repo changes.
|
||||||
|
|
||||||
|
## Operational Realism
|
||||||
|
|
||||||
|
- The change reflects RHEL/Oracle Linux, Debian/Ubuntu, AIX, Veritas, GPFS, Zabbix, ELK, Docker, Kubernetes/K3s, Terraform, VMware, or Proxmox operations accurately.
|
||||||
|
- Examples remain sanitized.
|
||||||
|
- Placeholder projects are identified as placeholders.
|
||||||
|
- There is no unnecessary abstraction or invented complexity.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
- Changed Bash scripts pass `bash -n`.
|
||||||
|
- `shellcheck` was run if available, or its absence was reported.
|
||||||
|
- Ansible syntax/lint checks were run if available and relevant.
|
||||||
|
- YAML/Markdown sanity checks were run if available.
|
||||||
|
- Failures and skipped checks are visible in the final summary.
|
||||||
@@ -0,0 +1,276 @@
|
|||||||
|
# Task Templates
|
||||||
|
|
||||||
|
Copy the relevant section into a future Codex request and fill in the blanks.
|
||||||
|
|
||||||
|
## Operational Bash Tool
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Build or improve a Bash tool for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Affected platform, incident, or operational workflow:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Default to dry-run/read-only.
|
||||||
|
- Require `--execute` for changes.
|
||||||
|
- Use `OK`, `WARNING`, and `CRITICAL`.
|
||||||
|
- Exit `0` OK, `1` operational issue, `2` invalid input or missing dependency.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `infra-run/scripts/bash/`
|
||||||
|
- Relevant runbook or README:
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Inspect neighboring scripts and shared helpers.
|
||||||
|
2. Add or adjust usage/help output.
|
||||||
|
3. Add discovery, pre-check, guarded change, post-check, and reporting sections where useful.
|
||||||
|
4. Update README or runbook notes.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
bash -n <script>
|
||||||
|
./scripts/check-bash.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
The tool is readable, safe by default, validates inputs, reports clearly, and has updated docs.
|
||||||
|
|
||||||
|
## Ansible Playbook/Role
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Add or improve Ansible automation for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Target OS and inventory group:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Preserve check-mode friendliness.
|
||||||
|
- Prefer modules over shell/command.
|
||||||
|
- Keep playbooks short.
|
||||||
|
- Keep role defaults sanitized.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `infra-run/ansible/README.md`
|
||||||
|
- `infra-run/ansible/inventory/`
|
||||||
|
- `infra-run/ansible/playbooks/`
|
||||||
|
- `infra-run/ansible/roles/`
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Inspect existing role/playbook patterns.
|
||||||
|
2. Add defaults, tasks, handlers, and tags only where needed.
|
||||||
|
3. Add validation or post-check tasks for operational evidence.
|
||||||
|
4. Update role/playbook README.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-ansible.sh
|
||||||
|
cd infra-run/ansible && ansible-playbook --syntax-check -i inventory/hosts.yml playbooks/<playbook>.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
The playbook targets the right hosts, is idempotent where practical, supports review with `--check --diff`, and docs explain limitations.
|
||||||
|
|
||||||
|
## Runbook
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Create or improve a runbook for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Incident signal, platform, and affected service:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Include pre-checks, decision points, rollback, post-checks, and evidence.
|
||||||
|
- Avoid pretending lab notes are production-certified.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `infra-run/runbooks/`
|
||||||
|
- `infra-run/docs/`
|
||||||
|
- Related scripts/examples:
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Define scope and assumptions.
|
||||||
|
2. Add triage steps and command examples.
|
||||||
|
3. Add safe execution gates.
|
||||||
|
4. Add validation and handoff notes.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
An operator can follow the runbook without guessing the risk, inputs, or success criteria.
|
||||||
|
|
||||||
|
## Lab Scenario
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Add or improve a lab scenario for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Technology and local environment:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Mark lab-only behavior clearly.
|
||||||
|
- Keep prerequisites and cleanup explicit.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `labs/`
|
||||||
|
- `labs/docs/lab-cheatsheet.md`
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Document prerequisites and topology.
|
||||||
|
2. Add setup, validation, failure injection if relevant, and cleanup.
|
||||||
|
3. Link related scripts or runbooks.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
The lab is reproducible enough to review and does not imply production readiness.
|
||||||
|
|
||||||
|
## Platform Project
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Add or improve a platform project for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Monitoring, storage, clustering, virtualization, observability, or related topic:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Keep status honest: planned, partial, lab-tested, or complete.
|
||||||
|
- Prefer operational notes over marketing language.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `platform-projects/`
|
||||||
|
- `platform-projects/docs/platform-cheatsheet.md`
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Identify scope and current maturity.
|
||||||
|
2. Add design notes, operational workflows, and validation.
|
||||||
|
3. Link runbooks, examples, and known limitations.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
The project explains what exists, how to validate it, and what remains unproven.
|
||||||
|
|
||||||
|
## Documentation Cleanup
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Clean up documentation for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Current confusion, duplication, or missing links:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Preserve useful operational detail.
|
||||||
|
- Avoid tutorial-style filler.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- Root `README.md`
|
||||||
|
- Section README files
|
||||||
|
- Related docs/runbooks:
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Remove duplication where it hurts navigation.
|
||||||
|
2. Add links to canonical docs.
|
||||||
|
3. Make limitations explicit.
|
||||||
|
4. Update changelog if meaningful.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/check-docs.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
Readers can find the right tool, runbook, or validation command quickly.
|
||||||
|
|
||||||
|
## Repository Review
|
||||||
|
|
||||||
|
### Goal
|
||||||
|
|
||||||
|
Review repository quality for:
|
||||||
|
|
||||||
|
### Context
|
||||||
|
|
||||||
|
Areas of concern:
|
||||||
|
|
||||||
|
### Constraints
|
||||||
|
|
||||||
|
- Findings first, ordered by severity.
|
||||||
|
- Include file/line references where possible.
|
||||||
|
- Do not rewrite unrelated content.
|
||||||
|
|
||||||
|
### Files/directories to inspect
|
||||||
|
|
||||||
|
- `AGENTS.md`
|
||||||
|
- `README.md`
|
||||||
|
- `infra-run/`
|
||||||
|
- `platform-projects/`
|
||||||
|
- `labs/`
|
||||||
|
- `scripts/`
|
||||||
|
|
||||||
|
### Implementation steps
|
||||||
|
|
||||||
|
1. Inspect structure and conventions.
|
||||||
|
2. Review safety, validation, docs, and maintainability.
|
||||||
|
3. Patch only low-risk issues if requested.
|
||||||
|
4. Report risks and follow-ups.
|
||||||
|
|
||||||
|
### Validation commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/validate-repo.sh
|
||||||
|
git diff --stat
|
||||||
|
```
|
||||||
|
|
||||||
|
### Done when
|
||||||
|
|
||||||
|
The review identifies practical risks and leaves a clear next action list.
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
# Known Limitations
|
||||||
|
|
||||||
|
- Veritas scripts require manual review before real use. VxVM and VCS behavior varies by version, cluster design, naming convention, and operational policy.
|
||||||
|
- GPFS commands require a real cluster and must be adapted to the site layout, NSD naming standard, failure groups, storage pools, and maintenance process.
|
||||||
|
- The AIX Ansible role is a portfolio example unless tested on a real AIX LPAR with the target OpenSSH, sudo, audit, and OS levels.
|
||||||
|
- SSH hardening must be validated against the full `sshd` configuration, not only a managed drop-in file.
|
||||||
|
- The hardening examples cover selected controls only. They are not a full CIS benchmark implementation or compliance attestation.
|
||||||
|
- Scripts do not replace formal change procedures, peer review, backups, monitoring checks, or rollback planning.
|
||||||
|
- Sample outputs are fake and sanitized. They should be used for documentation review, not operational decisions.
|
||||||
+93
-26
@@ -1,34 +1,101 @@
|
|||||||
# infra-run
|
# infra-run
|
||||||
|
|
||||||
`infra-run` is the operational core of this repository. It groups automation, scripts, runbooks, and supporting documentation for Linux infrastructure work, incident response, and controlled change execution.
|
`infra-run` is a sanitized infrastructure operations project. It contains Bash, Ansible, Python, and documentation examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
|
||||||
|
|
||||||
## Diagram
|
The goal is to show operational judgment, not to ship a universal automation product.
|
||||||
|
|
||||||
```mermaid
|
## Current Contents
|
||||||
flowchart TD
|
|
||||||
A["infra-run"] --> B["ansible"]
|
### Bash Operational Scripts
|
||||||
A --> C["docs"]
|
|
||||||
A --> D["runbooks"]
|
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
|
||||||
A --> E["scripts"]
|
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
|
||||||
E --> E1["bash"]
|
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
|
||||||
E --> E2["python"]
|
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
|
||||||
|
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
|
||||||
|
|
||||||
|
### Python Log And Reporting Tools
|
||||||
|
|
||||||
|
- [scripts/python](./scripts/python/) - read-only Python operational helpers using the standard library only.
|
||||||
|
- [scripts/python/incident-log-summary](./scripts/python/incident-log-summary/) - read-only Python log summary helper for incident pattern review.
|
||||||
|
- [scripts/python/log-diff-checker](./scripts/python/log-diff-checker/) - read-only Python before/after log comparison helper for change review.
|
||||||
|
- [scripts/python/auth-log-audit](./scripts/python/auth-log-audit/) - read-only Python authentication log audit helper for SSH, sudo, su, and PAM review.
|
||||||
|
- [scripts/python/jvm-log-analyzer](./scripts/python/jvm-log-analyzer/) - read-only Python JVM and Java application log analyzer for exception, stack trace, HTTP 5xx, database, and TLS review.
|
||||||
|
- [scripts/python/journal-analyzer](./scripts/python/journal-analyzer/) - read-only Python exported journal analyzer for failed units, restart patterns, OOM events, and service warnings.
|
||||||
|
- [scripts/python/known-error-matcher](./scripts/python/known-error-matcher/) - read-only Python matcher for local logs and JSON known-error catalogs with runbook references.
|
||||||
|
|
||||||
|
### Ansible Automation
|
||||||
|
|
||||||
|
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
|
||||||
|
|
||||||
|
### Runbooks And Documentation
|
||||||
|
|
||||||
|
- [examples](./examples/) - sanitized sample command outputs and incident notes.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
|
||||||
|
|
||||||
|
## What This Is
|
||||||
|
|
||||||
|
- A portfolio project for Linux and infrastructure operations roles.
|
||||||
|
- A set of readable examples showing precheck, dry-run, execution guardrails, postcheck, and reporting patterns.
|
||||||
|
- A place to demonstrate Bash, Ansible, storage workflow, and troubleshooting habits with sanitized inputs.
|
||||||
|
|
||||||
|
## What This Is Not
|
||||||
|
|
||||||
|
- Not intended for direct live use.
|
||||||
|
- Not a complete CIS benchmark implementation.
|
||||||
|
- Not a replacement for site-specific change procedures.
|
||||||
|
- Not tested against live Veritas, GPFS, or AIX systems in this repository.
|
||||||
|
- Not safe to run blindly on servers without review.
|
||||||
|
|
||||||
|
## Currently Usable
|
||||||
|
|
||||||
|
- Bash syntax can be checked locally.
|
||||||
|
- Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked.
|
||||||
|
- Disk-full read-only scripts can be run against local paths for basic behavior checks.
|
||||||
|
- Python log analysis examples can be run against sanitized sample logs under each tool directory.
|
||||||
|
- Ansible YAML and role structure can be linted locally.
|
||||||
|
|
||||||
|
## Running Safely
|
||||||
|
|
||||||
|
- Start with the relevant README or runbook before executing a script.
|
||||||
|
- Prefer read-only discovery scripts before remediation scripts.
|
||||||
|
- Use dry-run mode unless a script explicitly documents safe local behavior.
|
||||||
|
- Only use `--execute` after reviewing inputs, affected systems, rollback options, and post-checks.
|
||||||
|
- For Ansible, start with `--check --diff` against a lab inventory.
|
||||||
|
|
||||||
|
## Lab-Safe Examples
|
||||||
|
|
||||||
|
- Veritas and GPFS scripts default to dry-run behavior where they plan destructive or platform-changing operations.
|
||||||
|
- Ansible hardening roles are examples of selected controls and need adaptation before use.
|
||||||
|
- Sample outputs under [examples](./examples/) are fake and sanitized.
|
||||||
|
|
||||||
|
## Tested
|
||||||
|
|
||||||
|
See [TESTED.md](./TESTED.md) for current validation status.
|
||||||
|
|
||||||
|
Short version:
|
||||||
|
|
||||||
|
- Shell scripts were reviewed for dry-run behavior and obvious quoting issues.
|
||||||
|
- YAML and Ansible files are intended for local linting.
|
||||||
|
- Veritas, GPFS, and AIX behavior was not validated against real systems here.
|
||||||
|
|
||||||
|
## Basic Validation
|
||||||
|
|
||||||
|
From the repository root:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./scripts/validate-repo.sh
|
||||||
```
|
```
|
||||||
|
|
||||||
## Scope
|
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, `scripts/check-python.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
|
||||||
|
|
||||||
- `ansible` - infrastructure automation with CIS-inspired hardening roles and playbooks.
|
## Supporting Notes
|
||||||
- `docs` - supporting technical notes and written documentation.
|
|
||||||
- `runbooks` - procedural operational guides.
|
|
||||||
- `scripts` - executable tooling for operations and diagnostics.
|
|
||||||
|
|
||||||
## Current Automation
|
- [SOURCE.md](./SOURCE.md) explains why this project exists and what experience shaped it.
|
||||||
|
- [TESTED.md](./TESTED.md) lists what was checked locally and what was not.
|
||||||
- RHEL 9 CIS-inspired hardening role and playbook.
|
- [KNOWN_LIMITATIONS.md](./KNOWN_LIMITATIONS.md) documents technical limits and operational cautions.
|
||||||
- Debian 13 / Ubuntu 26.04 CIS-inspired hardening role and playbook.
|
- [ROADMAP.md](./ROADMAP.md) tracks planned additions without presenting them as completed work.
|
||||||
- IBM AIX 7 CIS-inspired hardening role and playbook.
|
- [../AGENTS.md](../AGENTS.md) and [../docs/codex](../docs/codex/) document repository working rules and review expectations.
|
||||||
- Shared sanitized inventory defaults for Linux and AIX examples.
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- This folder reflects the structure of a production-oriented operations repository.
|
|
||||||
- Current implementation includes Bash operational toolkits and Ansible hardening automation.
|
|
||||||
|
|||||||
@@ -0,0 +1,31 @@
|
|||||||
|
# infra-run Roadmap
|
||||||
|
|
||||||
|
This file tracks planned `infra-run` additions without presenting them as completed work.
|
||||||
|
|
||||||
|
## Candidate Additions
|
||||||
|
|
||||||
|
- More sample reports for disk pressure, service failures, and network incidents.
|
||||||
|
- A small Python parser for converting script output into a markdown change note.
|
||||||
|
- Additional Ansible molecule or container-based syntax checks where platform support is realistic.
|
||||||
|
- Standalone runbooks that reference the existing Bash workflows.
|
||||||
|
- Shared known-error pattern catalog review.
|
||||||
|
- Additional links between Python findings and existing runbooks.
|
||||||
|
- Change evidence collector for pre-check and post-check notes.
|
||||||
|
- Report examples suitable for incident and change tickets.
|
||||||
|
- Optional wrapper command only after the standalone Python tools stabilize.
|
||||||
|
|
||||||
|
## Implemented Additions
|
||||||
|
|
||||||
|
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
|
||||||
|
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
|
||||||
|
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
|
||||||
|
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
|
||||||
|
- `infra-run/scripts/python/jvm-log-analyzer/` - read-only JVM and Java application log analyzer for exceptions, stack traces, HTTP 5xx entries, database issues, TLS failures, and JVM failure symptoms.
|
||||||
|
- `infra-run/scripts/python/journal-analyzer/` - read-only exported `journalctl` text analyzer for summarizing failed units, dependency issues, restart patterns, OOM findings, disk/filesystem symptoms, and related service warnings.
|
||||||
|
- `infra-run/scripts/python/known-error-matcher/` - read-only known-error matcher for local logs and JSON pattern catalogs with severity, category, samples, and runbook references.
|
||||||
|
|
||||||
|
## Not Planned
|
||||||
|
|
||||||
|
- A full compliance benchmark implementation.
|
||||||
|
- Automated production changes without review gates.
|
||||||
|
- Vendor-specific storage actions that cannot be tested in a lab.
|
||||||
@@ -0,0 +1,37 @@
|
|||||||
|
# Source And Intent
|
||||||
|
|
||||||
|
`infra-run` exists to present infrastructure operations work in a form that can be reviewed without exposing employer systems, hostnames, storage identifiers, tickets, or internal procedures.
|
||||||
|
|
||||||
|
The project is inspired by professional Linux and infrastructure operations work: prechecks before changes, postchecks after changes, disk-pressure incidents, SSH and sudo hardening, storage expansion planning, cluster awareness, and the need to leave clear notes for other engineers.
|
||||||
|
|
||||||
|
## What Is Realistic
|
||||||
|
|
||||||
|
- The workflow shape: precheck, dry-run, execute only with explicit approval, postcheck, and report.
|
||||||
|
- The operational topics: Linux health checks, disk-full triage, Veritas VxVM/VCS concepts, GPFS / IBM Spectrum Scale concepts, and selected OS hardening controls.
|
||||||
|
- The caution around storage, clustering, SSH, sudo, audit, and filesystem changes.
|
||||||
|
|
||||||
|
## What Is Simplified
|
||||||
|
|
||||||
|
- Commands are written as examples and do not cover every vendor, OS release, package layout, or site standard.
|
||||||
|
- The Veritas and GPFS scripts model common workflow steps but cannot validate a real cluster from this repository.
|
||||||
|
- The Ansible roles apply selected baseline controls; they are not full compliance implementations.
|
||||||
|
- Reporting examples use sanitized sample data.
|
||||||
|
|
||||||
|
## What Was Sanitized
|
||||||
|
|
||||||
|
- Hostnames, IP addresses, disk names, WWNs, ticket numbers, application names, company names, and environment-specific values.
|
||||||
|
- Exact production procedures and internal approval paths.
|
||||||
|
- Any data that could identify a real system or organization.
|
||||||
|
|
||||||
|
## Production Caution
|
||||||
|
|
||||||
|
Do not run these scripts blindly on production systems. Review every command, adapt variables and paths, test in a lab, confirm backups and rollback plans, and follow the local change process.
|
||||||
|
|
||||||
|
This project does not claim that the exact scripts were used in production.
|
||||||
|
|
||||||
|
## Roles This Supports
|
||||||
|
|
||||||
|
- Linux System Administrator
|
||||||
|
- Infrastructure Engineer
|
||||||
|
- SRE / DevOps Operations Engineer
|
||||||
|
- Linux Platform Engineer
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
# Tested
|
||||||
|
|
||||||
|
This file documents the validation status for `infra-run`.
|
||||||
|
|
||||||
|
## Tested Locally
|
||||||
|
|
||||||
|
- Repository structure and documentation links were reviewed.
|
||||||
|
- Bash scripts were reviewed for dry-run defaults, quoting, and obvious unsafe cleanup behavior.
|
||||||
|
- Disk-full examples use fake data and can be read without access to production systems.
|
||||||
|
|
||||||
|
## Syntax Checked
|
||||||
|
|
||||||
|
Recommended local checks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
find infra-run/scripts/bash -name '*.sh' -print0 | xargs -0 shellcheck -x -P infra-run/scripts/bash/disk-full -P infra-run/scripts/bash/gpfs -P infra-run/scripts/bash/veritas
|
||||||
|
yamllint .
|
||||||
|
cd infra-run/ansible && ansible-lint playbooks roles
|
||||||
|
```
|
||||||
|
|
||||||
|
The GitHub Actions workflow runs shell and YAML validation. `ansible-lint` is non-blocking because role behavior depends on platform facts, installed collections, and target OS support.
|
||||||
|
|
||||||
|
## Not Tested Against Real Systems
|
||||||
|
|
||||||
|
- Veritas VxVM/VCS commands were not tested against a live Veritas cluster here.
|
||||||
|
- GPFS / IBM Spectrum Scale commands were not tested against a live GPFS cluster here.
|
||||||
|
- AIX hardening tasks were not tested against a real AIX LPAR here.
|
||||||
|
- SSH hardening was not validated across every possible `sshd_config` layout.
|
||||||
|
|
||||||
|
## Known Limitations
|
||||||
|
|
||||||
|
- Destructive storage operations are dry-run by default where applicable, but dry-run output is not a substitute for peer review.
|
||||||
|
- Some scripts require vendor commands that are not available on a normal Linux workstation.
|
||||||
|
- Ansible examples are selected baseline controls, not full hardening benchmarks.
|
||||||
|
- Local linting does not prove production safety.
|
||||||
|
|
||||||
|
## Suggested Validation Steps
|
||||||
|
|
||||||
|
1. Run `shellcheck` against all Bash scripts.
|
||||||
|
2. Run `yamllint` against repository YAML.
|
||||||
|
3. Run `cd infra-run/ansible && ansible-lint playbooks roles` and review any non-blocking warnings.
|
||||||
|
4. Run disk-full read-only scripts on disposable local paths.
|
||||||
|
5. For Veritas or GPFS, test only in a lab with fake volumes/disks or a controlled training environment.
|
||||||
|
6. Validate SSH changes on a disposable host using the full effective `sshd` configuration.
|
||||||
@@ -19,7 +19,7 @@ flowchart TD
|
|||||||
|
|
||||||
- `collections` - collection requirements for supported automation targets.
|
- `collections` - collection requirements for supported automation targets.
|
||||||
- `inventory` - sanitized Linux and AIX inventory examples with shared defaults.
|
- `inventory` - sanitized Linux and AIX inventory examples with shared defaults.
|
||||||
- `playbooks` - executable CIS-inspired hardening playbooks.
|
- `playbooks` - executable selected baseline hardening playbooks.
|
||||||
- `roles` - reusable hardening roles for supported operating systems.
|
- `roles` - reusable hardening roles for supported operating systems.
|
||||||
- `tests` - validation and test harnesses for Ansible content.
|
- `tests` - validation and test harnesses for Ansible content.
|
||||||
|
|
||||||
@@ -31,6 +31,8 @@ flowchart TD
|
|||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- Roles are CIS-inspired examples intended for portfolio and lab use, not a drop-in compliance certification.
|
- Roles are selected baseline examples intended for portfolio and lab use, not a drop-in compliance certification.
|
||||||
- Defaults are sanitized and configurable through inventory or `--extra-vars`.
|
- Defaults are sanitized and configurable through inventory or `--extra-vars`.
|
||||||
- Run platform-specific playbooks against appropriate test hosts before adapting them to production environments.
|
- Run platform-specific playbooks against appropriate test hosts before adapting them to managed environments.
|
||||||
|
- Prefer `--check --diff` for review runs before applying changes.
|
||||||
|
- Validate from the repository root with `./scripts/check-ansible.sh`.
|
||||||
|
|||||||
@@ -14,7 +14,7 @@ flowchart TD
|
|||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- `cis-rhel9-hardening.yml` applies the RHEL 9 CIS-inspired hardening role to Linux inventory targets.
|
- `cis-rhel9-hardening.yml` applies the RHEL 9 selected baseline hardening role to Linux inventory targets.
|
||||||
- `cis-debian-ubuntu-hardening.yml` applies the Debian 13 / Ubuntu 26.04 CIS-inspired hardening role to Linux inventory targets.
|
- `cis-debian-ubuntu-hardening.yml` applies the Debian 13 / Ubuntu 26.04 selected baseline hardening role to Linux inventory targets.
|
||||||
- `cis-aix7-hardening.yml` applies the IBM AIX 7 CIS-inspired hardening role to AIX inventory targets.
|
- `cis-aix7-hardening.yml` applies the IBM AIX 7 selected baseline hardening role to AIX inventory targets.
|
||||||
- Use the sanitized inventory under `../inventory/` as a starting point and override defaults per environment.
|
- Use the sanitized inventory under `../inventory/` as a starting point and override defaults per environment.
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
- name: Apply CIS-inspired IBM AIX 7 hardening controls
|
- name: Apply selected baseline IBM AIX 7 hardening controls
|
||||||
hosts: aix
|
hosts: aix
|
||||||
become: true
|
become: true
|
||||||
gather_facts: true
|
gather_facts: true
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
- name: Apply CIS-inspired Debian and Ubuntu hardening controls
|
- name: Apply selected baseline Debian and Ubuntu hardening controls
|
||||||
hosts: linux
|
hosts: linux
|
||||||
become: true
|
become: true
|
||||||
gather_facts: true
|
gather_facts: true
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
- name: Apply CIS-inspired RHEL 9 hardening controls
|
- name: Apply selected baseline RHEL 9 hardening controls
|
||||||
hosts: linux
|
hosts: linux
|
||||||
become: true
|
become: true
|
||||||
gather_facts: true
|
gather_facts: true
|
||||||
|
|||||||
@@ -17,11 +17,11 @@ flowchart TD
|
|||||||
|
|
||||||
## Current Roles
|
## Current Roles
|
||||||
|
|
||||||
- `cis-rhel9-hardening` - CIS-inspired RHEL 9 baseline with package, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
|
- `cis-rhel9-hardening` - RHEL 9 baseline example with package, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
|
||||||
- `cis-debian-ubuntu-hardening` - CIS-inspired Debian 13 and Ubuntu 26.04 baseline with apt, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
|
- `cis-debian-ubuntu-hardening` - Debian 13 and Ubuntu 26.04 baseline example with apt, service, SSH, sudo, sysctl, audit, logging, filesystem, and validation tasks.
|
||||||
- `cis-aix7-hardening` - CIS-inspired IBM AIX 7 baseline with SSH, sudo, audit, logging, cron, user, password, network, filesystem, service, and validation tasks.
|
- `cis-aix7-hardening` - IBM AIX 7 baseline example with SSH, sudo, audit, logging, cron, user, password, network, filesystem, service, and validation tasks.
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- Each role includes defaults, task includes, handlers where needed, and role-specific README guidance.
|
- Each role includes defaults, task includes, handlers where needed, and role-specific README guidance.
|
||||||
- The hardening content is sanitized for portfolio use and should be reviewed against site policy before production use.
|
- The hardening content is sanitized for portfolio use and should be reviewed against site policy before live use.
|
||||||
|
|||||||
@@ -1,10 +1,10 @@
|
|||||||
# cis-aix7-hardening
|
# cis-aix7-hardening
|
||||||
|
|
||||||
Operational IBM AIX 7.x hardening role inspired by CIS Benchmark 1.2.0 and common enterprise Unix security practices.
|
Operational IBM AIX 7.x hardening role inspired by CIS Benchmark 1.2.0 and common Unix security practices.
|
||||||
|
|
||||||
Reference: https://www.cisecurity.org/benchmark/aix
|
Reference: https://www.cisecurity.org/benchmark/aix
|
||||||
|
|
||||||
This role is intended for infrastructure and security operations teams that manage production AIX estates. It favors readable, conservative controls over broad benchmark coverage.
|
This role is intended for infrastructure and security operations teams that manage AIX estates. It favors readable, conservative controls over broad benchmark coverage.
|
||||||
|
|
||||||
## Supported OS
|
## Supported OS
|
||||||
|
|
||||||
@@ -27,7 +27,7 @@ This role is intended for infrastructure and security operations teams that mana
|
|||||||
|
|
||||||
AIX is not Linux. This role does not assume systemd, sysctl, Linux package managers, or Linux service paths. Service operations use SRC commands such as `lssrc`, `startsrc`, `stopsrc`, and `refresh`.
|
AIX is not Linux. This role does not assume systemd, sysctl, Linux package managers, or Linux service paths. Service operations use SRC commands such as `lssrc`, `startsrc`, `stopsrc`, and `refresh`.
|
||||||
|
|
||||||
AIX environments vary heavily between enterprises. Filesystem layout, OpenSSH source, sudo packaging, audit classes, NFS tuning, and security policy ownership should be validated before production rollout.
|
AIX environments vary heavily between environments. Filesystem layout, OpenSSH source, sudo packaging, audit classes, NFS tuning, and security policy ownership should be validated before managed rollout.
|
||||||
|
|
||||||
## Safety Philosophy
|
## Safety Philosophy
|
||||||
|
|
||||||
@@ -64,4 +64,4 @@ ansible-playbook playbooks/cis-aix7-hardening.yml --tags audit -e cis_enable_aud
|
|||||||
|
|
||||||
## Important Warning
|
## Important Warning
|
||||||
|
|
||||||
This is not a full CIS certification implementation and does not implement the entire CIS AIX benchmark. It is a practical CIS-inspired baseline that should be reviewed by infrastructure, security, and application owners before production enforcement.
|
This is not a full compliance certification implementation and does not implement the entire CIS AIX benchmark. It is a practical baseline example that should be reviewed by infrastructure, security, and application owners before managed enforcement.
|
||||||
|
|||||||
@@ -18,7 +18,7 @@
|
|||||||
ansible.builtin.debug:
|
ansible.builtin.debug:
|
||||||
msg: >-
|
msg: >-
|
||||||
OK: Mount option management is disabled by default.
|
OK: Mount option management is disabled by default.
|
||||||
Review target {{ item.path }} for options {{ item.options | join(', ') }} before production rollout.
|
Review target {{ item.path }} for options {{ item.options | join(', ') }} before managed rollout.
|
||||||
loop: "{{ cis_mount_option_targets }}"
|
loop: "{{ cis_mount_option_targets }}"
|
||||||
when: not cis_manage_mount_options | bool
|
when: not cis_manage_mount_options | bool
|
||||||
|
|
||||||
|
|||||||
@@ -54,5 +54,5 @@
|
|||||||
if cis_aix_post_sshd.rc == 0 else 'CRITICAL: sshd validation failed; review SSH config before restarting sessions.' }}
|
if cis_aix_post_sshd.rc == 0 else 'CRITICAL: sshd validation failed; review SSH config before restarting sessions.' }}
|
||||||
- "OK: Service states: {{ cis_aix_validation_summary.service_states }}"
|
- "OK: Service states: {{ cis_aix_validation_summary.service_states }}"
|
||||||
- "OK: Password policy summary: {{ cis_aix_validation_summary.password_policy }}"
|
- "OK: Password policy summary: {{ cis_aix_validation_summary.password_policy }}"
|
||||||
- "WARNING: This role is CIS-inspired and does not represent a complete CIS certification implementation."
|
- "WARNING: This role is selected baseline and does not represent a complete compliance certification implementation."
|
||||||
- "{{ cis_aix_validation_summary.recommendations }}"
|
- "{{ cis_aix_validation_summary.recommendations }}"
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# CIS-Inspired Debian and Ubuntu Hardening
|
# Debian And Ubuntu Baseline Hardening Role
|
||||||
|
|
||||||
This role applies a small, practical set of CIS-inspired operational hardening controls for Debian and Ubuntu servers. It is intentionally readable, conservative, and suitable as a baseline for production environments that still need local review.
|
This role applies a small, practical set of selected baseline operational hardening controls for Debian and Ubuntu servers. It is intentionally readable, conservative, and suitable as a baseline for managed environments that still need local review.
|
||||||
|
|
||||||
## Supported OS
|
## Supported OS
|
||||||
|
|
||||||
@@ -11,7 +11,7 @@ Unsupported distributions and versions fail during precheck before hardening tas
|
|||||||
|
|
||||||
## Implemented Areas
|
## Implemented Areas
|
||||||
|
|
||||||
- SSH daemon hardening with a validated drop-in configuration
|
- SSH daemon hardening through a managed drop-in and final `sshd -t` validation
|
||||||
- Legacy network package removal
|
- Legacy network package removal
|
||||||
- Optional installation and enablement of `auditd`, `chrony`, `rsyslog`, and `sudo`
|
- Optional installation and enablement of `auditd`, `chrony`, `rsyslog`, and `sudo`
|
||||||
- Kernel network sysctl hardening
|
- Kernel network sysctl hardening
|
||||||
@@ -31,7 +31,7 @@ The defaults are intended to be operationally safe:
|
|||||||
- Services are enabled only when the matching feature is enabled and the service exists.
|
- Services are enabled only when the matching feature is enabled and the service exists.
|
||||||
- Existing logging configuration is not replaced.
|
- Existing logging configuration is not replaced.
|
||||||
|
|
||||||
This role does not implement the full CIS benchmark and is not a CIS certification implementation.
|
This role does not implement the full CIS benchmark and is not a compliance certification implementation.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
|
|||||||
@@ -37,12 +37,16 @@
|
|||||||
ansible.builtin.set_fact:
|
ansible.builtin.set_fact:
|
||||||
cis_package_validation_summary:
|
cis_package_validation_summary:
|
||||||
legacy_absent: "{{ cis_legacy_packages | difference(ansible_facts.packages.keys() | list) }}"
|
legacy_absent: "{{ cis_legacy_packages | difference(ansible_facts.packages.keys() | list) }}"
|
||||||
hardening_present: "{{ (cis_enabled_hardening_packages | default(cis_hardening_packages)) | intersect(ansible_facts.packages.keys() | list) }}"
|
hardening_present: >-
|
||||||
|
{{ (cis_enabled_hardening_packages | default(cis_hardening_packages))
|
||||||
|
| intersect(ansible_facts.packages.keys() | list) }}
|
||||||
audit_present: "{{ cis_audit_packages | intersect(ansible_facts.packages.keys() | list) }}"
|
audit_present: "{{ cis_audit_packages | intersect(ansible_facts.packages.keys() | list) }}"
|
||||||
|
|
||||||
- name: Build sysctl validation summary
|
- name: Build sysctl validation summary
|
||||||
ansible.builtin.set_fact:
|
ansible.builtin.set_fact:
|
||||||
cis_sysctl_validation_summary: "{{ cis_sysctl_validation_summary | default({}) | combine({item.item.key: item.stdout | default('unreadable')}) }}"
|
cis_sysctl_validation_summary: >-
|
||||||
|
{{ cis_sysctl_validation_summary | default({})
|
||||||
|
| combine({item.item.key: item.stdout | default('unreadable')}) }}
|
||||||
loop: "{{ cis_sysctl_validation.results | default([]) }}"
|
loop: "{{ cis_sysctl_validation.results | default([]) }}"
|
||||||
loop_control:
|
loop_control:
|
||||||
label: "{{ item.item.key }}"
|
label: "{{ item.item.key }}"
|
||||||
@@ -65,7 +69,7 @@
|
|||||||
- name: Publish validation summary
|
- name: Publish validation summary
|
||||||
ansible.builtin.set_fact:
|
ansible.builtin.set_fact:
|
||||||
cis_validation_summary:
|
cis_validation_summary:
|
||||||
benchmark: "CIS-inspired controls for Debian 13 Trixie and Ubuntu Server 26.04 LTS"
|
benchmark: "selected controls for Debian 13 Trixie and Ubuntu Server 26.04 LTS"
|
||||||
sshd_config: "{{ 'OK' if cis_sshd_validate.rc == 0 else 'CRITICAL' }}"
|
sshd_config: "{{ 'OK' if cis_sshd_validate.rc == 0 else 'CRITICAL' }}"
|
||||||
services: "{{ cis_service_state_summary }}"
|
services: "{{ cis_service_state_summary }}"
|
||||||
packages: "{{ cis_package_validation_summary }}"
|
packages: "{{ cis_package_validation_summary }}"
|
||||||
|
|||||||
@@ -33,7 +33,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PermitRootLogin\s+'
|
regexp: '^PermitRootLogin\s+'
|
||||||
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
|
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -43,7 +42,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PermitEmptyPasswords\s+'
|
regexp: '^PermitEmptyPasswords\s+'
|
||||||
line: "PermitEmptyPasswords no"
|
line: "PermitEmptyPasswords no"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -53,7 +51,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PasswordAuthentication\s+'
|
regexp: '^PasswordAuthentication\s+'
|
||||||
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
|
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -63,7 +60,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^MaxAuthTries\s+'
|
regexp: '^MaxAuthTries\s+'
|
||||||
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
|
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -73,7 +69,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^LoginGraceTime\s+'
|
regexp: '^LoginGraceTime\s+'
|
||||||
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
|
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -83,7 +78,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^ClientAliveInterval\s+'
|
regexp: '^ClientAliveInterval\s+'
|
||||||
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
|
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
@@ -93,7 +87,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^ClientAliveCountMax\s+'
|
regexp: '^ClientAliveCountMax\s+'
|
||||||
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
|
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate ssh
|
- validate ssh
|
||||||
- restart ssh
|
- restart ssh
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
- name: Apply CIS-inspired sysctl settings
|
- name: Apply selected sysctl settings
|
||||||
ansible.posix.sysctl:
|
ansible.posix.sysctl:
|
||||||
name: "{{ item.key }}"
|
name: "{{ item.key }}"
|
||||||
value: "{{ item.value }}"
|
value: "{{ item.value }}"
|
||||||
|
|||||||
@@ -1,8 +1,8 @@
|
|||||||
# CIS-Inspired RHEL 9 Hardening Role
|
# RHEL 9 Baseline Hardening Role
|
||||||
|
|
||||||
This role provides a practical, production-style hardening baseline for RHEL 9 and Oracle Linux 9 systems. It is inspired by CIS Benchmark controls for Red Hat Enterprise Linux 9 version 2.0.0, but it is intentionally scoped to common operational controls that infrastructure and security operations teams frequently automate.
|
This role provides a practical, baseline hardening example for RHEL 9 and Oracle Linux 9 systems. It is inspired by hardening benchmark controls for Red Hat Enterprise Linux 9 version 2.0.0, but it is intentionally scoped to common operational controls that infrastructure and security operations teams frequently automate.
|
||||||
|
|
||||||
This is not a full CIS certification implementation.
|
This is not a full compliance certification implementation.
|
||||||
|
|
||||||
## Supported Platforms
|
## Supported Platforms
|
||||||
|
|
||||||
@@ -16,7 +16,7 @@ The role fails safely on unsupported operating systems or unsupported major vers
|
|||||||
- SSH daemon hardening for root login, empty passwords, password authentication, retry limits, login grace time, and client keepalive behavior.
|
- SSH daemon hardening for root login, empty passwords, password authentication, retry limits, login grace time, and client keepalive behavior.
|
||||||
- Removal of selected legacy network packages such as telnet, rsh-server, and ypbind.
|
- Removal of selected legacy network packages such as telnet, rsh-server, and ypbind.
|
||||||
- Optional installation and enablement of chrony, auditd, and rsyslog.
|
- Optional installation and enablement of chrony, auditd, and rsyslog.
|
||||||
- CIS-inspired IPv4 network sysctl settings.
|
- Selected IPv4 network sysctl settings.
|
||||||
- Service enablement for chronyd, auditd, and rsyslog.
|
- Service enablement for chronyd, auditd, and rsyslog.
|
||||||
- Safe disabling of known legacy services when they are present.
|
- Safe disabling of known legacy services when they are present.
|
||||||
- Basic audit backlog and audit rule examples.
|
- Basic audit backlog and audit rule examples.
|
||||||
@@ -26,9 +26,9 @@ The role fails safely on unsupported operating systems or unsupported major vers
|
|||||||
|
|
||||||
## Safety Philosophy
|
## Safety Philosophy
|
||||||
|
|
||||||
The defaults are conservative. The role supports Ansible check mode and avoids destructive production behavior by default. Filesystem mount option management is disabled unless `cis_manage_mount_options` is explicitly enabled, and even then the role persists configured options without remounting live filesystems.
|
The defaults are conservative. The role supports Ansible check mode and avoids destructive live-system behavior by default. Filesystem mount option management is disabled unless `cis_manage_mount_options` is explicitly enabled, and even then the role persists configured options without remounting live filesystems.
|
||||||
|
|
||||||
Review variables before using this role in production.
|
Review variables before adapting this role to managed hosts.
|
||||||
|
|
||||||
## Common Variables
|
## Common Variables
|
||||||
|
|
||||||
@@ -78,6 +78,6 @@ Example:
|
|||||||
ansible-playbook playbooks/cis-rhel9-hardening.yml --tags precheck,ssh,postcheck
|
ansible-playbook playbooks/cis-rhel9-hardening.yml --tags precheck,ssh,postcheck
|
||||||
```
|
```
|
||||||
|
|
||||||
## Production Rollout Notes
|
## Rollout Notes
|
||||||
|
|
||||||
This role is a hardening starting point for internal infrastructure teams. It should be reviewed against local access patterns, break-glass procedures, compliance requirements, monitoring expectations, and host build standards before rollout.
|
This role is a hardening starting point for internal infrastructure teams. It should be reviewed against local access patterns, break-glass procedures, compliance requirements, monitoring expectations, and host build standards before rollout.
|
||||||
|
|||||||
@@ -28,7 +28,9 @@
|
|||||||
|
|
||||||
- name: Build sysctl validation summary
|
- name: Build sysctl validation summary
|
||||||
ansible.builtin.set_fact:
|
ansible.builtin.set_fact:
|
||||||
cis_sysctl_validation_summary: "{{ cis_sysctl_validation_summary | default({}) | combine({item.item.key: item.stdout | default('unreadable')}) }}"
|
cis_sysctl_validation_summary: >-
|
||||||
|
{{ cis_sysctl_validation_summary | default({})
|
||||||
|
| combine({item.item.key: item.stdout | default('unreadable')}) }}
|
||||||
loop: "{{ cis_sysctl_validation.results | default([]) }}"
|
loop: "{{ cis_sysctl_validation.results | default([]) }}"
|
||||||
loop_control:
|
loop_control:
|
||||||
label: "{{ item.item.key }}"
|
label: "{{ item.item.key }}"
|
||||||
|
|||||||
@@ -22,7 +22,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PermitRootLogin\s+'
|
regexp: '^PermitRootLogin\s+'
|
||||||
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
|
line: "PermitRootLogin {{ 'no' if cis_disable_root_login | bool else 'prohibit-password' }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -32,7 +31,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PermitEmptyPasswords\s+'
|
regexp: '^PermitEmptyPasswords\s+'
|
||||||
line: "PermitEmptyPasswords no"
|
line: "PermitEmptyPasswords no"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -42,7 +40,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^PasswordAuthentication\s+'
|
regexp: '^PasswordAuthentication\s+'
|
||||||
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
|
line: "PasswordAuthentication {{ 'no' if cis_disable_password_auth | bool else 'yes' }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -52,7 +49,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^MaxAuthTries\s+'
|
regexp: '^MaxAuthTries\s+'
|
||||||
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
|
line: "MaxAuthTries {{ cis_ssh_max_auth_tries }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -62,7 +58,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^LoginGraceTime\s+'
|
regexp: '^LoginGraceTime\s+'
|
||||||
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
|
line: "LoginGraceTime {{ cis_ssh_login_grace_time }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -72,7 +67,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^ClientAliveInterval\s+'
|
regexp: '^ClientAliveInterval\s+'
|
||||||
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
|
line: "ClientAliveInterval {{ cis_ssh_client_alive_interval }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
@@ -82,7 +76,6 @@
|
|||||||
path: "{{ cis_ssh_dropin_path }}"
|
path: "{{ cis_ssh_dropin_path }}"
|
||||||
regexp: '^ClientAliveCountMax\s+'
|
regexp: '^ClientAliveCountMax\s+'
|
||||||
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
|
line: "ClientAliveCountMax {{ cis_ssh_client_alive_count_max }}"
|
||||||
validate: sshd -t -f %s
|
|
||||||
notify:
|
notify:
|
||||||
- validate sshd
|
- validate sshd
|
||||||
- reload sshd
|
- reload sshd
|
||||||
|
|||||||
@@ -1,5 +1,5 @@
|
|||||||
---
|
---
|
||||||
- name: Apply CIS-inspired sysctl settings
|
- name: Apply selected sysctl settings
|
||||||
ansible.posix.sysctl:
|
ansible.posix.sysctl:
|
||||||
name: "{{ item.key }}"
|
name: "{{ item.key }}"
|
||||||
value: "{{ item.value }}"
|
value: "{{ item.value }}"
|
||||||
|
|||||||
@@ -1,17 +1,10 @@
|
|||||||
# infra-run/docs
|
# docs
|
||||||
|
|
||||||
This directory is intended for supporting technical documentation tied to the operational tooling in `infra-run`. It is the natural home for implementation notes, architecture writeups, and operational reference material.
|
Planned area for longer technical notes.
|
||||||
|
|
||||||
## Diagram
|
Current documentation lives in the project README files plus:
|
||||||
|
|
||||||
```mermaid
|
- [SOURCE.md](../SOURCE.md)
|
||||||
flowchart TD
|
- [TESTED.md](../TESTED.md)
|
||||||
A["docs"] --> B["Architecture notes"]
|
- [KNOWN_LIMITATIONS.md](../KNOWN_LIMITATIONS.md)
|
||||||
A --> C["Operational references"]
|
- [ROADMAP.md](../ROADMAP.md)
|
||||||
A --> D["Change preparation notes"]
|
|
||||||
```
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- The folder currently contains only a placeholder file.
|
|
||||||
- It complements `runbooks` by focusing on reference material rather than step-by-step execution flows.
|
|
||||||
|
|||||||
@@ -0,0 +1,857 @@
|
|||||||
|
# Production Operations Cheatsheet
|
||||||
|
|
||||||
|
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
|
||||||
|
|
||||||
|
## Linux / Unix Daily Operations
|
||||||
|
|
||||||
|
### Uptime and Host State
|
||||||
|
|
||||||
|
Check host age, kernel, clock, and recent reboot history before touching anything:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
uname -r
|
||||||
|
hostnamectl
|
||||||
|
timedatectl
|
||||||
|
who -b
|
||||||
|
last -x | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
Pre-check pattern:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
date -u
|
||||||
|
uptime
|
||||||
|
df -h
|
||||||
|
free -m
|
||||||
|
systemctl --failed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Process Management
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -ef | head
|
||||||
|
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
|
||||||
|
pgrep -a java
|
||||||
|
pstree -ap | less
|
||||||
|
pidof sshd
|
||||||
|
renice +5 -p <pid>
|
||||||
|
kill -TERM <pid>
|
||||||
|
kill -9 <pid> # DANGEROUS: last resort only
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -p <pid> -o pid,stat,etime,cmd
|
||||||
|
journalctl -u <service> -n 50 --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
### systemctl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
systemctl is-active <service>
|
||||||
|
systemctl is-enabled <service>
|
||||||
|
systemctl list-units --type=service --state=running
|
||||||
|
systemctl list-units --failed
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl restart <service> # impact: confirms service interruption policy first
|
||||||
|
```
|
||||||
|
|
||||||
|
### journalctl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u <service> -n 100 --no-pager
|
||||||
|
journalctl -u <service> --since '30 min ago'
|
||||||
|
journalctl -p err -S today
|
||||||
|
journalctl -k -b
|
||||||
|
journalctl --disk-usage
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Troubleshooting Flow
|
||||||
|
|
||||||
|
1. Confirm service state and recent restart count.
|
||||||
|
2. Read the last 100-200 journal lines.
|
||||||
|
3. Validate config syntax before restart if the daemon supports it.
|
||||||
|
4. Check dependent ports, mounts, credentials, and name resolution.
|
||||||
|
5. Restart only after cause is understood or rollback exists.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status nginx --no-pager -l
|
||||||
|
journalctl -u nginx -n 100 --no-pager
|
||||||
|
nginx -t
|
||||||
|
ss -ltnp | grep ':80\|:443'
|
||||||
|
curl -kI https://127.0.0.1/
|
||||||
|
```
|
||||||
|
|
||||||
|
### CPU and Memory Diagnostics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
top -H -b -n 1 | head -40
|
||||||
|
pidstat 1 5
|
||||||
|
pidstat -ru -p ALL 1 3
|
||||||
|
vmstat 1 5
|
||||||
|
iostat -xz 1 5
|
||||||
|
free -m
|
||||||
|
sar -q 1 5
|
||||||
|
```
|
||||||
|
|
||||||
|
Quick interpretation:
|
||||||
|
|
||||||
|
- high `%wa`: storage path or NFS issue
|
||||||
|
- high run queue with low CPU idle: CPU contention
|
||||||
|
- swap growth plus page scans: memory pressure
|
||||||
|
|
||||||
|
### Disk Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
du -xhd1 /var | sort -h
|
||||||
|
find /var/log -type f -size +500M -ls | sort -k7,7n
|
||||||
|
lsof +L1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Inode Exhaustion
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -ih
|
||||||
|
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
|
||||||
|
find /tmp -xdev -type f | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mounts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mount | column -t
|
||||||
|
findmnt
|
||||||
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||||
|
cat /etc/fstab
|
||||||
|
mount -a # can expose bad fstab entries; use in change window
|
||||||
|
```
|
||||||
|
|
||||||
|
### Permissions
|
||||||
|
|
||||||
|
```bash
|
||||||
|
namei -l /path/to/file
|
||||||
|
stat /path/to/file
|
||||||
|
getfacl /path/to/file
|
||||||
|
chmod 640 /path/to/file
|
||||||
|
chown root:app /path/to/file
|
||||||
|
```
|
||||||
|
|
||||||
|
### SELinux
|
||||||
|
|
||||||
|
State and mode:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
sestatus
|
||||||
|
cat /etc/selinux/config
|
||||||
|
```
|
||||||
|
|
||||||
|
Check file, process, and port context:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -Zd /var/www/html
|
||||||
|
ls -lZ /var/www/html/index.html
|
||||||
|
ps -eZ | grep nginx
|
||||||
|
id -Z
|
||||||
|
semanage port -l | grep http
|
||||||
|
```
|
||||||
|
|
||||||
|
Audit and denial review:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||||
|
ausearch -m AVC -ts today | audit2why
|
||||||
|
journalctl -t setroubleshoot --since '1 hour ago'
|
||||||
|
sealert -a /var/log/audit/audit.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical flow:
|
||||||
|
|
||||||
|
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
|
||||||
|
2. Identify the failing path, process domain, and target context.
|
||||||
|
3. Read AVC denials before changing labels or booleans.
|
||||||
|
4. Prefer persistent policy-aligned fixes over `chcon`.
|
||||||
|
5. Restore default labels and retest service path.
|
||||||
|
|
||||||
|
Modify and restore context:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
|
||||||
|
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
|
||||||
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||||
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||||
|
restorecon -Rv /srv/app
|
||||||
|
matchpathcon /srv/app/uploads/file.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Booleans and validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getsebool -a | grep httpd
|
||||||
|
getsebool httpd_can_network_connect
|
||||||
|
setsebool -P httpd_can_network_connect on
|
||||||
|
runcon -t httpd_t -- id -Z
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
|
||||||
|
- use `chcon` only as a short-lived diagnostic or emergency workaround
|
||||||
|
- avoid generating local policy modules from `audit2allow` until root cause is understood
|
||||||
|
- after context changes, validate service startup, AVC silence, and application path access
|
||||||
|
|
||||||
|
### Archives
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tar tf backup.tar | head
|
||||||
|
tar czf logs-$(date +%F).tgz /var/log/app
|
||||||
|
tar xzf bundle.tgz -C /restore/path
|
||||||
|
gzip -t file.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
### File Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp -a source/ target/
|
||||||
|
rsync -aHAXvn /src/ /dst/
|
||||||
|
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
|
||||||
|
mv file file.$(date +%F-%H%M%S).bak
|
||||||
|
sha256sum file
|
||||||
|
```
|
||||||
|
|
||||||
|
## Text Processing & Regex
|
||||||
|
|
||||||
|
### Core Tools
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -n 'ERROR' app.log
|
||||||
|
grep -E 'ERROR|WARN' app.log
|
||||||
|
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
|
||||||
|
awk '{print $1,$4,$5}' access.log
|
||||||
|
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
|
||||||
|
sed -n '1,20p' file
|
||||||
|
sed -E 's/[[:space:]]+/ /g' file
|
||||||
|
cut -d: -f1,7 /etc/passwd
|
||||||
|
sort file | uniq -c | sort -nr
|
||||||
|
xargs -r -n1 systemctl status < service-list.txt
|
||||||
|
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### Regex Reference
|
||||||
|
|
||||||
|
```text
|
||||||
|
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
|
||||||
|
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
|
||||||
|
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
|
||||||
|
Log level \b(?:ERROR|WARN|INFO)\b
|
||||||
|
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
|
||||||
|
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Parsing Examples
|
||||||
|
|
||||||
|
IP extraction:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
|
||||||
|
```
|
||||||
|
|
||||||
|
Timestamp filter:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
|
||||||
|
```
|
||||||
|
|
||||||
|
UUID extraction:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
|
||||||
|
```
|
||||||
|
|
||||||
|
ERROR/WARN/INFO parsing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
|
||||||
|
```
|
||||||
|
|
||||||
|
Failed SSH login parsing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep 'Failed password' /var/log/secure \
|
||||||
|
| awk '{print $(NF-3),$NF}' \
|
||||||
|
| sort | uniq -c | sort -nr | head
|
||||||
|
```
|
||||||
|
|
||||||
|
Extract fields from logs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Filter Ansible output:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
|
||||||
|
grep -E '^fatal:|^failed:' ansible.log
|
||||||
|
```
|
||||||
|
|
||||||
|
## Incident Response
|
||||||
|
|
||||||
|
### Disk Full
|
||||||
|
|
||||||
|
Workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
df -ih
|
||||||
|
findmnt
|
||||||
|
du -xhd1 /var | sort -h
|
||||||
|
find /var -xdev -type f -size +1G -ls | sort -k7,7n
|
||||||
|
lsof +L1
|
||||||
|
journalctl --disk-usage
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical branches:
|
||||||
|
|
||||||
|
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
|
||||||
|
- inode full: remove file storms, spool buildup, temp-file leaks
|
||||||
|
- deleted open files: restart offender only after sizing impact
|
||||||
|
|
||||||
|
Post-check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
df -ih
|
||||||
|
systemctl --failed
|
||||||
|
```
|
||||||
|
|
||||||
|
### High CPU
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
mpstat -P ALL 1 5
|
||||||
|
pidstat -u -p ALL 1 5
|
||||||
|
top -H -b -n 1 | head -40
|
||||||
|
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Confirm sustained load, not a short spike.
|
||||||
|
2. Separate user CPU vs system CPU vs I/O wait.
|
||||||
|
3. Identify hot process and hot threads.
|
||||||
|
4. Correlate with deploys, cron, backups, or JVM GC.
|
||||||
|
5. Throttle, stop, or fail over only with service impact understood.
|
||||||
|
|
||||||
|
### Memory Pressure
|
||||||
|
|
||||||
|
```bash
|
||||||
|
free -m
|
||||||
|
vmstat 1 5
|
||||||
|
sar -r 1 5
|
||||||
|
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
|
||||||
|
dmesg -T | egrep -i 'oom|killed process'
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Check swap growth and page scan rates.
|
||||||
|
2. Identify top RSS owners.
|
||||||
|
3. Check kernel logs for OOM.
|
||||||
|
4. Validate cache vs real process growth.
|
||||||
|
5. Restart leaking service only after capturing evidence.
|
||||||
|
|
||||||
|
### Failed Service
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
journalctl -u <service> -b --no-pager | tail -100
|
||||||
|
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Validate config.
|
||||||
|
2. Validate credentials, ports, mounts, permissions.
|
||||||
|
3. Confirm dependency availability.
|
||||||
|
4. Restart and recheck logs immediately.
|
||||||
|
|
||||||
|
### SELinux Denials
|
||||||
|
|
||||||
|
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
|
||||||
|
|
||||||
|
Triage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
sestatus
|
||||||
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||||
|
ausearch -m AVC -ts recent | audit2why
|
||||||
|
journalctl -t setroubleshoot --since '30 min ago'
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
ps -eZ | grep <service>
|
||||||
|
ls -lZ /path/to/app /path/to/app/*
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Confirm the failure is current and reproducible.
|
||||||
|
2. Identify the denied process domain, target path, and requested access from AVC logs.
|
||||||
|
3. Validate expected default context with `matchpathcon`.
|
||||||
|
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
|
||||||
|
5. Apply the smallest persistent fix, then retest in `Enforcing`.
|
||||||
|
|
||||||
|
Common fixes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
matchpathcon /srv/app/config.yml
|
||||||
|
restorecon -Rv /srv/app
|
||||||
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||||
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||||
|
semanage port -l | grep http
|
||||||
|
getsebool -a | grep httpd
|
||||||
|
setsebool -P httpd_can_network_connect on
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
systemctl restart <service>
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
ausearch -m AVC -ts recent
|
||||||
|
curl -fsS http://127.0.0.1:<port>/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Operational notes:
|
||||||
|
|
||||||
|
- do not leave systems in `Permissive` as the fix
|
||||||
|
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
|
||||||
|
- treat `audit2allow` output as investigation material, not automatic remediation
|
||||||
|
- if policy changes are unavoidable, document exact AVC evidence and rollback path
|
||||||
|
|
||||||
|
### SSL Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||||
|
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
|
||||||
|
curl -vkI https://host/
|
||||||
|
```
|
||||||
|
|
||||||
|
Check for:
|
||||||
|
|
||||||
|
- expired certificate
|
||||||
|
- missing SAN
|
||||||
|
- incomplete chain
|
||||||
|
- hostname mismatch
|
||||||
|
- TLS version or cipher mismatch
|
||||||
|
|
||||||
|
### DNS Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dig +short app.example.com
|
||||||
|
dig @<resolver> app.example.com
|
||||||
|
dig +trace app.example.com
|
||||||
|
getent hosts app.example.com
|
||||||
|
resolvectl status
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Compare resolver result with authoritative result.
|
||||||
|
2. Check TTL and stale cache.
|
||||||
|
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
|
||||||
|
4. Test from affected host and unaffected host.
|
||||||
|
|
||||||
|
### Network Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ip addr
|
||||||
|
ip route
|
||||||
|
ss -tulpen
|
||||||
|
tcpdump -ni any host <peer> and port <port>
|
||||||
|
curl -sv http://host:port/health
|
||||||
|
mtr -rwzc 20 host
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Interface/link state.
|
||||||
|
2. Route and source IP selection.
|
||||||
|
3. Listening socket on target.
|
||||||
|
4. Firewall and security controls.
|
||||||
|
5. Packet capture if app logs are inconclusive.
|
||||||
|
|
||||||
|
### JVM / Tomcat Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -ef | grep -i tomcat
|
||||||
|
jcmd <pid> VM.flags
|
||||||
|
jstat -gcutil <pid> 1000 10
|
||||||
|
jstack <pid> | head -100
|
||||||
|
ss -ltnp | grep java
|
||||||
|
tail -100 /opt/tomcat/logs/catalina.out
|
||||||
|
```
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- stuck threads
|
||||||
|
- full GC loops
|
||||||
|
- heap exhaustion
|
||||||
|
- connector bind failures
|
||||||
|
- slow backend dependency
|
||||||
|
|
||||||
|
### Certificate Expiration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
|
||||||
|
| openssl x509 -noout -enddate
|
||||||
|
|
||||||
|
openssl x509 -checkend 2592000 -noout -in cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### Suspicious Login Attempts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
last -ai | head -30
|
||||||
|
lastb -ai | head -30
|
||||||
|
grep 'Failed password' /var/log/secure | tail -50
|
||||||
|
grep 'Accepted ' /var/log/secure | tail -50
|
||||||
|
ausearch -m USER_LOGIN -ts recent
|
||||||
|
```
|
||||||
|
|
||||||
|
Workflow:
|
||||||
|
|
||||||
|
1. Identify source IPs and usernames.
|
||||||
|
2. Validate whether attempts are expected from bastions/scanners.
|
||||||
|
3. Check successful logins from same sources.
|
||||||
|
4. Review sudo usage and persistence changes.
|
||||||
|
5. Preserve logs before cleanup or rotation.
|
||||||
|
|
||||||
|
## Networking Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ip -br addr
|
||||||
|
ip route get 8.8.8.8
|
||||||
|
ss -ltnp
|
||||||
|
ss -tn state established '( sport = :443 or dport = :443 )'
|
||||||
|
tcpdump -ni eth0 port 53
|
||||||
|
dig +short mx example.com
|
||||||
|
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
|
||||||
|
mtr -rwzc 10 host
|
||||||
|
traceroute -T -p 443 host
|
||||||
|
openssl s_client -connect host:443 -servername host </dev/null
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage Operations
|
||||||
|
|
||||||
|
### Block and Filesystem Discovery
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lsblk -f
|
||||||
|
blkid
|
||||||
|
findmnt
|
||||||
|
cat /proc/partitions
|
||||||
|
multipath -ll
|
||||||
|
```
|
||||||
|
|
||||||
|
### LVM
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvs
|
||||||
|
vgs
|
||||||
|
lvs -a -o +devices
|
||||||
|
pvdisplay /dev/sdX
|
||||||
|
vgdisplay <vg>
|
||||||
|
lvdisplay /dev/<vg>/<lv>
|
||||||
|
```
|
||||||
|
|
||||||
|
Growth example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvcreate /dev/mapper/mpatha # impact: write metadata
|
||||||
|
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
|
||||||
|
lvextend -L +100G -r /dev/vgdata/lvapp
|
||||||
|
```
|
||||||
|
|
||||||
|
### XFS
|
||||||
|
|
||||||
|
```bash
|
||||||
|
xfs_info /mountpoint
|
||||||
|
xfs_repair -n /dev/mapper/vg-lv
|
||||||
|
xfs_growfs /mountpoint
|
||||||
|
```
|
||||||
|
|
||||||
|
### ext4
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tune2fs -l /dev/mapper/vg-lv | head -40
|
||||||
|
e2fsck -fn /dev/mapper/vg-lv
|
||||||
|
resize2fs /dev/mapper/vg-lv
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multipath
|
||||||
|
|
||||||
|
```bash
|
||||||
|
multipath -ll
|
||||||
|
lsblk -S
|
||||||
|
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
|
||||||
|
```
|
||||||
|
|
||||||
|
### NFS
|
||||||
|
|
||||||
|
```bash
|
||||||
|
showmount -e nfs-server
|
||||||
|
nfsstat -m
|
||||||
|
mount | grep nfs
|
||||||
|
rpcinfo -p nfs-server
|
||||||
|
```
|
||||||
|
|
||||||
|
### iSCSI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
iscsiadm -m session
|
||||||
|
iscsiadm -m node
|
||||||
|
iscsiadm -m discovery -t sendtargets -p <target-ip>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mount Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt /mountpoint
|
||||||
|
mount -v /mountpoint
|
||||||
|
dmesg -T | tail -50
|
||||||
|
journalctl -k -n 100 --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- device path stable
|
||||||
|
- UUID correct
|
||||||
|
- filesystem type correct
|
||||||
|
- multipath settled
|
||||||
|
- network and RPC available for NFS
|
||||||
|
|
||||||
|
### Filesystem Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||||
|
df -hT /data
|
||||||
|
touch /data/.write-test && rm -f /data/.write-test
|
||||||
|
```
|
||||||
|
|
||||||
|
### Migration Validation Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt /data
|
||||||
|
df -hT /data
|
||||||
|
rsync -aHAXvn /olddata/ /data/
|
||||||
|
rsync -aHAXc --delete --dry-run /olddata/ /data/
|
||||||
|
sha256sum /olddata/keyfile /data/keyfile
|
||||||
|
```
|
||||||
|
|
||||||
|
## AIX Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
oslevel -s
|
||||||
|
errpt | head
|
||||||
|
errpt -a | more
|
||||||
|
topas
|
||||||
|
lsvg -o
|
||||||
|
lsvg rootvg
|
||||||
|
lslpp -L | grep -i openssl
|
||||||
|
svmon -G
|
||||||
|
svmon -P <pid>
|
||||||
|
netstat -rn
|
||||||
|
```
|
||||||
|
|
||||||
|
## SSL/TLS Operations
|
||||||
|
|
||||||
|
### OpenSSL Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl version -a
|
||||||
|
openssl x509 -in cert.pem -noout -text | less
|
||||||
|
openssl rsa -in key.pem -check
|
||||||
|
openssl verify -CAfile chain.pem cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### Expiration Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl x509 -enddate -noout -in cert.pem
|
||||||
|
openssl x509 -checkend 604800 -noout -in cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### keytool Basics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
keytool -list -v -keystore keystore.jks
|
||||||
|
keytool -list -cacerts | grep -i <alias>
|
||||||
|
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
|
||||||
|
```
|
||||||
|
|
||||||
|
### Chain Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||||
|
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
## Automation Operations
|
||||||
|
|
||||||
|
### Bash Safety Patterns
|
||||||
|
|
||||||
|
```bash
|
||||||
|
set -euo pipefail
|
||||||
|
IFS=$'\n\t'
|
||||||
|
trap 'echo "line ${LINENO}: command failed" >&2' ERR
|
||||||
|
trap 'rm -f "${tmpfile:-}"' EXIT
|
||||||
|
```
|
||||||
|
|
||||||
|
Safe loop examples:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
while IFS= read -r host; do
|
||||||
|
ssh "$host" uptime
|
||||||
|
done < hostlist.txt
|
||||||
|
|
||||||
|
find /var/log -type f -name '*.log' -print0 \
|
||||||
|
| while IFS= read -r -d '' file; do
|
||||||
|
gzip -t "$file"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Operational scripting patterns:
|
||||||
|
|
||||||
|
- default to read-only mode
|
||||||
|
- require explicit `--execute` for changes
|
||||||
|
- log actions with timestamps
|
||||||
|
- validate dependencies with `command -v`
|
||||||
|
- use temp files with `mktemp`
|
||||||
|
- guard destructive paths and empty variables
|
||||||
|
|
||||||
|
## Ansible Operations
|
||||||
|
|
||||||
|
### Execution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-inventory -i inventory/hosts.yml --graph
|
||||||
|
ansible-inventory -i inventory/hosts.yml --list | jq '.'
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Safe Rollout Workflow
|
||||||
|
|
||||||
|
1. Validate inventory and variable targeting.
|
||||||
|
2. Run syntax-check.
|
||||||
|
3. Run `--check --diff` on a single host.
|
||||||
|
4. Execute against one host or one tier.
|
||||||
|
5. Validate service health, logs, and config.
|
||||||
|
6. Expand rollout only after post-check passes.
|
||||||
|
|
||||||
|
Rollback mindset:
|
||||||
|
|
||||||
|
- keep before/after config copies
|
||||||
|
- know which tasks restart services
|
||||||
|
- define manual backout if package/config changes fail
|
||||||
|
- avoid broad `--limit` mistakes by reviewing resolved host list first
|
||||||
|
|
||||||
|
## Monitoring & Observability
|
||||||
|
|
||||||
|
### Zabbix Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status zabbix-agent2 --no-pager
|
||||||
|
zabbix_agent2 -t vfs.fs.size[/,free]
|
||||||
|
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### ELK Log Workflows
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
|
||||||
|
journalctl -u filebeat -n 100 --no-pager
|
||||||
|
curl -s http://localhost:9200/_cluster/health?pretty
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grafana Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||||
|
grep -i 'error' /var/log/grafana/grafana.log | tail -50
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health Endpoints and Alert Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsS http://app:8080/health
|
||||||
|
curl -fsS http://app:8080/metrics | head
|
||||||
|
```
|
||||||
|
|
||||||
|
False positive validation:
|
||||||
|
|
||||||
|
1. Compare alert timestamp with deploy/change window.
|
||||||
|
2. Confirm on-host evidence, not only dashboard data.
|
||||||
|
3. Check collector lag, scrape failures, and stale metrics.
|
||||||
|
4. Validate from a second source before escalating.
|
||||||
|
|
||||||
|
## Operational Habits
|
||||||
|
|
||||||
|
### Pre-checks
|
||||||
|
|
||||||
|
- capture time, hostname, and operator
|
||||||
|
- capture current config and service state
|
||||||
|
- check recent alerts, maintenance windows, and dependencies
|
||||||
|
- confirm backup or rollback path exists
|
||||||
|
|
||||||
|
### Post-checks
|
||||||
|
|
||||||
|
- validate service state
|
||||||
|
- validate logs for fresh errors
|
||||||
|
- validate client path, ports, and name resolution
|
||||||
|
- compare metrics before/after
|
||||||
|
|
||||||
|
### Rollback Thinking
|
||||||
|
|
||||||
|
- define exact backout trigger before change
|
||||||
|
- prefer reversible steps
|
||||||
|
- keep config backups with timestamps
|
||||||
|
- avoid bundling unrelated changes
|
||||||
|
|
||||||
|
### Change Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl is-active <service>
|
||||||
|
curl -fsS http://127.0.0.1:<port>/health
|
||||||
|
ss -ltnp | grep :<port>
|
||||||
|
journalctl -u <service> -S '5 min ago' --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
### Operational Communication
|
||||||
|
|
||||||
|
- state scope, risk, and expected impact before action
|
||||||
|
- record start and stop times in UTC
|
||||||
|
- document what changed, what was checked, and remaining risk
|
||||||
|
- escalate with evidence, not assumptions
|
||||||
|
|
||||||
|
### Evidence Collection During Incidents
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
|
||||||
|
journalctl -b > /tmp/incident-*/journal.txt
|
||||||
|
ss -tulpen > /tmp/incident-*/sockets.txt
|
||||||
|
df -hT > /tmp/incident-*/df.txt
|
||||||
|
free -m > /tmp/incident-*/free.txt
|
||||||
|
```
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
# examples
|
||||||
|
|
||||||
|
Sanitized sample outputs for documentation and review.
|
||||||
|
|
||||||
|
These files use fake hostnames, reserved example domains, reserved IP address ranges, and invented storage names. They are useful for reading the workflow without exposing real system details.
|
||||||
|
|
||||||
|
## Included
|
||||||
|
|
||||||
|
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
|
||||||
|
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
|
||||||
|
- `veritas/` - sample VxVM disk and VCS service group output.
|
||||||
|
- `gpfs/` - sample GPFS cluster and NSD output.
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
Filesystem Size Used Avail Use% Mounted on
|
||||||
|
/dev/mapper/vgapp-lvlog 80G 76G 4.0G 95% /var/log/app
|
||||||
|
/dev/mapper/vgapp-lvdata 200G 121G 79G 61% /srv/app
|
||||||
|
/dev/sda2 40G 19G 21G 48% /
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
|
||||||
|
appworker 1842 appsvc 12w REG 253,7 8589934592 0 9911 /var/log/app/app.log.1 (deleted)
|
||||||
|
java 2210 appsvc 45w REG 253,7 2147483648 0 9919 /var/log/app/gc.log.2 (deleted)
|
||||||
|
rsyslogd 712 root 7w REG 253,7 524288000 0 9924 /var/log/app/messages.old (deleted)
|
||||||
@@ -0,0 +1,13 @@
|
|||||||
|
Disk Full Review - Sanitized Example
|
||||||
|
|
||||||
|
Host: host-app-01.example.invalid
|
||||||
|
Filesystem: /var/log/app
|
||||||
|
Before: 95% used
|
||||||
|
After: 72% used
|
||||||
|
Actions reviewed:
|
||||||
|
- Confirmed largest files under /var/log/app.
|
||||||
|
- Identified deleted files still held by appworker and java processes.
|
||||||
|
- Confirmed no symlinks were removed during rotated log cleanup.
|
||||||
|
- Recommended application owner restart during approved window to release deleted files.
|
||||||
|
|
||||||
|
No real hostnames, tickets, or application names are included in this sample.
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
GPFS cluster information
|
||||||
|
========================
|
||||||
|
GPFS cluster name: gpfs-lab.example.invalid
|
||||||
|
GPFS cluster id: 1234567890123456789
|
||||||
|
GPFS UID domain: gpfs-lab.example.invalid
|
||||||
|
Remote shell command: /usr/bin/ssh
|
||||||
|
Remote file copy command: /usr/bin/scp
|
||||||
|
|
||||||
|
Node Daemon node name IP address Admin node name Designation
|
||||||
|
1 gpfs-node-a.example.invalid 192.0.2.11 gpfs-node-a.example.invalid quorum-manager
|
||||||
|
2 gpfs-node-b.example.invalid 192.0.2.12 gpfs-node-b.example.invalid quorum-manager
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
File system Disk name NSD servers
|
||||||
|
-------------------------------------------------------------------
|
||||||
|
fs_data nsd_data_01 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
|
||||||
|
fs_data nsd_data_02 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
|
||||||
|
fs_data nsd_data_03 gpfs-node-a.example.invalid,gpfs-node-b.example.invalid
|
||||||
@@ -0,0 +1,131 @@
|
|||||||
|
# L2 Incident Triage Report
|
||||||
|
|
||||||
|
- Generated: 2026-05-12T19:30:00Z
|
||||||
|
- Local hostname: app01.example.internal
|
||||||
|
- Current user: triage
|
||||||
|
- Incident type: all
|
||||||
|
- Service: nginx
|
||||||
|
- Host: app.example.com
|
||||||
|
- Port: 443
|
||||||
|
- PID: not provided
|
||||||
|
- Process match: not provided
|
||||||
|
- Since: 30 minutes ago
|
||||||
|
|
||||||
|
## Executed Checks
|
||||||
|
|
||||||
|
| Check | Script | Status | Exit | Command |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
|
||||||
|
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
|
||||||
|
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
|
||||||
|
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
|
||||||
|
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
|
||||||
|
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
|
||||||
|
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
|
||||||
|
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
|
||||||
|
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||||
|
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
|
||||||
|
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
|
||||||
|
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
|
||||||
|
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
|
||||||
|
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
|
||||||
|
- Inode usage: OK: Highest inode usage is 42%
|
||||||
|
- JVM threads and heap: WARNING: No Java processes detected
|
||||||
|
|
||||||
|
## Raw Evidence
|
||||||
|
|
||||||
|
### CPU saturation
|
||||||
|
|
||||||
|
Script: `check_high_cpu.sh`
|
||||||
|
|
||||||
|
Command: `./check_high_cpu.sh`
|
||||||
|
|
||||||
|
Status: OK, exit: 0
|
||||||
|
|
||||||
|
```text
|
||||||
|
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||||
|
|
||||||
|
Load average:
|
||||||
|
1m=0.42 5m=0.38 15m=0.31
|
||||||
|
|
||||||
|
Top CPU processes:
|
||||||
|
PID PPID USER %CPU %MEM COMMAND ARGS
|
||||||
|
1450 1 app 7.2 2.1 nginx nginx: worker process
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check process ownership and whether the top process is expected
|
||||||
|
- Review logs for the top CPU-consuming process
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory and OOM
|
||||||
|
|
||||||
|
Script: `check_high_memory_oom.sh`
|
||||||
|
|
||||||
|
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
|
||||||
|
|
||||||
|
Status: WARNING, exit: 1
|
||||||
|
|
||||||
|
```text
|
||||||
|
WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
|
||||||
|
Memory summary:
|
||||||
|
Mem: 15800 13272 1110 210 1418 1840
|
||||||
|
Swap: 4095 512 3583
|
||||||
|
|
||||||
|
OOM events since 30 minutes ago:
|
||||||
|
OK: no OOM evidence found in available sources
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service restart loop
|
||||||
|
|
||||||
|
Script: `check_service_restart_loop.sh`
|
||||||
|
|
||||||
|
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
|
||||||
|
|
||||||
|
Status: OK, exit: 0
|
||||||
|
|
||||||
|
```text
|
||||||
|
OK: Service nginx state=active substate=running restarts=0
|
||||||
|
|
||||||
|
Systemd properties:
|
||||||
|
Id=nginx.service
|
||||||
|
ActiveState=active
|
||||||
|
SubState=running
|
||||||
|
NRestarts=0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Skipped or limited checks
|
||||||
|
|
||||||
|
```text
|
||||||
|
JVM threads and heap returned WARNING because no Java process was detected.
|
||||||
|
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
|
||||||
|
```
|
||||||
|
|
||||||
|
## L2 Handover Checklist
|
||||||
|
|
||||||
|
- [ ] Business impact confirmed
|
||||||
|
- [ ] Affected host/service identified
|
||||||
|
- [ ] Monitoring alert attached
|
||||||
|
- [ ] Recent changes checked
|
||||||
|
- [ ] Logs attached
|
||||||
|
- [ ] Service owner identified
|
||||||
|
- [ ] Escalation target identified
|
||||||
|
|
||||||
|
## Escalation Notes
|
||||||
|
|
||||||
|
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
|
||||||
|
- Include the alert, timeline, commands run, and the raw evidence above.
|
||||||
|
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
|
||||||
|
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
|
||||||
|
|
||||||
|
## Recommended Next Steps
|
||||||
|
|
||||||
|
- Confirm the symptom against monitoring and user reports.
|
||||||
|
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
|
||||||
|
- Attach this report to the incident ticket before handoff.
|
||||||
|
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
#Group Attribute System Value
|
||||||
|
app_sg01 State node-a.example.invalid |ONLINE|
|
||||||
|
app_sg01 State node-b.example.invalid |OFFLINE|
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
DEVICE TYPE DISK GROUP STATUS
|
||||||
|
san_lun_001 auto:none - - online invalid
|
||||||
|
san_lun_002 auto:none - - online invalid
|
||||||
|
san_lun_010 auto:cdsdisk dgapp01_01 dgapp01 online
|
||||||
|
san_lun_011 auto:cdsdisk dgapp01_02 dgapp01 online
|
||||||
@@ -1,18 +1,5 @@
|
|||||||
# infra-run/runbooks
|
# runbooks
|
||||||
|
|
||||||
This directory is reserved for runbook-style procedures that describe how to perform controlled operational work. It sits alongside the executable scripts and captures the human workflow around them.
|
Planned area for standalone runbooks.
|
||||||
|
|
||||||
## Diagram
|
Current runnable workflow notes live with the Bash toolkits under [scripts/bash](../scripts/bash/).
|
||||||
|
|
||||||
```mermaid
|
|
||||||
flowchart TD
|
|
||||||
A["runbooks"] --> B["Pre-check"]
|
|
||||||
A --> C["Change execution"]
|
|
||||||
A --> D["Post-check"]
|
|
||||||
A --> E["Rollback or escalation"]
|
|
||||||
```
|
|
||||||
|
|
||||||
## Notes
|
|
||||||
|
|
||||||
- The directory is currently a placeholder.
|
|
||||||
- It is intended to hold narrative procedures that complement the script-based toolkits.
|
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# infra-run/scripts
|
# infra-run/scripts
|
||||||
|
|
||||||
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from future Python-based utilities while keeping both under one automation entry point.
|
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from Python-based analysis utilities while keeping both under one automation entry point.
|
||||||
|
|
||||||
## Diagram
|
## Diagram
|
||||||
|
|
||||||
@@ -9,15 +9,17 @@ flowchart TD
|
|||||||
A["scripts"] --> B["bash"]
|
A["scripts"] --> B["bash"]
|
||||||
A --> C["python"]
|
A --> C["python"]
|
||||||
B --> D["Operational toolkits"]
|
B --> D["Operational toolkits"]
|
||||||
C --> E["Future helper utilities"]
|
C --> E["Analysis helper utilities"]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Scope
|
## Scope
|
||||||
|
|
||||||
- `bash` - current implementation area with production-style operations toolkits.
|
- [bash](./bash/) - operational toolkits for host health checks, disk-full triage, Veritas examples, and GPFS examples.
|
||||||
- `python` - reserved space for future supporting utilities.
|
- [python](./python/) - read-only tools for local log parsing, reporting, and structured operational analysis.
|
||||||
|
|
||||||
## Notes
|
## Notes
|
||||||
|
|
||||||
- The repository currently emphasizes Bash because it maps directly to day-to-day Linux operations.
|
- Bash remains the right default for direct host checks and operational wrappers.
|
||||||
- The structure leaves room for higher-level helpers without mixing concerns.
|
- Python is used where parsing, report generation, comparison, or JSON output is clearer than shell.
|
||||||
|
- Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root.
|
||||||
|
- Python tooling should remain read-only by default, standard-library based, and validated with `../../scripts/check-python.sh` from the repository root.
|
||||||
|
|||||||
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
|
|||||||
```mermaid
|
```mermaid
|
||||||
flowchart TD
|
flowchart TD
|
||||||
A["bash"] --> B["os-healthcheck"]
|
A["bash"] --> B["os-healthcheck"]
|
||||||
A --> C["disk-full"]
|
A --> C["incident-checks"]
|
||||||
A --> D["veritas"]
|
A --> D["disk-full"]
|
||||||
A --> E["gpfs"]
|
A --> E["veritas"]
|
||||||
|
A --> F["gpfs"]
|
||||||
B --> B1["Host diagnostics"]
|
B --> B1["Host diagnostics"]
|
||||||
C --> C1["Incident workflow"]
|
C --> C1["Standalone triage checks"]
|
||||||
D --> D1["VxVM and VCS change flow"]
|
D --> D1["Incident workflow"]
|
||||||
E --> E1["Spectrum Scale expansion flow"]
|
E --> E1["VxVM and VCS change flow"]
|
||||||
|
F --> F1["Spectrum Scale expansion flow"]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Scripts
|
## Scripts
|
||||||
@@ -23,6 +25,7 @@ flowchart TD
|
|||||||
- `os-healthcheck/service_check.sh` - critical service status check.
|
- `os-healthcheck/service_check.sh` - critical service status check.
|
||||||
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
||||||
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
||||||
|
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@@ -37,8 +40,22 @@ cd infra-run/scripts/bash/os-healthcheck
|
|||||||
./system_report.sh
|
./system_report.sh
|
||||||
./network_troubleshoot.sh
|
./network_troubleshoot.sh
|
||||||
./network_troubleshoot.sh google.com
|
./network_troubleshoot.sh google.com
|
||||||
|
|
||||||
|
cd ../incident-checks
|
||||||
|
./check_high_cpu.sh
|
||||||
|
./check_high_memory_oom.sh --since "24 hours ago"
|
||||||
|
./check_service_restart_loop.sh --service sshd
|
||||||
|
./check_certificate_expiry.sh --host example.com
|
||||||
```
|
```
|
||||||
|
|
||||||
|
## Standards
|
||||||
|
|
||||||
|
- Scripts use Bash and should keep `#!/usr/bin/env bash` plus strict mode.
|
||||||
|
- Read-only checks should report missing tools without hiding the problem.
|
||||||
|
- Change-capable scripts must default to dry-run behavior and require explicit `--execute`.
|
||||||
|
- Output should use `OK`, `WARNING`, and `CRITICAL` where practical.
|
||||||
|
- Validate changed scripts with `./scripts/check-bash.sh` from the repository root.
|
||||||
|
|
||||||
## Exit Codes
|
## Exit Codes
|
||||||
|
|
||||||
`disk_check.sh`:
|
`disk_check.sh`:
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
||||||
DRY_RUN="${DRY_RUN:-true}"
|
DRY_RUN="${DRY_RUN:-true}"
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
||||||
DRY_RUN="${DRY_RUN:-true}"
|
DRY_RUN="${DRY_RUN:-true}"
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
@@ -43,8 +41,13 @@ fi
|
|||||||
|
|
||||||
warning "Adding NSDs must be coordinated with storage, GPFS, application, and change-management teams."
|
warning "Adding NSDs must be coordinated with storage, GPFS, application, and change-management teams."
|
||||||
section "Planned GPFS changes"
|
section "Planned GPFS changes"
|
||||||
ok "DRY-RUN: mmcrnsd -F $NSD_STANZA"
|
if [[ "$DRY_RUN" == "true" ]]; then
|
||||||
ok "DRY-RUN: mmadddisk $FILESYSTEM -F $NSD_STANZA"
|
ok "DRY-RUN: mmcrnsd -F $NSD_STANZA"
|
||||||
|
ok "DRY-RUN: mmadddisk $FILESYSTEM -F $NSD_STANZA"
|
||||||
|
else
|
||||||
|
warning "EXECUTE: mmcrnsd -F $NSD_STANZA"
|
||||||
|
warning "EXECUTE: mmadddisk $FILESYSTEM -F $NSD_STANZA"
|
||||||
|
fi
|
||||||
|
|
||||||
confirm_execute "create NSDs and add disks to $FILESYSTEM"
|
confirm_execute "create NSDs and add disks to $FILESYSTEM"
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
@@ -48,7 +46,11 @@ if [[ "$BACKGROUND" == "true" ]]; then
|
|||||||
mmrestripefs "$FILESYSTEM" -b 2>&1 | tee -a "$LOG_FILE" &
|
mmrestripefs "$FILESYSTEM" -b 2>&1 | tee -a "$LOG_FILE" &
|
||||||
fi
|
fi
|
||||||
else
|
else
|
||||||
|
if [[ "$DRY_RUN" == "true" ]]; then
|
||||||
ok "DRY-RUN: mmrestripefs $FILESYSTEM -b"
|
ok "DRY-RUN: mmrestripefs $FILESYSTEM -b"
|
||||||
|
else
|
||||||
|
warning "EXECUTE: mmrestripefs $FILESYSTEM -b"
|
||||||
|
fi
|
||||||
confirm_execute "restripe for $FILESYSTEM"
|
confirm_execute "restripe for $FILESYSTEM"
|
||||||
if [[ "$DRY_RUN" == "false" ]]; then
|
if [[ "$DRY_RUN" == "false" ]]; then
|
||||||
run_cmd mmrestripefs "$FILESYSTEM" -b
|
run_cmd mmrestripefs "$FILESYSTEM" -b
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,6 +1,6 @@
|
|||||||
# GPFS / IBM Spectrum Scale Filesystem Expansion Toolkit
|
# GPFS / IBM Spectrum Scale Filesystem Expansion Toolkit
|
||||||
|
|
||||||
Safe, sanitized Bash examples for planning and executing a GPFS / IBM Spectrum Scale filesystem expansion. The scripts are written as portfolio-grade operational tooling for a Linux Infrastructure Engineer: conservative defaults, clear validation, dry-run behavior, and explicit operator confirmation before changes.
|
Safe, sanitized Bash examples for planning and executing a GPFS / IBM Spectrum Scale filesystem expansion. The scripts are written as readable operational examples for a Linux Infrastructure Engineer: conservative defaults, clear validation, dry-run behavior, and explicit operator confirmation before changes.
|
||||||
|
|
||||||
These scripts are examples. Exact GPFS commands, flags, quorum practices, failure-group design, and storage naming standards vary by Spectrum Scale version and site policy.
|
These scripts are examples. Exact GPFS commands, flags, quorum practices, failure-group design, and storage naming standards vary by Spectrum Scale version and site policy.
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -0,0 +1,124 @@
|
|||||||
|
# Bash Incident Checks
|
||||||
|
|
||||||
|
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
|
||||||
|
|
||||||
|
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
|
||||||
|
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
|
||||||
|
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
|
||||||
|
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
|
||||||
|
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
|
||||||
|
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
|
||||||
|
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
|
||||||
|
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||||
|
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||||
|
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||||
|
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./check_high_cpu.sh
|
||||||
|
./check_high_cpu.sh --warning 70 --critical 90 --top 15
|
||||||
|
|
||||||
|
./check_high_memory_oom.sh
|
||||||
|
./check_high_memory_oom.sh --since "6 hours ago" --top 5
|
||||||
|
|
||||||
|
./check_service_restart_loop.sh --service nginx
|
||||||
|
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
|
||||||
|
|
||||||
|
./check_failed_ssh_logins.sh
|
||||||
|
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
|
||||||
|
|
||||||
|
./check_certificate_expiry.sh --host example.com
|
||||||
|
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
|
||||||
|
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
|
||||||
|
|
||||||
|
./check_dns_connectivity.sh --host example.com
|
||||||
|
./check_dns_connectivity.sh --host db.example.internal --port 5432
|
||||||
|
|
||||||
|
./check_ntp_time_drift.sh
|
||||||
|
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
|
||||||
|
|
||||||
|
./check_filesystem_readonly.sh
|
||||||
|
./check_filesystem_readonly.sh --include-system
|
||||||
|
|
||||||
|
./check_inode_usage.sh
|
||||||
|
./check_inode_usage.sh --warning 75 --critical 90
|
||||||
|
|
||||||
|
./check_jvm_threads_heap.sh
|
||||||
|
./check_jvm_threads_heap.sh --pid 1234
|
||||||
|
./check_jvm_threads_heap.sh --match app-name
|
||||||
|
|
||||||
|
./incident_triage_report.sh --type cpu
|
||||||
|
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||||
|
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||||
|
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## L2 Triage Report Wrapper
|
||||||
|
|
||||||
|
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
|
||||||
|
|
||||||
|
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
|
||||||
|
|
||||||
|
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
|
||||||
|
|
||||||
|
## Exit Codes
|
||||||
|
|
||||||
|
- `0` - OK.
|
||||||
|
- `1` - WARNING or operational issue detected.
|
||||||
|
- `2` - invalid input or missing required dependency.
|
||||||
|
- `3` - CRITICAL issue detected.
|
||||||
|
|
||||||
|
## Supported Platforms
|
||||||
|
|
||||||
|
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
|
||||||
|
|
||||||
|
Some data sources vary by distribution:
|
||||||
|
|
||||||
|
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
|
||||||
|
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
|
||||||
|
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
|
||||||
|
|
||||||
|
## Safety Notes
|
||||||
|
|
||||||
|
- Scripts are read-only.
|
||||||
|
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
|
||||||
|
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
|
||||||
|
- Treat output as triage evidence, not as complete root-cause analysis.
|
||||||
|
|
||||||
|
## Dependency Notes
|
||||||
|
|
||||||
|
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
|
||||||
|
|
||||||
|
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
|
||||||
|
|
||||||
|
## Copy-To-Server Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
|
||||||
|
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
|
||||||
|
```
|
||||||
|
|
||||||
|
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
|
||||||
|
|
||||||
|
## Sample Outputs
|
||||||
|
|
||||||
|
Sanitized examples are available in [examples](./examples/):
|
||||||
|
|
||||||
|
- `high-cpu.sample.txt`
|
||||||
|
- `high-memory-oom.sample.txt`
|
||||||
|
- `service-restart-loop.sample.txt`
|
||||||
|
- `failed-ssh-logins.sample.txt`
|
||||||
|
- `certificate-expiry.sample.txt`
|
||||||
|
- `dns-connectivity.sample.txt`
|
||||||
|
- `ntp-time-drift.sample.txt`
|
||||||
|
- `filesystem-readonly.sample.txt`
|
||||||
|
- `inode-usage.sample.txt`
|
||||||
|
- `jvm-threads-heap.sample.txt`
|
||||||
|
|
||||||
|
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
|
||||||
@@ -0,0 +1,134 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
host_name=""
|
||||||
|
port=443
|
||||||
|
cert_file=""
|
||||||
|
warning_days=30
|
||||||
|
critical_days=7
|
||||||
|
servername=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
|
||||||
|
|
||||||
|
Check TLS certificate expiry for a remote endpoint or local certificate file.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||||
|
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||||
|
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
|
||||||
|
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
|
||||||
|
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
|
||||||
|
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if ! command -v openssl >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: openssl\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$port" "$warning_days" "$critical_days"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((critical_days >= warning_days)); then
|
||||||
|
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$host_name" && -n "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: use either --host or --file, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -z "$host_name" && -z "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: either --host or --file is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -z "$servername" ]]; then
|
||||||
|
servername="$host_name"
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_cert="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_cert"' EXIT
|
||||||
|
|
||||||
|
if [[ -n "$host_name" ]]; then
|
||||||
|
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
|
||||||
|
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
|
||||||
|
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
cp "$cert_file" "$tmp_cert"
|
||||||
|
fi
|
||||||
|
|
||||||
|
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
|
||||||
|
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
|
||||||
|
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
|
||||||
|
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
|
||||||
|
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
|
||||||
|
|
||||||
|
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
|
||||||
|
now_epoch="$(date +%s)"
|
||||||
|
if [[ -z "$expiry_epoch" ]]; then
|
||||||
|
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
seconds_left=$((expiry_epoch - now_epoch))
|
||||||
|
days_left=$((seconds_left / 86400))
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((days_left < critical_days)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((days_left < warning_days)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
target="$cert_file"
|
||||||
|
if [[ -n "$host_name" ]]; then
|
||||||
|
target="${host_name}:${port}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
|
||||||
|
|
||||||
|
printf 'Certificate details:\n'
|
||||||
|
printf 'Subject: %s\n' "$subject"
|
||||||
|
printf 'Issuer: %s\n' "$issuer"
|
||||||
|
printf 'notBefore: %s\n' "$not_before"
|
||||||
|
printf 'notAfter: %s\n' "$not_after"
|
||||||
|
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Target: %s\n' "$target"
|
||||||
|
printf 'SNI: %s\n' "${servername:-not used}"
|
||||||
|
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Renew certificate before the operational threshold is breached\n'
|
||||||
|
printf -- '- Check the full chain and intermediate certificates\n'
|
||||||
|
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
|
||||||
|
printf -- '- Verify monitoring threshold and alert ownership\n'
|
||||||
|
printf -- '- Attach this output to incident or change ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,161 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
host_name=""
|
||||||
|
port=""
|
||||||
|
count=3
|
||||||
|
timeout_seconds=3
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
|
||||||
|
|
||||||
|
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||||
|
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||||
|
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
|
||||||
|
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
printf 'CRITICAL: --host is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$count" "$timeout_seconds"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||||
|
printf 'CRITICAL: --port must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v getent >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: getent\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
dns_ok=0
|
||||||
|
ping_ok=0
|
||||||
|
tcp_ok=0
|
||||||
|
tcp_checked=0
|
||||||
|
tcp_note=""
|
||||||
|
ping_output="$(mktemp)"
|
||||||
|
trap 'rm -f "$ping_output"' EXIT
|
||||||
|
|
||||||
|
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
|
||||||
|
if [[ -n "$dns_output" ]]; then
|
||||||
|
dns_ok=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v ping >/dev/null 2>&1; then
|
||||||
|
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
|
||||||
|
ping_ok=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
tcp_checked=1
|
||||||
|
if command -v timeout >/dev/null 2>&1; then
|
||||||
|
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||||
|
tcp_ok=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
|
||||||
|
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||||
|
tcp_ok=1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((dns_ok == 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((tcp_checked == 1 && tcp_ok == 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
|
||||||
|
if ((tcp_checked == 1)); then
|
||||||
|
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
|
||||||
|
fi
|
||||||
|
printf '\n\n'
|
||||||
|
|
||||||
|
printf 'DNS result:\n'
|
||||||
|
if [[ -n "$dns_output" ]]; then
|
||||||
|
printf '%s\n' "$dns_output"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Ping result:\n'
|
||||||
|
if [[ -s "$ping_output" ]]; then
|
||||||
|
cat "$ping_output"
|
||||||
|
else
|
||||||
|
printf 'WARNING: ping result unavailable or ping command missing\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
if ((tcp_checked == 1)); then
|
||||||
|
printf 'TCP port result:\n'
|
||||||
|
if ((tcp_ok == 1)); then
|
||||||
|
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
|
||||||
|
fi
|
||||||
|
if [[ -n "$tcp_note" ]]; then
|
||||||
|
printf '%s\n' "$tcp_note"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'Local network hints:\n'
|
||||||
|
if command -v ip >/dev/null 2>&1; then
|
||||||
|
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
|
||||||
|
elif command -v ss >/dev/null 2>&1; then
|
||||||
|
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
|
||||||
|
if [[ -n "$tcp_note" ]]; then
|
||||||
|
printf '%s\n' "$tcp_note"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify the DNS record and resolver path\n'
|
||||||
|
printf -- '- Check firewall, routing, security group, or proxy policy\n'
|
||||||
|
printf -- '- Compare results from another host or network segment\n'
|
||||||
|
printf -- '- Check application endpoint health after network reachability is confirmed\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,124 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
since_value="1 hour ago"
|
||||||
|
warning_count=20
|
||||||
|
critical_count=50
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect failed SSH login bursts from journal or readable authentication logs.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_count" "$critical_count" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_count >= critical_count)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_log="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_log"' EXIT
|
||||||
|
log_source="journalctl"
|
||||||
|
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
journalctl --since "$since_value" --no-pager 2>/dev/null \
|
||||||
|
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
|
||||||
|
else
|
||||||
|
log_source="log file fallback"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! -s "$tmp_log" ]]; then
|
||||||
|
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
|
||||||
|
if [[ -r "$log_file" ]]; then
|
||||||
|
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
|
||||||
|
log_source="$log_file"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((attempts >= critical_count)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((attempts >= warning_count)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
|
||||||
|
|
||||||
|
printf 'Top source IPs:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
|
||||||
|
| sed -E 's/^(from|rhost=) //' \
|
||||||
|
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||||
|
else
|
||||||
|
printf 'OK: no failed SSH attempts found in available sources\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top attempted users:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
|
||||||
|
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||||
|
else
|
||||||
|
printf 'OK: no attempted users extracted\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Sample recent lines:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
tail -n "$top_count" "$tmp_log"
|
||||||
|
else
|
||||||
|
printf 'OK: no sample lines available\n'
|
||||||
|
fi
|
||||||
|
printf '\n\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||||
|
printf 'Log source: %s\n' "$log_source"
|
||||||
|
if [[ "$log_source" != "journalctl" ]]; then
|
||||||
|
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; authentication log visibility may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
|
||||||
|
printf -- '- Check firewall, fail2ban, or security tooling state\n'
|
||||||
|
printf -- '- Confirm whether the attempts are expected for this host\n'
|
||||||
|
printf -- '- Review successful logins too, not only failures\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
include_system=0
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_filesystem_readonly.sh [--include-system] [--help]
|
||||||
|
|
||||||
|
Detect filesystems mounted read-only. Read-only.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--include-system) include_system=1; shift ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
tmp_mounts="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_mounts"' EXIT
|
||||||
|
|
||||||
|
if command -v findmnt >/dev/null 2>&1; then
|
||||||
|
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
|
||||||
|
elif command -v mount >/dev/null 2>&1; then
|
||||||
|
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: findmnt or mount is required\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_ro="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
|
||||||
|
|
||||||
|
awk -v include_system="$include_system" '
|
||||||
|
function system_fs(type, target) {
|
||||||
|
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
|
||||||
|
}
|
||||||
|
{
|
||||||
|
target=$1; source=$2; type=$3; opts=$4
|
||||||
|
if (opts ~ /(^|,)ro(,|$)/) {
|
||||||
|
if (include_system == 1 || ! system_fs(type, target)) {
|
||||||
|
print target "\t" source "\t" type "\t" opts
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
' "$tmp_mounts" > "$tmp_ro"
|
||||||
|
|
||||||
|
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((readonly_count > 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
|
||||||
|
|
||||||
|
printf 'Read-only filesystems:\n'
|
||||||
|
if [[ -s "$tmp_ro" ]]; then
|
||||||
|
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
|
||||||
|
cat "$tmp_ro"
|
||||||
|
else
|
||||||
|
printf 'OK: no read-only filesystems found with current filters\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'include_system=%s\n' "$include_system"
|
||||||
|
printf 'Collector: '
|
||||||
|
if command -v findmnt >/dev/null 2>&1; then
|
||||||
|
printf 'findmnt\n'
|
||||||
|
else
|
||||||
|
printf 'mount fallback\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
|
||||||
|
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
|
||||||
|
printf -- '- Check filesystem health with the platform-approved procedure\n'
|
||||||
|
printf -- '- Do not remount read-write before understanding the cause\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
+146
@@ -0,0 +1,146 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=75
|
||||||
|
critical_threshold=90
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect high CPU load and show top CPU-consuming processes.
|
||||||
|
|
||||||
|
Exit codes:
|
||||||
|
0 OK
|
||||||
|
1 WARNING / operational issue detected
|
||||||
|
2 invalid input / missing required dependency
|
||||||
|
3 CRITICAL issue detected
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
|
||||||
|
warning_threshold="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--critical)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
|
||||||
|
critical_threshold="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--top)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
|
||||||
|
top_count="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--help|-h)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd ps
|
||||||
|
require_cmd awk
|
||||||
|
require_cmd head
|
||||||
|
|
||||||
|
cpu_count=1
|
||||||
|
if command -v getconf >/dev/null 2>&1; then
|
||||||
|
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
|
||||||
|
elif [[ -r /proc/cpuinfo ]]; then
|
||||||
|
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
|
||||||
|
fi
|
||||||
|
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
|
||||||
|
((cpu_count > 0)) || cpu_count=1
|
||||||
|
|
||||||
|
load_1m="unavailable"
|
||||||
|
load_5m="unavailable"
|
||||||
|
load_15m="unavailable"
|
||||||
|
load_per_cpu_pct=0
|
||||||
|
if [[ -r /proc/loadavg ]]; then
|
||||||
|
read -r load_1m load_5m load_15m _ < /proc/loadavg
|
||||||
|
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
|
||||||
|
elif command -v uptime >/dev/null 2>&1; then
|
||||||
|
load_line="$(uptime 2>/dev/null || true)"
|
||||||
|
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((load_per_cpu_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((load_per_cpu_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
|
||||||
|
|
||||||
|
printf 'Load average:\n'
|
||||||
|
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
|
||||||
|
|
||||||
|
printf 'CPU count:\n'
|
||||||
|
printf '%s\n\n' "$cpu_count"
|
||||||
|
|
||||||
|
printf 'Top CPU processes:\n'
|
||||||
|
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
if command -v uptime >/dev/null 2>&1; then
|
||||||
|
uptime || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
|
||||||
|
fi
|
||||||
|
if ((load_per_cpu_pct >= 100)); then
|
||||||
|
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
|
||||||
|
else
|
||||||
|
printf 'OK: load is not above online CPU count at collection time\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check process ownership and whether the top process is expected\n'
|
||||||
|
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
|
||||||
|
printf -- '- Review logs for the top CPU-consuming process\n'
|
||||||
|
printf -- '- Compare with longer trend data from monitoring before taking action\n'
|
||||||
|
printf -- '- Attach this output to the incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=80
|
||||||
|
critical_threshold=90
|
||||||
|
since_value="24 hours ago"
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect high memory or swap usage and show recent OOM killer evidence.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd free
|
||||||
|
require_cmd ps
|
||||||
|
require_cmd awk
|
||||||
|
require_cmd head
|
||||||
|
|
||||||
|
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
|
||||||
|
/^Mem:/ { mt=$2; mu=$3 }
|
||||||
|
/^Swap:/ { st=$2; su=$3 }
|
||||||
|
END { printf "%d %d %d %d\n", mt, mu, st, su }
|
||||||
|
')
|
||||||
|
|
||||||
|
mem_pct=0
|
||||||
|
swap_pct=0
|
||||||
|
if ((mem_total > 0)); then
|
||||||
|
mem_pct=$((mem_used * 100 / mem_total))
|
||||||
|
fi
|
||||||
|
if ((swap_total > 0)); then
|
||||||
|
swap_pct=$((swap_used * 100 / swap_total))
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
|
||||||
|
|
||||||
|
printf 'Memory summary:\n'
|
||||||
|
free -m
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top memory processes:\n'
|
||||||
|
printf 'PID RSS_MB COMMAND\n'
|
||||||
|
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'OOM events since %s:\n' "$since_value"
|
||||||
|
oom_found=0
|
||||||
|
oom_source="journalctl"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
|
||||||
|
oom_found=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl not available; checking readable log files\n'
|
||||||
|
oom_source="log file fallback"
|
||||||
|
fi
|
||||||
|
if ((oom_found == 0)); then
|
||||||
|
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
|
||||||
|
if [[ -r "$log_file" ]]; then
|
||||||
|
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
|
||||||
|
oom_found=1
|
||||||
|
oom_source="$log_file"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
if ((oom_found == 0)); then
|
||||||
|
printf 'OK: no OOM evidence found in available sources\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
|
||||||
|
printf 'OOM evidence source: %s\n' "$oom_source"
|
||||||
|
if [[ "$oom_source" != "journalctl" ]]; then
|
||||||
|
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check application memory trend\n'
|
||||||
|
printf -- '- Review JVM heap settings if process is Java\n'
|
||||||
|
printf -- '- Verify swap pressure and paging activity\n'
|
||||||
|
printf -- '- Confirm whether OOM events align with application impact\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
+103
@@ -0,0 +1,103 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=80
|
||||||
|
critical_threshold=90
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect inode exhaustion using df -i.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v df >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: df\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_df="$(mktemp)"
|
||||||
|
tmp_alerts="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
|
||||||
|
|
||||||
|
df -Pi > "$tmp_df"
|
||||||
|
awk -v warn="$warning_threshold" '
|
||||||
|
NR > 1 {
|
||||||
|
pct=$5
|
||||||
|
gsub(/%/, "", pct)
|
||||||
|
if (pct >= warn) {
|
||||||
|
print $0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
' "$tmp_df" > "$tmp_alerts"
|
||||||
|
|
||||||
|
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((max_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((max_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
|
||||||
|
|
||||||
|
printf 'Filesystems above threshold:\n'
|
||||||
|
if [[ -s "$tmp_alerts" ]]; then
|
||||||
|
cat "$tmp_alerts"
|
||||||
|
else
|
||||||
|
printf 'OK: no filesystems above warning threshold\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Inode usage table:\n'
|
||||||
|
cat "$tmp_df"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top affected mount points:\n'
|
||||||
|
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
|
||||||
|
| sort -rn | head -n "$top_count" \
|
||||||
|
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Find directories with many small files under affected mount points\n'
|
||||||
|
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
|
||||||
|
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
|
||||||
|
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,134 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
target_pid=""
|
||||||
|
match_string=""
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
|
||||||
|
|
||||||
|
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
|
||||||
|
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
printf 'CRITICAL: use either --pid or --match, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||||
|
printf 'CRITICAL: --pid must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! is_number "$top_count"; then
|
||||||
|
printf 'CRITICAL: --top must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v ps >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: ps\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_java="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_java"' EXIT
|
||||||
|
|
||||||
|
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
|
||||||
|
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
|
||||||
|
|
||||||
|
if [[ -z "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$target_pid" ]]; then
|
||||||
|
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
|
||||||
|
if ((detected_count == 0)); then
|
||||||
|
printf 'WARNING: No Java processes detected\n\n'
|
||||||
|
else
|
||||||
|
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
|
||||||
|
fi
|
||||||
|
printf 'Detected JVM processes:\n'
|
||||||
|
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||||
|
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
|
||||||
|
printf '\nRecommended next steps:\n'
|
||||||
|
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
|
||||||
|
printf -- '- Review GC logs and application logs for the selected process\n'
|
||||||
|
printf -- '- Check heap sizing and thread count trend\n'
|
||||||
|
printf -- '- Capture jstack only if approved by operational process\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! ps -p "$target_pid" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
|
||||||
|
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
|
||||||
|
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
else
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
fi
|
||||||
|
|
||||||
|
thread_count="unavailable"
|
||||||
|
if [[ -r "/proc/${target_pid}/status" ]]; then
|
||||||
|
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
|
||||||
|
|
||||||
|
printf 'Detected JVM process:\n'
|
||||||
|
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||||
|
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
|
||||||
|
printf 'Thread count: %s\n\n' "$thread_count"
|
||||||
|
|
||||||
|
printf 'Heap and JVM evidence:\n'
|
||||||
|
if command -v jcmd >/dev/null 2>&1; then
|
||||||
|
printf '\n[jcmd VM.flags]\n'
|
||||||
|
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
|
||||||
|
printf '\n[jcmd GC.heap_info]\n'
|
||||||
|
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
|
||||||
|
printf '\n[jcmd Thread.print summary]\n'
|
||||||
|
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
|
||||||
|
elif command -v jstat >/dev/null 2>&1; then
|
||||||
|
printf '\n[jstat -gc]\n'
|
||||||
|
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Review GC logs and recent application errors\n'
|
||||||
|
printf -- '- Check JVM heap sizing against container or host memory limits\n'
|
||||||
|
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
|
||||||
|
printf -- '- Capture jstack only if approved by operational process\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,121 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_offset_ms=500
|
||||||
|
critical_offset_ms=5000
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
|
||||||
|
|
||||||
|
Check time synchronization status and offset evidence when available.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
|
||||||
|
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_offset_ms" "$critical_offset_ms"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_offset_ms >= critical_offset_ms)); then
|
||||||
|
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
|
||||||
|
timezone="$(date '+%Z %z')"
|
||||||
|
sync_status="unknown"
|
||||||
|
detected_tool="none"
|
||||||
|
offset_ms=""
|
||||||
|
|
||||||
|
timedate_output=""
|
||||||
|
if command -v timedatectl >/dev/null 2>&1; then
|
||||||
|
detected_tool="timedatectl"
|
||||||
|
timedate_output="$(timedatectl 2>/dev/null || true)"
|
||||||
|
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
|
||||||
|
[[ -n "$sync_status" ]] || sync_status="unknown"
|
||||||
|
fi
|
||||||
|
|
||||||
|
chronyc_output=""
|
||||||
|
if command -v chronyc >/dev/null 2>&1; then
|
||||||
|
detected_tool="chronyc"
|
||||||
|
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
|
||||||
|
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
|
||||||
|
if [[ -n "$raw_offset" ]]; then
|
||||||
|
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
|
||||||
|
fi
|
||||||
|
elif command -v ntpq >/dev/null 2>&1; then
|
||||||
|
detected_tool="ntpq"
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
if [[ -n "$offset_ms" ]]; then
|
||||||
|
if ((offset_ms >= critical_offset_ms)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((offset_ms >= warning_offset_ms)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
elif [[ "$detected_tool" == "none" ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
|
||||||
|
|
||||||
|
printf 'Time status:\n'
|
||||||
|
printf 'System time: %s\n' "$system_time"
|
||||||
|
printf 'Timezone: %s\n' "$timezone"
|
||||||
|
printf 'Detected tool: %s\n' "$detected_tool"
|
||||||
|
printf 'NTP synchronized: %s\n' "$sync_status"
|
||||||
|
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
|
||||||
|
|
||||||
|
printf 'Tool evidence:\n'
|
||||||
|
if [[ -n "$chronyc_output" ]]; then
|
||||||
|
printf '%s\n' "$chronyc_output"
|
||||||
|
elif command -v ntpq >/dev/null 2>&1; then
|
||||||
|
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
|
||||||
|
elif [[ -n "$timedate_output" ]]; then
|
||||||
|
printf '%s\n' "$timedate_output"
|
||||||
|
else
|
||||||
|
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
|
||||||
|
if [[ -z "$offset_ms" ]]; then
|
||||||
|
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify chrony or ntpd service status and configuration\n'
|
||||||
|
printf -- '- Check NTP sources and reachability\n'
|
||||||
|
printf -- '- Check virtualization host time if this is a VM\n'
|
||||||
|
printf -- '- Avoid restarting time services blindly in production\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,111 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
service_name=""
|
||||||
|
since_value="1 hour ago"
|
||||||
|
warning_count=3
|
||||||
|
critical_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
|
||||||
|
|
||||||
|
Detect restart-loop evidence for a systemd service. Read-only.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$service_name" ]]; then
|
||||||
|
printf 'CRITICAL: --service is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$warning_count" "$critical_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_count >= critical_count)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd systemctl
|
||||||
|
|
||||||
|
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
|
||||||
|
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
|
||||||
|
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
|
||||||
|
restart_count="${n_restarts:-0}"
|
||||||
|
if ! is_number "$restart_count"; then
|
||||||
|
restart_count=0
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
|
||||||
|
|
||||||
|
printf 'Service state:\n'
|
||||||
|
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Systemd properties:\n'
|
||||||
|
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
|
||||||
|
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
|
||||||
|
| tail -n 40 || printf 'OK: no matching journal lines found\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; journal visibility may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Inspect the unit file and drop-in overrides\n'
|
||||||
|
printf -- '- Review application logs around the restart timestamps\n'
|
||||||
|
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
|
||||||
|
printf -- '- Verify recent configuration or package changes\n'
|
||||||
|
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
|
||||||
|
|
||||||
|
Certificate details:
|
||||||
|
Subject: CN = app.example.com
|
||||||
|
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
|
||||||
|
notBefore: Apr 11 00:00:00 2026 GMT
|
||||||
|
notAfter: May 29 23:59:59 2026 GMT
|
||||||
|
SAN/CN: DNS:app.example.com, DNS:api.example.com
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Target: app.example.com:443
|
||||||
|
SNI: app.example.com
|
||||||
|
Thresholds: warning=30 days critical=7 days
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Renew certificate before the operational threshold is breached
|
||||||
|
- Check the full chain and intermediate certificates
|
||||||
|
- Check the load balancer, ingress, or reverse proxy serving this certificate
|
||||||
|
- Verify monitoring threshold and alert ownership
|
||||||
|
- Attach this output to incident or change ticket
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
OK: DNS=OK ping=OK tcp_443=OK
|
||||||
|
|
||||||
|
DNS result:
|
||||||
|
93.184.216.34 example.com
|
||||||
|
|
||||||
|
Ping result:
|
||||||
|
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
|
||||||
|
|
||||||
|
TCP port result:
|
||||||
|
OK: TCP connection to example.com:443 succeeded
|
||||||
|
|
||||||
|
Local network hints:
|
||||||
|
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Host: example.com count=3 timeout=3s port=443
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify the DNS record and resolver path
|
||||||
|
- Check firewall, routing, security group, or proxy policy
|
||||||
|
- Compare results from another host or network segment
|
||||||
|
- Check application endpoint health after network reachability is confirmed
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,26 @@
|
|||||||
|
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
|
||||||
|
|
||||||
|
Top source IPs:
|
||||||
|
52 203.0.113.44
|
||||||
|
12 198.51.100.20
|
||||||
|
9 192.0.2.10
|
||||||
|
|
||||||
|
Top attempted users:
|
||||||
|
31 admin
|
||||||
|
24 oracle
|
||||||
|
18 root
|
||||||
|
|
||||||
|
Sample recent lines:
|
||||||
|
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
|
||||||
|
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=20 critical=50 since="1 hour ago"
|
||||||
|
Log source: journalctl
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify source IPs against expected scanners, admins, or automation
|
||||||
|
- Check firewall, fail2ban, or security tooling state
|
||||||
|
- Confirm whether the attempts are expected for this host
|
||||||
|
- Review successful logins too, not only failures
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
CRITICAL: Found 1 read-only filesystem(s)
|
||||||
|
|
||||||
|
Read-only filesystems:
|
||||||
|
MOUNT_POINT SOURCE FSTYPE OPTIONS
|
||||||
|
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
include_system=0
|
||||||
|
Collector: findmnt
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check dmesg or journal logs for I/O errors and filesystem remount events
|
||||||
|
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
|
||||||
|
- Check filesystem health with the platform-approved procedure
|
||||||
|
- Do not remount read-write before understanding the cause
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
|
||||||
|
|
||||||
|
Load average:
|
||||||
|
1m=7.82 5m=6.91 15m=5.40
|
||||||
|
|
||||||
|
CPU count:
|
||||||
|
8
|
||||||
|
|
||||||
|
Top CPU processes:
|
||||||
|
PID PPID USER %CPU %MEM COMMAND COMMAND
|
||||||
|
2314 1 app 245 12.1 java java -jar order-api.jar
|
||||||
|
991 1 root 38 0.4 backup-agent backup-agent --scan
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
WARNING: load is close to online CPU count; runnable task saturation is possible
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check process ownership and whether the top process is expected
|
||||||
|
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
|
||||||
|
- Review logs for the top CPU-consuming process
|
||||||
|
- Compare with longer trend data from monitoring before taking action
|
||||||
|
- Attach this output to the incident ticket
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
|
||||||
|
Memory summary:
|
||||||
|
total used free shared buff/cache available
|
||||||
|
Mem: 15934 13386 512 121 2036 2101
|
||||||
|
Swap: 4095 512 3583
|
||||||
|
|
||||||
|
Top memory processes:
|
||||||
|
PID RSS_MB COMMAND
|
||||||
|
1234 2048 java
|
||||||
|
987 812 postgres
|
||||||
|
|
||||||
|
OOM events since 24 hours ago:
|
||||||
|
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=80% critical=90% since="24 hours ago"
|
||||||
|
OOM evidence source: journalctl
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check application memory trend
|
||||||
|
- Review JVM heap settings if process is Java
|
||||||
|
- Verify swap pressure and paging activity
|
||||||
|
- Confirm whether OOM events align with application impact
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
WARNING: Highest inode usage is 87%
|
||||||
|
|
||||||
|
Filesystems above threshold:
|
||||||
|
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||||
|
|
||||||
|
Inode usage table:
|
||||||
|
Filesystem Inodes IUsed IFree IUse% Mounted on
|
||||||
|
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
|
||||||
|
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||||
|
|
||||||
|
Top affected mount points:
|
||||||
|
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=80% critical=90%
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Find directories with many small files under affected mount points
|
||||||
|
- Check logs, cache, spool, session, and temporary directories
|
||||||
|
- Avoid deleting blindly; confirm ownership and application impact first
|
||||||
|
- Confirm whether inode exhaustion is causing write or deploy failures
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
OK: JVM diagnostics collected for PID 1234
|
||||||
|
|
||||||
|
Detected JVM process:
|
||||||
|
PID USER RSS_MB CPU COMMAND
|
||||||
|
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
|
||||||
|
Thread count: 188
|
||||||
|
|
||||||
|
Heap and JVM evidence:
|
||||||
|
|
||||||
|
[jcmd VM.flags]
|
||||||
|
1234:
|
||||||
|
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
|
||||||
|
|
||||||
|
[jcmd GC.heap_info]
|
||||||
|
garbage-first heap total 2097152K, used 1521000K
|
||||||
|
|
||||||
|
[jcmd Thread.print summary]
|
||||||
|
102 java.lang.Thread.State: WAITING
|
||||||
|
53 java.lang.Thread.State: RUNNABLE
|
||||||
|
33 java.lang.Thread.State: TIMED_WAITING
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
PID=1234 thread_count=188 top=10
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Review GC logs and recent application errors
|
||||||
|
- Check JVM heap sizing against container or host memory limits
|
||||||
|
- Check thread count trend in monitoring before concluding a leak
|
||||||
|
- Capture jstack only if approved by operational process
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
WARNING: Time sync status=yes offset_ms=812
|
||||||
|
|
||||||
|
Time status:
|
||||||
|
System time: 2026-05-11 10:18:01 UTC +0000
|
||||||
|
Timezone: UTC +0000
|
||||||
|
Detected tool: chronyc
|
||||||
|
NTP synchronized: yes
|
||||||
|
Offset ms: 812
|
||||||
|
|
||||||
|
Tool evidence:
|
||||||
|
Reference ID : 203.0.113.10
|
||||||
|
System time : 0.812345 seconds fast of NTP time
|
||||||
|
Last offset : +0.812345 seconds
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=500ms critical=5000ms
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify chrony or ntpd service status and configuration
|
||||||
|
- Check NTP sources and reachability
|
||||||
|
- Check virtualization host time if this is a VM
|
||||||
|
- Avoid restarting time services blindly in production
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,27 @@
|
|||||||
|
CRITICAL: Service app.service state=failed substate=failed restarts=12
|
||||||
|
|
||||||
|
Service state:
|
||||||
|
app.service - Example application
|
||||||
|
Loaded: loaded (/etc/systemd/system/app.service; enabled)
|
||||||
|
Active: failed (Result: exit-code)
|
||||||
|
|
||||||
|
Systemd properties:
|
||||||
|
Id=app.service
|
||||||
|
ActiveState=failed
|
||||||
|
SubState=failed
|
||||||
|
Result=exit-code
|
||||||
|
NRestarts=12
|
||||||
|
|
||||||
|
Recent start/stop/failure log lines since 1 hour ago:
|
||||||
|
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
|
||||||
|
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Inspect the unit file and drop-in overrides
|
||||||
|
- Review application logs around the restart timestamps
|
||||||
|
- Check dependencies such as network, storage, database, or secrets
|
||||||
|
- Verify recent configuration or package changes
|
||||||
|
- Do not restart blindly; attach this output to the incident ticket
|
||||||
@@ -0,0 +1,385 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
incident_type=""
|
||||||
|
service_name=""
|
||||||
|
host_name=""
|
||||||
|
port=""
|
||||||
|
target_pid=""
|
||||||
|
match_string=""
|
||||||
|
output_file=""
|
||||||
|
since_value="1 hour ago"
|
||||||
|
|
||||||
|
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: incident_triage_report.sh --type TYPE [options]
|
||||||
|
|
||||||
|
Run selected read-only incident checks and produce a Markdown triage report.
|
||||||
|
|
||||||
|
Incident types:
|
||||||
|
cpu
|
||||||
|
memory
|
||||||
|
service
|
||||||
|
network
|
||||||
|
auth
|
||||||
|
cert
|
||||||
|
filesystem
|
||||||
|
jvm
|
||||||
|
all
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--type TYPE Incident type to collect
|
||||||
|
--service SERVICE_NAME systemd service name for service checks
|
||||||
|
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
|
||||||
|
--port PORT TCP or TLS port for host checks
|
||||||
|
--pid PID JVM process ID
|
||||||
|
--match PROCESS_MATCH JVM process match string
|
||||||
|
--output FILE write Markdown report to FILE
|
||||||
|
--since VALUE time window for log-based checks
|
||||||
|
--help show this help
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
./incident_triage_report.sh --type cpu
|
||||||
|
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||||
|
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||||
|
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
valid_type() {
|
||||||
|
case "$1" in
|
||||||
|
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
|
||||||
|
*) return 1 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--type)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
|
||||||
|
incident_type="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--service)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
|
||||||
|
service_name="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--host)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
|
||||||
|
host_name="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--port)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
|
||||||
|
port="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--pid)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
|
||||||
|
target_pid="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--match)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
|
||||||
|
match_string="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--output)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
|
||||||
|
output_file="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--since)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
|
||||||
|
since_value="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--help|-h)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$incident_type" ]]; then
|
||||||
|
printf 'CRITICAL: --type is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! valid_type "$incident_type"; then
|
||||||
|
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||||
|
printf 'CRITICAL: --port must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||||
|
printf 'CRITICAL: --pid must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_dir="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$tmp_dir"' EXIT
|
||||||
|
|
||||||
|
report_file="$tmp_dir/report.md"
|
||||||
|
|
||||||
|
check_labels=()
|
||||||
|
check_names=()
|
||||||
|
check_commands=()
|
||||||
|
check_statuses=()
|
||||||
|
check_exit_codes=()
|
||||||
|
check_summaries=()
|
||||||
|
check_outputs=()
|
||||||
|
|
||||||
|
status_from_exit() {
|
||||||
|
case "$1" in
|
||||||
|
0) printf 'OK' ;;
|
||||||
|
1) printf 'WARNING' ;;
|
||||||
|
2) printf 'INVALID' ;;
|
||||||
|
3) printf 'CRITICAL' ;;
|
||||||
|
*) printf 'ERROR' ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
render_command() {
|
||||||
|
local item
|
||||||
|
for item in "$@"; do
|
||||||
|
printf '%q ' "$item"
|
||||||
|
done | sed 's/[[:space:]]*$//'
|
||||||
|
}
|
||||||
|
|
||||||
|
append_skipped_check() {
|
||||||
|
local label="$1"
|
||||||
|
local name="$2"
|
||||||
|
local reason="$3"
|
||||||
|
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||||
|
|
||||||
|
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
|
||||||
|
|
||||||
|
check_labels+=("$label")
|
||||||
|
check_names+=("$name")
|
||||||
|
check_commands+=("not run")
|
||||||
|
check_statuses+=("SKIPPED")
|
||||||
|
check_exit_codes+=("-")
|
||||||
|
check_summaries+=("$reason")
|
||||||
|
check_outputs+=("$output_path")
|
||||||
|
}
|
||||||
|
|
||||||
|
run_check() {
|
||||||
|
local label="$1"
|
||||||
|
local script_name="$2"
|
||||||
|
shift 2
|
||||||
|
|
||||||
|
local script_path="${script_dir}/${script_name}"
|
||||||
|
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||||
|
local command_text
|
||||||
|
local exit_code
|
||||||
|
local status
|
||||||
|
local summary
|
||||||
|
|
||||||
|
command_text="$(render_command "$script_path" "$@")"
|
||||||
|
|
||||||
|
if [[ ! -e "$script_path" ]]; then
|
||||||
|
append_skipped_check "$label" "$script_name" "missing script: $script_name"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ ! -x "$script_path" ]]; then
|
||||||
|
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
set +e
|
||||||
|
"$script_path" "$@" > "$output_path" 2>&1
|
||||||
|
exit_code=$?
|
||||||
|
set -e
|
||||||
|
|
||||||
|
status="$(status_from_exit "$exit_code")"
|
||||||
|
summary="$(sed -n '1p' "$output_path")"
|
||||||
|
if [[ -z "$summary" ]]; then
|
||||||
|
summary="no output captured"
|
||||||
|
fi
|
||||||
|
|
||||||
|
check_labels+=("$label")
|
||||||
|
check_names+=("$script_name")
|
||||||
|
check_commands+=("$command_text")
|
||||||
|
check_statuses+=("$status")
|
||||||
|
check_exit_codes+=("$exit_code")
|
||||||
|
check_summaries+=("$summary")
|
||||||
|
check_outputs+=("$output_path")
|
||||||
|
}
|
||||||
|
|
||||||
|
run_cpu_checks() {
|
||||||
|
run_check "CPU saturation" "check_high_cpu.sh"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_memory_checks() {
|
||||||
|
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_service_checks() {
|
||||||
|
if [[ -z "$service_name" ]]; then
|
||||||
|
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_network_checks() {
|
||||||
|
local args=(--host "$host_name")
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
args+=(--port "$port")
|
||||||
|
fi
|
||||||
|
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_auth_checks() {
|
||||||
|
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_cert_checks() {
|
||||||
|
local args=(--host "$host_name")
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
args+=(--port "$port")
|
||||||
|
fi
|
||||||
|
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_filesystem_checks() {
|
||||||
|
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
|
||||||
|
run_check "Inode usage" "check_inode_usage.sh"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_jvm_checks() {
|
||||||
|
local args=()
|
||||||
|
if [[ -n "$target_pid" ]]; then
|
||||||
|
args+=(--pid "$target_pid")
|
||||||
|
elif [[ -n "$match_string" ]]; then
|
||||||
|
args+=(--match "$match_string")
|
||||||
|
fi
|
||||||
|
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
case "$incident_type" in
|
||||||
|
cpu) run_cpu_checks ;;
|
||||||
|
memory) run_memory_checks ;;
|
||||||
|
service) run_service_checks ;;
|
||||||
|
network) run_network_checks ;;
|
||||||
|
auth) run_auth_checks ;;
|
||||||
|
cert) run_cert_checks ;;
|
||||||
|
filesystem) run_filesystem_checks ;;
|
||||||
|
jvm) run_jvm_checks ;;
|
||||||
|
all)
|
||||||
|
run_cpu_checks
|
||||||
|
run_memory_checks
|
||||||
|
run_service_checks
|
||||||
|
run_network_checks
|
||||||
|
run_auth_checks
|
||||||
|
run_cert_checks
|
||||||
|
run_filesystem_checks
|
||||||
|
run_jvm_checks
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||||
|
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
|
||||||
|
current_user="$(id -un 2>/dev/null || printf 'unknown')"
|
||||||
|
|
||||||
|
{
|
||||||
|
printf '# L2 Incident Triage Report\n\n'
|
||||||
|
printf -- '- Generated: %s\n' "$generated_at"
|
||||||
|
printf -- '- Local hostname: %s\n' "$local_hostname"
|
||||||
|
printf -- '- Current user: %s\n' "$current_user"
|
||||||
|
printf -- '- Incident type: %s\n' "$incident_type"
|
||||||
|
printf -- '- Service: %s\n' "${service_name:-not provided}"
|
||||||
|
printf -- '- Host: %s\n' "${host_name:-not provided}"
|
||||||
|
printf -- '- Port: %s\n' "${port:-not provided}"
|
||||||
|
printf -- '- PID: %s\n' "${target_pid:-not provided}"
|
||||||
|
printf -- '- Process match: %s\n' "${match_string:-not provided}"
|
||||||
|
printf -- '- Since: %s\n\n' "$since_value"
|
||||||
|
|
||||||
|
printf '## Executed Checks\n\n'
|
||||||
|
printf '| Check | Script | Status | Exit | Command |\n'
|
||||||
|
printf '| --- | --- | --- | --- | --- |\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
|
||||||
|
"${check_labels[$index]}" \
|
||||||
|
"${check_names[$index]}" \
|
||||||
|
"${check_statuses[$index]}" \
|
||||||
|
"${check_exit_codes[$index]}" \
|
||||||
|
"${check_commands[$index]}"
|
||||||
|
done
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf '## Summary\n\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
|
||||||
|
done
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf '## Raw Evidence\n\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf '### %s\n\n' "${check_labels[$index]}"
|
||||||
|
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
|
||||||
|
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
|
||||||
|
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
|
||||||
|
printf '```text\n'
|
||||||
|
cat "${check_outputs[$index]}"
|
||||||
|
printf '\n```\n\n'
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '## L2 Handover Checklist\n\n'
|
||||||
|
printf -- '- [ ] Business impact confirmed\n'
|
||||||
|
printf -- '- [ ] Affected host/service identified\n'
|
||||||
|
printf -- '- [ ] Monitoring alert attached\n'
|
||||||
|
printf -- '- [ ] Recent changes checked\n'
|
||||||
|
printf -- '- [ ] Logs attached\n'
|
||||||
|
printf -- '- [ ] Service owner identified\n'
|
||||||
|
printf -- '- [ ] Escalation target identified\n\n'
|
||||||
|
|
||||||
|
printf '## Escalation Notes\n\n'
|
||||||
|
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
|
||||||
|
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
|
||||||
|
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
|
||||||
|
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
|
||||||
|
|
||||||
|
printf '## Recommended Next Steps\n\n'
|
||||||
|
printf -- '- Confirm the symptom against monitoring and user reports.\n'
|
||||||
|
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
|
||||||
|
printf -- '- Attach this report to the incident ticket before handoff.\n'
|
||||||
|
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
|
||||||
|
} > "$report_file"
|
||||||
|
|
||||||
|
if [[ -n "$output_file" ]]; then
|
||||||
|
cp "$report_file" "$output_file"
|
||||||
|
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
|
||||||
|
else
|
||||||
|
cat "$report_file"
|
||||||
|
fi
|
||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
threshold="${1:-80}"
|
threshold="${1:-80}"
|
||||||
|
|
||||||
@@ -13,7 +11,7 @@ fi
|
|||||||
status=0
|
status=0
|
||||||
warning_threshold=$(( threshold > 5 ? threshold - 5 : threshold ))
|
warning_threshold=$(( threshold > 5 ? threshold - 5 : threshold ))
|
||||||
|
|
||||||
while read -r filesystem size used avail use_percent mountpoint; do
|
while read -r filesystem _size _used avail use_percent mountpoint; do
|
||||||
usage="${use_percent%\%}"
|
usage="${use_percent%\%}"
|
||||||
|
|
||||||
if (( usage >= threshold )); then
|
if (( usage >= threshold )); then
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
section() {
|
section() {
|
||||||
printf '\n== %s ==\n' "$1"
|
printf '\n== %s ==\n' "$1"
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
target="${1:-}"
|
target="${1:-}"
|
||||||
status=0
|
status=0
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
services=("$@")
|
services=("$@")
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
host="$(hostname)"
|
host="$(hostname)"
|
||||||
timestamp="$(date '+%Y-%m-%d_%H%M%S')"
|
timestamp="$(date '+%Y-%m-%d_%H%M%S')"
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
DRY_RUN=true
|
DRY_RUN=true
|
||||||
TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
|
TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
|
||||||
@@ -14,7 +12,6 @@ MOUNTPOINT=""
|
|||||||
SIZE=""
|
SIZE=""
|
||||||
DISKS=""
|
DISKS=""
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
|
||||||
|
|
||||||
log() {
|
log() {
|
||||||
local level="${1:-INFO}"
|
local level="${1:-INFO}"
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
@@ -1,7 +1,5 @@
|
|||||||
#!/usr/bin/env bash
|
#!/usr/bin/env bash
|
||||||
set -o errexit
|
set -euo pipefail
|
||||||
set -o nounset
|
|
||||||
set -o pipefail
|
|
||||||
|
|
||||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
# shellcheck source=00_env.sh
|
# shellcheck source=00_env.sh
|
||||||
|
|||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user