Compare commits

...

15 Commits

Author SHA1 Message Date
Mateusz Suski 4e739c5c99 Add Linux fresh setup toolkit
lint / shell-yaml-ansible (push) Failing after 16s
2026-06-06 00:23:11 +00:00
Mateusz Suski 8cb92de06f Add AI lab maintenance toolkit
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-06 00:10:44 +00:00
Mateusz Suski 1843796e92 Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-05 15:39:24 +00:00
Mateusz Suski cd6830334b Add Slurm AI/HPC cluster platform project 2026-06-05 15:38:56 +00:00
mateusz e2624a7533 PDF CV file upload
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-14 21:23:49 +02:00
Mateusz Suski 6475f76787 Add L2 incident triage report wrapper
lint / shell-yaml-ansible (push) Failing after 17s
2026-05-12 20:00:42 +00:00
Mateusz Suski e851568c8c Add standalone Bash incident check scripts
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-11 18:49:00 +00:00
Mateusz Suski 8a7b7c5abc Clean up Python log analysis documentation
lint / shell-yaml-ansible (push) Failing after 20s
2026-05-11 17:10:10 +00:00
Mateusz Suski 1636f46f81 Add known error matcher tool 2026-05-11 17:06:46 +00:00
Mateusz Suski 5fc96348c5 Add journal analyzer tool 2026-05-11 17:06:05 +00:00
Mateusz Suski 89b7fabb96 Add JVM log analyzer tool 2026-05-11 17:05:27 +00:00
Mateusz Suski 2da5e8b46c Add authentication log audit tool 2026-05-11 17:04:48 +00:00
Mateusz Suski 452ff4fac1 Add log diff checker tool 2026-05-11 17:04:10 +00:00
Mateusz Suski 5dde403ce3 Add incident log summary tool 2026-05-11 17:03:31 +00:00
Mateusz Suski 61483c233f Add Python tooling validation foundation 2026-05-11 17:02:35 +00:00
157 changed files with 16261 additions and 22 deletions
+3
View File
@@ -28,6 +28,9 @@ jobs:
-P infra-run/scripts/bash/gpfs \
-P infra-run/scripts/bash/veritas
- name: Python syntax checks
run: bash scripts/check-python.sh
- name: yamllint
run: yamllint .
+10
View File
@@ -41,6 +41,7 @@ Focused checks:
```bash
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
```
@@ -64,6 +65,15 @@ Also run targeted checks for changed files, such as `bash -n`, `ansible-playbook
- Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
- Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps.
## Python Standards
- Use Python for parsing, reporting, and structured operational tooling where it adds value over Bash.
- Keep Python tools read-only by default.
- Prefer the Python standard library.
- Avoid frameworks and unnecessary abstractions.
- Use clear operational output and meaningful exit codes.
- Keep tools small, focused, and easy to validate.
## Ansible Standards
- Keep playbooks short and roles simple.
+14
View File
@@ -4,6 +4,17 @@
### Added
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
- Python tooling validation for operational scripts.
- `incident-log-summary` for general incident log summarization.
- `log-diff-checker` for pre-change and post-change log comparison.
- `auth-log-audit` for Linux authentication log review.
- `jvm-log-analyzer` for JVM application log summaries.
- `journal-analyzer` for exported `journalctl` log review.
- `known-error-matcher` with JSON-based known error patterns.
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
- Repository-level Codex guidance:
- `AGENTS.md`
- `docs/codex/README.md`
@@ -27,12 +38,15 @@
- IBM AIX 7 role and playbook.
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
### Changed
- Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work.
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
- Updated Python tooling documentation and repository roadmap.
- Integrated Python syntax validation into repository validation workflow and CI.
### Notes
Binary file not shown.
+13 -1
View File
@@ -30,10 +30,19 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
- [infra-run](./infra-run/) - the main implemented project in this repository.
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
## Planned Areas
@@ -78,10 +87,11 @@ Basic local validation:
./scripts/validate-repo.sh
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh
```
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Set `STRICT=1` to fail when optional tools are missing.
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
@@ -90,10 +100,12 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
## Operational Areas Demonstrated
- Linux operations triage and reporting.
- Local operational log analysis with read-only Python helpers.
- Disk pressure and deleted-file incident analysis.
- Dry-run-first Bash automation.
- Controlled storage change workflow design.
- Veritas VxVM/VCS operational awareness.
- GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls.
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system.
+19 -2
View File
@@ -16,6 +16,23 @@ This file keeps future portfolio ideas in one place so empty folders do not look
- Clustering: service group checks, failover review, and operational checklists.
- Monitoring: Zabbix-oriented alert review and host onboarding notes.
- Virtualization: VM lifecycle and platform operations examples.
- Log analysis: ELK-style search examples for incident review.
- Log analysis: optional ELK-style search case study under `platform-projects`, separate from current local Python helpers.
Nothing in this roadmap should be read as completed implementation.
## Implemented Portfolio Additions
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
- Python operational log analysis suite under `infra-run/scripts/python/`:
- `incident-log-summary`
- `log-diff-checker`
- `auth-log-audit`
- `jvm-log-analyzer`
- `journal-analyzer`
- `known-error-matcher`
## Future Python Tooling Ideas
- Real-world sample report examples using sanitized evidence.
- Integration examples that combine log summaries with change evidence collection.
- A shared Python helper library only if the standalone tools begin duplicating enough stable behavior to justify it.
Planned sections remain future work unless listed as implemented.
+22 -2
View File
@@ -1,16 +1,35 @@
# infra-run
`infra-run` is a sanitized infrastructure operations project. It contains Bash and Ansible examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
`infra-run` is a sanitized infrastructure operations project. It contains Bash, Ansible, Python, and documentation examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
The goal is to show operational judgment, not to ship a universal automation product.
## Current Contents
### Bash Operational Scripts
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
### Python Log And Reporting Tools
- [scripts/python](./scripts/python/) - read-only Python operational helpers using the standard library only.
- [scripts/python/incident-log-summary](./scripts/python/incident-log-summary/) - read-only Python log summary helper for incident pattern review.
- [scripts/python/log-diff-checker](./scripts/python/log-diff-checker/) - read-only Python before/after log comparison helper for change review.
- [scripts/python/auth-log-audit](./scripts/python/auth-log-audit/) - read-only Python authentication log audit helper for SSH, sudo, su, and PAM review.
- [scripts/python/jvm-log-analyzer](./scripts/python/jvm-log-analyzer/) - read-only Python JVM and Java application log analyzer for exception, stack trace, HTTP 5xx, database, and TLS review.
- [scripts/python/journal-analyzer](./scripts/python/journal-analyzer/) - read-only Python exported journal analyzer for failed units, restart patterns, OOM events, and service warnings.
- [scripts/python/known-error-matcher](./scripts/python/known-error-matcher/) - read-only Python matcher for local logs and JSON known-error catalogs with runbook references.
### Ansible Automation
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
### Runbooks And Documentation
- [examples](./examples/) - sanitized sample command outputs and incident notes.
## Documentation
@@ -36,6 +55,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
- Bash syntax can be checked locally.
- Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked.
- Disk-full read-only scripts can be run against local paths for basic behavior checks.
- Python log analysis examples can be run against sanitized sample logs under each tool directory.
- Ansible YAML and role structure can be linted locally.
## Running Safely
@@ -70,7 +90,7 @@ From the repository root:
./scripts/validate-repo.sh
```
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, `scripts/check-python.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
## Supporting Notes
+15
View File
@@ -8,6 +8,21 @@ This file tracks planned `infra-run` additions without presenting them as comple
- A small Python parser for converting script output into a markdown change note.
- Additional Ansible molecule or container-based syntax checks where platform support is realistic.
- Standalone runbooks that reference the existing Bash workflows.
- Shared known-error pattern catalog review.
- Additional links between Python findings and existing runbooks.
- Change evidence collector for pre-check and post-check notes.
- Report examples suitable for incident and change tickets.
- Optional wrapper command only after the standalone Python tools stabilize.
## Implemented Additions
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
- `infra-run/scripts/python/jvm-log-analyzer/` - read-only JVM and Java application log analyzer for exceptions, stack traces, HTTP 5xx entries, database issues, TLS failures, and JVM failure symptoms.
- `infra-run/scripts/python/journal-analyzer/` - read-only exported `journalctl` text analyzer for summarizing failed units, dependency issues, restart patterns, OOM findings, disk/filesystem symptoms, and related service warnings.
- `infra-run/scripts/python/known-error-matcher/` - read-only known-error matcher for local logs and JSON pattern catalogs with severity, category, samples, and runbook references.
## Not Planned
+1
View File
@@ -7,5 +7,6 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
## Included
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
- `veritas/` - sample VxVM disk and VCS service group output.
- `gpfs/` - sample GPFS cluster and NSD output.
@@ -0,0 +1,131 @@
# L2 Incident Triage Report
- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago
## Executed Checks
| Check | Script | Status | Exit | Command |
| --- | --- | --- | --- | --- |
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
## Summary
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected
## Raw Evidence
### CPU saturation
Script: `check_high_cpu.sh`
Command: `./check_high_cpu.sh`
Status: OK, exit: 0
```text
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
Load average:
1m=0.42 5m=0.38 15m=0.31
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process
Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
```
### Memory and OOM
Script: `check_high_memory_oom.sh`
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
Status: WARNING, exit: 1
```text
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583
OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
```
### Service restart loop
Script: `check_service_restart_loop.sh`
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
Status: OK, exit: 0
```text
OK: Service nginx state=active substate=running restarts=0
Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
```
### Skipped or limited checks
```text
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
```
## L2 Handover Checklist
- [ ] Business impact confirmed
- [ ] Affected host/service identified
- [ ] Monitoring alert attached
- [ ] Recent changes checked
- [ ] Logs attached
- [ ] Service owner identified
- [ ] Escalation target identified
## Escalation Notes
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
## Recommended Next Steps
- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
+7 -6
View File
@@ -1,6 +1,6 @@
# infra-run/scripts
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from future Python-based utilities while keeping both under one automation entry point.
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from Python-based analysis utilities while keeping both under one automation entry point.
## Diagram
@@ -9,16 +9,17 @@ flowchart TD
A["scripts"] --> B["bash"]
A --> C["python"]
B --> D["Operational toolkits"]
C --> E["Future helper utilities"]
C --> E["Analysis helper utilities"]
```
## Scope
- `bash` - current implementation area with operations toolkits.
- `python` - reserved space for future supporting utilities.
- [bash](./bash/) - operational toolkits for host health checks, disk-full triage, Veritas examples, and GPFS examples.
- [python](./python/) - read-only tools for local log parsing, reporting, and structured operational analysis.
## Notes
- The repository currently emphasizes Bash because it maps directly to day-to-day Linux operations.
- The structure leaves room for higher-level helpers without mixing concerns.
- Bash remains the right default for direct host checks and operational wrappers.
- Python is used where parsing, report generation, comparison, or JSON output is clearer than shell.
- Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root.
- Python tooling should remain read-only by default, standard-library based, and validated with `../../scripts/check-python.sh` from the repository root.
+15 -6
View File
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
```mermaid
flowchart TD
A["bash"] --> B["os-healthcheck"]
A --> C["disk-full"]
A --> D["veritas"]
A --> E["gpfs"]
A --> C["incident-checks"]
A --> D["disk-full"]
A --> E["veritas"]
A --> F["gpfs"]
B --> B1["Host diagnostics"]
C --> C1["Incident workflow"]
D --> D1["VxVM and VCS change flow"]
E --> E1["Spectrum Scale expansion flow"]
C --> C1["Standalone triage checks"]
D --> D1["Incident workflow"]
E --> E1["VxVM and VCS change flow"]
F --> F1["Spectrum Scale expansion flow"]
```
## Scripts
@@ -23,6 +25,7 @@ flowchart TD
- `os-healthcheck/service_check.sh` - critical service status check.
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
## Usage
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
./system_report.sh
./network_troubleshoot.sh
./network_troubleshoot.sh google.com
cd ../incident-checks
./check_high_cpu.sh
./check_high_memory_oom.sh --since "24 hours ago"
./check_service_restart_loop.sh --service sshd
./check_certificate_expiry.sh --host example.com
```
## Standards
@@ -0,0 +1,124 @@
# Bash Incident Checks
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
## Scripts
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
- `check_filesystem_readonly.sh` - read-only filesystem detection.
- `check_inode_usage.sh` - inode pressure and top affected mount points.
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
## Usage Examples
```bash
./check_high_cpu.sh
./check_high_cpu.sh --warning 70 --critical 90 --top 15
./check_high_memory_oom.sh
./check_high_memory_oom.sh --since "6 hours ago" --top 5
./check_service_restart_loop.sh --service nginx
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
./check_failed_ssh_logins.sh
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
./check_certificate_expiry.sh --host example.com
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
./check_dns_connectivity.sh --host example.com
./check_dns_connectivity.sh --host db.example.internal --port 5432
./check_ntp_time_drift.sh
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
./check_filesystem_readonly.sh
./check_filesystem_readonly.sh --include-system
./check_inode_usage.sh
./check_inode_usage.sh --warning 75 --critical 90
./check_jvm_threads_heap.sh
./check_jvm_threads_heap.sh --pid 1234
./check_jvm_threads_heap.sh --match app-name
./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
```
## L2 Triage Report Wrapper
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
## Exit Codes
- `0` - OK.
- `1` - WARNING or operational issue detected.
- `2` - invalid input or missing required dependency.
- `3` - CRITICAL issue detected.
## Supported Platforms
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
Some data sources vary by distribution:
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
## Safety Notes
- Scripts are read-only.
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
- Treat output as triage evidence, not as complete root-cause analysis.
## Dependency Notes
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
## Copy-To-Server Example
```bash
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
```
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
## Sample Outputs
Sanitized examples are available in [examples](./examples/):
- `high-cpu.sample.txt`
- `high-memory-oom.sample.txt`
- `service-restart-loop.sample.txt`
- `failed-ssh-logins.sample.txt`
- `certificate-expiry.sample.txt`
- `dns-connectivity.sample.txt`
- `ntp-time-drift.sample.txt`
- `filesystem-readonly.sample.txt`
- `inode-usage.sample.txt`
- `jvm-threads-heap.sample.txt`
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
@@ -0,0 +1,134 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
host_name=""
port=443
cert_file=""
warning_days=30
critical_days=7
servername=""
usage() {
cat <<'USAGE'
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
Check TLS certificate expiry for a remote endpoint or local certificate file.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if ! command -v openssl >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: openssl\n'
exit 2
fi
for value in "$port" "$warning_days" "$critical_days"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((critical_days >= warning_days)); then
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
exit 2
fi
if [[ -n "$host_name" && -n "$cert_file" ]]; then
printf 'CRITICAL: use either --host or --file, not both\n'
exit 2
fi
if [[ -z "$host_name" && -z "$cert_file" ]]; then
printf 'CRITICAL: either --host or --file is required\n'
usage
exit 2
fi
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
exit 2
fi
if [[ -z "$servername" ]]; then
servername="$host_name"
fi
tmp_cert="$(mktemp)"
trap 'rm -f "$tmp_cert"' EXIT
if [[ -n "$host_name" ]]; then
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
exit 2
fi
else
cp "$cert_file" "$tmp_cert"
fi
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
now_epoch="$(date +%s)"
if [[ -z "$expiry_epoch" ]]; then
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
exit 2
fi
seconds_left=$((expiry_epoch - now_epoch))
days_left=$((seconds_left / 86400))
status="OK"
exit_code=0
if ((days_left < critical_days)); then
status="CRITICAL"
exit_code=3
elif ((days_left < warning_days)); then
status="WARNING"
exit_code=1
fi
target="$cert_file"
if [[ -n "$host_name" ]]; then
target="${host_name}:${port}"
fi
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
printf 'Certificate details:\n'
printf 'Subject: %s\n' "$subject"
printf 'Issuer: %s\n' "$issuer"
printf 'notBefore: %s\n' "$not_before"
printf 'notAfter: %s\n' "$not_after"
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
printf '\n'
printf 'Evidence:\n'
printf 'Target: %s\n' "$target"
printf 'SNI: %s\n' "${servername:-not used}"
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
printf 'Recommended next steps:\n'
printf -- '- Renew certificate before the operational threshold is breached\n'
printf -- '- Check the full chain and intermediate certificates\n'
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
printf -- '- Verify monitoring threshold and alert ownership\n'
printf -- '- Attach this output to incident or change ticket\n'
exit "$exit_code"
@@ -0,0 +1,161 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
host_name=""
port=""
count=3
timeout_seconds=3
usage() {
cat <<'USAGE'
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -z "$host_name" ]]; then
printf 'CRITICAL: --host is required\n'
usage
exit 2
fi
for value in "$count" "$timeout_seconds"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if [[ -n "$port" ]] && ! is_number "$port"; then
printf 'CRITICAL: --port must be numeric\n'
exit 2
fi
if ! command -v getent >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: getent\n'
exit 2
fi
dns_ok=0
ping_ok=0
tcp_ok=0
tcp_checked=0
tcp_note=""
ping_output="$(mktemp)"
trap 'rm -f "$ping_output"' EXIT
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
if [[ -n "$dns_output" ]]; then
dns_ok=1
fi
if command -v ping >/dev/null 2>&1; then
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
ping_ok=1
fi
else
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
fi
if [[ -n "$port" ]]; then
tcp_checked=1
if command -v timeout >/dev/null 2>&1; then
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
tcp_ok=1
fi
else
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
tcp_ok=1
fi
fi
fi
status="OK"
exit_code=0
if ((dns_ok == 0)); then
status="CRITICAL"
exit_code=3
elif ((tcp_checked == 1 && tcp_ok == 0)); then
status="CRITICAL"
exit_code=3
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
status="WARNING"
exit_code=1
fi
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
if ((tcp_checked == 1)); then
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
fi
printf '\n\n'
printf 'DNS result:\n'
if [[ -n "$dns_output" ]]; then
printf '%s\n' "$dns_output"
else
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
fi
printf '\n'
printf 'Ping result:\n'
if [[ -s "$ping_output" ]]; then
cat "$ping_output"
else
printf 'WARNING: ping result unavailable or ping command missing\n'
fi
printf '\n'
if ((tcp_checked == 1)); then
printf 'TCP port result:\n'
if ((tcp_ok == 1)); then
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
else
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
fi
if [[ -n "$tcp_note" ]]; then
printf '%s\n' "$tcp_note"
fi
printf '\n'
fi
printf 'Local network hints:\n'
if command -v ip >/dev/null 2>&1; then
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
elif command -v ss >/dev/null 2>&1; then
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
else
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
if [[ -n "$tcp_note" ]]; then
printf '%s\n' "$tcp_note"
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify the DNS record and resolver path\n'
printf -- '- Check firewall, routing, security group, or proxy policy\n'
printf -- '- Compare results from another host or network segment\n'
printf -- '- Check application endpoint health after network reachability is confirmed\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,124 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
since_value="1 hour ago"
warning_count=20
critical_count=50
top_count=10
usage() {
cat <<'USAGE'
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
Detect failed SSH login bursts from journal or readable authentication logs.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_count" "$critical_count" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_count >= critical_count)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
tmp_log="$(mktemp)"
trap 'rm -f "$tmp_log"' EXIT
log_source="journalctl"
if command -v journalctl >/dev/null 2>&1; then
journalctl --since "$since_value" --no-pager 2>/dev/null \
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
else
log_source="log file fallback"
fi
if [[ ! -s "$tmp_log" ]]; then
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
if [[ -r "$log_file" ]]; then
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
log_source="$log_file"
fi
done
fi
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
status="OK"
exit_code=0
if ((attempts >= critical_count)); then
status="CRITICAL"
exit_code=3
elif ((attempts >= warning_count)); then
status="WARNING"
exit_code=1
fi
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
printf 'Top source IPs:\n'
if [[ -s "$tmp_log" ]]; then
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
| sed -E 's/^(from|rhost=) //' \
| sort | uniq -c | sort -rn | head -n "$top_count" || true
else
printf 'OK: no failed SSH attempts found in available sources\n'
fi
printf '\n'
printf 'Top attempted users:\n'
if [[ -s "$tmp_log" ]]; then
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
| sort | uniq -c | sort -rn | head -n "$top_count" || true
else
printf 'OK: no attempted users extracted\n'
fi
printf '\n'
printf 'Sample recent lines:\n'
if [[ -s "$tmp_log" ]]; then
tail -n "$top_count" "$tmp_log"
else
printf 'OK: no sample lines available\n'
fi
printf '\n\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
printf 'Log source: %s\n' "$log_source"
if [[ "$log_source" != "journalctl" ]]; then
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; authentication log visibility may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
printf -- '- Check firewall, fail2ban, or security tooling state\n'
printf -- '- Confirm whether the attempts are expected for this host\n'
printf -- '- Review successful logins too, not only failures\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,89 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
include_system=0
usage() {
cat <<'USAGE'
Usage: check_filesystem_readonly.sh [--include-system] [--help]
Detect filesystems mounted read-only. Read-only.
USAGE
}
while (($# > 0)); do
case "$1" in
--include-system) include_system=1; shift ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
tmp_mounts="$(mktemp)"
trap 'rm -f "$tmp_mounts"' EXIT
if command -v findmnt >/dev/null 2>&1; then
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
elif command -v mount >/dev/null 2>&1; then
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
else
printf 'CRITICAL: findmnt or mount is required\n'
exit 2
fi
tmp_ro="$(mktemp)"
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
awk -v include_system="$include_system" '
function system_fs(type, target) {
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
}
{
target=$1; source=$2; type=$3; opts=$4
if (opts ~ /(^|,)ro(,|$)/) {
if (include_system == 1 || ! system_fs(type, target)) {
print target "\t" source "\t" type "\t" opts
}
}
}
' "$tmp_mounts" > "$tmp_ro"
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
status="OK"
exit_code=0
if ((readonly_count > 0)); then
status="CRITICAL"
exit_code=3
fi
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
printf 'Read-only filesystems:\n'
if [[ -s "$tmp_ro" ]]; then
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
cat "$tmp_ro"
else
printf 'OK: no read-only filesystems found with current filters\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'include_system=%s\n' "$include_system"
printf 'Collector: '
if command -v findmnt >/dev/null 2>&1; then
printf 'findmnt\n'
else
printf 'mount fallback\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
printf -- '- Check filesystem health with the platform-approved procedure\n'
printf -- '- Do not remount read-write before understanding the cause\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
+146
View File
@@ -0,0 +1,146 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=75
critical_threshold=90
top_count=10
usage() {
cat <<'USAGE'
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
Detect high CPU load and show top CPU-consuming processes.
Exit codes:
0 OK
1 WARNING / operational issue detected
2 invalid input / missing required dependency
3 CRITICAL issue detected
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--warning)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
warning_threshold="$2"
shift 2
;;
--critical)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
critical_threshold="$2"
shift 2
;;
--top)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
top_count="$2"
shift 2
;;
--help|-h)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1"
usage
exit 2
;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd ps
require_cmd awk
require_cmd head
cpu_count=1
if command -v getconf >/dev/null 2>&1; then
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
elif [[ -r /proc/cpuinfo ]]; then
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
fi
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
((cpu_count > 0)) || cpu_count=1
load_1m="unavailable"
load_5m="unavailable"
load_15m="unavailable"
load_per_cpu_pct=0
if [[ -r /proc/loadavg ]]; then
read -r load_1m load_5m load_15m _ < /proc/loadavg
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
elif command -v uptime >/dev/null 2>&1; then
load_line="$(uptime 2>/dev/null || true)"
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
fi
status="OK"
exit_code=0
if ((load_per_cpu_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((load_per_cpu_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
printf 'Load average:\n'
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
printf 'CPU count:\n'
printf '%s\n\n' "$cpu_count"
printf 'Top CPU processes:\n'
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
printf '\n'
printf 'Evidence:\n'
if command -v uptime >/dev/null 2>&1; then
uptime || true
else
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
fi
if ((load_per_cpu_pct >= 100)); then
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
else
printf 'OK: load is not above online CPU count at collection time\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check process ownership and whether the top process is expected\n'
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
printf -- '- Review logs for the top CPU-consuming process\n'
printf -- '- Compare with longer trend data from monitoring before taking action\n'
printf -- '- Attach this output to the incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,138 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=80
critical_threshold=90
since_value="24 hours ago"
top_count=10
usage() {
cat <<'USAGE'
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
Detect high memory or swap usage and show recent OOM killer evidence.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd free
require_cmd ps
require_cmd awk
require_cmd head
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
/^Mem:/ { mt=$2; mu=$3 }
/^Swap:/ { st=$2; su=$3 }
END { printf "%d %d %d %d\n", mt, mu, st, su }
')
mem_pct=0
swap_pct=0
if ((mem_total > 0)); then
mem_pct=$((mem_used * 100 / mem_total))
fi
if ((swap_total > 0)); then
swap_pct=$((swap_used * 100 / swap_total))
fi
status="OK"
exit_code=0
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
printf 'Memory summary:\n'
free -m
printf '\n'
printf 'Top memory processes:\n'
printf 'PID RSS_MB COMMAND\n'
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
printf '\n'
printf 'OOM events since %s:\n' "$since_value"
oom_found=0
oom_source="journalctl"
if command -v journalctl >/dev/null 2>&1; then
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
oom_found=1
fi
else
printf 'WARNING: journalctl not available; checking readable log files\n'
oom_source="log file fallback"
fi
if ((oom_found == 0)); then
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
if [[ -r "$log_file" ]]; then
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
oom_found=1
oom_source="$log_file"
break
fi
fi
done
fi
if ((oom_found == 0)); then
printf 'OK: no OOM evidence found in available sources\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
printf 'OOM evidence source: %s\n' "$oom_source"
if [[ "$oom_source" != "journalctl" ]]; then
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check application memory trend\n'
printf -- '- Review JVM heap settings if process is Java\n'
printf -- '- Verify swap pressure and paging activity\n'
printf -- '- Confirm whether OOM events align with application impact\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=80
critical_threshold=90
top_count=10
usage() {
cat <<'USAGE'
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
Detect inode exhaustion using df -i.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
if ! command -v df >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: df\n'
exit 2
fi
tmp_df="$(mktemp)"
tmp_alerts="$(mktemp)"
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
df -Pi > "$tmp_df"
awk -v warn="$warning_threshold" '
NR > 1 {
pct=$5
gsub(/%/, "", pct)
if (pct >= warn) {
print $0
}
}
' "$tmp_df" > "$tmp_alerts"
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
status="OK"
exit_code=0
if ((max_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((max_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
printf 'Filesystems above threshold:\n'
if [[ -s "$tmp_alerts" ]]; then
cat "$tmp_alerts"
else
printf 'OK: no filesystems above warning threshold\n'
fi
printf '\n'
printf 'Inode usage table:\n'
cat "$tmp_df"
printf '\n'
printf 'Top affected mount points:\n'
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
| sort -rn | head -n "$top_count" \
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
printf 'Recommended next steps:\n'
printf -- '- Find directories with many small files under affected mount points\n'
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,134 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
target_pid=""
match_string=""
top_count=10
usage() {
cat <<'USAGE'
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -n "$target_pid" && -n "$match_string" ]]; then
printf 'CRITICAL: use either --pid or --match, not both\n'
exit 2
fi
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
printf 'CRITICAL: --pid must be numeric\n'
exit 2
fi
if ! is_number "$top_count"; then
printf 'CRITICAL: --top must be numeric\n'
exit 2
fi
if ! command -v ps >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: ps\n'
exit 2
fi
tmp_java="$(mktemp)"
trap 'rm -f "$tmp_java"' EXIT
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
if [[ -z "$target_pid" && -n "$match_string" ]]; then
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
fi
if [[ -z "$target_pid" ]]; then
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
if ((detected_count == 0)); then
printf 'WARNING: No Java processes detected\n\n'
else
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
fi
printf 'Detected JVM processes:\n'
printf 'PID USER RSS_MB CPU COMMAND\n'
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
printf '\nRecommended next steps:\n'
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
printf -- '- Review GC logs and application logs for the selected process\n'
printf -- '- Check heap sizing and thread count trend\n'
printf -- '- Capture jstack only if approved by operational process\n'
exit 1
fi
if ! ps -p "$target_pid" >/dev/null 2>&1; then
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
exit 2
fi
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
status="WARNING"
exit_code=1
else
status="OK"
exit_code=0
fi
thread_count="unavailable"
if [[ -r "/proc/${target_pid}/status" ]]; then
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
fi
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
printf 'Detected JVM process:\n'
printf 'PID USER RSS_MB CPU COMMAND\n'
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
printf 'Thread count: %s\n\n' "$thread_count"
printf 'Heap and JVM evidence:\n'
if command -v jcmd >/dev/null 2>&1; then
printf '\n[jcmd VM.flags]\n'
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
printf '\n[jcmd GC.heap_info]\n'
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
printf '\n[jcmd Thread.print summary]\n'
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
elif command -v jstat >/dev/null 2>&1; then
printf '\n[jstat -gc]\n'
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
else
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Review GC logs and recent application errors\n'
printf -- '- Check JVM heap sizing against container or host memory limits\n'
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
printf -- '- Capture jstack only if approved by operational process\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,121 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_offset_ms=500
critical_offset_ms=5000
usage() {
cat <<'USAGE'
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
Check time synchronization status and offset evidence when available.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_offset_ms" "$critical_offset_ms"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_offset_ms >= critical_offset_ms)); then
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
exit 2
fi
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
timezone="$(date '+%Z %z')"
sync_status="unknown"
detected_tool="none"
offset_ms=""
timedate_output=""
if command -v timedatectl >/dev/null 2>&1; then
detected_tool="timedatectl"
timedate_output="$(timedatectl 2>/dev/null || true)"
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
[[ -n "$sync_status" ]] || sync_status="unknown"
fi
chronyc_output=""
if command -v chronyc >/dev/null 2>&1; then
detected_tool="chronyc"
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
if [[ -n "$raw_offset" ]]; then
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
fi
elif command -v ntpq >/dev/null 2>&1; then
detected_tool="ntpq"
fi
status="OK"
exit_code=0
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
status="WARNING"
exit_code=1
fi
if [[ -n "$offset_ms" ]]; then
if ((offset_ms >= critical_offset_ms)); then
status="CRITICAL"
exit_code=3
elif ((offset_ms >= warning_offset_ms)); then
status="WARNING"
exit_code=1
fi
elif [[ "$detected_tool" == "none" ]]; then
status="WARNING"
exit_code=1
fi
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
printf 'Time status:\n'
printf 'System time: %s\n' "$system_time"
printf 'Timezone: %s\n' "$timezone"
printf 'Detected tool: %s\n' "$detected_tool"
printf 'NTP synchronized: %s\n' "$sync_status"
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
printf 'Tool evidence:\n'
if [[ -n "$chronyc_output" ]]; then
printf '%s\n' "$chronyc_output"
elif command -v ntpq >/dev/null 2>&1; then
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
elif [[ -n "$timedate_output" ]]; then
printf '%s\n' "$timedate_output"
else
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
if [[ -z "$offset_ms" ]]; then
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify chrony or ntpd service status and configuration\n'
printf -- '- Check NTP sources and reachability\n'
printf -- '- Check virtualization host time if this is a VM\n'
printf -- '- Avoid restarting time services blindly in production\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,111 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
service_name=""
since_value="1 hour ago"
warning_count=3
critical_count=10
usage() {
cat <<'USAGE'
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
Detect restart-loop evidence for a systemd service. Read-only.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -z "$service_name" ]]; then
printf 'CRITICAL: --service is required\n'
usage
exit 2
fi
for value in "$warning_count" "$critical_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_count >= critical_count)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd systemctl
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
restart_count="${n_restarts:-0}"
if ! is_number "$restart_count"; then
restart_count=0
fi
status="OK"
exit_code=0
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
status="CRITICAL"
exit_code=3
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
status="WARNING"
exit_code=1
fi
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
printf 'Service state:\n'
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
printf '\n'
printf 'Systemd properties:\n'
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
printf '\n'
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
if command -v journalctl >/dev/null 2>&1; then
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
| tail -n 40 || printf 'OK: no matching journal lines found\n'
else
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; journal visibility may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Inspect the unit file and drop-in overrides\n'
printf -- '- Review application logs around the restart timestamps\n'
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
printf -- '- Verify recent configuration or package changes\n'
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,20 @@
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
Certificate details:
Subject: CN = app.example.com
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
notBefore: Apr 11 00:00:00 2026 GMT
notAfter: May 29 23:59:59 2026 GMT
SAN/CN: DNS:app.example.com, DNS:api.example.com
Evidence:
Target: app.example.com:443
SNI: app.example.com
Thresholds: warning=30 days critical=7 days
Recommended next steps:
- Renew certificate before the operational threshold is breached
- Check the full chain and intermediate certificates
- Check the load balancer, ingress, or reverse proxy serving this certificate
- Verify monitoring threshold and alert ownership
- Attach this output to incident or change ticket
@@ -0,0 +1,23 @@
OK: DNS=OK ping=OK tcp_443=OK
DNS result:
93.184.216.34 example.com
Ping result:
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
TCP port result:
OK: TCP connection to example.com:443 succeeded
Local network hints:
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
Evidence:
Host: example.com count=3 timeout=3s port=443
Recommended next steps:
- Verify the DNS record and resolver path
- Check firewall, routing, security group, or proxy policy
- Compare results from another host or network segment
- Check application endpoint health after network reachability is confirmed
- Attach this output to incident ticket
@@ -0,0 +1,26 @@
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
Top source IPs:
52 203.0.113.44
12 198.51.100.20
9 192.0.2.10
Top attempted users:
31 admin
24 oracle
18 root
Sample recent lines:
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
Evidence:
Thresholds: warning=20 critical=50 since="1 hour ago"
Log source: journalctl
Recommended next steps:
- Verify source IPs against expected scanners, admins, or automation
- Check firewall, fail2ban, or security tooling state
- Confirm whether the attempts are expected for this host
- Review successful logins too, not only failures
- Attach this output to incident ticket
@@ -0,0 +1,16 @@
CRITICAL: Found 1 read-only filesystem(s)
Read-only filesystems:
MOUNT_POINT SOURCE FSTYPE OPTIONS
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
Evidence:
include_system=0
Collector: findmnt
Recommended next steps:
- Check dmesg or journal logs for I/O errors and filesystem remount events
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
- Check filesystem health with the platform-approved procedure
- Do not remount read-write before understanding the cause
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
Load average:
1m=7.82 5m=6.91 15m=5.40
CPU count:
8
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND COMMAND
2314 1 app 245 12.1 java java -jar order-api.jar
991 1 root 38 0.4 backup-agent backup-agent --scan
Evidence:
WARNING: load is close to online CPU count; runnable task saturation is possible
Recommended next steps:
- Check process ownership and whether the top process is expected
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
- Review logs for the top CPU-consuming process
- Compare with longer trend data from monitoring before taking action
- Attach this output to the incident ticket
@@ -0,0 +1,25 @@
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
total used free shared buff/cache available
Mem: 15934 13386 512 121 2036 2101
Swap: 4095 512 3583
Top memory processes:
PID RSS_MB COMMAND
1234 2048 java
987 812 postgres
OOM events since 24 hours ago:
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
Evidence:
Thresholds: warning=80% critical=90% since="24 hours ago"
OOM evidence source: journalctl
Recommended next steps:
- Check application memory trend
- Review JVM heap settings if process is Java
- Verify swap pressure and paging activity
- Confirm whether OOM events align with application impact
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: Highest inode usage is 87%
Filesystems above threshold:
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Inode usage table:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Top affected mount points:
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
Evidence:
Thresholds: warning=80% critical=90%
Recommended next steps:
- Find directories with many small files under affected mount points
- Check logs, cache, spool, session, and temporary directories
- Avoid deleting blindly; confirm ownership and application impact first
- Confirm whether inode exhaustion is causing write or deploy failures
- Attach this output to incident ticket
@@ -0,0 +1,30 @@
OK: JVM diagnostics collected for PID 1234
Detected JVM process:
PID USER RSS_MB CPU COMMAND
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
Thread count: 188
Heap and JVM evidence:
[jcmd VM.flags]
1234:
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
[jcmd GC.heap_info]
garbage-first heap total 2097152K, used 1521000K
[jcmd Thread.print summary]
102 java.lang.Thread.State: WAITING
53 java.lang.Thread.State: RUNNABLE
33 java.lang.Thread.State: TIMED_WAITING
Evidence:
PID=1234 thread_count=188 top=10
Recommended next steps:
- Review GC logs and recent application errors
- Check JVM heap sizing against container or host memory limits
- Check thread count trend in monitoring before concluding a leak
- Capture jstack only if approved by operational process
- Attach this output to incident ticket
@@ -0,0 +1,23 @@
WARNING: Time sync status=yes offset_ms=812
Time status:
System time: 2026-05-11 10:18:01 UTC +0000
Timezone: UTC +0000
Detected tool: chronyc
NTP synchronized: yes
Offset ms: 812
Tool evidence:
Reference ID : 203.0.113.10
System time : 0.812345 seconds fast of NTP time
Last offset : +0.812345 seconds
Evidence:
Thresholds: warning=500ms critical=5000ms
Recommended next steps:
- Verify chrony or ntpd service status and configuration
- Check NTP sources and reachability
- Check virtualization host time if this is a VM
- Avoid restarting time services blindly in production
- Attach this output to incident ticket
@@ -0,0 +1,27 @@
CRITICAL: Service app.service state=failed substate=failed restarts=12
Service state:
app.service - Example application
Loaded: loaded (/etc/systemd/system/app.service; enabled)
Active: failed (Result: exit-code)
Systemd properties:
Id=app.service
ActiveState=failed
SubState=failed
Result=exit-code
NRestarts=12
Recent start/stop/failure log lines since 1 hour ago:
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
Evidence:
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
Recommended next steps:
- Inspect the unit file and drop-in overrides
- Review application logs around the restart timestamps
- Check dependencies such as network, storage, database, or secrets
- Verify recent configuration or package changes
- Do not restart blindly; attach this output to the incident ticket
@@ -0,0 +1,385 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
incident_type=""
service_name=""
host_name=""
port=""
target_pid=""
match_string=""
output_file=""
since_value="1 hour ago"
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
usage() {
cat <<'USAGE'
Usage: incident_triage_report.sh --type TYPE [options]
Run selected read-only incident checks and produce a Markdown triage report.
Incident types:
cpu
memory
service
network
auth
cert
filesystem
jvm
all
Options:
--type TYPE Incident type to collect
--service SERVICE_NAME systemd service name for service checks
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
--port PORT TCP or TLS port for host checks
--pid PID JVM process ID
--match PROCESS_MATCH JVM process match string
--output FILE write Markdown report to FILE
--since VALUE time window for log-based checks
--help show this help
Examples:
./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
valid_type() {
case "$1" in
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
*) return 1 ;;
esac
}
while (($# > 0)); do
case "$1" in
--type)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
incident_type="$2"
shift 2
;;
--service)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
service_name="$2"
shift 2
;;
--host)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
host_name="$2"
shift 2
;;
--port)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
port="$2"
shift 2
;;
--pid)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
target_pid="$2"
shift 2
;;
--match)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
match_string="$2"
shift 2
;;
--output)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
output_file="$2"
shift 2
;;
--since)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
since_value="$2"
shift 2
;;
--help|-h)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1"
usage
exit 2
;;
esac
done
if [[ -z "$incident_type" ]]; then
printf 'CRITICAL: --type is required\n'
usage
exit 2
fi
if ! valid_type "$incident_type"; then
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
usage
exit 2
fi
if [[ -n "$port" ]] && ! is_number "$port"; then
printf 'CRITICAL: --port must be numeric\n'
exit 2
fi
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
printf 'CRITICAL: --pid must be numeric\n'
exit 2
fi
if [[ -n "$target_pid" && -n "$match_string" ]]; then
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
exit 2
fi
tmp_dir="$(mktemp -d)"
trap 'rm -rf "$tmp_dir"' EXIT
report_file="$tmp_dir/report.md"
check_labels=()
check_names=()
check_commands=()
check_statuses=()
check_exit_codes=()
check_summaries=()
check_outputs=()
status_from_exit() {
case "$1" in
0) printf 'OK' ;;
1) printf 'WARNING' ;;
2) printf 'INVALID' ;;
3) printf 'CRITICAL' ;;
*) printf 'ERROR' ;;
esac
}
render_command() {
local item
for item in "$@"; do
printf '%q ' "$item"
done | sed 's/[[:space:]]*$//'
}
append_skipped_check() {
local label="$1"
local name="$2"
local reason="$3"
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
check_labels+=("$label")
check_names+=("$name")
check_commands+=("not run")
check_statuses+=("SKIPPED")
check_exit_codes+=("-")
check_summaries+=("$reason")
check_outputs+=("$output_path")
}
run_check() {
local label="$1"
local script_name="$2"
shift 2
local script_path="${script_dir}/${script_name}"
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
local command_text
local exit_code
local status
local summary
command_text="$(render_command "$script_path" "$@")"
if [[ ! -e "$script_path" ]]; then
append_skipped_check "$label" "$script_name" "missing script: $script_name"
return
fi
if [[ ! -x "$script_path" ]]; then
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
return
fi
set +e
"$script_path" "$@" > "$output_path" 2>&1
exit_code=$?
set -e
status="$(status_from_exit "$exit_code")"
summary="$(sed -n '1p' "$output_path")"
if [[ -z "$summary" ]]; then
summary="no output captured"
fi
check_labels+=("$label")
check_names+=("$script_name")
check_commands+=("$command_text")
check_statuses+=("$status")
check_exit_codes+=("$exit_code")
check_summaries+=("$summary")
check_outputs+=("$output_path")
}
run_cpu_checks() {
run_check "CPU saturation" "check_high_cpu.sh"
}
run_memory_checks() {
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
}
run_service_checks() {
if [[ -z "$service_name" ]]; then
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
return
fi
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
}
run_network_checks() {
local args=(--host "$host_name")
if [[ -z "$host_name" ]]; then
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
return
fi
if [[ -n "$port" ]]; then
args+=(--port "$port")
fi
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
}
run_auth_checks() {
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
}
run_cert_checks() {
local args=(--host "$host_name")
if [[ -z "$host_name" ]]; then
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
return
fi
if [[ -n "$port" ]]; then
args+=(--port "$port")
fi
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
}
run_filesystem_checks() {
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
run_check "Inode usage" "check_inode_usage.sh"
}
run_jvm_checks() {
local args=()
if [[ -n "$target_pid" ]]; then
args+=(--pid "$target_pid")
elif [[ -n "$match_string" ]]; then
args+=(--match "$match_string")
fi
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
}
case "$incident_type" in
cpu) run_cpu_checks ;;
memory) run_memory_checks ;;
service) run_service_checks ;;
network) run_network_checks ;;
auth) run_auth_checks ;;
cert) run_cert_checks ;;
filesystem) run_filesystem_checks ;;
jvm) run_jvm_checks ;;
all)
run_cpu_checks
run_memory_checks
run_service_checks
run_network_checks
run_auth_checks
run_cert_checks
run_filesystem_checks
run_jvm_checks
;;
esac
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
current_user="$(id -un 2>/dev/null || printf 'unknown')"
{
printf '# L2 Incident Triage Report\n\n'
printf -- '- Generated: %s\n' "$generated_at"
printf -- '- Local hostname: %s\n' "$local_hostname"
printf -- '- Current user: %s\n' "$current_user"
printf -- '- Incident type: %s\n' "$incident_type"
printf -- '- Service: %s\n' "${service_name:-not provided}"
printf -- '- Host: %s\n' "${host_name:-not provided}"
printf -- '- Port: %s\n' "${port:-not provided}"
printf -- '- PID: %s\n' "${target_pid:-not provided}"
printf -- '- Process match: %s\n' "${match_string:-not provided}"
printf -- '- Since: %s\n\n' "$since_value"
printf '## Executed Checks\n\n'
printf '| Check | Script | Status | Exit | Command |\n'
printf '| --- | --- | --- | --- | --- |\n'
for index in "${!check_labels[@]}"; do
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
"${check_labels[$index]}" \
"${check_names[$index]}" \
"${check_statuses[$index]}" \
"${check_exit_codes[$index]}" \
"${check_commands[$index]}"
done
printf '\n'
printf '## Summary\n\n'
for index in "${!check_labels[@]}"; do
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
done
printf '\n'
printf '## Raw Evidence\n\n'
for index in "${!check_labels[@]}"; do
printf '### %s\n\n' "${check_labels[$index]}"
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
printf '```text\n'
cat "${check_outputs[$index]}"
printf '\n```\n\n'
done
printf '## L2 Handover Checklist\n\n'
printf -- '- [ ] Business impact confirmed\n'
printf -- '- [ ] Affected host/service identified\n'
printf -- '- [ ] Monitoring alert attached\n'
printf -- '- [ ] Recent changes checked\n'
printf -- '- [ ] Logs attached\n'
printf -- '- [ ] Service owner identified\n'
printf -- '- [ ] Escalation target identified\n\n'
printf '## Escalation Notes\n\n'
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
printf '## Recommended Next Steps\n\n'
printf -- '- Confirm the symptom against monitoring and user reports.\n'
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
printf -- '- Attach this report to the incident ticket before handoff.\n'
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
} > "$report_file"
if [[ -n "$output_file" ]]; then
cp "$report_file" "$output_file"
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
else
cat "$report_file"
fi
+67 -3
View File
@@ -1,5 +1,69 @@
# python
# Python Operational Tools
Planned area for small Python helpers.
This directory contains small Python utilities that support operational analysis in `infra-run`.
No Python tooling is implemented in `infra-run` yet.
Python is used here only when it adds practical value over Bash: parsing structured or noisy input, producing repeatable reports, comparing evidence, or emitting machine-readable output for later automation. Shell remains the default choice for direct host checks and simple command wrappers.
## Tools
| Tool | Path | Purpose | Typical use | Example command |
| --- | --- | --- | --- | --- |
| incident-log-summary | [incident-log-summary](./incident-log-summary/) | Summarize configured incident patterns from one local log file. | First-pass incident notes from system or application logs. | `python3 incident_log_summary.py --file examples/system-messages.log` |
| log-diff-checker | [log-diff-checker](./log-diff-checker/) | Compare configured patterns before and after a change. | Post-change review for new, increased, decreased, resolved, or unchanged log symptoms. | `python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log` |
| auth-log-audit | [auth-log-audit](./auth-log-audit/) | Summarize SSH, sudo, su, and PAM findings from local authentication logs. | Authentication incident review or access-control evidence gathering. | `python3 auth_log_audit.py --file examples/sample-auth.log` |
| jvm-log-analyzer | [jvm-log-analyzer](./jvm-log-analyzer/) | Summarize JVM exceptions, stack traces, HTTP 5xx entries, database issues, and TLS symptoms. | Java application support, restart review, or incident handoff evidence. | `python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log` |
| journal-analyzer | [journal-analyzer](./journal-analyzer/) | Summarize exported `journalctl` text for failed units, restart loops, OOM events, and service warnings. | Linux service incident review or patching/change evidence. | `python3 journal_analyzer.py --file examples/sample-journal.log` |
| known-error-matcher | [known-error-matcher](./known-error-matcher/) | Match local logs against a JSON known-error catalog. | Connect known symptoms to severity, category, samples, and runbook references. | `python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json` |
## Expected Use Cases
- Log parsing for incident review.
- Markdown or text report generation from collected evidence.
- Change evidence helpers for pre-check and post-check notes.
- Incident summary builders from sanitized inputs.
- Structured output for automation, such as JSON where useful.
## Standards
- Use the Python standard library only unless a later tool clearly justifies another dependency.
- Keep tools read-only by default.
- Do not perform destructive actions.
- Use `argparse` for command-line interfaces.
- Produce predictable text output suitable for terminal review and change notes.
- Support text, Markdown, and JSON output where useful for terminal review, tickets, or local automation.
- Use an `OK`, `WARNING`, `CRITICAL`, and `UNKNOWN` status model for findings.
- Handle malformed input, permission problems, and runtime errors defensively.
- Return meaningful exit codes.
- Keep each tool small, focused, and easy to review.
## Exit Codes
- `0` - OK, no findings, or successful validation.
- `1` - Operational findings detected.
- `2` - Invalid input, missing dependency, permission issue, or runtime error.
## Validation
From the repository root:
```bash
bash scripts/check-python.sh
bash scripts/validate-repo.sh
```
The checks use `python3 -m py_compile` and do not require external Python dependencies.
## Expected Tool Structure
Future tools should use a small self-contained layout:
```text
tool-name/
tool_name.py
README.md
examples/
sample-input.log
sample-report.md
```
Do not add package metadata, framework scaffolding, or external dependency files unless a future tool has a specific operational reason.
@@ -0,0 +1,190 @@
# auth-log-audit
`auth-log-audit` is a read-only Python CLI for reviewing local Linux authentication logs. It summarizes suspicious SSH, sudo, su, and PAM authentication patterns that may require operator review during incident response, hardening checks, or access-control evidence gathering.
The tool analyzes collected log files only. It does not modify logs, query remote systems, or prove compromise.
## When To Use
- During incident response when `/var/log/auth.log`, `/var/log/secure`, or an exported authentication log needs a quick first-pass summary.
- During Linux hardening or access review when repeated failures, invalid users, root login attempts, or sudo failures need to be surfaced.
- Before attaching authentication evidence to an incident, security, problem, or compliance review ticket.
- When JSON output is useful for local automation or repeatable reporting.
## What It Does
- Reads one local authentication log supplied with `--file`.
- Detects common SSH, sudo, su, and PAM authentication events.
- Extracts usernames, source IPs, authentication methods, services, timestamps, and sample raw lines where practical.
- Aggregates failed login counts by source IP and username.
- Flags suspicious source IPs and usernames when failed attempts meet the configured threshold.
- Produces text, Markdown, or JSON output.
## What It Does Not Do
- It does not detect breaches or prove compromise.
- It does not read remote systems or live journal streams.
- It does not modify logs, accounts, SSH configuration, sudoers, or host state.
- It does not query SIEM, SOC tooling, ELK, Zabbix, identity providers, or ticketing systems.
- It does not replace host-specific incident response, access review, or forensic procedures.
- It does not classify every vendor-specific authentication message.
## Supported Input Types
- Debian/Ubuntu-style `/var/log/auth.log`.
- RHEL/Oracle Linux-style `/var/log/secure`.
- Exported authentication logs with similar syslog-style lines.
- UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue.
Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Event Categories
SSH-related:
- Failed SSH password login.
- Failed SSH publickey login.
- Successful SSH login.
- Invalid user attempts.
- Root login attempts.
- Refused or disallowed user attempts.
- Disconnects after failed authentication where detectable.
- Too many authentication failures where detectable.
sudo and su-related:
- sudo command usage.
- sudo authentication failure.
- su session opened.
- su authentication failure.
Generic authentication:
- authentication failure.
- `pam_unix` authentication failure.
- Account locked messages where detectable.
- User not known to the underlying authentication module.
## Timestamp Handling
The scanner attempts to parse:
- `May 11 10:15:30`
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and first seen / last seen values are reported as `UNKNOWN` when no parseable event timestamps are found. Syslog timestamps without a year use the current local year internally while preserving the original timestamp shape in text and Markdown output.
## Suspicious Activity Model
Default threshold:
```text
--threshold-failed 5
```
The report classifies findings conservatively:
- `OK` - no suspicious findings.
- `WARNING` - repeated failed logins, invalid users, root login attempts below the threshold, or sudo authentication failures.
- `CRITICAL` - root login attempts above threshold, high-volume brute-force indicators, or multiple suspicious source IPs above threshold.
This status is a triage signal. It identifies suspicious authentication patterns that require review; it does not confirm a breach.
## Usage
```bash
cd infra-run/scripts/python/auth-log-audit
python3 auth_log_audit.py --file examples/sample-auth.log
python3 auth_log_audit.py --file examples/sample-secure.log
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown --output auth-report.md
python3 auth_log_audit.py --file examples/sample-auth.log --format json
python3 auth_log_audit.py --file examples/sample-auth.log --top 10
python3 auth_log_audit.py --file examples/sample-auth.log --threshold-failed 5
python3 auth_log_audit.py --file examples/sample-auth.log --ignore-users monitoring,backup,ansible
```
Ignored users are excluded from suspicious username threshold findings. Their events are still counted in totals and can still appear in top-user summaries so operational context is not silently hidden.
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or security ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
## Exit Codes
- `0` - OK, no suspicious findings.
- `1` - Suspicious findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Auth Log Audit
==============
Overall status: WARNING
First seen: May 11 09:58:12
Last seen: May 11 10:07:48
Top Source IPs by Failed Attempts
---------------------------------
- 203.0.113.50: 7
- 198.51.100.23: 1
Suspicious Source IPs
---------------------
- 203.0.113.50: 7
Operational Summary
-------------------
Overall status: WARNING
Total lines scanned: 15
Authentication events detected: 15
Failed logins: 8
Successful logins: 1
Invalid user attempts: 1
Root login attempts: 2
Sudo usage events: 1
Sudo authentication failures: 1
Suspicious source IPs: 1
Suspicious usernames: 0
Threshold used: 5
Ignored users: None
```
## Markdown Workflow
Generate a Markdown report from a collected authentication log and attach it to the incident or security ticket as supporting evidence:
```bash
python3 auth_log_audit.py \
--file examples/sample-auth.log \
--format markdown \
--output auth-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with host access history, SSH configuration, sudo policy, user ownership, and any relevant monitoring evidence.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line may produce more than one event when PAM and service messages overlap.
- Syslog timestamps without a year are normalized internally with the current local year.
- Source IP extraction is IPv4-oriented.
- The tool compares counts, not rates, authentication windows, geolocation, or identity context.
- Large log files are read into memory; collect scoped extracts for very large incidents.
- Vendor-specific PAM modules or SSH daemon formats may need future patterns.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not prove compromise or determine root cause automatically.
@@ -0,0 +1,734 @@
#!/usr/bin/env python3
"""Summarize suspicious authentication activity in local Linux auth logs."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
SERVICE_RE = re.compile(r"\s([A-Za-z0-9_.-]+)(?:\[\d+\])?:\s")
IP_RE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
EVENT_PATTERNS = [
{
"event_type": "failed_ssh_password",
"category": "failed_login",
"method": "password",
"regex": re.compile(
r"sshd(?:\[\d+\])?: Failed password for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "failed_ssh_publickey",
"category": "failed_login",
"method": "publickey",
"regex": re.compile(
r"sshd(?:\[\d+\])?: Failed publickey for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "successful_ssh_login",
"category": "successful_login",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Accepted (\S+) for (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "invalid_user_attempt",
"category": "invalid_user",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Invalid user (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "refused_user_attempt",
"category": "refused_user",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: (?:User|Connection closed by invalid user) (\S+).*?from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "disconnect_after_failed_auth",
"category": "disconnect_after_failed_auth",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Disconnected from (?:authenticating user \S+ |invalid user \S+ )?((?:\d{1,3}\.){3}\d{1,3}).*(?:preauth|Too many authentication failures)"
),
},
{
"event_type": "too_many_auth_failures",
"category": "failed_login",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: .*(?:Too many authentication failures|maximum authentication attempts exceeded).*"
),
},
{
"event_type": "sudo_command",
"category": "sudo_usage",
"method": None,
"regex": re.compile(r"sudo(?:\[\d+\])?:\s+(\S+)\s+:\s+TTY=.*COMMAND=(.+)$"),
},
{
"event_type": "sudo_auth_failure",
"category": "sudo_failure",
"method": None,
"regex": re.compile(r"sudo(?:\[\d+\])?: pam_unix\(sudo:auth\): authentication failure;.*"),
},
{
"event_type": "su_session_opened",
"category": "su_event",
"method": None,
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:session\): session opened for user (\S+)"),
},
{
"event_type": "su_auth_failure",
"category": "su_event",
"method": None,
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:auth\): authentication failure;.*"),
},
{
"event_type": "pam_unix_auth_failure",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"pam_unix\([^)]*:auth\): authentication failure;.*"),
},
{
"event_type": "user_unknown",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"user (?:unknown|not known to the underlying authentication module)"),
},
{
"event_type": "account_locked",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"(?:account locked|authentication failure;.*account locked)", re.IGNORECASE),
},
]
FAILED_CATEGORIES = {"failed_login", "generic_auth_failure"}
SAMPLE_CATEGORIES = [
"failed_login",
"invalid_user",
"root_login_attempt",
"sudo_failure",
"suspicious_source_ip",
]
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze local Linux authentication logs for suspicious patterns."
)
parser.add_argument("--file", required=True, help="Local auth.log or secure file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top IPs, usernames, and event types to display. Default: 10.",
)
parser.add_argument(
"--threshold-failed",
type=positive_int,
default=5,
help="Failed attempt threshold for suspicious IPs and usernames. Default: 5.",
)
parser.add_argument(
"--ignore-users",
default="",
help="Comma-separated usernames excluded from suspicious username thresholds.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding category. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_ignore_users(value: str) -> list[str]:
if not value.strip():
return []
users = []
for item in value.split(","):
user = item.strip()
if user:
users.append(user)
return sorted(set(users))
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
try:
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
normalized = f"{syslog_year} {raw}"
try:
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def extract_service(line: str) -> str:
match = SERVICE_RE.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_ip(line: str) -> str:
match = IP_RE.search(line)
if match:
return match.group(0)
return UNKNOWN
def extract_user_from_key_values(line: str) -> str:
for pattern in (
r"\buser=([A-Za-z0-9_.@-]+)",
r"\bruser=([A-Za-z0-9_.@-]+)",
r"\bUSER=([A-Za-z0-9_.@-]+)",
):
match = re.search(pattern, line)
if match and match.group(1):
return match.group(1)
return UNKNOWN
def event_from_match(line: str, pattern: dict[str, Any], match: re.Match[str]) -> dict[str, Any]:
event_type = pattern["event_type"]
username = UNKNOWN
source_ip = extract_ip(line)
method = pattern["method"] or UNKNOWN
if event_type in ("failed_ssh_password", "failed_ssh_publickey"):
username = match.group(2)
source_ip = match.group(3)
elif event_type == "successful_ssh_login":
method = match.group(1)
username = match.group(2)
source_ip = match.group(3)
elif event_type in ("invalid_user_attempt", "refused_user_attempt"):
username = match.group(1)
source_ip = match.group(2)
elif event_type == "sudo_command":
username = match.group(1)
elif event_type == "su_session_opened":
username = match.group(1).rstrip(")")
elif event_type in ("sudo_auth_failure", "su_auth_failure", "pam_unix_auth_failure"):
username = extract_user_from_key_values(line)
if username == "root" and event_type in (
"failed_ssh_password",
"failed_ssh_publickey",
"successful_ssh_login",
"invalid_user_attempt",
"refused_user_attempt",
):
event_type = "root_login_attempt"
return {
"event_type": event_type,
"category": pattern["category"],
"username": username or UNKNOWN,
"source_ip": source_ip or UNKNOWN,
"method": method,
"service": extract_service(line),
"raw": line,
}
def detect_events(line: str) -> list[dict[str, Any]]:
events = []
for pattern in EVENT_PATTERNS:
match = pattern["regex"].search(line)
if match:
events.append(event_from_match(line, pattern, match))
if any(event["event_type"] in ("sudo_auth_failure", "su_auth_failure") for event in events):
events = [
event for event in events if event["event_type"] != "pam_unix_auth_failure"
]
if "authentication failure" in line and not events:
events.append(
{
"event_type": "authentication_failure",
"category": "generic_auth_failure",
"username": extract_user_from_key_values(line),
"source_ip": extract_ip(line),
"method": UNKNOWN,
"service": extract_service(line),
"raw": line,
}
)
return dedupe_events(events)
def dedupe_events(events: list[dict[str, Any]]) -> list[dict[str, Any]]:
deduped = []
seen = set()
for event in events:
key = (event["event_type"], event["username"], event["source_ip"], event["raw"])
if key in seen:
continue
seen.add(key)
deduped.append(event)
return deduped
def append_sample(samples: dict[str, list[str]], category: str, line: str, max_samples: int) -> None:
if max_samples == 0:
return
if len(samples[category]) < max_samples:
samples[category].append(line)
def update_seen(
first_seen: tuple[datetime, str] | None,
last_seen: tuple[datetime, str] | None,
parsed_at: datetime | None,
rendered_at: str,
) -> tuple[tuple[datetime, str] | None, tuple[datetime, str] | None]:
if parsed_at is None:
return first_seen, last_seen
if first_seen is None or parsed_at < first_seen[0]:
first_seen = (parsed_at, rendered_at)
if last_seen is None or parsed_at > last_seen[0]:
last_seen = (parsed_at, rendered_at)
return first_seen, last_seen
def analyze_log(
lines: list[str],
threshold_failed: int,
ignore_users: list[str],
top: int,
max_samples: int,
) -> dict[str, Any]:
syslog_year = datetime.now().year
events = []
samples: dict[str, list[str]] = defaultdict(list)
event_type_counts: Counter[str] = Counter()
failed_by_ip: Counter[str] = Counter()
failed_by_user: Counter[str] = Counter()
success_by_ip: Counter[str] = Counter()
success_by_user: Counter[str] = Counter()
first_seen: tuple[datetime, str] | None = None
last_seen: tuple[datetime, str] | None = None
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
line_events = detect_events(line)
if not line_events:
continue
first_seen, last_seen = update_seen(first_seen, last_seen, parsed_at, rendered_at)
for event in line_events:
event["timestamp"] = rendered_at
events.append(event)
event_type_counts[event["event_type"]] += 1
category = event["category"]
username = event["username"]
source_ip = event["source_ip"]
if event["event_type"] == "root_login_attempt":
append_sample(samples, "root_login_attempt", line, max_samples)
category = "failed_login"
if category in FAILED_CATEGORIES:
if source_ip != UNKNOWN:
failed_by_ip[source_ip] += 1
if username != UNKNOWN:
failed_by_user[username] += 1
append_sample(samples, "failed_login", line, max_samples)
if category == "successful_login":
if source_ip != UNKNOWN:
success_by_ip[source_ip] += 1
if username != UNKNOWN:
success_by_user[username] += 1
if category == "invalid_user":
append_sample(samples, "invalid_user", line, max_samples)
if category == "sudo_failure":
append_sample(samples, "sudo_failure", line, max_samples)
suspicious_ips = {
ip: count for ip, count in failed_by_ip.items() if count >= threshold_failed
}
suspicious_users = {
user: count
for user, count in failed_by_user.items()
if count >= threshold_failed and user not in ignore_users
}
for event in events:
if event["source_ip"] in suspicious_ips:
append_sample(samples, "suspicious_source_ip", event["raw"], max_samples)
summary = build_summary(
lines=lines,
events=events,
failed_by_ip=failed_by_ip,
failed_by_user=failed_by_user,
suspicious_ips=suspicious_ips,
suspicious_users=suspicious_users,
event_type_counts=event_type_counts,
threshold_failed=threshold_failed,
ignore_users=ignore_users,
first_seen=first_seen,
last_seen=last_seen,
)
return {
"summary": summary,
"top_source_ips_by_failed_attempts": top_items(failed_by_ip, top),
"top_usernames_by_failed_attempts": top_items(failed_by_user, top),
"top_source_ips_by_successful_logins": top_items(success_by_ip, top),
"top_usernames_by_successful_logins": top_items(success_by_user, top),
"top_event_types": top_items(event_type_counts, top),
"suspicious_source_ips": sorted_count_items(suspicious_ips),
"suspicious_usernames": sorted_count_items(suspicious_users),
"samples": {category: samples.get(category, []) for category in SAMPLE_CATEGORIES},
}
def build_summary(
lines: list[str],
events: list[dict[str, Any]],
failed_by_ip: Counter[str],
failed_by_user: Counter[str],
suspicious_ips: dict[str, int],
suspicious_users: dict[str, int],
event_type_counts: Counter[str],
threshold_failed: int,
ignore_users: list[str],
first_seen: tuple[datetime, str] | None,
last_seen: tuple[datetime, str] | None,
) -> dict[str, Any]:
root_attempts = event_type_counts["root_login_attempt"]
sudo_failures = event_type_counts["sudo_auth_failure"]
invalid_users = event_type_counts["invalid_user_attempt"]
high_volume_ips = sum(1 for count in suspicious_ips.values() if count >= threshold_failed * 2)
high_volume_users = sum(1 for count in suspicious_users.values() if count >= threshold_failed * 2)
if (
root_attempts >= threshold_failed
or high_volume_ips > 0
or high_volume_users > 0
or len(suspicious_ips) >= 2
):
status = "CRITICAL"
elif suspicious_ips or suspicious_users or invalid_users > 0 or sudo_failures > 0 or root_attempts > 0:
status = "WARNING"
else:
status = "OK"
return {
"overall_status": status,
"first_seen": render_seen(first_seen),
"last_seen": render_seen(last_seen),
"total_lines_scanned": len(lines),
"authentication_events_detected": len(events),
"failed_login_count": sum(failed_by_ip.values()),
"successful_login_count": event_type_counts["successful_ssh_login"],
"invalid_user_count": invalid_users,
"root_login_attempt_count": root_attempts,
"sudo_command_count": event_type_counts["sudo_command"],
"sudo_failure_count": sudo_failures,
"su_event_count": event_type_counts["su_session_opened"] + event_type_counts["su_auth_failure"],
"suspicious_source_ip_count": len(suspicious_ips),
"suspicious_username_count": len(suspicious_users),
"threshold_failed": threshold_failed,
"ignored_users": ignore_users,
}
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
def sorted_count_items(items: dict[str, int]) -> list[dict[str, Any]]:
return [
{"value": value, "count": count}
for value, count in sorted(items.items(), key=lambda item: (-item[1], item[0]))
]
def render_text(report: dict[str, Any]) -> str:
summary = report["summary"]
lines = [
"Auth Log Audit",
"==============",
"",
f"Overall status: {summary['overall_status']}",
f"First seen: {summary['first_seen']}",
f"Last seen: {summary['last_seen']}",
"",
]
lines.extend(render_text_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
lines.extend(render_text_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
lines.extend(render_text_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
lines.extend(render_text_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
lines.extend(render_text_table("Suspicious Source IPs", report["suspicious_source_ips"]))
lines.extend(render_text_table("Suspicious Usernames", report["suspicious_usernames"]))
lines.extend(render_text_table("Top Event Types", report["top_event_types"]))
lines.extend(render_text_samples(report["samples"]))
lines.extend(render_text_summary(summary))
return "\n".join(lines) + "\n"
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [title, "-" * len(title)]
if not rows:
lines.append("No entries detected.")
else:
for item in rows:
lines.append(f"- {item['value']}: {item['count']}")
lines.append("")
return lines
def render_text_samples(samples: dict[str, list[str]]) -> list[str]:
lines = ["Sample Log Lines", "----------------"]
for category in SAMPLE_CATEGORIES:
lines.append(f"{category}:")
if samples.get(category):
lines.extend(f" - {sample}" for sample in samples[category])
else:
lines.append(" - No samples retained")
lines.append("")
return lines
def render_text_summary(summary: dict[str, Any]) -> list[str]:
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
return [
"Operational Summary",
"-------------------",
f"Overall status: {summary['overall_status']}",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Authentication events detected: {summary['authentication_events_detected']}",
f"Failed logins: {summary['failed_login_count']}",
f"Successful logins: {summary['successful_login_count']}",
f"Invalid user attempts: {summary['invalid_user_count']}",
f"Root login attempts: {summary['root_login_attempt_count']}",
f"Sudo usage events: {summary['sudo_command_count']}",
f"Sudo authentication failures: {summary['sudo_failure_count']}",
f"su events: {summary['su_event_count']}",
f"Suspicious source IPs: {summary['suspicious_source_ip_count']}",
f"Suspicious usernames: {summary['suspicious_username_count']}",
f"Threshold used: {summary['threshold_failed']}",
f"Ignored users: {ignored}",
]
def render_markdown(report: dict[str, Any]) -> str:
summary = report["summary"]
lines = [
"# Auth Log Audit",
"",
f"- Overall status: {summary['overall_status']}",
f"- First seen: {summary['first_seen']}",
f"- Last seen: {summary['last_seen']}",
"",
]
lines.extend(render_markdown_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
lines.extend(render_markdown_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
lines.extend(render_markdown_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
lines.extend(render_markdown_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
lines.extend(render_markdown_table("Suspicious Source IPs", report["suspicious_source_ips"]))
lines.extend(render_markdown_table("Suspicious Usernames", report["suspicious_usernames"]))
lines.extend(render_markdown_table("Top Event Types", report["top_event_types"]))
lines.extend(render_markdown_samples(report["samples"]))
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
lines.extend(
[
"## Operational Summary",
"",
f"- Overall status: {summary['overall_status']}",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Authentication events detected: {summary['authentication_events_detected']}",
f"- Failed logins: {summary['failed_login_count']}",
f"- Successful logins: {summary['successful_login_count']}",
f"- Invalid user attempts: {summary['invalid_user_count']}",
f"- Root login attempts: {summary['root_login_attempt_count']}",
f"- Sudo usage events: {summary['sudo_command_count']}",
f"- Sudo authentication failures: {summary['sudo_failure_count']}",
f"- su events: {summary['su_event_count']}",
f"- Suspicious source IPs: {summary['suspicious_source_ip_count']}",
f"- Suspicious usernames: {summary['suspicious_username_count']}",
f"- Threshold used: {summary['threshold_failed']}",
f"- Ignored users: {ignored}",
"",
]
)
return "\n".join(lines)
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [f"## {title}", ""]
if not rows:
lines.extend(["No entries detected.", ""])
return lines
lines.extend(["| Value | Count |", "| --- | ---: |"])
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
lines.append("")
return lines
def render_markdown_samples(samples: dict[str, list[str]]) -> list[str]:
lines = ["## Sample Log Lines", ""]
for category in SAMPLE_CATEGORIES:
lines.extend([f"### {category}", ""])
if samples.get(category):
lines.append("```text")
lines.extend(samples[category])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
return lines
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
if path.resolve() == input_path.resolve():
raise OSError("output path must not be the same as input file")
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
input_path = Path(args.file)
ignore_users = parse_ignore_users(args.ignore_users)
try:
lines = read_log_file(input_path)
report = analyze_log(
lines=lines,
threshold_failed=args.threshold_failed,
ignore_users=ignore_users,
top=args.top,
max_samples=args.max_samples,
)
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(input_path, args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,112 @@
# Auth Log Audit
- Overall status: WARNING
- First seen: May 11 09:58:12
- Last seen: May 11 10:07:48
## Top Source IPs by Failed Attempts
| Value | Count |
| --- | ---: |
| 203.0.113.50 | 7 |
| 198.51.100.23 | 1 |
## Top Usernames by Failed Attempts
| Value | Count |
| --- | ---: |
| appuser | 3 |
| root | 2 |
| admin | 1 |
| backup | 1 |
## Top Source IPs by Successful Logins
| Value | Count |
| --- | ---: |
| 10.20.30.15 | 1 |
## Top Usernames by Successful Logins
| Value | Count |
| --- | ---: |
| deploy | 1 |
## Suspicious Source IPs
| Value | Count |
| --- | ---: |
| 203.0.113.50 | 7 |
## Suspicious Usernames
No entries detected.
## Top Event Types
| Value | Count |
| --- | ---: |
| failed_ssh_password | 4 |
| root_login_attempt | 2 |
| successful_ssh_login | 1 |
| sudo_command | 1 |
| invalid_user_attempt | 1 |
| disconnect_after_failed_auth | 1 |
| failed_ssh_publickey | 1 |
| sudo_auth_failure | 1 |
| su_session_opened | 1 |
| refused_user_attempt | 1 |
## Sample Log Lines
### failed_login
```text
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
### invalid_user
```text
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
```
### root_login_attempt
```text
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
### sudo_failure
```text
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
```
### suspicious_source_ip
```text
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
## Operational Summary
- Overall status: WARNING
- Total lines scanned: 15
- Authentication events detected: 15
- Failed logins: 8
- Successful logins: 1
- Invalid user attempts: 1
- Root login attempts: 2
- Sudo usage events: 1
- Sudo authentication failures: 1
- su events: 1
- Suspicious source IPs: 1
- Suspicious usernames: 0
- Threshold used: 5
- Ignored users: None
@@ -0,0 +1,15 @@
May 11 09:58:12 web01 sshd[1201]: Accepted publickey for deploy from 10.20.30.15 port 52214 ssh2: ED25519 SHA256:samplekey
May 11 10:00:01 web01 sudo: deploy : TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/usr/bin/systemctl status nginx
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:11 web01 sshd[1224]: Disconnected from authenticating user root 203.0.113.50 port 45012 [preauth]
May 11 10:03:10 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:14 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:18 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:41 web01 sshd[1238]: Failed publickey for backup from 198.51.100.23 port 50222 ssh2
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
May 11 10:05:02 web01 su[1244]: pam_unix(su:session): session opened for user root by deploy(uid=1001)
May 11 10:06:31 web01 sshd[1250]: User testuser from 192.0.2.77 not allowed because not listed in AllowUsers
May 11 10:07:48 web01 sshd[1254]: error: maximum authentication attempts exceeded for invalid user oracle from 203.0.113.50 port 45200 ssh2 [preauth]
@@ -0,0 +1,14 @@
May 11 09:52:44 db01 sshd[2110]: Accepted publickey for admin from 10.40.10.25 port 60124 ssh2: RSA SHA256:samplekey
May 11 09:55:10 db01 sudo[2120]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/systemctl restart auditd
May 11 09:55:10 db01 sudo[2120]: pam_unix(sudo:session): session opened for user root(uid=0) by admin(uid=1000)
May 11 10:00:01 db01 sshd[2130]: Failed password for invalid user postgres from 198.51.100.90 port 42101 ssh2
May 11 10:00:03 db01 sshd[2130]: Invalid user postgres from 198.51.100.90 port 42101
May 11 10:00:09 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
May 11 10:00:13 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
May 11 10:00:20 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:00:25 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:00:31 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:01:12 db01 su[2142]: pam_unix(su:auth): authentication failure; logname=admin uid=1000 euid=0 tty=pts/1 ruser=admin rhost= user=root
May 11 10:01:45 db01 sshd[2149]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.77 user=monitoring
May 11 10:02:03 db01 sshd[2154]: error: PAM: User not known to the underlying authentication module for illegal user deploy from 203.0.113.77
May 11 10:02:36 db01 sshd[2159]: Disconnecting authenticating user oracle 198.51.100.90 port 42111: Too many authentication failures [preauth]
@@ -0,0 +1,159 @@
# incident-log-summary
`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
## When To Use
- During incident response when a collected log file needs a fast pattern summary.
- Before attaching evidence to an incident, problem, or change ticket.
- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
- When JSON output is useful for later local automation.
## What It Does Not Do
- It does not read remote systems.
- It does not modify logs or system state.
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
- It does not prove root cause.
- It does not classify every possible vendor or application error.
- It does not treat sanitized examples as production validation.
## Supported Input
- One local text log file provided with `--file`.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Patterns
Critical patterns:
- `CRITICAL`
- `FATAL`
- `panic`
- `kernel panic`
- `no space left on device`
- `out of memory`
- `killed process`
- `read-only file system`
- `segmentation fault`
- `segfault`
- `certificate expired`
- `TLS handshake failed`
- `SSLHandshakeException`
- `database unavailable`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
Warning patterns:
- `ERROR`
- `failed`
- `failure`
- `timeout`
- `connection refused`
- `connection reset`
- `permission denied`
- `authentication failed`
- `denied`
- `unavailable`
- `service restart`
- `retrying`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
## Timestamp Handling
The scanner attempts to parse:
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `May 11 10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
## Usage
```bash
cd infra-run/scripts/python/incident-log-summary
python3 incident_log_summary.py --file examples/system-messages.log
python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
python3 incident_log_summary.py --file examples/app-error.log --format json
python3 incident_log_summary.py --file examples/app-error.log --top 20
python3 incident_log_summary.py --file examples/app-error.log --ignore-case
python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
## Exit Codes
- `0` - OK, no findings.
- `1` - Operational findings detected.
- `2` - Invalid input, unreadable file, bad argument, or runtime error.
## Example Text Output
```text
Incident Log Summary
====================
[CRITICAL] no space left on device
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Samples:
- May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
Operational Summary
-------------------
Total lines scanned: 7
Total findings: 7
Critical finding groups: 3
Warning finding groups: 4
Overall status: CRITICAL
```
## Markdown Workflow
Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
```bash
python3 incident_log_summary.py \
--file examples/app-error.log \
--format markdown \
--output incident-report.md
```
Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Syslog timestamps without a year are normalized with an inferred year.
- Date filters are best-effort because lines without parseable timestamps are retained.
- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,8 @@
2026-05-11 09:48:12 app01 api[4150]: INFO request_id=7f3a status=200 path=/health
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
2026-05-11 12:10:01 app01 api[4150]: INFO request_id=9001 status=200 path=/health
@@ -0,0 +1,144 @@
# Incident Log Summary
## CRITICAL: certificate expired
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## CRITICAL: CRITICAL
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## CRITICAL: database unavailable
- Occurrences: 1
- First seen: 2026-05-11 10:03:19
- Last seen: 2026-05-11 10:03:19
Sample log lines:
```text
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
```
## CRITICAL: HTTP 500
- Occurrences: 1
- First seen: 2026-05-11 10:01:03
- Last seen: 2026-05-11 10:01:03
Sample log lines:
```text
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
```
## CRITICAL: HTTP 503
- Occurrences: 1
- First seen: 2026-05-11 10:13:58
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
```
## CRITICAL: TLS handshake failed
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## WARNING: ERROR
- Occurrences: 4
- First seen: 2026-05-11 10:01:03
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
```
## WARNING: unavailable
- Occurrences: 2
- First seen: 2026-05-11 10:03:19
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
```
## WARNING: connection refused
- Occurrences: 1
- First seen: 2026-05-11 10:07:02
- Last seen: 2026-05-11 10:07:02
Sample log lines:
```text
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
```
## WARNING: failed
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## WARNING: timeout
- Occurrences: 1
- First seen: 2026-05-11 10:05:44
- Last seen: 2026-05-11 10:05:44
Sample log lines:
```text
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
```
## Operational Summary
- Total lines scanned: 8
- Total findings: 15
- Critical finding groups: 6
- Warning finding groups: 5
- Overall status: CRITICAL
@@ -0,0 +1,7 @@
May 11 09:57:01 ops-node-01 systemd[1]: Started Session 443 of user svc_backup.
May 11 10:02:14 ops-node-01 systemd[1]: failed to start nightly-report.service: Unit entered failed state.
May 11 10:04:22 ops-node-01 sudo[18442]: svc_backup : command not allowed ; permission denied
May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
May 11 10:21:45 ops-node-01 kernel: out of memory: killed process 2517 (java) total-vm:2048000kB
May 11 10:22:03 ops-node-01 systemd[1]: service restart scheduled for app-worker.service
May 11 10:30:31 ops-node-01 sshd[19210]: Accepted publickey for admin from 192.0.2.15 port 52210 ssh2
@@ -0,0 +1,448 @@
#!/usr/bin/env python3
"""Summarize incident-oriented patterns in local log files."""
from __future__ import annotations
import argparse
import json
import re
import sys
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
"CRITICAL",
"FATAL",
"panic",
"kernel panic",
"no space left on device",
"out of memory",
"killed process",
"read-only file system",
"segmentation fault",
"segfault",
"certificate expired",
"TLS handshake failed",
"SSLHandshakeException",
"database unavailable",
"HTTP 500",
"HTTP 502",
"HTTP 503",
"HTTP 504",
]
WARNING_PATTERNS = [
"ERROR",
"failed",
"failure",
"timeout",
"connection refused",
"connection reset",
"permission denied",
"authentication failed",
"denied",
"unavailable",
"service restart",
"retrying",
]
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Summarize suspicious and critical patterns in a local log file."
)
parser.add_argument("--file", required=True, help="Local log file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
help="Limit finding groups after severity and count sorting.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match all configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError(
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
)
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
pattern_defs: list[dict[str, str]] = []
pattern_defs.extend(
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
)
pattern_defs.extend(
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
)
compiled = []
for item in pattern_defs:
compiled.append(
{
"pattern": item["pattern"],
"severity": item["severity"],
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str | None]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
try:
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
except ValueError:
return None, None
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
normalized = f"{syslog_year} {raw}"
try:
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
except ValueError:
return None, None
return parsed, parsed.strftime("%Y-%m-%d %H:%M:%S")
return None, None
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
max_samples: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if not line_in_time_window(parsed_at, since, until):
continue
for item in patterns:
if not item["regex"].search(line):
continue
key = f"{item['severity']}::{item['pattern']}"
group = groups.setdefault(
key,
{
"pattern": item["pattern"],
"severity": item["severity"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
if parsed_at is not None:
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
if len(group["samples"]) < max_samples:
group["samples"].append(line)
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["pattern"].lower(),
),
)
rendered_findings = []
for group in findings:
rendered_findings.append(
{
"pattern": group["pattern"],
"severity": group["severity"],
"occurrences": group["occurrences"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
}
)
return {
"total_lines_scanned": len(lines),
"findings": rendered_findings,
}
def render_seen(value: tuple[datetime, str | None] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def apply_top_limit(report: dict[str, Any], top: int | None) -> dict[str, Any]:
if top is None:
return report
limited = dict(report)
limited["findings"] = report["findings"][:top]
return limited
def add_summary(report: dict[str, Any]) -> dict[str, Any]:
findings = report["findings"]
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
total_findings = sum(item["occurrences"] for item in findings)
if critical_groups > 0:
status = "CRITICAL"
elif warning_groups > 0:
status = "WARNING"
else:
status = "OK"
enriched = dict(report)
enriched["summary"] = {
"total_lines_scanned": report["total_lines_scanned"],
"total_findings": total_findings,
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"overall_status": status,
}
return enriched
def render_text(report: dict[str, Any]) -> str:
lines = ["Incident Log Summary", "====================", ""]
if not report["findings"]:
lines.append("No configured incident patterns were detected.")
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['pattern']}",
f"Occurrences: {finding['occurrences']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
lines.append("")
lines.extend(render_text_summary(report["summary"]))
return "\n".join(lines) + "\n"
def render_text_summary(summary: dict[str, Any]) -> list[str]:
return [
"Operational Summary",
"-------------------",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Total findings: {summary['total_findings']}",
f"Critical finding groups: {summary['critical_finding_groups']}",
f"Warning finding groups: {summary['warning_finding_groups']}",
f"Overall status: {summary['overall_status']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = ["# Incident Log Summary", ""]
if not report["findings"]:
lines.extend(["No configured incident patterns were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['pattern']}",
"",
f"- Occurrences: {finding['occurrences']}",
f"- First seen: {finding['first_seen']}",
f"- Last seen: {finding['last_seen']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
summary = report["summary"]
lines.extend(
[
"## Operational Summary",
"",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Total findings: {summary['total_findings']}",
f"- Critical finding groups: {summary['critical_finding_groups']}",
f"- Warning finding groups: {summary['warning_finding_groups']}",
f"- Overall status: {summary['overall_status']}",
"",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
if args.since is not None and args.until is not None and args.since > args.until:
parser.error("--since must be earlier than or equal to --until")
try:
lines = read_log_file(Path(args.file))
report = analyze_log(
lines=lines,
patterns=compile_patterns(args.ignore_case),
since=args.since,
until=args.until,
max_samples=args.max_samples,
)
report = add_summary(apply_top_limit(report, args.top))
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,215 @@
# journal-analyzer
`journal-analyzer` is a read-only Python CLI for reviewing exported `journalctl` text logs. It summarizes systemd, service, and system-level journal findings that require operator review during Linux incident response, post-patching validation, restart troubleshooting, and change evidence collection.
The tool analyzes exported journal text only. It does not call `journalctl` directly, does not modify host state, and does not claim root cause.
## Purpose
- Summarize which units failed and which services appear repeatedly affected.
- Surface dependency failures, restart loops, timeout patterns, OOM symptoms, disk/filesystem errors, TLS/certificate issues, authentication events, and network-related warnings.
- Produce predictable text, Markdown, or JSON output that can be attached to an incident or change ticket.
## When To Use
- After exporting a scoped `journalctl` window during incident response.
- After package patching or service restarts when failed units or degraded services need review.
- During Linux service troubleshooting when repeated restart or dependency messages need a quick grouped summary.
- Before attaching journal evidence to an incident, problem, or change record.
## What It Does Not Do
- It does not call `journalctl` directly in v1.
- It does not modify the input log, systemd state, service state, or host configuration.
- It does not read remote systems or live journal streams.
- It does not query SIEM, ELK, Zabbix, APM, or ticketing systems.
- It does not prove root cause or a service defect.
- It does not classify every vendor-specific journal message.
## Supported Input Type
- One exported local `journalctl` text file supplied with `--file`.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
Example export commands:
```bash
journalctl --since "1 hour ago" > journal.log
journalctl -u nginx --since today > nginx-journal.log
journalctl -p warning..alert --since "24 hours ago" > warnings.log
journalctl --no-pager --since "2026-05-11 10:00:00" > journal.log
```
## Supported Event Categories
Critical-oriented categories:
- Failed unit or failed start findings.
- Dependency failures.
- Kernel panic and panic findings.
- OOM killer and killed process findings.
- Disk and filesystem issues such as `no space left on device`, read-only filesystem, filesystem errors, and I/O errors.
- Service or application crash patterns such as `segfault`.
- TLS and certificate failures.
- Emergency mode findings.
Warning-oriented categories:
- Restart and repeated start request findings.
- Timeout and timed out findings.
- Connection refused and connection reset findings.
- Permission denied and denied findings.
- Authentication failure findings.
- Availability, degraded, failed, and warning findings that still require review.
The matching is practical and pattern-based. Default matching is already case-tolerant for common operational wording, and `--ignore-case` is available for explicit filter runs and predictable operator intent. The tool is intended for first-pass operational review, not for proving causality.
## Timestamp Support
The analyzer attempts to parse common journal and syslog timestamp formats:
- `May 11 10:15:30`
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `2026-05-11 10:15:30.123456`
- `2026-05-11 10:15:30,123`
If a timestamp cannot be parsed:
- the line is still analyzed
- first seen / last seen remain `UNKNOWN` where needed
- time-window filters keep the line by default rather than silently discarding it
Syslog-style timestamps without a year use the current local year internally unless `--since` provides a year context.
## Service Filtering
Use `--service SERVICE_NAME` to keep findings for a specific service, unit, or process name. Partial matches are allowed.
Examples:
```bash
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
```
`--service nginx` matches practical variants such as `nginx`, `nginx.service`, and lines where the raw journal text includes `nginx`.
## Severity Filtering
Use `--severity warning` or `--severity critical` to limit the displayed findings.
Examples:
```bash
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
python3 journal_analyzer.py --file examples/sample-journal.log --severity warning
```
## Severity Model
Overall status is conservative:
- `OK` - no journal findings detected.
- `WARNING` - warning-level findings exist but no critical findings exist.
- `CRITICAL` - one or more critical findings exist.
Critical status is driven by failed units, dependency failures, OOM events, kernel panic findings, disk full or read-only filesystem symptoms, emergency mode, TLS/certificate failures, and I/O or filesystem errors.
Warning status is driven by restart-related findings, timeout patterns, connection issues, permission denied events, authentication failures, degraded messages, and generic warning/failure entries that still require review.
The report summarizes exported journal findings that require review. It does not claim root cause.
## Usage
```bash
cd infra-run/scripts/python/journal-analyzer
python3 journal_analyzer.py --file examples/sample-journal.log
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown --output journal-report.md
python3 journal_analyzer.py --file examples/sample-journal.log --format json
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
python3 journal_analyzer.py --file examples/sample-journal.log --top 10
python3 journal_analyzer.py --file examples/sample-journal.log --since "2026-05-11 10:00:00"
python3 journal_analyzer.py --file examples/sample-journal.log --until "2026-05-11 12:00:00"
python3 journal_analyzer.py --file examples/sample-journal.log --ignore-case
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the report to a separate file. Without `--output`, the report is printed to stdout.
## Exit Codes
- `0` - OK, no journal findings.
- `1` - Journal findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Journal Analyzer
================
Overall status: CRITICAL
Journal findings require review; logs alone do not prove root cause.
[CRITICAL] nginx.service - failed_unit
Pattern: failed to start
Occurrences: 1
Unit: nginx.service
Process: systemd
PID: 1
First seen: May 11 10:16:11
Last seen: May 11 10:16:11
Samples:
- May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 17
Total findings: 13
Critical finding groups: 7
Warning finding groups: 5
Affected services/units count: 9
```
## Markdown Workflow
Generate a Markdown report from an exported journal and attach it to the incident or change ticket as supporting evidence:
```bash
python3 journal_analyzer.py \
--file examples/sample-journal.log \
--format markdown \
--output journal-report.md
```
Review the report before attaching it. Use it as a concise summary of exported journal findings, then correlate it with service status, monitoring, recent changes, package history, and runbook-specific post-checks.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match more than one finding when it contains more than one meaningful symptom, such as a TLS failure plus certificate expiry.
- Default matching is already case-tolerant for practical journal review; `--ignore-case` remains available when you want to force case-insensitive operator searches.
- Unit, process, and PID extraction are best-effort and may return `UNKNOWN`.
- Time filtering is best-effort because lines without parseable timestamps are retained.
- Large log files are read into memory; use scoped journal exports for very large review windows.
- The tool does not inspect structured journal fields because v1 works on exported text logs.
## Safety Notes
- The tool only reads the input journal export and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require root privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, or unsanitized production details in portfolio examples.
- Treat operational findings as triage evidence that requires review; the tool does not determine root cause automatically.
@@ -0,0 +1,143 @@
# Journal Analyzer Report
- Overall status: `CRITICAL`
- Journal findings require review; logs alone do not prove root cause.
## Finding Groups
### [CRITICAL] backup-agent - tls_certificate
- Pattern: `certificate expired`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `backup-agent`
- PID: `777`
- First seen: `2026-05-11 10:18:10`
- Last seen: `2026-05-11 10:18:10`
- Samples:
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
### [CRITICAL] backup-agent - tls_certificate
- Pattern: `TLS handshake failed`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `backup-agent`
- PID: `777`
- First seen: `2026-05-11 10:18:10`
- Last seen: `2026-05-11 10:18:10`
- Samples:
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
### [CRITICAL] dockerd - disk_filesystem
- Pattern: `no space left on device`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `dockerd`
- PID: `1347`
- First seen: `2026-05-11 10:17:33`
- Last seen: `2026-05-11 10:17:33`
- Samples:
- `2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device`
### [CRITICAL] java - oom
- Pattern: `Out of memory`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `java`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:02`
- Last seen: `2026-05-11 10:17:02`
- Samples:
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
### [CRITICAL] java - oom
- Pattern: `killed process`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `java`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:02`
- Last seen: `2026-05-11 10:17:02`
- Samples:
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
### [CRITICAL] kernel - disk_filesystem
- Pattern: `read-only file system`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `kernel`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:54`
- Last seen: `2026-05-11 10:17:54`
- Samples:
- `2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system`
### [CRITICAL] kernel - oom
- Pattern: `invoked oom-killer`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `kernel`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:01`
- Last seen: `2026-05-11 10:17:01`
- Samples:
- `2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0`
### [CRITICAL] nginx.service - dependency_failure
- Pattern: `dependency failed`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:08`
- Last seen: `May 11 10:16:08`
- Samples:
- `May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.`
### [CRITICAL] nginx.service - failed_unit
- Pattern: `failed to start`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:11`
- Last seen: `May 11 10:16:11`
- Samples:
- `May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.`
### [CRITICAL] nginx.service - failed_unit
- Pattern: `entered failed state`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:12`
- Last seen: `May 11 10:16:12`
- Samples:
- `May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.`
## Operational Summary
- Overall status: `CRITICAL`
- Total lines scanned: `17`
- Total findings: `18`
- Critical finding groups: `11`
- Warning finding groups: `7`
- Affected services/units count: `9`
- Top affected services/units: nginx.service (5), sshd.service (3), kernel (2), java (2), backup-agent (2), sshd (1), dockerd (1), NetworkManager (1), systemd (1)
- Top finding categories: restart (3), oom (3), failed_unit (2), disk_filesystem (2), tls_certificate (2), authentication (1), timeout (1), dependency_failure (1), generic_failure (1), network (1)
- Failed unit findings: nginx.service (3)
- Restart findings: `3`
- OOM findings: `3`
- Filesystem/disk findings: `2`
- Timestamp coverage: parsed=`17`, unknown=`0`
- Filters used: service=`None`, severity=`None`, since=`None`, until=`None`
@@ -0,0 +1,17 @@
May 11 10:14:01 web01 systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
May 11 10:14:02 web01 systemd[1]: Started ssh.service - OpenBSD Secure Shell server.
May 11 10:15:03 web01 sshd[2284]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=198.51.100.23 user=deploy
May 11 10:15:22 web01 systemd[1]: sshd.service: Scheduled restart job, restart counter is at 3.
May 11 10:15:23 web01 systemd[1]: sshd.service: Service restart completed after watchdog timeout warning
May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.
May 11 10:16:09 web01 systemd[1]: nginx.service: Job nginx.service/start failed with result 'dependency'.
May 11 10:16:10 web01 systemd[1]: nginx.service: Start request repeated too quickly.
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.
2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB
2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device
2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system
2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection
2026-05-11 10:18:28 web01 NetworkManager[691]: Connection activation failed: Connection refused while reaching upstream gateway
2026-05-11 10:18:42 web01 systemd[1]: Emergency mode is enabled. System cannot continue normal boot.
@@ -0,0 +1,895 @@
#!/usr/bin/env python3
"""Analyze exported journalctl text logs for operational findings."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
{
"name": "failed to start",
"pattern": "failed to start",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "entered failed state",
"pattern": "entered failed state",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "dependency failed",
"pattern": "dependency failed",
"category": "dependency_failure",
"service_hint": "systemd",
},
{
"name": "job failed",
"pattern": "job failed",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "unit failed",
"pattern": "unit failed",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "kernel panic",
"pattern": "kernel panic",
"category": "kernel_panic",
"service_hint": "kernel",
},
{
"name": "panic",
"pattern": "panic",
"category": "kernel_panic",
"service_hint": "kernel",
},
{
"name": "Out of memory",
"pattern": "Out of memory",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "invoked oom-killer",
"pattern": "invoked oom-killer",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "killed process",
"pattern": "killed process",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "no space left on device",
"pattern": "no space left on device",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "read-only file system",
"pattern": "read-only file system",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "segmentation fault",
"pattern": "segmentation fault",
"category": "crash",
"service_hint": "application",
},
{
"name": "segfault",
"pattern": "segfault",
"category": "crash",
"service_hint": "application",
},
{
"name": "certificate expired",
"pattern": "certificate expired",
"category": "tls_certificate",
"service_hint": "tls",
},
{
"name": "TLS handshake failed",
"pattern": "TLS handshake failed",
"category": "tls_certificate",
"service_hint": "tls",
},
{
"name": "emergency mode",
"pattern": "emergency mode",
"category": "system_recovery",
"service_hint": "systemd",
},
{
"name": "filesystem error",
"pattern": "filesystem error",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "I/O error",
"pattern": "I/O error",
"category": "disk_filesystem",
"service_hint": "storage",
},
]
WARNING_PATTERNS = [
{
"name": "service restart",
"pattern": "service restart",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "scheduled restart job",
"pattern": "scheduled restart job",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "start request repeated too quickly",
"pattern": "start request repeated too quickly",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "timeout",
"pattern": "timeout",
"category": "timeout",
"service_hint": "application",
},
{
"name": "timed out",
"pattern": "timed out",
"category": "timeout",
"service_hint": "application",
},
{
"name": "connection refused",
"pattern": "connection refused",
"category": "network",
"service_hint": "network",
},
{
"name": "connection reset",
"pattern": "connection reset",
"category": "network",
"service_hint": "network",
},
{
"name": "permission denied",
"pattern": "permission denied",
"category": "permission",
"service_hint": "security",
},
{
"name": "authentication failure",
"pattern": "authentication failure",
"category": "authentication",
"service_hint": "security",
},
{
"name": "denied",
"pattern": "denied",
"category": "permission",
"service_hint": "security",
},
{
"name": "unavailable",
"pattern": "unavailable",
"category": "availability",
"service_hint": "application",
},
{
"name": "degraded",
"pattern": "degraded",
"category": "degraded",
"service_hint": "systemd",
},
{
"name": "failed",
"pattern": "failed",
"category": "generic_failure",
"service_hint": "application",
},
{
"name": "warning",
"pattern": "warning",
"category": "warning",
"service_hint": "application",
},
]
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
UNIT_RE = re.compile(r"\b([A-Za-z0-9_.@:-]+\.service)\b")
ANY_UNIT_RE = re.compile(
r"\b([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))\b"
)
PREFIX_RE = re.compile(
r"^(?:[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+)?"
r"(?:\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}:\d{2}(?:[,.]\d{1,6})?\s+)?"
r"(?:(?P<host>[A-Za-z0-9_.:-]+)\s+)?"
r"(?P<proc>[A-Za-z0-9_.@/-]+)(?:\[(?P<pid>\d+)\])?:"
)
KILLED_PROCESS_RE = re.compile(r"Killed process \d+ \(([^)]+)\)")
SYSTEMD_FAILED_START_RE = re.compile(r"Failed to start\s+(.+?)\.")
SYSTEMD_TRIGGER_RE = re.compile(r"Triggered By:\s*([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))")
PID_RE = re.compile(r"\bpid[ =](\d+)\b", re.IGNORECASE)
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze exported journalctl text logs for systemd and service findings."
)
parser.add_argument("--file", required=True, help="Exported journal log file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--service",
help="Filter findings to a service, unit, or process name. Partial matching is allowed.",
)
parser.add_argument(
"--severity",
choices=("warning", "critical"),
help="Show only warning or critical findings.",
)
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top groups, services, and categories to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in (
"%Y-%m-%d %H:%M:%S",
"%Y-%m-%dT%H:%M:%S",
"%Y-%m-%d %H:%M:%S.%f",
"%Y-%m-%d %H:%M:%S,%f",
):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError(
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
)
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE
if ignore_case:
flags |= re.IGNORECASE
compiled = []
for item in CRITICAL_PATTERNS:
compiled.append(
{
**item,
"severity": "CRITICAL",
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
for item in WARNING_PATTERNS:
compiled.append(
{
**item,
"severity": "WARNING",
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def append_limited(items: list[str], value: str, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def normalize_service_name(value: str) -> str:
stripped = value.strip()
if not stripped:
return UNKNOWN
return stripped
def extract_service_info(line: str, pattern_item: dict[str, Any]) -> dict[str, str]:
unit_match = UNIT_RE.search(line)
any_unit_match = ANY_UNIT_RE.search(line)
prefix_match = PREFIX_RE.search(line)
killed_match = KILLED_PROCESS_RE.search(line)
triggered_match = SYSTEMD_TRIGGER_RE.search(line)
pid_match = PID_RE.search(line)
unit = UNKNOWN
process = UNKNOWN
pid = UNKNOWN
if unit_match:
unit = unit_match.group(1)
elif any_unit_match:
unit = any_unit_match.group(1)
if prefix_match:
process = prefix_match.group("proc") or UNKNOWN
pid = prefix_match.group("pid") or UNKNOWN
if killed_match:
process = normalize_service_name(killed_match.group(1))
if pid == UNKNOWN and pid_match:
pid = pid_match.group(1)
if unit == UNKNOWN and process == "systemd":
failed_start_match = SYSTEMD_FAILED_START_RE.search(line)
if failed_start_match:
unit = normalize_service_name(
failed_start_match.group(1).strip().replace(" ", "-")
)
if not unit.endswith(".service"):
unit = f"{unit}.service"
if unit == UNKNOWN and triggered_match:
unit = triggered_match.group(1)
service = UNKNOWN
if unit != UNKNOWN:
service = unit
elif process != UNKNOWN:
service = process
elif pattern_item.get("service_hint"):
service = pattern_item["service_hint"]
return {
"service": service,
"unit": unit,
"process": process,
"pid": pid,
}
def service_filter_matches(service_filter: str | None, service_info: dict[str, str], line: str) -> bool:
if not service_filter:
return True
needle = service_filter.lower()
candidates = [line.lower()]
for key in ("service", "unit", "process"):
value = service_info.get(key, UNKNOWN)
if value != UNKNOWN:
candidates.append(value.lower())
return any(needle in candidate for candidate in candidates)
def severity_filter_matches(selected: str | None, severity: str) -> bool:
if selected is None:
return True
return selected.upper() == severity
def detect_failed_unit(line: str, service_info: dict[str, str], category: str) -> str | None:
if category not in {"failed_unit", "dependency_failure"}:
return None
if service_info["unit"] != UNKNOWN:
return service_info["unit"]
match = ANY_UNIT_RE.search(line)
if match:
return match.group(1)
return None
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
service_filter: str | None,
severity_filter: str | None,
top: int,
max_samples: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
total_lines_scanned = 0
parsed_timestamps = 0
unknown_timestamps = 0
top_services = Counter()
top_categories = Counter()
failed_units = Counter()
restart_findings = 0
oom_findings = 0
filesystem_findings = 0
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
total_lines_scanned += 1
if parsed_at is not None:
parsed_timestamps += 1
else:
unknown_timestamps += 1
if not line_in_time_window(parsed_at, since, until):
continue
matched_items = [item for item in patterns if item["regex"].search(line)]
if matched_items:
has_specific_match = any(
item["name"] not in {"failed", "warning"} for item in matched_items
)
if has_specific_match:
matched_items = [
item for item in matched_items if item["name"] not in {"failed", "warning"}
]
for item in matched_items:
if not severity_filter_matches(severity_filter, item["severity"]):
continue
service_info = extract_service_info(line, item)
if not service_filter_matches(service_filter, service_info, line):
continue
key = (
f"{service_info['service']}::{item['name']}::{item['category']}::{item['severity']}"
)
group = groups.setdefault(
key,
{
"service": service_info["service"],
"unit": service_info["unit"],
"process": service_info["process"],
"pid": service_info["pid"],
"category": item["category"],
"pattern": item["name"],
"severity": item["severity"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
top_services[group["service"]] += 1
top_categories[group["category"]] += 1
failed_unit = detect_failed_unit(line, service_info, item["category"])
if failed_unit:
failed_units[failed_unit] += 1
if item["category"] == "restart":
restart_findings += 1
if item["category"] == "oom":
oom_findings += 1
if item["category"] == "disk_filesystem":
filesystem_findings += 1
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["service"].lower(),
item["category"].lower(),
),
)
rendered_findings = []
for group in findings:
rendered_findings.append(
{
"service": group["service"],
"unit": group["unit"],
"process": group["process"],
"pid": group["pid"],
"category": group["category"],
"pattern": group["pattern"],
"severity": group["severity"],
"occurrences": group["occurrences"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
}
)
critical_groups = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
overall_status = "OK"
if critical_groups > 0:
overall_status = "CRITICAL"
elif warning_groups > 0:
overall_status = "WARNING"
displayed_findings = rendered_findings[:top]
return {
"overall_status": overall_status,
"total_lines_scanned": total_lines_scanned,
"total_findings": sum(item["occurrences"] for item in rendered_findings),
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"affected_services_count": len([name for name in top_services if name != UNKNOWN]),
"top_affected_services": [
{"service": name, "count": count}
for name, count in top_services.most_common(top)
],
"top_categories": [
{"category": name, "count": count}
for name, count in top_categories.most_common(top)
],
"failed_units": [
{"unit": name, "count": count} for name, count in failed_units.most_common(top)
],
"restart_findings": restart_findings,
"oom_findings": oom_findings,
"filesystem_disk_findings": filesystem_findings,
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
"filters_used": {
"service": service_filter or None,
"severity": severity_filter or None,
"since": since.strftime("%Y-%m-%d %H:%M:%S") if since else None,
"until": until.strftime("%Y-%m-%d %H:%M:%S") if until else None,
},
"finding_groups": displayed_findings,
"finding_groups_total": len(rendered_findings),
}
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
if not items:
return "None"
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
def render_text(report: dict[str, Any]) -> str:
lines = [
"Journal Analyzer",
"================",
"",
f"Overall status: {report['overall_status']}",
"Journal findings require review; logs alone do not prove root cause.",
"",
]
if report["finding_groups"]:
for finding in report["finding_groups"]:
lines.extend(
[
f"[{finding['severity']}] {finding['service']} - {finding['category']}",
f"Pattern: {finding['pattern']}",
f"Occurrences: {finding['occurrences']}",
f"Unit: {finding['unit']}",
f"Process: {finding['process']}",
f"PID: {finding['pid']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
"Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - {sample}")
else:
lines.append(" - None")
lines.append("")
else:
lines.extend(["No journal findings detected for the selected filters.", ""])
lines.extend(
[
"Operational Summary",
"-------------------",
f"Overall status: {report['overall_status']}",
f"Total lines scanned: {report['total_lines_scanned']}",
f"Total findings: {report['total_findings']}",
f"Critical finding groups: {report['critical_finding_groups']}",
f"Warning finding groups: {report['warning_finding_groups']}",
f"Affected services/units count: {report['affected_services_count']}",
"Top affected services/units: "
+ render_top_pairs(report["top_affected_services"], "service"),
"Top finding categories: "
+ render_top_pairs(report["top_categories"], "category"),
"Failed unit findings: "
+ render_top_pairs(report["failed_units"], "unit"),
f"Restart findings: {report['restart_findings']}",
f"OOM findings: {report['oom_findings']}",
f"Filesystem/disk findings: {report['filesystem_disk_findings']}",
"Timestamp coverage: "
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
"Filters used: "
f"service={report['filters_used']['service'] or 'None'}, "
f"severity={report['filters_used']['severity'] or 'None'}, "
f"since={report['filters_used']['since'] or 'None'}, "
f"until={report['filters_used']['until'] or 'None'}",
]
)
return "\n".join(lines)
def render_markdown(report: dict[str, Any]) -> str:
lines = [
"# Journal Analyzer Report",
"",
f"- Overall status: `{report['overall_status']}`",
"- Journal findings require review; logs alone do not prove root cause.",
"",
]
if report["finding_groups"]:
lines.append("## Finding Groups")
lines.append("")
for finding in report["finding_groups"]:
lines.extend(
[
f"### [{finding['severity']}] {finding['service']} - {finding['category']}",
"",
f"- Pattern: `{finding['pattern']}`",
f"- Occurrences: `{finding['occurrences']}`",
f"- Unit: `{finding['unit']}`",
f"- Process: `{finding['process']}`",
f"- PID: `{finding['pid']}`",
f"- First seen: `{finding['first_seen']}`",
f"- Last seen: `{finding['last_seen']}`",
"- Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - `{sample}`")
else:
lines.append(" - `None`")
lines.append("")
else:
lines.extend(["## Finding Groups", "", "No journal findings detected for the selected filters.", ""])
lines.extend(
[
"## Operational Summary",
"",
f"- Overall status: `{report['overall_status']}`",
f"- Total lines scanned: `{report['total_lines_scanned']}`",
f"- Total findings: `{report['total_findings']}`",
f"- Critical finding groups: `{report['critical_finding_groups']}`",
f"- Warning finding groups: `{report['warning_finding_groups']}`",
f"- Affected services/units count: `{report['affected_services_count']}`",
"- Top affected services/units: "
+ (render_top_pairs(report["top_affected_services"], "service") or "None"),
"- Top finding categories: "
+ (render_top_pairs(report["top_categories"], "category") or "None"),
"- Failed unit findings: "
+ (render_top_pairs(report["failed_units"], "unit") or "None"),
f"- Restart findings: `{report['restart_findings']}`",
f"- OOM findings: `{report['oom_findings']}`",
f"- Filesystem/disk findings: `{report['filesystem_disk_findings']}`",
"- Timestamp coverage: "
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
"- Filters used: "
f"service=`{report['filters_used']['service'] or 'None'}`, "
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
f"since=`{report['filters_used']['since'] or 'None'}`, "
f"until=`{report['filters_used']['until'] or 'None'}`",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2)
def write_output(text: str, output_path: str | None, input_path: Path) -> None:
if output_path is None:
print(text)
return
destination = Path(output_path)
try:
if destination.exists() and destination.resolve() == input_path.resolve():
raise OSError("output path must not overwrite the input log file")
except OSError:
pass
try:
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write report to {destination}: {exc}") from exc
def determine_exit_code(report: dict[str, Any]) -> int:
if report["total_findings"] > 0:
return EXIT_FINDINGS
return EXIT_OK
def main() -> int:
parser = build_parser()
args = parser.parse_args()
try:
input_path = Path(args.file)
lines = read_log_file(input_path)
patterns = compile_patterns(args.ignore_case)
report = analyze_log(
lines=lines,
patterns=patterns,
since=args.since,
until=args.until,
service_filter=args.service,
severity_filter=args.severity.upper() if args.severity else None,
top=args.top,
max_samples=args.max_samples,
)
if args.format == "text":
rendered = render_text(report)
elif args.format == "markdown":
rendered = render_markdown(report)
else:
rendered = render_json(report)
write_output(rendered, args.output, input_path)
return determine_exit_code(report)
except (OSError, ValueError) as exc:
print(f"ERROR: {exc}", file=sys.stderr)
return EXIT_INVALID
except Exception as exc: # pragma: no cover - defensive operational fallback
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
return EXIT_INVALID
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,218 @@
# jvm-log-analyzer
`jvm-log-analyzer` is a read-only Python CLI for reviewing local JVM and Java application logs. It summarizes common Java exceptions, stack trace fragments, JVM failure symptoms, database issues, network/TLS problems, HTTP 5xx entries, and repeated application warning/error patterns that require operator review.
The tool is intended for Linux infrastructure, SRE, and application support workflows where a collected log file needs a quick first-pass operational summary. It does not modify logs or system state.
## When To Use
- During incident response when a JVM application log needs a fast exception and symptom summary.
- During application support handoff when stack traces, HTTP 5xx entries, or database failures need to be attached as evidence.
- After a restart, deployment, certificate change, database incident, or capacity event when local log extracts are available.
- When predictable text, Markdown, or JSON output is useful for local review.
## What It Does
- Reads one local JVM or Java application log supplied with `--file`.
- Detects configured critical and warning JVM/application patterns.
- Extracts timestamps, log levels, thread names, logger/class names, exception types, raw samples, and short stack trace fragments where practical.
- Aggregates top finding groups, exception types, and operational symptoms.
- Produces text, Markdown, or JSON output.
## What It Does Not Do
- It does not read remote systems or live journal streams.
- It does not modify logs, services, application files, JVM flags, certificates, or database state.
- It does not query APM, ELK, SIEM, Zabbix, ticketing systems, or application APIs.
- It does not find root cause automatically.
- It does not prove an application defect.
- It does not classify every vendor-specific Java framework or application message.
## Supported Input Types
- Java / JVM application logs.
- Spring Boot style logs.
- Tomcat-style application logs.
- Generic application logs containing Java exceptions and stack traces.
UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue. Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported JVM/Application Patterns
Critical patterns:
- `OutOfMemoryError`
- `Java heap space`
- `GC overhead limit exceeded`
- `StackOverflowError`
- `NoClassDefFoundError`
- `ClassNotFoundException`
- `ExceptionInInitializerError`
- `SSLHandshakeException`
- `CertificateExpiredException`
- `SQLException`
- `SQLRecoverableException`
- `CommunicationsException`
- `database unavailable`
- `connection pool exhausted`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
- `FATAL`
Warning patterns:
- `NullPointerException`
- `IllegalArgumentException`
- `IllegalStateException`
- `SocketTimeoutException`
- `ConnectException`
- `TimeoutException`
- `connection refused`
- `connection reset`
- `Broken pipe`
- `WARN`
- `ERROR`
- `retrying`
- `slow query`
- `deadlock detected`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across configured patterns.
## Stack Trace Handling
The scanner detects practical multiline Java stack traces using common starts such as:
- Fully qualified Java exception lines, such as `java.lang.NullPointerException`.
- `Exception in thread "main"`.
- `Caused by:`.
- Application exceptions ending in `Exception` or `Error`.
Following stack frames are grouped when they look like Java frames:
- Lines starting with whitespace followed by `at `.
- Lines starting with `Caused by:`.
- Lines containing `... N more`.
Stack traces are associated with the detected exception type where possible. Text and Markdown output include only short sample lines by default. Use `--include-stacktraces` to include capped multiline stack trace fragments.
## Timestamp Handling
The scanner attempts to parse:
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `2026-05-11 10:15:30,123`
- `2026-05-11 10:15:30.123`
- `May 11 10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed. When `--since` or `--until` is used, lines without parseable timestamps are retained by default so potentially important findings are not silently discarded.
## Severity Model
Overall status is conservative:
- `OK` - no JVM/application findings.
- `WARNING` - warning-level findings exist but no critical findings exist.
- `CRITICAL` - one or more critical findings exist.
Critical status is driven by JVM memory failures, fatal JVM symptoms, selected class loading errors, TLS/certificate failures, database unavailable or pool exhaustion symptoms, and HTTP 5xx volume at or above the configured threshold.
Warning status is driven by non-fatal exceptions, `WARN`/`ERROR` entries, timeout/retry patterns, connection refused/reset symptoms, slow query findings, and deadlock patterns.
HTTP 5xx findings are warnings until their total reaches `--http-critical-threshold`, which defaults to `5`. The report summarizes findings that require review; it does not claim root cause.
## Usage
```bash
cd infra-run/scripts/python/jvm-log-analyzer
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown --output jvm-report.md
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format json
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --top 10
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --max-samples 5
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --include-stacktraces
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --since "2026-05-11 10:00:00"
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --until "2026-05-11 12:00:00"
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --http-critical-threshold 2
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or application support ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
## Exit Codes
- `0` - OK, no JVM/application findings.
- `1` - JVM/application findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
JVM Log Analyzer
================
Overall status: CRITICAL
Findings require review; logs alone do not prove root cause.
[CRITICAL] OutOfMemoryError
Occurrences: 1
Symptom: jvm_memory
First seen: UNKNOWN
Last seen: UNKNOWN
Stack traces linked: 1
Samples:
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 33
Total findings: 27
Total stack traces detected: 4
Critical finding groups: 11
Warning finding groups: 8
HTTP 5xx count: 3
Parsed timestamps count: 21
Unknown timestamps count: 12
```
## Markdown Workflow
Generate a Markdown report from a collected JVM application log and attach it to the incident or application support ticket as supporting evidence:
```bash
python3 jvm_log_analyzer.py \
--file examples/sample-jvm-app.log \
--format markdown \
--include-stacktraces \
--output jvm-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with application health checks, JVM memory telemetry, database status, certificate state, recent deployments, and the relevant application owner.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple findings, such as `ERROR`, `HTTP 503`, and a Java exception.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Stack trace grouping is practical, not a complete Java parser.
- Timestamp parsing is best-effort; unparseable lines are retained during time filtering.
- HTTP 5xx counts are raw log counts, not request rates or customer impact.
- Large log files are read into memory; collect scoped extracts for very large incidents.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, tokens, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,32 @@
2026-05-11 09:58:01 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Starting InventoryApplication v2.8.4
2026-05-11 09:58:07 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Started InventoryApplication in 6.2 seconds
2026-05-11 10:02:14 WARN inventory-api[2214] --- [order-worker-2] com.example.retry.PaymentClient : upstream timeout, retrying payment authorization attempt=2
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
java.lang.NullPointerException: Cannot invoke "Customer.getStatus()" because "customer" is null
at com.example.orders.OrderService.validateCustomer(OrderService.java:144)
at com.example.orders.OrderService.submit(OrderService.java:92)
at com.example.orders.OrderController.create(OrderController.java:61)
Caused by: java.lang.IllegalStateException: customer lookup returned empty result
at com.example.customers.CustomerRepository.findRequired(CustomerRepository.java:38)
... 3 more
2026-05-11 10:08:42 WARN inventory-api[2214] --- [http-nio-8080-exec-2] com.example.integration.ShippingClient : java.net.SocketTimeoutException: Read timed out calling shipping endpoint
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
... 2 more
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
2026-05-11 10:18:01 ERROR inventory-api[2214] --- [http-nio-8080-exec-8] com.example.web.ErrorHandler : HTTP 500 POST /api/orders requestId=req-1001
2026-05-11 10:18:03 ERROR inventory-api[2214] --- [http-nio-8080-exec-9] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1002
2026-05-11 10:18:06 ERROR inventory-api[2214] --- [http-nio-8080-exec-3] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1003
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
@@ -0,0 +1,215 @@
# JVM Log Analyzer
- Overall status: CRITICAL
- Finding language is a triage summary; logs alone do not prove root cause.
## CRITICAL: CertificateExpiredException
- Occurrences: 1
- Symptom: tls_certificate
- First seen: 2026-05-11 10:16:40
- Last seen: 2026-05-11 10:16:40
- Stack traces linked: 0
Sample log lines:
```text
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
```
## CRITICAL: CommunicationsException
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 0
Sample log lines:
```text
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
```
## CRITICAL: connection pool exhausted
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:12:55
- Last seen: 2026-05-11 10:12:55
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
```
## CRITICAL: database unavailable
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
```
## CRITICAL: FATAL
- Occurrences: 1
- Symptom: fatal
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
```
## CRITICAL: Java heap space
- Occurrences: 1
- Symptom: jvm_memory
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 0
Sample log lines:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
```
## CRITICAL: OutOfMemoryError
- Occurrences: 1
- Symptom: jvm_memory
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 1
Sample log lines:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
```
Stack trace samples:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
```
## CRITICAL: SQLRecoverableException
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 1
Sample log lines:
```text
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
```
Stack trace samples:
```text
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
... 2 more
```
## CRITICAL: SSLHandshakeException
- Occurrences: 1
- Symptom: tls_certificate
- First seen: 2026-05-11 10:16:40
- Last seen: 2026-05-11 10:16:40
- Stack traces linked: 1
Sample log lines:
```text
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
```
Stack trace samples:
```text
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
```
## WARNING: ERROR
- Occurrences: 8
- Symptom: log_level
- First seen: 2026-05-11 10:05:31
- Last seen: 2026-05-11 10:18:06
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
```
## Top Exception Types
| Value | Count |
| --- | ---: |
| NullPointerException | 1 |
| IllegalStateException | 1 |
| SocketTimeoutException | 1 |
| ConnectException | 1 |
| SQLRecoverableException | 1 |
| CommunicationsException | 1 |
| SSLHandshakeException | 1 |
| CertificateExpiredException | 1 |
| OutOfMemoryError | 1 |
## Top Operational Symptoms
| Value | Count |
| --- | ---: |
| log_level | 10 |
| database | 4 |
| http_5xx | 3 |
| application_exception | 2 |
| network_timeout | 2 |
| network_connectivity | 2 |
| tls_certificate | 2 |
| jvm_memory | 2 |
| retry | 1 |
| fatal | 1 |
## Operational Summary
- Overall status: CRITICAL
- Total lines scanned: 32
- Total findings: 29
- Total stack traces detected: 4
- Critical finding groups: 9
- Warning finding groups: 11
- HTTP 5xx count: 3
- Parsed timestamps count: 13
- Unknown timestamps count: 19
@@ -0,0 +1,837 @@
#!/usr/bin/env python3
"""Analyze JVM and Java application logs for operational findings."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
{"name": "OutOfMemoryError", "pattern": "OutOfMemoryError", "symptom": "jvm_memory"},
{"name": "Java heap space", "pattern": "Java heap space", "symptom": "jvm_memory"},
{"name": "GC overhead limit exceeded", "pattern": "GC overhead limit exceeded", "symptom": "jvm_memory"},
{"name": "StackOverflowError", "pattern": "StackOverflowError", "symptom": "jvm_stack"},
{"name": "NoClassDefFoundError", "pattern": "NoClassDefFoundError", "symptom": "class_loading"},
{"name": "ClassNotFoundException", "pattern": "ClassNotFoundException", "symptom": "class_loading"},
{"name": "ExceptionInInitializerError", "pattern": "ExceptionInInitializerError", "symptom": "class_loading"},
{"name": "SSLHandshakeException", "pattern": "SSLHandshakeException", "symptom": "tls_certificate"},
{"name": "CertificateExpiredException", "pattern": "CertificateExpiredException", "symptom": "tls_certificate"},
{"name": "SQLException", "pattern": "SQLException", "symptom": "database"},
{"name": "SQLRecoverableException", "pattern": "SQLRecoverableException", "symptom": "database"},
{"name": "CommunicationsException", "pattern": "CommunicationsException", "symptom": "database"},
{"name": "database unavailable", "pattern": "database unavailable", "symptom": "database"},
{"name": "connection pool exhausted", "pattern": "connection pool exhausted", "symptom": "database"},
{"name": "FATAL", "pattern": "FATAL", "symptom": "fatal"},
]
WARNING_PATTERNS = [
{"name": "NullPointerException", "pattern": "NullPointerException", "symptom": "application_exception"},
{"name": "IllegalArgumentException", "pattern": "IllegalArgumentException", "symptom": "application_exception"},
{"name": "IllegalStateException", "pattern": "IllegalStateException", "symptom": "application_exception"},
{"name": "SocketTimeoutException", "pattern": "SocketTimeoutException", "symptom": "network_timeout"},
{"name": "ConnectException", "pattern": "ConnectException", "symptom": "network_connectivity"},
{"name": "TimeoutException", "pattern": "TimeoutException", "symptom": "network_timeout"},
{"name": "connection refused", "pattern": "connection refused", "symptom": "network_connectivity"},
{"name": "connection reset", "pattern": "connection reset", "symptom": "network_connectivity"},
{"name": "Broken pipe", "pattern": "Broken pipe", "symptom": "network_connectivity"},
{"name": "WARN", "pattern": "WARN", "symptom": "log_level"},
{"name": "ERROR", "pattern": "ERROR", "symptom": "log_level"},
{"name": "retrying", "pattern": "retrying", "symptom": "retry"},
{"name": "slow query", "pattern": "slow query", "symptom": "database"},
{"name": "deadlock detected", "pattern": "deadlock detected", "symptom": "database"},
]
HTTP_PATTERNS = [
{"name": "HTTP 500", "pattern": "HTTP 500", "symptom": "http_5xx"},
{"name": "HTTP 502", "pattern": "HTTP 502", "symptom": "http_5xx"},
{"name": "HTTP 503", "pattern": "HTTP 503", "symptom": "http_5xx"},
{"name": "HTTP 504", "pattern": "HTTP 504", "symptom": "http_5xx"},
]
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
LEVEL_RE = re.compile(r"\b(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b")
SPRING_LOGGER_RE = re.compile(r"\s---\s+\[[^\]]+\]\s+([A-Za-z0-9_.$-]+)\s*:")
GENERIC_LOGGER_RE = re.compile(
r"\b(?:TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b\s+(?:\d+\s+)?([A-Za-z0-9_.$-]+)\s*:"
)
THREAD_RE = re.compile(r"\[([^\]]+)\]")
SPRING_THREAD_RE = re.compile(r"\s---\s+\[([^\]]+)\]")
EXCEPTION_RE = re.compile(
r"\b((?:[A-Za-z_$][\w$]*\.)+[A-Za-z_$][\w$]*(?:Exception|Error)|[A-Za-z_$][\w$]*(?:Exception|Error))\b"
)
STACK_FRAME_RE = re.compile(r"^\s+at\s+")
CAUSED_BY_RE = re.compile(r"^\s*Caused by:\s+")
MORE_RE = re.compile(r"^\s*\.\.\.\s+\d+\s+more\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze local JVM and Java application logs for operational findings."
)
parser.add_argument("--file", required=True, help="Local JVM or Java application log to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top finding groups, exception types, and symptoms to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
parser.add_argument(
"--include-stacktraces",
action="store_true",
help="Include short multiline stack trace samples in text and Markdown reports.",
)
parser.add_argument(
"--max-stack-lines",
type=positive_int,
default=12,
help="Maximum lines retained per stack trace sample. Default: 12.",
)
parser.add_argument(
"--http-critical-threshold",
type=positive_int,
default=5,
help="HTTP 5xx count that raises HTTP findings to CRITICAL. Default: 5.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError('expected timestamp format "YYYY-MM-DD HH:MM:SS"')
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
compiled = []
for item in CRITICAL_PATTERNS:
compiled.append({**item, "severity": "CRITICAL", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
for item in WARNING_PATTERNS:
compiled.append({**item, "severity": "WARNING", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
for item in HTTP_PATTERNS:
compiled.append({**item, "severity": "WARNING", "kind": "http_5xx", "regex": re.compile(re.escape(item["pattern"]), flags)})
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def extract_level(line: str) -> str:
match = LEVEL_RE.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_thread(line: str) -> str:
for regex in (SPRING_THREAD_RE, THREAD_RE):
match = regex.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_logger(line: str) -> str:
for regex in (SPRING_LOGGER_RE, GENERIC_LOGGER_RE):
match = regex.search(line)
if match:
return match.group(1)
return UNKNOWN
def normalize_exception_type(value: str) -> str:
return value.split(".")[-1]
def extract_exception_type(line: str) -> str:
match = EXCEPTION_RE.search(line)
if match:
return normalize_exception_type(match.group(1))
return UNKNOWN
def is_stack_start(line: str) -> bool:
return (
"Exception in thread" in line
or CAUSED_BY_RE.search(line) is not None
or EXCEPTION_RE.search(line) is not None
)
def is_stack_continuation(line: str) -> bool:
return (
STACK_FRAME_RE.search(line) is not None
or CAUSED_BY_RE.search(line) is not None
or MORE_RE.search(line) is not None
)
def update_seen(
group: dict[str, Any], parsed_at: datetime | None, rendered_at: str
) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def append_limited(items: list[Any], value: Any, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def finding_key(severity: str, name: str) -> str:
return f"{severity}::{name}"
def ensure_group(
groups: dict[str, dict[str, Any]],
name: str,
severity: str,
symptom: str,
kind: str,
) -> dict[str, Any]:
key = finding_key(severity, name)
return groups.setdefault(
key,
{
"name": name,
"severity": severity,
"symptom": symptom,
"kind": kind,
"occurrences": 0,
"stack_trace_count": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
"stack_trace_samples": [],
"fields": [],
},
)
def add_finding(
groups: dict[str, dict[str, Any]],
name: str,
severity: str,
symptom: str,
kind: str,
line: str,
parsed_at: datetime | None,
rendered_at: str,
max_samples: int,
) -> dict[str, Any]:
group = ensure_group(groups, name, severity, symptom, kind)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
append_limited(
group["fields"],
{
"timestamp": rendered_at,
"log_level": extract_level(line),
"logger": extract_logger(line),
"thread": extract_thread(line),
"exception_type": extract_exception_type(line),
"raw": line,
},
max_samples,
)
return group
def record_stack_trace(
groups: dict[str, dict[str, Any]],
stack: dict[str, Any],
max_samples: int,
max_stack_lines: int,
) -> None:
exception_type = stack["exception_type"] if stack["exception_type"] != UNKNOWN else "Java stack trace"
severity = severity_for_exception(exception_type)
group = ensure_group(groups, exception_type, severity, "stack_trace", "stack_trace")
group["stack_trace_count"] += 1
update_seen(group, stack["parsed_at"], stack["rendered_at"])
append_limited(group["samples"], stack["lines"][0], max_samples)
append_limited(group["stack_trace_samples"], stack["lines"][:max_stack_lines], max_samples)
def severity_for_exception(exception_type: str) -> str:
critical = {item["name"] for item in CRITICAL_PATTERNS}
if exception_type in critical or exception_type in {"OutOfMemoryError", "StackOverflowError"}:
return "CRITICAL"
return "WARNING"
def detect_stack_traces(
included: list[dict[str, Any]],
groups: dict[str, dict[str, Any]],
max_samples: int,
max_stack_lines: int,
) -> int:
stack: dict[str, Any] | None = None
stack_count = 0
for item in included:
line = item["line"]
if stack is None:
if is_stack_start(line):
stack = {
"lines": [line],
"exception_type": extract_exception_type(line),
"parsed_at": item["parsed_at"],
"rendered_at": item["rendered_at"],
}
continue
if is_stack_continuation(line):
stack["lines"].append(line)
if stack["exception_type"] == UNKNOWN:
stack["exception_type"] = extract_exception_type(line)
continue
if len(stack["lines"]) > 1:
record_stack_trace(groups, stack, max_samples, max_stack_lines)
stack_count += 1
stack = None
if is_stack_start(line):
stack = {
"lines": [line],
"exception_type": extract_exception_type(line),
"parsed_at": item["parsed_at"],
"rendered_at": item["rendered_at"],
}
if stack is not None and len(stack["lines"]) > 1:
record_stack_trace(groups, stack, max_samples, max_stack_lines)
stack_count += 1
return stack_count
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
top: int,
max_samples: int,
max_stack_lines: int,
http_critical_threshold: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
exception_counts: Counter[str] = Counter()
symptom_counts: Counter[str] = Counter()
parsed_timestamps = 0
unknown_timestamps = 0
included: list[dict[str, Any]] = []
http_5xx_count = 0
context_parsed_at: datetime | None = None
context_rendered_at = UNKNOWN
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if parsed_at is None:
unknown_timestamps += 1
else:
parsed_timestamps += 1
context_parsed_at = parsed_at
context_rendered_at = rendered_at
if not line_in_time_window(parsed_at, since, until):
continue
# Stack trace frames often omit timestamps; keep nearby log context for first/last seen.
effective_parsed_at = parsed_at
effective_rendered_at = rendered_at
if parsed_at is None and (is_stack_start(line) or is_stack_continuation(line)):
effective_parsed_at = context_parsed_at
effective_rendered_at = context_rendered_at
included.append(
{
"line": line,
"parsed_at": effective_parsed_at,
"rendered_at": effective_rendered_at,
}
)
matched_names = set()
for item in patterns:
if not item["regex"].search(line):
continue
severity = item["severity"]
if item["kind"] == "http_5xx":
http_5xx_count += 1
add_finding(
groups=groups,
name=item["name"],
severity=severity,
symptom=item["symptom"],
kind=item["kind"],
line=line,
parsed_at=effective_parsed_at,
rendered_at=effective_rendered_at,
max_samples=max_samples,
)
symptom_counts[item["symptom"]] += 1
matched_names.add(item["name"])
exception_type = extract_exception_type(line)
if exception_type != UNKNOWN:
exception_counts[exception_type] += 1
if exception_type not in matched_names:
severity = severity_for_exception(exception_type)
add_finding(
groups=groups,
name=exception_type,
severity=severity,
symptom="application_exception",
kind="exception",
line=line,
parsed_at=effective_parsed_at,
rendered_at=effective_rendered_at,
max_samples=max_samples,
)
symptom_counts["application_exception"] += 1
stack_trace_count = detect_stack_traces(included, groups, max_samples, max_stack_lines)
promote_http_5xx(groups, http_5xx_count, http_critical_threshold)
findings = sorted(
(render_group(group) for group in groups.values()),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["name"].lower(),
),
)
summary = build_summary(
total_lines=len(lines),
findings=findings,
stack_trace_count=stack_trace_count,
http_5xx_count=http_5xx_count,
parsed_timestamps=parsed_timestamps,
unknown_timestamps=unknown_timestamps,
)
return {
"summary": summary,
"findings": findings[:top],
"top_exception_types": top_items(exception_counts, top),
"top_operational_symptoms": top_items(symptom_counts, top),
}
def promote_http_5xx(
groups: dict[str, dict[str, Any]], http_5xx_count: int, threshold: int
) -> None:
if http_5xx_count < threshold:
return
http_names = {item["name"] for item in HTTP_PATTERNS}
for old_key, group in list(groups.items()):
if group["name"] not in http_names or group["severity"] == "CRITICAL":
continue
group["severity"] = "CRITICAL"
new_key = finding_key("CRITICAL", group["name"])
groups[new_key] = group
del groups[old_key]
def render_group(group: dict[str, Any]) -> dict[str, Any]:
return {
"name": group["name"],
"severity": group["severity"],
"symptom": group["symptom"],
"kind": group["kind"],
"occurrences": group["occurrences"],
"stack_trace_count": group["stack_trace_count"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
"stack_trace_samples": group["stack_trace_samples"],
"fields": group["fields"],
}
def build_summary(
total_lines: int,
findings: list[dict[str, Any]],
stack_trace_count: int,
http_5xx_count: int,
parsed_timestamps: int,
unknown_timestamps: int,
) -> dict[str, Any]:
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
total_findings = sum(item["occurrences"] for item in findings)
if critical_groups > 0:
status = "CRITICAL"
elif warning_groups > 0:
status = "WARNING"
else:
status = "OK"
return {
"overall_status": status,
"total_lines_scanned": total_lines,
"total_findings": total_findings,
"total_stack_traces_detected": stack_trace_count,
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"http_5xx_count": http_5xx_count,
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
}
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
def render_text(report: dict[str, Any], include_stacktraces: bool) -> str:
lines = ["JVM Log Analyzer", "================", ""]
summary = report["summary"]
lines.extend(
[
f"Overall status: {summary['overall_status']}",
"Findings require review; logs alone do not prove root cause.",
"",
]
)
if not report["findings"]:
lines.extend(["No configured JVM/application findings were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['name']}",
f"Occurrences: {finding['occurrences']}",
f"Symptom: {finding['symptom']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
f"Stack traces linked: {finding['stack_trace_count']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
if include_stacktraces and finding["stack_trace_samples"]:
lines.append("Stack trace samples:")
for stack in finding["stack_trace_samples"]:
lines.append(" ---")
lines.extend(f" {entry}" for entry in stack)
lines.append("")
lines.extend(render_text_table("Top Exception Types", report["top_exception_types"]))
lines.extend(render_text_table("Top Operational Symptoms", report["top_operational_symptoms"]))
lines.extend(render_text_summary(summary))
return "\n".join(lines) + "\n"
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [title, "-" * len(title)]
if not rows:
lines.append("No entries detected.")
else:
lines.extend(f"- {item['value']}: {item['count']}" for item in rows)
lines.append("")
return lines
def render_text_summary(summary: dict[str, Any]) -> list[str]:
coverage = summary["timestamp_coverage"]
return [
"Operational Summary",
"-------------------",
f"Overall status: {summary['overall_status']}",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Total findings: {summary['total_findings']}",
f"Total stack traces detected: {summary['total_stack_traces_detected']}",
f"Critical finding groups: {summary['critical_finding_groups']}",
f"Warning finding groups: {summary['warning_finding_groups']}",
f"HTTP 5xx count: {summary['http_5xx_count']}",
f"Parsed timestamps count: {coverage['parsed_timestamps_count']}",
f"Unknown timestamps count: {coverage['unknown_timestamps_count']}",
]
def render_markdown(report: dict[str, Any], include_stacktraces: bool) -> str:
summary = report["summary"]
lines = [
"# JVM Log Analyzer",
"",
f"- Overall status: {summary['overall_status']}",
"- Finding language is a triage summary; logs alone do not prove root cause.",
"",
]
if not report["findings"]:
lines.extend(["No configured JVM/application findings were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['name']}",
"",
f"- Occurrences: {finding['occurrences']}",
f"- Symptom: {finding['symptom']}",
f"- First seen: {finding['first_seen']}",
f"- Last seen: {finding['last_seen']}",
f"- Stack traces linked: {finding['stack_trace_count']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
if include_stacktraces and finding["stack_trace_samples"]:
lines.extend(["Stack trace samples:", ""])
for stack in finding["stack_trace_samples"]:
lines.append("```text")
lines.extend(stack)
lines.append("```")
lines.append("")
lines.extend(render_markdown_table("Top Exception Types", report["top_exception_types"]))
lines.extend(render_markdown_table("Top Operational Symptoms", report["top_operational_symptoms"]))
lines.extend(render_markdown_summary(summary))
return "\n".join(lines)
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [f"## {title}", ""]
if not rows:
lines.extend(["No entries detected.", ""])
return lines
lines.extend(["| Value | Count |", "| --- | ---: |"])
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
lines.append("")
return lines
def render_markdown_summary(summary: dict[str, Any]) -> list[str]:
coverage = summary["timestamp_coverage"]
return [
"## Operational Summary",
"",
f"- Overall status: {summary['overall_status']}",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Total findings: {summary['total_findings']}",
f"- Total stack traces detected: {summary['total_stack_traces_detected']}",
f"- Critical finding groups: {summary['critical_finding_groups']}",
f"- Warning finding groups: {summary['warning_finding_groups']}",
f"- HTTP 5xx count: {summary['http_5xx_count']}",
f"- Parsed timestamps count: {coverage['parsed_timestamps_count']}",
f"- Unknown timestamps count: {coverage['unknown_timestamps_count']}",
"",
]
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
if path.resolve() == input_path.resolve():
raise OSError("output path must not be the same as input file")
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
input_path = Path(args.file)
if args.since is not None and args.until is not None and args.since > args.until:
parser.error("--since must be earlier than or equal to --until")
try:
lines = read_log_file(input_path)
report = analyze_log(
lines=lines,
patterns=compile_patterns(args.ignore_case),
since=args.since,
until=args.until,
top=args.top,
max_samples=args.max_samples,
max_stack_lines=args.max_stack_lines,
http_critical_threshold=args.http_critical_threshold,
)
if args.format == "text":
content = render_text(report, args.include_stacktraces)
elif args.format == "markdown":
content = render_markdown(report, args.include_stacktraces)
else:
content = render_json(report)
write_report(input_path, args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,199 @@
# known-error-matcher
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
## Purpose
- Identify which cataloged operational problems are visible in a collected log.
- Count how often each known error pattern appears.
- Surface warning and critical matches conservatively.
- Point operators toward relevant runbooks or supporting local tools.
- Produce predictable text, Markdown, or JSON output for incident notes.
## When To Use
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
- Before attaching log evidence to an incident, problem, or change ticket.
- When teams maintain a small local catalog of operational patterns and runbook links.
- When JSON output is useful for later local automation.
## What It Does Not Do
- It does not read remote systems or live streams.
- It does not modify logs, services, applications, accounts, or host state.
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
- It does not find root cause automatically.
- It does not prove an incident or confirm customer impact.
- It does not classify every vendor-specific log message.
## Pattern Catalog Format
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
```json
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
}
]
}
```
Required fields per pattern:
- `id` - stable non-empty identifier.
- `name` - human-readable finding name.
- `severity` - `WARNING` or `CRITICAL`.
- `regex` - Python regular expression used for matching.
Optional fields:
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
- `description` - short operator-facing explanation. Missing values are reported as `None`.
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
## Adding A Known Error Pattern
Add a new object under `patterns` in `patterns.json`:
```json
{
"id": "example_dependency_failure",
"name": "Example dependency failure",
"severity": "WARNING",
"regex": "dependency request failed|upstream dependency unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
"description": "Application logged a dependency failure that requires review."
}
```
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
## Severity Model
Overall status is conservative:
- `OK` - no known error patterns matched.
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
- `CRITICAL` - one or more critical patterns matched.
The status means known error patterns require review. It is not a final root-cause statement.
## Category Filtering
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
Examples:
```bash
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
```
## Usage
```bash
cd infra-run/scripts/python/known-error-matcher
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident, problem, or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
## Exit Codes
- `0` - OK, no known error matches.
- `1` - Known error matches detected.
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Known Error Matcher
===================
Overall status: CRITICAL
Known error pattern matches require operator review; logs alone do not prove root cause.
[CRITICAL] database_unavailable - Database unavailable
Category: application
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
Description: Application logged unavailable database or database connectivity symptoms.
Samples:
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 9
Known error matches: 7
Matched known error patterns: 7
Critical matched patterns: 5
Warning matched patterns: 2
Top categories: application (3), network (2), application_jvm (2)
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
Timestamp coverage: parsed=9, unknown=0
Filters used: severity=None, category=None
Pattern catalog path: patterns.json
```
## Markdown Workflow
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
```bash
python3 known_error_matcher.py \
--file examples/sample-app.log \
--patterns patterns.json \
--format markdown \
--output known-error-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple known error patterns.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
- Large log files are read into memory; use scoped extracts for very large incidents.
## Safety Notes
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,9 @@
2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42
2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted
2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443
2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms
2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed
2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"
2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space
2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded
@@ -0,0 +1,97 @@
# Known Error Matcher Report
- Overall status: `CRITICAL`
- Known error pattern matches require operator review; logs alone do not prove root cause.
## Matched Known Errors
### [CRITICAL] database_unavailable - Database unavailable
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:07`
- Last seen: `2026-05-11 10:16:07`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Application logged unavailable database or database connectivity symptoms.
- Samples:
- `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool`
### [CRITICAL] http_500 - HTTP 500
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:02`
- Last seen: `2026-05-11 10:16:02`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 500 responses.
- Samples:
- `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted`
### [CRITICAL] http_503 - HTTP 503
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:31.456`
- Last seen: `2026-05-11 10:16:31.456`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 503 service unavailable responses.
- Samples:
- `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"`
### [CRITICAL] java_out_of_memory - Java OutOfMemoryError
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:40`
- Last seen: `2026-05-11 10:16:40`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java process logged memory exhaustion symptoms.
- Samples:
- `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space`
### [CRITICAL] ssl_handshake_exception - SSLHandshakeException
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:22`
- Last seen: `2026-05-11 10:16:22`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java TLS handshake exception was logged.
- Samples:
- `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed`
### [WARNING] connection_refused - Connection refused
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:11`
- Last seen: `2026-05-11 10:16:11`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Client connection attempts were refused by the destination service or host.
- Samples:
- `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443`
### [WARNING] timeout - Timeout
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:15,123`
- Last seen: `2026-05-11 10:16:15,123`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Operation timed out and may require network, service, or dependency review.
- Samples:
- `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms`
## Operational Summary
- Overall status: `CRITICAL`
- Total lines scanned: `9`
- Known error matches: `7`
- Matched known error patterns: `7`
- Critical matched patterns: `5`
- Warning matched patterns: `2`
- Top categories: application (3), network (2), application_jvm (2)
- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
- Timestamp coverage: parsed=`9`, unknown=`0`
- Filters used: severity=`None`, category=`None`
- Pattern catalog path: `patterns.json`
@@ -0,0 +1,10 @@
May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages
May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only
May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE)
May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application.
May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly.
May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy
May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp
May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp
@@ -0,0 +1,562 @@
#!/usr/bin/env python3
"""Match local logs against a JSON catalog of known operational error patterns."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
VALID_SEVERITIES = {"WARNING", "CRITICAL"}
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Scan a local log file for known operational error patterns."
)
parser.add_argument("--file", required=True, help="Local log file to scan.")
parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--severity",
choices=("warning", "critical"),
help="Only include findings with this severity.",
)
parser.add_argument(
"--category",
help="Only include findings from this exact category.",
)
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of matched known errors and summary entries to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per matched known error. Default: 3.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Compile catalog regex patterns case-insensitively.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def read_text_file(path: Path, label: str) -> str:
if not path.exists():
raise OSError(f"{label} does not exist: {path}")
if not path.is_file():
raise OSError(f"{label} is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"{label} is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read {label} {path}: {exc}") from exc
if text == "":
raise ValueError(f"{label} is empty: {path}")
return text
def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]:
text = read_text_file(path, "pattern catalog")
try:
catalog = json.loads(text)
except json.JSONDecodeError as exc:
raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc
errors: list[str] = []
if not isinstance(catalog, dict):
raise ValueError("invalid pattern catalog: top-level JSON value must be an object")
if "patterns" not in catalog:
raise ValueError('invalid pattern catalog: missing top-level "patterns" field')
if not isinstance(catalog["patterns"], list):
raise ValueError('invalid pattern catalog: "patterns" must be a list')
seen_ids: set[str] = set()
compiled_patterns: list[dict[str, Any]] = []
flags = re.IGNORECASE if ignore_case else 0
for index, item in enumerate(catalog["patterns"], start=1):
if not isinstance(item, dict):
errors.append(f"pattern #{index}: must be an object")
continue
pattern_id = normalize_required_text(item, "id")
name = normalize_required_text(item, "name")
severity = normalize_required_text(item, "severity").upper()
regex_text = normalize_required_text(item, "regex")
if not pattern_id:
errors.append(f"pattern #{index}: id is required and must be non-empty")
elif pattern_id in seen_ids:
errors.append(f"pattern #{index}: duplicate id {pattern_id}")
else:
seen_ids.add(pattern_id)
if not name:
errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty")
if severity not in VALID_SEVERITIES:
errors.append(
f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL"
)
if not regex_text:
errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty")
compiled_regex = None
if regex_text:
try:
compiled_regex = re.compile(regex_text, flags)
except re.error as exc:
errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}")
if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex:
compiled_patterns.append(
{
"id": pattern_id,
"name": name,
"severity": severity,
"regex_text": regex_text,
"regex": compiled_regex,
"category": normalize_optional_text(item, "category", UNKNOWN),
"runbook": normalize_optional_text(item, "runbook", ""),
"description": normalize_optional_text(item, "description", ""),
}
)
if errors:
raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors))
if not compiled_patterns:
raise ValueError("invalid pattern catalog: no patterns configured")
return compiled_patterns
def normalize_required_text(item: dict[str, Any], field: str) -> str:
value = item.get(field)
if not isinstance(value, str):
return ""
return value.strip()
def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str:
value = item.get(field, default)
if not isinstance(value, str):
return default
value = value.strip()
return value if value else default
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def severity_filter_matches(selected: str | None, severity: str) -> bool:
if selected is None:
return True
return selected.upper() == severity
def category_filter_matches(selected: str | None, category: str) -> bool:
if selected is None:
return True
return selected == category
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def append_limited(items: list[str], value: str, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
severity_filter: str | None,
category_filter: str | None,
top: int,
max_samples: int,
pattern_catalog_path: Path,
) -> dict[str, Any]:
syslog_year = datetime.now().year
groups: dict[str, dict[str, Any]] = {}
top_categories = Counter()
total_lines_scanned = 0
parsed_timestamps = 0
unknown_timestamps = 0
for line in lines:
total_lines_scanned += 1
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if parsed_at is None:
unknown_timestamps += 1
else:
parsed_timestamps += 1
for pattern in patterns:
if not severity_filter_matches(severity_filter, pattern["severity"]):
continue
if not category_filter_matches(category_filter, pattern["category"]):
continue
if not pattern["regex"].search(line):
continue
group = groups.setdefault(
pattern["id"],
{
"id": pattern["id"],
"name": pattern["name"],
"severity": pattern["severity"],
"category": pattern["category"],
"runbook": pattern["runbook"],
"description": pattern["description"],
"regex": pattern["regex_text"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
top_categories[pattern["category"]] += 1
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["id"],
),
)
rendered_findings = [
{
**finding,
"first_seen": render_seen(finding["first_seen"]),
"last_seen": render_seen(finding["last_seen"]),
}
for finding in findings
]
critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
total_matches = sum(item["occurrences"] for item in rendered_findings)
overall_status = "OK"
if critical_patterns > 0:
overall_status = "CRITICAL"
elif warning_patterns > 0:
overall_status = "WARNING"
return {
"overall_status": overall_status,
"total_lines_scanned": total_lines_scanned,
"total_known_error_matches": total_matches,
"matched_pattern_count": len(rendered_findings),
"critical_matched_pattern_count": critical_patterns,
"warning_matched_pattern_count": warning_patterns,
"top_categories": [
{"category": name, "count": count}
for name, count in top_categories.most_common(top)
],
"top_known_errors": [
{"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]}
for item in rendered_findings[:top]
],
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
"filters_used": {
"severity": severity_filter.lower() if severity_filter else None,
"category": category_filter,
},
"pattern_catalog_path": str(pattern_catalog_path),
"findings": rendered_findings[:top],
"findings_total": len(rendered_findings),
}
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
if not items:
return "None"
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
def render_text(report: dict[str, Any]) -> str:
lines = [
"Known Error Matcher",
"===================",
"",
f"Overall status: {report['overall_status']}",
"Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['id']} - {finding['name']}",
f"Category: {finding['category']}",
f"Occurrences: {finding['occurrences']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
f"Runbook: {finding['runbook'] or 'None'}",
f"Description: {finding['description'] or 'None'}",
"Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - {sample}")
else:
lines.append(" - None")
lines.append("")
else:
lines.extend(["No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=False))
return "\n".join(lines)
def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]:
if markdown:
return [
"## Operational Summary",
"",
f"- Overall status: `{report['overall_status']}`",
f"- Total lines scanned: `{report['total_lines_scanned']}`",
f"- Known error matches: `{report['total_known_error_matches']}`",
f"- Matched known error patterns: `{report['matched_pattern_count']}`",
f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`",
f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`",
"- Top categories: " + render_top_pairs(report["top_categories"], "category"),
"- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"- Timestamp coverage: "
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
"- Filters used: "
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
f"category=`{report['filters_used']['category'] or 'None'}`",
f"- Pattern catalog path: `{report['pattern_catalog_path']}`",
]
return [
"Operational Summary",
"-------------------",
f"Overall status: {report['overall_status']}",
f"Total lines scanned: {report['total_lines_scanned']}",
f"Known error matches: {report['total_known_error_matches']}",
f"Matched known error patterns: {report['matched_pattern_count']}",
f"Critical matched patterns: {report['critical_matched_pattern_count']}",
f"Warning matched patterns: {report['warning_matched_pattern_count']}",
"Top categories: " + render_top_pairs(report["top_categories"], "category"),
"Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"Timestamp coverage: "
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
"Filters used: "
f"severity={report['filters_used']['severity'] or 'None'}, "
f"category={report['filters_used']['category'] or 'None'}",
f"Pattern catalog path: {report['pattern_catalog_path']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = [
"# Known Error Matcher Report",
"",
f"- Overall status: `{report['overall_status']}`",
"- Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
lines.extend(["## Matched Known Errors", ""])
for finding in report["findings"]:
lines.extend(
[
f"### [{finding['severity']}] {finding['id']} - {finding['name']}",
"",
f"- Category: `{finding['category']}`",
f"- Occurrences: `{finding['occurrences']}`",
f"- First seen: `{finding['first_seen']}`",
f"- Last seen: `{finding['last_seen']}`",
f"- Runbook: `{finding['runbook'] or 'None'}`",
f"- Description: {finding['description'] or 'None'}",
"- Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - `{sample}`")
else:
lines.append(" - `None`")
lines.append("")
else:
lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=True))
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2)
def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None:
if output_path is None:
print(text)
return
destination = Path(output_path)
try:
destination_resolved = destination.resolve()
for input_path in protected_inputs:
if input_path.resolve() == destination_resolved:
raise OSError("output path must not overwrite an input file")
except FileNotFoundError as exc:
raise OSError(f"unable to resolve output path {destination}: {exc}") from exc
try:
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write report to {destination}: {exc}") from exc
def determine_exit_code(report: dict[str, Any]) -> int:
if report["total_known_error_matches"] > 0:
return EXIT_FINDINGS
return EXIT_OK
def main() -> int:
parser = build_parser()
args = parser.parse_args()
try:
log_path = Path(args.file)
pattern_path = Path(args.patterns)
log_text = read_text_file(log_path, "log file")
lines = log_text.splitlines()
patterns = load_pattern_catalog(pattern_path, args.ignore_case)
severity_filter = args.severity.upper() if args.severity else None
report = analyze_log(
lines=lines,
patterns=patterns,
severity_filter=severity_filter,
category_filter=args.category,
top=args.top,
max_samples=args.max_samples,
pattern_catalog_path=pattern_path,
)
if args.format == "text":
rendered = render_text(report)
elif args.format == "markdown":
rendered = render_markdown(report)
else:
rendered = render_json(report)
write_output(rendered, args.output, [log_path, pattern_path])
return determine_exit_code(report)
except (OSError, ValueError) as exc:
print(f"ERROR: {exc}", file=sys.stderr)
return EXIT_INVALID
except Exception as exc: # pragma: no cover - defensive operational fallback
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
return EXIT_INVALID
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,220 @@
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full|filesystem full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
},
{
"id": "inode_exhaustion",
"name": "Inode exhaustion",
"severity": "CRITICAL",
"regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem may have free blocks but too few available inodes."
},
{
"id": "read_only_filesystem",
"name": "Read-only filesystem",
"severity": "CRITICAL",
"regex": "read-only file system|read-only filesystem|Remounting filesystem read-only",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/read-only-filesystem.md",
"description": "Filesystem writes failed because the mount was read-only or remounted read-only."
},
{
"id": "io_error",
"name": "I/O error",
"severity": "CRITICAL",
"regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/storage-io-error.md",
"description": "Kernel or application reported storage I/O errors that require device and filesystem review."
},
{
"id": "out_of_memory",
"name": "Out of memory",
"severity": "CRITICAL",
"regex": "\\bout of memory\\b|Cannot allocate memory",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/memory-pressure.md",
"description": "Process or host reported memory exhaustion symptoms."
},
{
"id": "oom_killer",
"name": "OOM killer invoked",
"severity": "CRITICAL",
"regex": "oom-killer|Killed process \\d+|Out of memory: Killed process",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/oom-killer.md",
"description": "Kernel OOM killer activity was logged and affected processes should be reviewed."
},
{
"id": "segmentation_fault",
"name": "Segmentation fault",
"severity": "CRITICAL",
"regex": "segmentation fault|segfault",
"category": "process",
"runbook": "infra-run/runbooks/incidents/process-crash.md",
"description": "A process crash pattern was logged."
},
{
"id": "connection_refused",
"name": "Connection refused",
"severity": "WARNING",
"regex": "connection refused|ConnectException: Connection refused",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Client connection attempts were refused by the destination service or host."
},
{
"id": "connection_reset",
"name": "Connection reset",
"severity": "WARNING",
"regex": "connection reset|Connection reset by peer",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Established network connections were reset and require endpoint review."
},
{
"id": "timeout",
"name": "Timeout",
"severity": "WARNING",
"regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Operation timed out and may require network, service, or dependency review."
},
{
"id": "dns_resolution_failure",
"name": "DNS resolution failure",
"severity": "WARNING",
"regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host",
"category": "network",
"runbook": "infra-run/runbooks/incidents/dns-resolution.md",
"description": "Name resolution failed for a host or service dependency."
},
{
"id": "certificate_expired",
"name": "Certificate expired",
"severity": "CRITICAL",
"regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/certificate-expired.md",
"description": "TLS certificate expiry was logged and certificate state should be reviewed."
},
{
"id": "tls_handshake_failed",
"name": "TLS handshake failed",
"severity": "WARNING",
"regex": "TLS handshake failed|SSL handshake failed|handshake_failure",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/tls-handshake.md",
"description": "TLS handshake failed and may require certificate, protocol, or trust-store review."
},
{
"id": "authentication_failure",
"name": "Authentication failure",
"severity": "WARNING",
"regex": "authentication failure|Failed password|authentication failed",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Authentication failures were logged and may require access review."
},
{
"id": "permission_denied",
"name": "Permission denied",
"severity": "WARNING",
"regex": "permission denied|access denied|denied by policy",
"category": "security",
"runbook": "infra-run/runbooks/incidents/permission-denied.md",
"description": "Access or permission denial was logged."
},
{
"id": "invalid_user",
"name": "Invalid user",
"severity": "WARNING",
"regex": "Invalid user|invalid user|user unknown|User not known",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Log contains attempts involving invalid or unknown users."
},
{
"id": "java_out_of_memory",
"name": "Java OutOfMemoryError",
"severity": "CRITICAL",
"regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java process logged memory exhaustion symptoms."
},
{
"id": "ssl_handshake_exception",
"name": "SSLHandshakeException",
"severity": "CRITICAL",
"regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java TLS handshake exception was logged."
},
{
"id": "database_unavailable",
"name": "Database unavailable",
"severity": "CRITICAL",
"regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted",
"category": "application",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Application logged unavailable database or database connectivity symptoms."
},
{
"id": "http_500",
"name": "HTTP 500",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 500 responses."
},
{
"id": "http_503",
"name": "HTTP 503",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 503 service unavailable responses."
},
{
"id": "service_failed",
"name": "Systemd service failed",
"severity": "CRITICAL",
"regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a failed service or failed service start."
},
{
"id": "dependency_failed",
"name": "Systemd dependency failed",
"severity": "CRITICAL",
"regex": "Dependency failed for|dependency failed",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a unit dependency failure."
},
{
"id": "start_request_repeated",
"name": "Start request repeated too quickly",
"severity": "WARNING",
"regex": "Start request repeated too quickly|start request repeated too quickly",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd throttled service restarts after repeated start failures."
}
]
}
@@ -0,0 +1,164 @@
# log-diff-checker
`log-diff-checker` is a read-only Python CLI for comparing configured operational log patterns before and after a change. It is intended to help an infrastructure engineer decide whether a patch, deployment, configuration change, or service restart introduced new log risk or reduced existing noise.
The tool compares local pre-change and post-change log extracts. It does not modify input logs or system state.
## When To Use
- After a planned change when pre-check and post-check log extracts are available.
- During change validation when the question is whether errors increased, disappeared, or stayed flat.
- Before attaching log evidence to a change, incident, or problem ticket.
- When predictable text, Markdown, or JSON output is useful for local review.
## What It Does
- Reads two local text log files supplied with `--before` and `--after`.
- Scans both files for configured critical and warning patterns.
- Compares before and after counts for each detected pattern.
- Classifies patterns as `NEW`, `INCREASED`, `DECREASED`, `RESOLVED`, or `UNCHANGED`.
- Sets an overall status of `OK`, `WARNING`, or `CRITICAL`.
- Includes sample log lines from the side that best explains the change.
## What It Does Not Do
- It does not read remote systems.
- It does not modify logs, services, or host state.
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
- It does not prove root cause or change safety.
- It does not replace service-specific post-change checks.
- It does not classify every possible vendor or application error.
## Supported Input
- Two local text log files:
- `--before` for the pre-change log extract.
- `--after` for the post-change log extract.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Patterns
Critical patterns:
- `CRITICAL`
- `FATAL`
- `panic`
- `kernel panic`
- `no space left on device`
- `out of memory`
- `killed process`
- `read-only file system`
- `segmentation fault`
- `segfault`
- `certificate expired`
- `TLS handshake failed`
- `SSLHandshakeException`
- `database unavailable`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
Warning patterns:
- `ERROR`
- `failed`
- `failure`
- `timeout`
- `connection refused`
- `connection reset`
- `permission denied`
- `authentication failed`
- `denied`
- `unavailable`
- `service restart`
- `retrying`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
## Usage
```bash
cd infra-run/scripts/python/log-diff-checker
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown --output change-log-diff.md
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format json
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --ignore-case
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --top 20
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --max-samples 5
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - change or incident ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to either input log file.
## Exit Codes
- `0` - OK, no new or increased findings.
- `1` - New or increased findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Log Diff Checker
================
[CRITICAL] CRITICAL - NEW
Before count: 0
After count: 1
Delta: +1
Sample source: after
Samples:
- 2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
Operational Summary
-------------------
Total lines scanned before: 7
Total lines scanned after: 8
Total unique patterns compared: 9
New findings count: 3
Increased findings count: 3
Decreased findings count: 0
Resolved findings count: 2
Unchanged findings count: 1
Overall status: CRITICAL
```
## Markdown Workflow
Generate a Markdown report from collected pre-change and post-change logs, review it, and attach it to the change ticket as supporting evidence:
```bash
python3 log_diff_checker.py \
--before examples/pre-change.log \
--after examples/post-change.log \
--format markdown \
--output change-log-diff.md
```
Use the report as a log perspective on the change. A `CRITICAL` or `WARNING` result should be reviewed with service health checks, monitoring, rollback criteria, and the relevant application owner.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match multiple patterns, such as `CRITICAL`, `database unavailable`, and `unavailable`.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- The tool compares counts, not rates, time windows, or request volume.
- Large log files are read into memory; collect scoped extracts for very large incidents.
- `--top` limits displayed findings only. The operational summary still reflects all compared patterns.
## Safety Notes
- The tool only reads the input logs and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,8 @@
2026-05-11 10:10:01 app01 systemd[1]: Started inventory-api.service after package update.
2026-05-11 10:10:15 app01 inventory-api[2294]: INFO readiness check passed
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
2026-05-11 10:15:00 app01 inventory-api[2294]: INFO background reconciliation completed
@@ -0,0 +1,7 @@
2026-05-11 09:55:01 app01 systemd[1]: Started inventory-api.service.
2026-05-11 09:56:12 app01 inventory-api[1842]: INFO readiness check passed
2026-05-11 09:57:20 app01 inventory-api[1842]: WARNING timeout contacting cache01, retrying
2026-05-11 09:58:04 app01 inventory-api[1842]: ERROR failed to refresh optional pricing cache
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
2026-05-11 10:00:00 app01 systemd[1]: Stopping inventory-api.service for planned restart.
2026-05-11 10:00:03 app01 systemd[1]: Started inventory-api.service.
@@ -0,0 +1,134 @@
# Log Diff Checker
## CRITICAL: CRITICAL (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## CRITICAL: database unavailable (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## WARNING: unavailable (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## WARNING: failed (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
```
## WARNING: retrying (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
```
## WARNING: timeout (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
```
## WARNING: denied (RESOLVED)
- Before count: 1
- After count: 0
- Delta: -1
- Sample source: before
Sample log lines:
```text
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
```
## WARNING: permission denied (RESOLVED)
- Before count: 1
- After count: 0
- Delta: -1
- Sample source: before
Sample log lines:
```text
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
```
## WARNING: ERROR (UNCHANGED)
- Before count: 2
- After count: 2
- Delta: +0
- Sample source: after
Sample log lines:
```text
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
```
## Operational Summary
- Total lines scanned before: 7
- Total lines scanned after: 8
- Total unique patterns compared: 9
- New findings count: 3
- Increased findings count: 3
- Decreased findings count: 0
- Resolved findings count: 2
- Unchanged findings count: 1
- Overall status: CRITICAL
@@ -0,0 +1,462 @@
#!/usr/bin/env python3
"""Compare incident-oriented log patterns before and after a change."""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
STATUS_ORDER = {
"NEW": 0,
"INCREASED": 1,
"DECREASED": 2,
"RESOLVED": 3,
"UNCHANGED": 4,
}
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
"CRITICAL",
"FATAL",
"panic",
"kernel panic",
"no space left on device",
"out of memory",
"killed process",
"read-only file system",
"segmentation fault",
"segfault",
"certificate expired",
"TLS handshake failed",
"SSLHandshakeException",
"database unavailable",
"HTTP 500",
"HTTP 502",
"HTTP 503",
"HTTP 504",
]
WARNING_PATTERNS = [
"ERROR",
"failed",
"failure",
"timeout",
"connection refused",
"connection reset",
"permission denied",
"authentication failed",
"denied",
"unavailable",
"service restart",
"retrying",
]
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Compare configured operational log patterns before and after a change."
)
parser.add_argument("--before", required=True, help="Pre-change local log file.")
parser.add_argument("--after", required=True, help="Post-change local log file.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
help="Limit displayed findings after operational importance sorting.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match all configured patterns case-insensitively.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
pattern_defs: list[dict[str, str]] = []
pattern_defs.extend(
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
)
pattern_defs.extend(
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
)
compiled = []
for item in pattern_defs:
compiled.append(
{
"pattern": item["pattern"],
"severity": item["severity"],
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def scan_log(
lines: list[str], patterns: list[dict[str, Any]], max_samples: int
) -> dict[str, dict[str, Any]]:
groups: dict[str, dict[str, Any]] = {}
for line in lines:
for item in patterns:
if not item["regex"].search(line):
continue
key = f"{item['severity']}::{item['pattern']}"
group = groups.setdefault(
key,
{
"pattern": item["pattern"],
"severity": item["severity"],
"count": 0,
"samples": [],
},
)
group["count"] += 1
if len(group["samples"]) < max_samples:
group["samples"].append(line)
return groups
def classify_status(before_count: int, after_count: int) -> str:
if before_count == 0 and after_count > 0:
return "NEW"
if before_count > 0 and after_count == 0:
return "RESOLVED"
if after_count > before_count:
return "INCREASED"
if after_count < before_count:
return "DECREASED"
return "UNCHANGED"
def sample_source_for(status: str) -> str:
if status in ("NEW", "INCREASED"):
return "after"
if status in ("DECREASED", "RESOLVED"):
return "before"
return "after"
def compare_logs(
before_lines: list[str],
after_lines: list[str],
patterns: list[dict[str, Any]],
max_samples: int,
top: int | None,
) -> dict[str, Any]:
before_groups = scan_log(before_lines, patterns, max_samples)
after_groups = scan_log(after_lines, patterns, max_samples)
compared_keys = sorted(set(before_groups) | set(after_groups))
findings = []
for key in compared_keys:
before_group = before_groups.get(key)
after_group = after_groups.get(key)
reference = before_group or after_group
if reference is None:
continue
before_count = before_group["count"] if before_group is not None else 0
after_count = after_group["count"] if after_group is not None else 0
status = classify_status(before_count, after_count)
source = sample_source_for(status)
sample_group = after_group if source == "after" else before_group
findings.append(
{
"pattern": reference["pattern"],
"severity": reference["severity"],
"before_count": before_count,
"after_count": after_count,
"delta": after_count - before_count,
"status": status,
"sample_source": source,
"samples": sample_group["samples"] if sample_group is not None else [],
}
)
sorted_findings = sorted(findings, key=finding_sort_key)
summary = build_summary(
before_lines=before_lines,
after_lines=after_lines,
findings=sorted_findings,
)
displayed_findings = sorted_findings if top is None else sorted_findings[:top]
return {
"findings": displayed_findings,
"summary": summary,
}
def finding_sort_key(finding: dict[str, Any]) -> tuple[int, int, int, int, str]:
return (
STATUS_ORDER[finding["status"]],
SEVERITY_ORDER[finding["severity"]],
-abs(finding["delta"]),
-finding["after_count"],
finding["pattern"].lower(),
)
def build_summary(
before_lines: list[str], after_lines: list[str], findings: list[dict[str, Any]]
) -> dict[str, Any]:
status_counts = {
"NEW": 0,
"INCREASED": 0,
"DECREASED": 0,
"RESOLVED": 0,
"UNCHANGED": 0,
}
for finding in findings:
status_counts[finding["status"]] += 1
critical_regressions = any(
finding["severity"] == "CRITICAL"
and finding["status"] in ("NEW", "INCREASED")
for finding in findings
)
warning_regressions = any(
finding["severity"] == "WARNING"
and finding["status"] in ("NEW", "INCREASED")
for finding in findings
)
if critical_regressions:
overall_status = "CRITICAL"
elif warning_regressions:
overall_status = "WARNING"
else:
overall_status = "OK"
return {
"total_lines_scanned_before": len(before_lines),
"total_lines_scanned_after": len(after_lines),
"total_unique_patterns_compared": len(findings),
"new_findings_count": status_counts["NEW"],
"increased_findings_count": status_counts["INCREASED"],
"decreased_findings_count": status_counts["DECREASED"],
"resolved_findings_count": status_counts["RESOLVED"],
"unchanged_findings_count": status_counts["UNCHANGED"],
"overall_status": overall_status,
}
def render_text(report: dict[str, Any]) -> str:
lines = ["Log Diff Checker", "================", ""]
if not report["findings"]:
lines.append("No configured operational patterns were detected in either log.")
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['pattern']} - {finding['status']}",
f"Before count: {finding['before_count']}",
f"After count: {finding['after_count']}",
f"Delta: {finding['delta']:+d}",
f"Sample source: {finding['sample_source']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
lines.append("")
lines.extend(render_text_summary(report["summary"]))
return "\n".join(lines) + "\n"
def render_text_summary(summary: dict[str, Any]) -> list[str]:
return [
"Operational Summary",
"-------------------",
f"Total lines scanned before: {summary['total_lines_scanned_before']}",
f"Total lines scanned after: {summary['total_lines_scanned_after']}",
f"Total unique patterns compared: {summary['total_unique_patterns_compared']}",
f"New findings count: {summary['new_findings_count']}",
f"Increased findings count: {summary['increased_findings_count']}",
f"Decreased findings count: {summary['decreased_findings_count']}",
f"Resolved findings count: {summary['resolved_findings_count']}",
f"Unchanged findings count: {summary['unchanged_findings_count']}",
f"Overall status: {summary['overall_status']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = ["# Log Diff Checker", ""]
if not report["findings"]:
lines.extend(["No configured operational patterns were detected in either log.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['pattern']} ({finding['status']})",
"",
f"- Before count: {finding['before_count']}",
f"- After count: {finding['after_count']}",
f"- Delta: {finding['delta']:+d}",
f"- Sample source: {finding['sample_source']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
summary = report["summary"]
lines.extend(
[
"## Operational Summary",
"",
f"- Total lines scanned before: {summary['total_lines_scanned_before']}",
f"- Total lines scanned after: {summary['total_lines_scanned_after']}",
f"- Total unique patterns compared: {summary['total_unique_patterns_compared']}",
f"- New findings count: {summary['new_findings_count']}",
f"- Increased findings count: {summary['increased_findings_count']}",
f"- Decreased findings count: {summary['decreased_findings_count']}",
f"- Resolved findings count: {summary['resolved_findings_count']}",
f"- Unchanged findings count: {summary['unchanged_findings_count']}",
f"- Overall status: {summary['overall_status']}",
"",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(
output_path: str | None, content: str, input_paths: tuple[Path, Path]
) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
output_resolved = path.resolve()
input_resolved = {input_path.resolve() for input_path in input_paths}
except OSError as exc:
raise OSError(f"unable to validate output path {path}: {exc}") from exc
if output_resolved in input_resolved:
raise OSError("output path must not overwrite an input log file")
try:
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
before_path = Path(args.before)
after_path = Path(args.after)
try:
before_lines = read_log_file(before_path)
after_lines = read_log_file(after_path)
report = compare_logs(
before_lines=before_lines,
after_lines=after_lines,
patterns=compile_patterns(args.ignore_case),
max_samples=args.max_samples,
top=args.top,
)
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(args.output, content, (before_path, after_path))
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
+5
View File
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
- `ci-cd`
- `docker`
## Linux operations labs
- [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
+308
View File
@@ -0,0 +1,308 @@
# AI Lab Maintenance Toolkit
## Executive summary
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
configuration backup, and non-destructive VM inventory into a small toolkit
that is readable enough for review and guarded enough for homelab use.
This is a portfolio and lab implementation, not evidence of production
certification. Review package policy, backup coverage, maintenance windows, and
application impact before deploying it to another host.
## Problem solved
AI lab hosts accumulate operating system packages, kernel packages, container
images, build cache, journals, and configuration changes while also carrying
stateful workloads. Manual maintenance is easy to defer and risky to perform
without evidence. This project provides scheduled, logged tasks with explicit
safety boundaries and separate read-only audit commands.
## What this demonstrates
- Bash strict mode, input validation, dependency checks, and operational exit
codes.
- Dry-run-first maintenance with explicit authorization for changes.
- systemd oneshot services and persistent calendar timers.
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
- Docker cleanup that preserves volumes.
- Configuration-focused backups with bounded retention.
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
- Idempotent installation and guarded JSON configuration updates.
## Architecture and directory layout
```text
ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│ ├── ailab-healthcheck.sh
│ ├── ailab-disk-watch.sh
│ ├── ailab-apt-cleanup.sh
│ ├── ailab-kernel-cleanup.sh
│ ├── ailab-docker-cleanup.sh
│ ├── ailab-config-backup.sh
│ └── ailab-vm-audit.sh
└── systemd/
├── ailab-apt-cleanup.service
├── ailab-apt-cleanup.timer
├── ailab-kernel-cleanup.service
├── ailab-kernel-cleanup.timer
├── ailab-docker-cleanup.service
├── ailab-docker-cleanup.timer
├── ailab-config-backup.service
├── ailab-config-backup.timer
├── ailab-disk-watch.service
└── ailab-disk-watch.timer
```
The installer deploys scripts to `/usr/local/sbin` and units to
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
through an additional framework.
## Maintenance tasks
| Command | Purpose | Change behavior |
| --- | --- | --- |
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
## Safety model
Change-capable scripts default to dry-run behavior. Manual execution requires
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
use `--execute --non-interactive`; installing and enabling those reviewed unit
files is the explicit authorization for scheduled maintenance.
Exit codes follow the repository convention:
- `0`: completed successfully or an optional component was absent.
- `1`: an operational check or maintenance action failed.
- `2`: invalid input, missing required dependency, or insufficient privilege.
The scripts do not bypass APT or Docker locks, delete VM resources, manually
select kernel names for removal, or hide command failures.
## Installation
Review every script and unit first. Installation changes package state,
journald settings, Docker daemon settings when Docker exists, and enabled timer
state.
```bash
cd labs/linux/ailab-maintenance
sudo ./install.sh
```
The installer:
1. Installs the documented Ubuntu utilities.
2. Deploys scripts and systemd units with fixed permissions.
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
4. Restarts `systemd-journald`.
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
with `jq`, and attempts a Docker restart.
6. Enables all five timers.
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
The installer is intended for Ubuntu 26.04. It is not run automatically by
repository validation.
## Manual commands
Read-only reports:
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Preview maintenance:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh
```
Apply reviewed maintenance interactively:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute
```
`--non-interactive` is reserved for reviewed automation and is rejected unless
`--execute` is also present.
## Systemd timers
| Timer | Schedule |
| --- | --- |
| `ailab-config-backup.timer` | Daily at 03:30 |
| `ailab-disk-watch.timer` | Hourly |
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
All timers use `Persistent=true`, so a missed event runs after the host becomes
available. Inspect timer and service evidence with:
```bash
systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service
```
## Logs
Scheduled and manual maintenance writes to:
```text
/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log
```
systemd also records service output in the journal. Logrotate is installed as a
dependency, but this lab does not create a custom rotation policy for these
small maintenance logs.
## Docker policy
Docker cleanup runs `docker system prune -af` and removes build cache older
than seven days. It never passes `--volumes`. Named and anonymous volumes
remain outside this automated policy and require application-aware review.
The installer configures the `json-file` driver with a maximum size of `50m`
and five files. Existing valid JSON is backed up and merged. Invalid JSON
causes installation to stop rather than overwrite operator configuration.
## Kernel policy
Kernel removal is delegated to `apt autoremove --purge`; package names are not
constructed or purged with regular expressions. Before execution, the script
logs the APT simulation and refuses cleanup unless at least two installed
versioned kernel image packages remain after simulated removals.
This protects a fallback kernel while preserving Ubuntu dependency policy.
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
Secure Boot state, and the simulated removal set before manual execution.
## Backup policy
Backups are written to `/srv/backups/ailab-config` as
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
deleted only after a new archive is created.
The backup covers `/etc`, selected root shell configuration,
`/opt/ailab-maintenance` when present, and libvirt configuration under
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
Ollama models, VM disk images, or other large application datasets. Because
`/etc` is included, explicitly listed configuration subdirectories are already
covered even when optional-path reporting mentions them separately.
This is a local configuration backup, not a disaster-recovery design. A real
deployment should copy archives to independently protected storage and test
restoration.
## Journald policy
The installer applies:
```ini
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
```
These settings bound journal growth while retaining useful troubleshooting
evidence. Capacity and retention should be adjusted to the host's disk size
and incident-response requirements.
## Disk watch policy
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
`1` when any checked filesystem meets or exceeds the threshold. Override the
threshold for a manual or unit invocation with:
```bash
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
```
The script reports every filesystem as `OK` or `WARNING`; it does not delete
data or attempt remediation.
## Example operational workflows
### Weekly maintenance review
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-
```
Review the kernel simulation, Docker usage, failed units, backup freshness, and
disk warnings before approving manual changes.
### Disk pressure investigation
```bash
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Use the evidence to identify ownership. Do not treat Docker pruning or file
deletion as a substitute for application-specific retention policy.
### Post-maintenance evidence
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh \
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'
```
## Interview talking points
- Why timer units explicitly carry the non-interactive execution boundary.
- Why APT dependency policy is safer than regex-based kernel deletion.
- How Docker volume preservation separates platform hygiene from application
data lifecycle decisions.
- How optional dependency handling keeps one health command useful across
container, GPU, and virtualization host variants.
- Why configuration backup and application-data backup are separate concerns.
- How exit codes, persistent timers, logs, and post-checks support operations.
## Future improvements
- Add a dedicated logrotate policy after measuring log growth.
- Export disk-watch status to a monitoring system instead of relying only on
timer failure state.
- Add automated archive integrity checks and off-host replication.
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
commands.
- Add package-lock detection with bounded retry policy if recurring contention
is observed.
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
dedicated read-only audit.
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
DOCKER_CONFIG="/etc/docker/daemon.json"
packages=(
logrotate
needrestart
smartmontools
nvme-cli
sysstat
iotop
ncdu
duf
jq
lsof
psmisc
tar
gzip
)
timers=(
ailab-apt-cleanup.timer
ailab-kernel-cleanup.timer
ailab-docker-cleanup.timer
ailab-config-backup.timer
ailab-disk-watch.timer
)
if ((EUID != 0)); then
printf 'CRITICAL: install.sh must run as root\n' >&2
exit 2
fi
for command_name in apt-get install systemctl; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
printf 'Installing maintenance dependencies...\n'
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
printf 'Installing scripts and systemd units...\n'
for script in "$SCRIPT_DIR"/scripts/*.sh; do
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
done
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
done
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
tmp_journald="$(mktemp)"
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
cat >"$tmp_journald" <<'EOF'
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
EOF
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
systemctl restart systemd-journald
if command -v docker >/dev/null 2>&1; then
printf 'Configuring Docker log rotation limits...\n'
install -d -m 0755 /etc/docker
tmp_docker="$(mktemp)"
if [[ -f "$DOCKER_CONFIG" ]]; then
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
exit 1
fi
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$DOCKER_CONFIG" "$backup"
jq '. + {
"log-driver": "json-file",
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
}' "$DOCKER_CONFIG" >"$tmp_docker"
else
jq -n '{
"log-driver": "json-file",
"log-opts": {"max-size": "50m", "max-file": "5"}
}' >"$tmp_docker"
fi
jq empty "$tmp_docker"
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
systemctl restart docker || true
else
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
fi
systemctl daemon-reload
systemctl enable --now "${timers[@]}"
printf '\nEnabled AI Lab timers:\n'
systemctl list-timers --all --no-pager | grep 'ailab-' || true
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
+66
View File
@@ -0,0 +1,66 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-apt-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
if ! command -v apt >/dev/null 2>&1; then
printf 'CRITICAL: apt is required\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
printf 'INFO: simulated autoremove follows\n'
LC_ALL=C apt -s autoremove --purge
printf 'INFO: rerun with --execute and confirm to apply changes\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt update
apt autoremove --purge -y
apt autoclean -y
if command -v needrestart >/dev/null 2>&1; then
needrestart -b || true
else
printf 'WARNING: needrestart is not installed\n'
fi
printf 'OK: APT cleanup completed\n'
@@ -0,0 +1,90 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-config-backup.log"
BACKUP_DIR="/srv/backups/ailab-config"
RETENTION_DAYS=30
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in tar gzip find; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
timestamp="$(date '+%Y%m%d-%H%M%S')"
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
candidate_paths=(
/etc
/root/.bashrc
/root/.bashrc.d
/opt/ailab-maintenance
/var/lib/libvirt/qemu
)
source_paths=()
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
for path in "${candidate_paths[@]}"; do
if [[ -e "$path" ]]; then
source_paths+=("${path#/}")
printf 'OK: include %s\n' "$path"
else
printf 'INFO: optional path is absent: %s\n' "$path"
fi
done
if ((${#source_paths[@]} == 0)); then
printf 'CRITICAL: no backup source paths are present\n'
exit 1
fi
printf 'Backup destination: %s\n' "$archive"
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'Type EXECUTE to create the archive and apply retention: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
install -d -m 0750 "$BACKUP_DIR"
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
printf 'OK: configuration backup created: %s\n' "$archive"
+38
View File
@@ -0,0 +1,38 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-disk-watch.log"
threshold="${AILAB_DISK_THRESHOLD:-85}"
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
exit 2
fi
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
status=0
while read -r filesystem _blocks _used available use_percent mountpoint; do
usage="${use_percent%\%}"
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
status=1
elif ((usage >= threshold)); then
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
status=1
else
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
fi
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
exit "$status"
@@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-docker-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
if ! command -v docker >/dev/null 2>&1; then
printf 'INFO: Docker is not installed; nothing to do\n'
exit 0
fi
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
printf 'INFO: docker.service is inactive; nothing to do\n'
exit 0
fi
printf '\nDocker disk usage before cleanup:\n'
docker system df
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; would run docker system prune -af\n'
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
printf 'INFO: Docker volumes are never included in this cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
docker system prune -af
docker builder prune -af --filter "until=168h"
printf '\nDocker disk usage after cleanup:\n'
docker system df
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
+111
View File
@@ -0,0 +1,111 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Host identity"
if command -v hostnamectl >/dev/null 2>&1; then
run_optional "hostnamectl" hostnamectl
else
run_optional "hostname" hostname
fi
run_optional "kernel information" uname -a
run_optional "uptime" uptime
section "Memory"
if command -v free >/dev/null 2>&1; then
run_optional "memory report" free -h
else
printf 'WARNING: free is not available\n'
fi
section "Filesystems"
if command -v df >/dev/null 2>&1; then
run_optional "filesystem report" df -hT
printf '\nKey mountpoints present:\n'
for mountpoint in / /boot /var /srv /opt /home; do
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
fi
done
else
printf 'WARNING: df is not available\n'
fi
section "Journal usage"
if command -v journalctl >/dev/null 2>&1; then
run_optional "journal disk usage" journalctl --disk-usage
else
printf 'WARNING: journalctl is not available\n'
fi
section "Docker"
if command -v docker >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "Docker service state" systemctl is-active docker
fi
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
run_optional "Docker disk usage" docker system df
else
printf 'INFO: Docker is not installed\n'
fi
section "Libvirt"
if command -v virsh >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "libvirtd service state" systemctl is-active libvirtd
fi
run_optional "libvirt guest list" virsh list --all
else
printf 'INFO: virsh is not installed\n'
fi
section "NVIDIA"
if command -v nvidia-smi >/dev/null 2>&1; then
run_optional "NVIDIA status" nvidia-smi
else
printf 'INFO: nvidia-smi is not installed\n'
fi
section "Failed systemd units"
if command -v systemctl >/dev/null 2>&1; then
run_optional "failed systemd unit report" systemctl --failed --no-pager
else
printf 'WARNING: systemctl is not available\n'
fi
section "SMART quick health"
if command -v smartctl >/dev/null 2>&1; then
shopt -s nullglob
devices=(/dev/sd? /dev/nvme?n?)
shopt -u nullglob
if ((${#devices[@]} == 0)); then
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
else
for device in "${devices[@]}"; do
printf '\n-- %s --\n' "$device"
run_optional "SMART health check for $device" smartctl -H "$device"
done
fi
else
printf 'INFO: smartctl is not installed\n'
fi
exit 0
@@ -0,0 +1,117 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
# APT autoremove respects package dependencies and kernel protection rules. That
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
kernel_packages() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
| awk '$1 ~ /^ii/ {print $2}' \
| sort -u || true
}
versioned_kernel_images() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
| sort -u || true
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in apt dpkg-query uname; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
printf 'Running kernel: %s\n' "$(uname -r)"
printf '\nInstalled kernel-related packages before cleanup:\n'
kernel_packages
simulation="$(LC_ALL=C apt -s autoremove --purge)"
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
mapfile -t installed_images < <(versioned_kernel_images)
mapfile -t removed_images < <(
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
)
remaining_images=0
for image in "${installed_images[@]}"; do
remove_image=false
for removed in "${removed_images[@]}"; do
if [[ "$image" == "$removed" ]]; then
remove_image=true
break
fi
done
if [[ "$remove_image" != true ]]; then
remaining_images=$((remaining_images + 1))
fi
done
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
exit 1
fi
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no packages were removed\n'
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt autoremove --purge -y
apt autoclean -y
if command -v update-grub >/dev/null 2>&1; then
update-grub || true
else
printf 'WARNING: update-grub is not installed\n'
fi
printf '\nInstalled kernel-related packages after cleanup:\n'
kernel_packages
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
+42
View File
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
if ! command -v virsh >/dev/null 2>&1; then
printf 'INFO: virsh is not installed; VM audit skipped\n'
exit 0
fi
section "Virtual machines"
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
section "Storage pools"
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
for pool in "${pools[@]}"; do
section "Volumes in pool: $pool"
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
done
section "Possible VM disk and installation images"
search_roots=()
for path in /var/lib/libvirt /srv /opt; do
[[ -d "$path" ]] && search_roots+=("$path")
done
if ((${#search_roots[@]} == 0)); then
printf 'INFO: no configured search roots are present\n'
else
find "${search_roots[@]}" -xdev -type f \
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
-printf '%12s bytes %p\n' 2>/dev/null \
| sort -nr || true
fi
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe APT cleanup
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab APT cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:00:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab configuration backup
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab configuration backup daily
[Timer]
OnCalendar=*-*-* 03:30:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab disk usage check
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab disk usage check hourly
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe Docker cleanup
Requires=docker.service
After=docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab Docker cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:40:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe kernel cleanup
After=network-online.target ailab-apt-cleanup.service
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab kernel cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:20:00
Persistent=true
[Install]
WantedBy=timers.target
+276
View File
@@ -0,0 +1,276 @@
# Linux Fresh Setup Toolkit
## Executive summary
The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
lab server or workstation. It prepares a host for routine administration,
Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
diagnostics, bounded logging, practical kernel tuning, and a conservative
security baseline.
The scripts are modular and safe to rerun. Optional components remain optional,
UFW is not enabled without a specific flag, and an NVIDIA driver is never
installed without an explicit version. This is a portfolio and homelab
implementation, not a production-certified build standard.
## Scope and non-goals
The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
with APT package management. It is suitable for a host such as `ailab` that may
run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
It does not:
- Deploy applications, containers, or virtual machines.
- Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
- Select an NVIDIA driver automatically.
- Define a complete firewall policy or compliance baseline.
- Replace backup, monitoring, patching, or ongoing maintenance processes.
- Claim live validation against every future Ubuntu release.
## Why this is separate from ailab-maintenance
This project establishes a fresh host. The sibling
[AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
checks, scheduled cleanup, configuration backup, disk monitoring, and VM
inventory after a host is operating.
Keeping bootstrap and maintenance separate makes the change boundary clear:
this toolkit installs platform capabilities and baseline configuration, while
the maintenance toolkit manages recurring operational tasks.
## Directory layout
```text
setup/
├── README.md
├── install.sh
├── scripts/
│ ├── 00-preflight.sh
│ ├── 00-platform-guard.inc
│ ├── 01-base-packages.sh
│ ├── 02-shell-profile.sh
│ ├── 03-cockpit.sh
│ ├── 04-docker.sh
│ ├── 05-libvirt.sh
│ ├── 06-nvidia-tools.sh
│ ├── 07-tuning.sh
│ ├── 08-security-baseline.sh
│ └── 99-postcheck.sh
├── files/
│ ├── bashrc.d/ailab.sh
│ ├── docker/daemon.json
│ ├── sysctl/99-ailab.conf
│ └── systemd/journald-ailab-limits.conf
└── docs/
├── fresh-install-checklist.md
├── cockpit.md
├── docker.md
├── libvirt.md
├── nvidia.md
└── bash-shell.md
```
`00-platform-guard.inc` is an internal sourced helper used by mutating
component scripts; it is not an executable profile.
## Supported profiles and flags
| Flag | Result |
| --- | --- |
| `--base` | Install operational CLI, diagnostic, storage, and network packages |
| `--shell` | Install the root AI lab Bash profile |
| `--cockpit` | Install and enable Cockpit |
| `--docker` | Install Docker and bounded JSON-file logging |
| `--libvirt` | Install and enable libvirt/KVM |
| `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
| `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
| `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
| `--security` | Install and enable fail2ban; install but do not enable UFW |
| `--enable-ufw` | Run security setup and explicitly enable UFW |
| `--all` | Run every standard profile without UFW enablement or driver installation |
`--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
`--security`. With no flags, the installer prints help and makes no changes.
## Installation examples
Review the scripts and current host access path before execution:
```bash
cd labs/linux/setup
./install.sh
sudo ./install.sh --base --shell
sudo ./install.sh --cockpit --docker --libvirt
sudo ./install.sh --all
```
Explicit high-impact options can be combined with `--all`:
```bash
sudo ./install.sh --all --enable-ufw
sudo ./install.sh --all --install-nvidia-driver 550
```
The installer runs the read-only preflight once before selected profiles and a
postcheck after all successful profile steps.
## Fresh host workflow
1. Patch the base Ubuntu installation and confirm console or out-of-band access.
2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
3. Run `sudo ./install.sh --base --shell`.
4. Add only the platform profiles needed by the host.
5. Review service state, listening ports, storage, networking, and warnings in
the postcheck.
6. Reboot if a driver or kernel-related package requires it.
7. Capture host-specific configuration and backup requirements separately.
## AI lab workflow
A general AI lab host can start with:
```bash
sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
```
This installs GPU diagnostics but leaves driver choice to the operator. Add
libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
Cockpit, application, bridge, and VM networking requirements.
## Safety model
- Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
than 24.04.
- Component profiles install their own direct prerequisites.
- Existing managed configuration is changed only when content differs.
- Changed root shell, Docker, journald, and sysctl files receive timestamped
backups.
- Existing valid Docker JSON is merged so unrelated settings survive.
- Invalid Docker JSON stops configuration rather than being overwritten.
- UFW and NVIDIA driver installation require explicit flags.
- Package and service failures are not hidden.
- Postcheck warnings report optional or inactive components without masking a
successfully completed diagnostic script.
APT installation and service restarts are real system changes. Test first on a
disposable host and maintain a console path when changing remote access policy.
## Bash shell profile
The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
source line is maintained in `/root/.bashrc`. It adds concise helpers for
systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
## Cockpit setup
Cockpit provides browser-based host, storage, network, package, VM, metrics,
and support-report views. The installer enables `cockpit.socket` and reports
`https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
available in every enabled Ubuntu repository.
See [Cockpit setup](docs/cockpit.md).
## Docker setup
The Ubuntu `docker.io` package path is preferred. The Docker official
repository is configured only when `docker.io` is unavailable. The daemon uses
the `json-file` log driver with five 50 MB files per container.
The toolkit configures log retention only. It does not prune data, deploy
Compose applications, or configure an NVIDIA container runtime.
See [Docker setup](docs/docker.md).
## libvirt/KVM setup
The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
virt-manager, bridge utilities, and libvirt clients and services. It enables
`libvirtd` and prints existing guests and networks.
See [libvirt/KVM setup](docs/libvirt.md).
## NVIDIA tooling
The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
commands exist.
Driver installation requires a numeric version that maps to an available
Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
driver suitability, CUDA, container runtime support, and passthrough remain
operator decisions.
See [NVIDIA tooling](docs/nvidia.md).
## Tuning
The tuning profile bounds persistent journal use, raises inotify limits for
development and container workloads, reduces swappiness, enables sysstat, and
runs automatic sensor detection when available.
Review these values against available memory, storage, monitoring retention,
and workload behavior before deployment beyond a lab.
## Security baseline
The security profile installs UFW and fail2ban and enables fail2ban. It leaves
UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
OpenSSH and TCP port 9090 before activation.
This is a minimal access-preservation baseline, not a complete host firewall or
hardening standard. Application and VM networking may require additional
reviewed rules.
## Postcheck
The final script reports:
- Failed systemd units.
- Cockpit, Docker, libvirt, and fail2ban status when installed.
- Running Docker containers and defined virtual machines.
- NVIDIA runtime state.
- Filesystem usage and listening ports.
Warnings require operator review but optional component absence does not cause
the postcheck itself to fail.
## Troubleshooting
Run individual read-only checks after correcting a failed profile:
```bash
sudo ./scripts/00-preflight.sh
sudo ./scripts/99-postcheck.sh
systemctl --failed
journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
```
Common failure areas are unavailable APT repositories, unsupported package
names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
module signing, disabled CPU virtualization, and remote firewall assumptions.
To roll back a managed configuration, compare the current file with its
timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
owning service. Package removal is intentionally not automated because it may
affect workloads and dependencies.
## Interview talking points
- Why day-0 bootstrap and day-2 maintenance have separate ownership.
- How explicit flags protect firewall and GPU driver decisions.
- Why Docker JSON is validated, backed up, and merged.
- How idempotent content checks prevent backup and restart churn.
- Why preflight and postcheck evidence surround mutating profiles.
- Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
## Future improvements
- Add automated tests using disposable Ubuntu VMs.
- Add a documented NVIDIA Container Toolkit profile.
- Add optional non-root administrative user and group membership management.
- Add bridge and VFIO planning checks without applying passthrough changes.
- Add package compatibility matrices after validating future Ubuntu releases.
- Export postcheck results in a structured format for evidence collection.
+53
View File
@@ -0,0 +1,53 @@
# Bash Shell Profile
## Installation
The shell profile is installed for root:
```text
/root/.bashrc.d/ailab.sh
```
The installer maintains one exact source line in `/root/.bashrc` and backs up
changed files. Start a new Bash session or run:
```bash
source /root/.bashrc
```
## Aliases
| Alias | Purpose |
| --- | --- |
| `ll`, `la` | Detailed and hidden-file directory listings |
| `ports` | Listening TCP/UDP sockets and processes |
| `dus`, `dufh` | Directory and filesystem usage |
| `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
| `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
| `vms` | All libvirt guests |
| `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
| `now` | Current timestamp and timezone |
`dcu` runs `docker compose up -d` in the current directory and therefore may
create or start resources. Review the Compose project before using it.
## Functions
- `svc_status SERVICE`
- `svc_logs SERVICE [LINES]`
- `docker_logs CONTAINER [LINES]`
- `docker_restart CONTAINER`
- `vm_autostart VM`
- `vm_no_autostart VM`
- `path_backup PATH`
- `extract ARCHIVE`
Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
report missing commands clearly. `path_backup` creates a timestamped adjacent
copy and can consume substantial space for large paths.
## Rollback
Review timestamped backups under `/root`, restore the desired `.bashrc` or
profile copy, and start a new shell. Avoid restoring a backup without checking
for unrelated shell changes made after bootstrap.
+41
View File
@@ -0,0 +1,41 @@
# Cockpit
## Purpose
The Cockpit profile installs browser-based host administration modules for
system state, storage, networking, packages, virtual machines, metrics, and
support reports. It enables the socket-activated service.
## Installation and validation
```bash
sudo ./install.sh --cockpit
systemctl status cockpit.socket
ss -ltnp | grep ':9090'
```
Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
default host certificate is not trusted.
`cockpit-files` is installed when available and skipped with a warning
otherwise.
## Access and firewall
The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
TCP 9090, but upstream firewalls and network ACLs remain external concerns.
Use normal Linux accounts and review which users may administer the host.
## Troubleshooting and rollback
```bash
journalctl -u cockpit.socket -u cockpit.service
systemctl restart cockpit.socket
apt-cache policy cockpit cockpit-machines cockpit-files
```
To disable remote access without removing packages:
```bash
sudo systemctl disable --now cockpit.socket
```
+56
View File
@@ -0,0 +1,56 @@
# Docker
## Package policy
The profile prefers Ubuntu's `docker.io` package. If that package is
unavailable after an APT refresh, it configures Docker's official Ubuntu
repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
This fallback requires network access to `download.docker.com`.
## Daemon configuration
The managed settings are:
```json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
```
Existing valid `/etc/docker/daemon.json` content is preserved and merged with
these log settings. A changed file is backed up with a timestamp. Invalid JSON
causes the profile to stop rather than overwrite operator configuration.
Log limits apply to newly created containers. Existing containers may retain
their original logging configuration until recreated.
## Validation
```bash
docker version
docker compose version
docker info
docker ps
docker system df
jq . /etc/docker/daemon.json
```
## Troubleshooting and rollback
```bash
systemctl status docker
journalctl -u docker
jq empty /etc/docker/daemon.json
```
To restore a previous daemon configuration, review a timestamped backup,
replace the current file, validate it with `jq empty`, and restart Docker.
Do not restore blindly when workloads depend on newer daemon settings.
The profile does not configure Docker data roots, prune objects, deploy
applications, or install the NVIDIA Container Toolkit.
@@ -0,0 +1,47 @@
# Fresh Install Checklist
## Before bootstrap
- Confirm Ubuntu 24.04 or newer and record the release and kernel.
- Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
- Confirm console or out-of-band access before firewall work.
- Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
- Patch the base system and reboot if required.
- Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
- Review application ports and VM networking before enabling UFW.
- Confirm backups exist for any pre-existing host configuration.
## Bootstrap
Start with the least capability required:
```bash
sudo ./install.sh --base --shell
```
Add reviewed platform profiles:
```bash
sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
```
Do not select `--enable-ufw` until remote access and application rules are
understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
and workload compatibility are known.
## Post-bootstrap evidence
- Review all installer warnings.
- Run `systemctl --failed`.
- Confirm expected services with `systemctl status`.
- Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
- Confirm Docker with `docker version` and `docker compose version`.
- Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
- Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
- Reboot after driver installation and repeat the postcheck.
## Handover
Document host-specific storage, network, firewall, backup, application, GPU,
and VM decisions. Install the separate `ailab-maintenance` toolkit only after
reviewing its scheduled day-2 behavior.
+54
View File
@@ -0,0 +1,54 @@
# libvirt and KVM
## Purpose
The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
supports later Linux or Windows 11 VM work without defining guests.
## Firmware pre-checks
Confirm CPU virtualization is enabled:
```bash
lscpu | grep -E 'Virtualization|Hypervisor'
grep -Eom1 '(vmx|svm)' /proc/cpuinfo
```
IOMMU and GPU passthrough require separate firmware, kernel command-line,
device isolation, driver binding, and recovery planning. This toolkit reports
hints but does not apply those changes.
## Validation
```bash
systemctl status libvirtd
virsh list --all
virsh net-list --all
virsh pool-list --all
```
Use `virt-host-validate` when available for a broader host capability report.
Desktop use of `virt-manager` requires a graphical environment or remote
display strategy.
## Networking and Windows 11
The default libvirt NAT network is distinct from host bridge networking. Review
DHCP, DNS, forwarding, and firewall behavior before changing it.
Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
packages provide those building blocks, but guest creation and licensing remain
manual.
## Troubleshooting
```bash
journalctl -u libvirtd
virsh net-info default
virsh pool-list --all
lsmod | grep kvm
```
Disabling `libvirtd` does not remove VM disks or definitions. Package removal
and VM data deletion are intentionally outside this toolkit.
+52
View File
@@ -0,0 +1,52 @@
# NVIDIA Tooling
## Diagnostic-only default
The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
does not install or select a driver:
```bash
sudo ./install.sh --nvidia-tools
```
Review hardware and current module state:
```bash
lspci -nn | grep -i nvidia
nvidia-smi
dkms status
mokutil --sb-state
```
## Explicit driver installation
Install only a reviewed Ubuntu driver package version:
```bash
sudo ./install.sh --install-nvidia-driver 550
```
The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
an unavailable package. Reboot after installation, then validate `nvidia-smi`,
kernel logs, DKMS state, and application behavior.
## Selection considerations
- GPU generation and supported driver branch.
- Ubuntu release, kernel, and HWE stack.
- Secure Boot module enrollment.
- CUDA or application compatibility.
- Docker NVIDIA Container Toolkit requirements.
- Whether the device will be bound to VFIO instead of the host driver.
## Troubleshooting
```bash
journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
lsmod | grep -E 'nvidia|nouveau'
dkms status
apt-cache policy 'nvidia-driver-*'
```
Driver rollback is environment-specific and is not automated. Preserve console
access and a known-good kernel before changing GPU or Secure Boot configuration.
+133
View File
@@ -0,0 +1,133 @@
#!/usr/bin/env bash
# AI lab operational shell helpers. This file is intended to be sourced.
alias ll='ls -alF'
alias la='ls -A'
alias ports='ss -tulpn'
alias dus='du -xhd1 2>/dev/null | sort -h'
alias dufh='df -hT'
alias failed='systemctl --failed --no-pager'
alias jerr='journalctl -p err -b --no-pager'
alias timers='systemctl list-timers --all --no-pager'
alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
svc_status() {
if (($# != 1)); then
printf 'Usage: svc_status SERVICE\n' >&2
return 2
fi
systemctl status "$1" --no-pager
}
svc_logs() {
if (($# < 1 || $# > 2)); then
printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
return 2
fi
local lines="${2:-100}"
[[ "$lines" =~ ^[0-9]+$ ]] || {
printf 'LINES must be numeric\n' >&2
return 2
}
journalctl -u "$1" -n "$lines" --no-pager
}
docker_logs() {
if (($# < 1 || $# > 2)); then
printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
return 2
fi
command -v docker >/dev/null 2>&1 || {
printf 'Docker is not installed\n' >&2
return 1
}
local lines="${2:-100}"
[[ "$lines" =~ ^[0-9]+$ ]] || {
printf 'LINES must be numeric\n' >&2
return 2
}
docker logs --tail "$lines" "$1"
}
docker_restart() {
if (($# != 1)); then
printf 'Usage: docker_restart CONTAINER\n' >&2
return 2
fi
command -v docker >/dev/null 2>&1 || {
printf 'Docker is not installed\n' >&2
return 1
}
docker restart "$1"
}
vm_autostart() {
if (($# != 1)); then
printf 'Usage: vm_autostart VM\n' >&2
return 2
fi
command -v virsh >/dev/null 2>&1 || {
printf 'virsh is not installed\n' >&2
return 1
}
virsh autostart "$1"
}
vm_no_autostart() {
if (($# != 1)); then
printf 'Usage: vm_no_autostart VM\n' >&2
return 2
fi
command -v virsh >/dev/null 2>&1 || {
printf 'virsh is not installed\n' >&2
return 1
}
virsh autostart --disable "$1"
}
path_backup() {
if (($# != 1)); then
printf 'Usage: path_backup PATH\n' >&2
return 2
fi
if [[ ! -e "$1" ]]; then
printf 'Path does not exist: %s\n' "$1" >&2
return 1
fi
local destination
destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
cp -a -- "$1" "$destination"
printf 'Backup created: %s\n' "$destination"
}
extract() {
if (($# != 1)); then
printf 'Usage: extract ARCHIVE\n' >&2
return 2
fi
if [[ ! -f "$1" ]]; then
printf 'Archive does not exist: %s\n' "$1" >&2
return 1
fi
case "$1" in
*.tar.bz2|*.tbz2) tar xjf "$1" ;;
*.tar.gz|*.tgz) tar xzf "$1" ;;
*.tar.xz|*.txz) tar xJf "$1" ;;
*.tar) tar xf "$1" ;;
*.bz2) bunzip2 "$1" ;;
*.gz) gunzip "$1" ;;
*.zip) unzip "$1" ;;
*.7z) 7z x "$1" ;;
*.rar) unrar x "$1" ;;
*)
printf 'Unsupported archive type: %s\n' "$1" >&2
return 2
;;
esac
}
@@ -0,0 +1,7 @@
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
@@ -0,0 +1,3 @@
fs.inotify.max_user_watches=1048576
fs.inotify.max_user_instances=1024
vm.swappiness=10
@@ -0,0 +1,5 @@
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
+182
View File
@@ -0,0 +1,182 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
run_base=0
run_shell=0
run_cockpit=0
run_docker=0
run_libvirt=0
run_nvidia=0
run_tuning=0
run_security=0
enable_ufw=0
nvidia_driver_version=""
usage() {
cat <<'EOF'
Usage: sudo ./install.sh [OPTIONS]
Day-0 bootstrap automation for Ubuntu 24.04 or newer.
Profiles:
--base Install baseline operational packages
--shell Install the root shell profile
--cockpit Install and enable Cockpit
--docker Install and configure Docker
--libvirt Install and enable libvirt/KVM
--nvidia-tools Install NVIDIA diagnostic tools only
--install-nvidia-driver VERSION
Install diagnostic tools and the explicit driver
--tuning Install journald and sysctl tuning
--security Install fail2ban and UFW; do not enable UFW
--enable-ufw Run security profile and explicitly enable UFW
--all Run every profile without enabling UFW or
installing an NVIDIA driver
-h, --help Show this help
Examples:
sudo ./install.sh --base --shell
sudo ./install.sh --all
sudo ./install.sh --all --enable-ufw
sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
EOF
}
require_supported_ubuntu() {
if [[ ! -r /etc/os-release ]]; then
printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
exit 2
fi
# shellcheck disable=SC1091
source /etc/os-release
if [[ "${ID:-}" != "ubuntu" ]]; then
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
exit 2
fi
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
"${VERSION_ID:-unknown}" >&2
exit 2
fi
}
if (($# == 0)); then
usage
exit 0
fi
while (($# > 0)); do
case "$1" in
--base)
run_base=1
;;
--shell)
run_shell=1
;;
--cockpit)
run_cockpit=1
;;
--docker)
run_docker=1
;;
--libvirt)
run_libvirt=1
;;
--nvidia-tools)
run_nvidia=1
;;
--install-nvidia-driver)
if (($# < 2)); then
printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
exit 2
fi
nvidia_driver_version="$2"
if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
exit 2
fi
run_nvidia=1
shift
;;
--tuning)
run_tuning=1
;;
--security)
run_security=1
;;
--enable-ufw)
enable_ufw=1
run_security=1
;;
--all)
run_base=1
run_shell=1
run_cockpit=1
run_docker=1
run_libvirt=1
run_nvidia=1
run_tuning=1
run_security=1
;;
-h|--help)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
usage >&2
exit 2
;;
esac
shift
done
if ((EUID != 0)); then
printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
exit 2
fi
for required_command in bash dpkg; do
if ! command -v "$required_command" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
exit 2
fi
done
require_supported_ubuntu
printf 'INFO: running read-only preflight\n'
"$SCRIPT_DIR/scripts/00-preflight.sh"
((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
if ((run_nvidia == 1)); then
if [[ -n "$nvidia_driver_version" ]]; then
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
else
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
fi
fi
((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
if ((run_security == 1)); then
if ((enable_ufw == 1)); then
"$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
else
"$SCRIPT_DIR/scripts/08-security-baseline.sh"
fi
fi
printf '\nINFO: running post-install checks\n'
"$SCRIPT_DIR/scripts/99-postcheck.sh"
printf '\nOK: selected Linux setup profiles completed\n'
@@ -0,0 +1,20 @@
# shellcheck shell=bash
require_supported_ubuntu() {
if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
exit 2
fi
# shellcheck disable=SC1091
source /etc/os-release
if [[ "${ID:-}" != "ubuntu" ]]; then
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
exit 2
fi
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
"${VERSION_ID:-unknown}" >&2
exit 2
fi
}
+124
View File
@@ -0,0 +1,124 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Operating system"
if [[ -r /etc/os-release ]]; then
run_optional "OS release report" cat /etc/os-release
else
printf 'WARNING: /etc/os-release is unavailable\n'
fi
run_optional "kernel report" uname -a
section "Host"
run_optional "hostname report" hostname
run_optional "uptime report" uptime
section "CPU and virtualization"
if command -v lscpu >/dev/null 2>&1; then
run_optional "CPU report" lscpu
printf '\nVirtualization flags:\n'
lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
printf 'INFO: no virtualization summary reported by lscpu\n'
else
printf 'WARNING: lscpu is unavailable\n'
fi
if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
printf 'OK: CPU virtualization flags detected\n'
else
printf 'WARNING: CPU virtualization flags were not detected\n'
fi
section "Memory"
if command -v free >/dev/null 2>&1; then
run_optional "memory report" free -h
else
run_optional "memory report" cat /proc/meminfo
fi
section "Disks"
if command -v lsblk >/dev/null 2>&1; then
run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
else
printf 'WARNING: lsblk is unavailable\n'
fi
run_optional "filesystem report" df -hT
section "Network"
if command -v ip >/dev/null 2>&1; then
run_optional "network interface report" ip -brief address
run_optional "route report" ip route
else
printf 'WARNING: ip is unavailable\n'
fi
section "Firmware and Secure Boot"
if [[ -d /sys/firmware/efi ]]; then
printf 'OK: boot mode is UEFI\n'
else
printf 'INFO: boot mode appears to be legacy BIOS\n'
fi
if command -v mokutil >/dev/null 2>&1; then
run_optional "Secure Boot report" mokutil --sb-state
else
printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
fi
section "IOMMU"
if [[ -r /proc/cmdline ]]; then
printf 'Kernel command line:\n'
cat /proc/cmdline
if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
printf 'OK: IOMMU-related kernel arguments detected\n'
else
printf 'INFO: no explicit IOMMU kernel argument detected\n'
fi
fi
if command -v dmesg >/dev/null 2>&1; then
dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
printf 'INFO: no readable IOMMU hints found in dmesg\n'
fi
section "NVIDIA hardware"
if command -v lspci >/dev/null 2>&1; then
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
else
printf 'INFO: lspci is unavailable\n'
fi
section "Existing platform components"
for command_name in docker virsh cockpit-bridge; do
if command -v "$command_name" >/dev/null 2>&1; then
printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
else
printf 'INFO: %s is not installed\n' "$command_name"
fi
done
if command -v systemctl >/dev/null 2>&1; then
for unit in docker.service libvirtd.service cockpit.socket; do
if systemctl cat "$unit" >/dev/null 2>&1; then
state="$(systemctl is-active "$unit" 2>/dev/null || true)"
printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
else
printf 'INFO: %s is not installed\n' "$unit"
fi
done
fi
printf '\nOK: preflight completed without modifying the host\n'
+41
View File
@@ -0,0 +1,41 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
packages=(
curl wget git vim nano tmux byobu htop btop glances
jq unzip zip rsync tree ncdu duf
lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
sysstat iotop iftop nload
ca-certificates gnupg software-properties-common apt-transport-https
needrestart unattended-upgrades logrotate
)
if ((EUID != 0)); then
printf 'CRITICAL: base package setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
printf 'INFO: refreshing APT metadata\n'
apt-get update
printf 'INFO: installing baseline operational packages\n'
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
if command -v systemctl >/dev/null 2>&1; then
systemctl enable --now sysstat
else
printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
fi
printf 'OK: baseline operational packages are installed\n'
+60
View File
@@ -0,0 +1,60 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
PROFILE_DIR="/root/.bashrc.d"
PROFILE_FILE="$PROFILE_DIR/ailab.sh"
BASHRC="/root/.bashrc"
SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
backup_file() {
local path="$1"
local backup
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$path" "$backup"
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
}
if ((EUID != 0)); then
printf 'CRITICAL: shell profile setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if [[ ! -r "$SOURCE_FILE" ]]; then
printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
exit 2
fi
install -d -m 0755 "$PROFILE_DIR"
if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
if [[ -f "$PROFILE_FILE" ]]; then
backup_file "$PROFILE_FILE"
fi
install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
printf 'OK: installed %s\n' "$PROFILE_FILE"
else
printf 'OK: shell profile is already current\n'
fi
if [[ ! -f "$BASHRC" ]]; then
install -m 0644 /dev/null "$BASHRC"
fi
source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
if [[ "$source_count" != "1" ]]; then
tmp_bashrc="$(mktemp)"
trap 'rm -f "$tmp_bashrc"' EXIT
grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
backup_file "$BASHRC"
install -m 0644 "$tmp_bashrc" "$BASHRC"
printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
else
printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
fi

Some files were not shown because too many files have changed in this diff Show More