Compare commits
15 Commits
a527022518
..
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 4e739c5c99 | |||
| 8cb92de06f | |||
| 1843796e92 | |||
| cd6830334b | |||
| e2624a7533 | |||
| 6475f76787 | |||
| e851568c8c | |||
| 8a7b7c5abc | |||
| 1636f46f81 | |||
| 5fc96348c5 | |||
| 89b7fabb96 | |||
| 2da5e8b46c | |||
| 452ff4fac1 | |||
| 5dde403ce3 | |||
| 61483c233f |
@@ -28,6 +28,9 @@ jobs:
|
||||
-P infra-run/scripts/bash/gpfs \
|
||||
-P infra-run/scripts/bash/veritas
|
||||
|
||||
- name: Python syntax checks
|
||||
run: bash scripts/check-python.sh
|
||||
|
||||
- name: yamllint
|
||||
run: yamllint .
|
||||
|
||||
|
||||
@@ -41,6 +41,7 @@ Focused checks:
|
||||
```bash
|
||||
./scripts/check-bash.sh
|
||||
./scripts/check-ansible.sh
|
||||
./scripts/check-python.sh
|
||||
./scripts/check-docs.sh
|
||||
```
|
||||
|
||||
@@ -64,6 +65,15 @@ Also run targeted checks for changed files, such as `bash -n`, `ansible-playbook
|
||||
- Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
|
||||
- Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps.
|
||||
|
||||
## Python Standards
|
||||
|
||||
- Use Python for parsing, reporting, and structured operational tooling where it adds value over Bash.
|
||||
- Keep Python tools read-only by default.
|
||||
- Prefer the Python standard library.
|
||||
- Avoid frameworks and unnecessary abstractions.
|
||||
- Use clear operational output and meaningful exit codes.
|
||||
- Keep tools small, focused, and easy to validate.
|
||||
|
||||
## Ansible Standards
|
||||
|
||||
- Keep playbooks short and roles simple.
|
||||
|
||||
@@ -4,6 +4,17 @@
|
||||
|
||||
### Added
|
||||
|
||||
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
|
||||
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
|
||||
- Python tooling validation for operational scripts.
|
||||
- `incident-log-summary` for general incident log summarization.
|
||||
- `log-diff-checker` for pre-change and post-change log comparison.
|
||||
- `auth-log-audit` for Linux authentication log review.
|
||||
- `jvm-log-analyzer` for JVM application log summaries.
|
||||
- `journal-analyzer` for exported `journalctl` log review.
|
||||
- `known-error-matcher` with JSON-based known error patterns.
|
||||
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
|
||||
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
|
||||
- Repository-level Codex guidance:
|
||||
- `AGENTS.md`
|
||||
- `docs/codex/README.md`
|
||||
@@ -27,12 +38,15 @@
|
||||
- IBM AIX 7 role and playbook.
|
||||
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
||||
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
||||
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||
|
||||
### Changed
|
||||
|
||||
- Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work.
|
||||
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
|
||||
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
||||
- Updated Python tooling documentation and repository roadmap.
|
||||
- Integrated Python syntax validation into repository validation workflow and CI.
|
||||
|
||||
### Notes
|
||||
|
||||
|
||||
Binary file not shown.
@@ -30,10 +30,19 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
||||
|
||||
- [infra-run](./infra-run/) - the main implemented project in this repository.
|
||||
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
|
||||
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
|
||||
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
|
||||
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
|
||||
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
|
||||
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
|
||||
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
|
||||
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
|
||||
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
|
||||
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
|
||||
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||
|
||||
## Planned Areas
|
||||
|
||||
@@ -78,10 +87,11 @@ Basic local validation:
|
||||
./scripts/validate-repo.sh
|
||||
./scripts/check-bash.sh
|
||||
./scripts/check-ansible.sh
|
||||
./scripts/check-python.sh
|
||||
./scripts/check-docs.sh
|
||||
```
|
||||
|
||||
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Set `STRICT=1` to fail when optional tools are missing.
|
||||
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
|
||||
|
||||
Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
|
||||
|
||||
@@ -90,10 +100,12 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
|
||||
## Operational Areas Demonstrated
|
||||
|
||||
- Linux operations triage and reporting.
|
||||
- Local operational log analysis with read-only Python helpers.
|
||||
- Disk pressure and deleted-file incident analysis.
|
||||
- Dry-run-first Bash automation.
|
||||
- Controlled storage change workflow design.
|
||||
- Veritas VxVM/VCS operational awareness.
|
||||
- GPFS / IBM Spectrum Scale operational awareness.
|
||||
- Ansible role organization for selected hardening controls.
|
||||
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||
- Clear documentation of what was tested and what still needs a real system.
|
||||
|
||||
+19
-2
@@ -16,6 +16,23 @@ This file keeps future portfolio ideas in one place so empty folders do not look
|
||||
- Clustering: service group checks, failover review, and operational checklists.
|
||||
- Monitoring: Zabbix-oriented alert review and host onboarding notes.
|
||||
- Virtualization: VM lifecycle and platform operations examples.
|
||||
- Log analysis: ELK-style search examples for incident review.
|
||||
- Log analysis: optional ELK-style search case study under `platform-projects`, separate from current local Python helpers.
|
||||
|
||||
Nothing in this roadmap should be read as completed implementation.
|
||||
## Implemented Portfolio Additions
|
||||
|
||||
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
|
||||
- Python operational log analysis suite under `infra-run/scripts/python/`:
|
||||
- `incident-log-summary`
|
||||
- `log-diff-checker`
|
||||
- `auth-log-audit`
|
||||
- `jvm-log-analyzer`
|
||||
- `journal-analyzer`
|
||||
- `known-error-matcher`
|
||||
|
||||
## Future Python Tooling Ideas
|
||||
|
||||
- Real-world sample report examples using sanitized evidence.
|
||||
- Integration examples that combine log summaries with change evidence collection.
|
||||
- A shared Python helper library only if the standalone tools begin duplicating enough stable behavior to justify it.
|
||||
|
||||
Planned sections remain future work unless listed as implemented.
|
||||
|
||||
+22
-2
@@ -1,16 +1,35 @@
|
||||
# infra-run
|
||||
|
||||
`infra-run` is a sanitized infrastructure operations project. It contains Bash and Ansible examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
|
||||
`infra-run` is a sanitized infrastructure operations project. It contains Bash, Ansible, Python, and documentation examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
|
||||
|
||||
The goal is to show operational judgment, not to ship a universal automation product.
|
||||
|
||||
## Current Contents
|
||||
|
||||
### Bash Operational Scripts
|
||||
|
||||
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
|
||||
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
|
||||
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
|
||||
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
|
||||
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
|
||||
|
||||
### Python Log And Reporting Tools
|
||||
|
||||
- [scripts/python](./scripts/python/) - read-only Python operational helpers using the standard library only.
|
||||
- [scripts/python/incident-log-summary](./scripts/python/incident-log-summary/) - read-only Python log summary helper for incident pattern review.
|
||||
- [scripts/python/log-diff-checker](./scripts/python/log-diff-checker/) - read-only Python before/after log comparison helper for change review.
|
||||
- [scripts/python/auth-log-audit](./scripts/python/auth-log-audit/) - read-only Python authentication log audit helper for SSH, sudo, su, and PAM review.
|
||||
- [scripts/python/jvm-log-analyzer](./scripts/python/jvm-log-analyzer/) - read-only Python JVM and Java application log analyzer for exception, stack trace, HTTP 5xx, database, and TLS review.
|
||||
- [scripts/python/journal-analyzer](./scripts/python/journal-analyzer/) - read-only Python exported journal analyzer for failed units, restart patterns, OOM events, and service warnings.
|
||||
- [scripts/python/known-error-matcher](./scripts/python/known-error-matcher/) - read-only Python matcher for local logs and JSON known-error catalogs with runbook references.
|
||||
|
||||
### Ansible Automation
|
||||
|
||||
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
|
||||
|
||||
### Runbooks And Documentation
|
||||
|
||||
- [examples](./examples/) - sanitized sample command outputs and incident notes.
|
||||
|
||||
## Documentation
|
||||
@@ -36,6 +55,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
|
||||
- Bash syntax can be checked locally.
|
||||
- Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked.
|
||||
- Disk-full read-only scripts can be run against local paths for basic behavior checks.
|
||||
- Python log analysis examples can be run against sanitized sample logs under each tool directory.
|
||||
- Ansible YAML and role structure can be linted locally.
|
||||
|
||||
## Running Safely
|
||||
@@ -70,7 +90,7 @@ From the repository root:
|
||||
./scripts/validate-repo.sh
|
||||
```
|
||||
|
||||
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
|
||||
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, `scripts/check-python.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
|
||||
|
||||
## Supporting Notes
|
||||
|
||||
|
||||
@@ -8,6 +8,21 @@ This file tracks planned `infra-run` additions without presenting them as comple
|
||||
- A small Python parser for converting script output into a markdown change note.
|
||||
- Additional Ansible molecule or container-based syntax checks where platform support is realistic.
|
||||
- Standalone runbooks that reference the existing Bash workflows.
|
||||
- Shared known-error pattern catalog review.
|
||||
- Additional links between Python findings and existing runbooks.
|
||||
- Change evidence collector for pre-check and post-check notes.
|
||||
- Report examples suitable for incident and change tickets.
|
||||
- Optional wrapper command only after the standalone Python tools stabilize.
|
||||
|
||||
## Implemented Additions
|
||||
|
||||
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
|
||||
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
|
||||
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
|
||||
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
|
||||
- `infra-run/scripts/python/jvm-log-analyzer/` - read-only JVM and Java application log analyzer for exceptions, stack traces, HTTP 5xx entries, database issues, TLS failures, and JVM failure symptoms.
|
||||
- `infra-run/scripts/python/journal-analyzer/` - read-only exported `journalctl` text analyzer for summarizing failed units, dependency issues, restart patterns, OOM findings, disk/filesystem symptoms, and related service warnings.
|
||||
- `infra-run/scripts/python/known-error-matcher/` - read-only known-error matcher for local logs and JSON pattern catalogs with severity, category, samples, and runbook references.
|
||||
|
||||
## Not Planned
|
||||
|
||||
|
||||
@@ -7,5 +7,6 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
|
||||
## Included
|
||||
|
||||
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
|
||||
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
|
||||
- `veritas/` - sample VxVM disk and VCS service group output.
|
||||
- `gpfs/` - sample GPFS cluster and NSD output.
|
||||
|
||||
@@ -0,0 +1,131 @@
|
||||
# L2 Incident Triage Report
|
||||
|
||||
- Generated: 2026-05-12T19:30:00Z
|
||||
- Local hostname: app01.example.internal
|
||||
- Current user: triage
|
||||
- Incident type: all
|
||||
- Service: nginx
|
||||
- Host: app.example.com
|
||||
- Port: 443
|
||||
- PID: not provided
|
||||
- Process match: not provided
|
||||
- Since: 30 minutes ago
|
||||
|
||||
## Executed Checks
|
||||
|
||||
| Check | Script | Status | Exit | Command |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
|
||||
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
|
||||
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
|
||||
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
|
||||
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
|
||||
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
|
||||
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
|
||||
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
|
||||
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
|
||||
|
||||
## Summary
|
||||
|
||||
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
|
||||
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
|
||||
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
|
||||
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
|
||||
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
|
||||
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
|
||||
- Inode usage: OK: Highest inode usage is 42%
|
||||
- JVM threads and heap: WARNING: No Java processes detected
|
||||
|
||||
## Raw Evidence
|
||||
|
||||
### CPU saturation
|
||||
|
||||
Script: `check_high_cpu.sh`
|
||||
|
||||
Command: `./check_high_cpu.sh`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
|
||||
Load average:
|
||||
1m=0.42 5m=0.38 15m=0.31
|
||||
|
||||
Top CPU processes:
|
||||
PID PPID USER %CPU %MEM COMMAND ARGS
|
||||
1450 1 app 7.2 2.1 nginx nginx: worker process
|
||||
|
||||
Recommended next steps:
|
||||
- Check process ownership and whether the top process is expected
|
||||
- Review logs for the top CPU-consuming process
|
||||
```
|
||||
|
||||
### Memory and OOM
|
||||
|
||||
Script: `check_high_memory_oom.sh`
|
||||
|
||||
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
|
||||
|
||||
Status: WARNING, exit: 1
|
||||
|
||||
```text
|
||||
WARNING: Memory usage is 84% and swap usage is 12%
|
||||
|
||||
Memory summary:
|
||||
Mem: 15800 13272 1110 210 1418 1840
|
||||
Swap: 4095 512 3583
|
||||
|
||||
OOM events since 30 minutes ago:
|
||||
OK: no OOM evidence found in available sources
|
||||
```
|
||||
|
||||
### Service restart loop
|
||||
|
||||
Script: `check_service_restart_loop.sh`
|
||||
|
||||
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: Service nginx state=active substate=running restarts=0
|
||||
|
||||
Systemd properties:
|
||||
Id=nginx.service
|
||||
ActiveState=active
|
||||
SubState=running
|
||||
NRestarts=0
|
||||
```
|
||||
|
||||
### Skipped or limited checks
|
||||
|
||||
```text
|
||||
JVM threads and heap returned WARNING because no Java process was detected.
|
||||
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
|
||||
```
|
||||
|
||||
## L2 Handover Checklist
|
||||
|
||||
- [ ] Business impact confirmed
|
||||
- [ ] Affected host/service identified
|
||||
- [ ] Monitoring alert attached
|
||||
- [ ] Recent changes checked
|
||||
- [ ] Logs attached
|
||||
- [ ] Service owner identified
|
||||
- [ ] Escalation target identified
|
||||
|
||||
## Escalation Notes
|
||||
|
||||
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
|
||||
- Include the alert, timeline, commands run, and the raw evidence above.
|
||||
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
|
||||
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
- Confirm the symptom against monitoring and user reports.
|
||||
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
|
||||
- Attach this report to the incident ticket before handoff.
|
||||
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
|
||||
@@ -1,6 +1,6 @@
|
||||
# infra-run/scripts
|
||||
|
||||
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from future Python-based utilities while keeping both under one automation entry point.
|
||||
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from Python-based analysis utilities while keeping both under one automation entry point.
|
||||
|
||||
## Diagram
|
||||
|
||||
@@ -9,16 +9,17 @@ flowchart TD
|
||||
A["scripts"] --> B["bash"]
|
||||
A --> C["python"]
|
||||
B --> D["Operational toolkits"]
|
||||
C --> E["Future helper utilities"]
|
||||
C --> E["Analysis helper utilities"]
|
||||
```
|
||||
|
||||
## Scope
|
||||
|
||||
- `bash` - current implementation area with operations toolkits.
|
||||
- `python` - reserved space for future supporting utilities.
|
||||
- [bash](./bash/) - operational toolkits for host health checks, disk-full triage, Veritas examples, and GPFS examples.
|
||||
- [python](./python/) - read-only tools for local log parsing, reporting, and structured operational analysis.
|
||||
|
||||
## Notes
|
||||
|
||||
- The repository currently emphasizes Bash because it maps directly to day-to-day Linux operations.
|
||||
- The structure leaves room for higher-level helpers without mixing concerns.
|
||||
- Bash remains the right default for direct host checks and operational wrappers.
|
||||
- Python is used where parsing, report generation, comparison, or JSON output is clearer than shell.
|
||||
- Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root.
|
||||
- Python tooling should remain read-only by default, standard-library based, and validated with `../../scripts/check-python.sh` from the repository root.
|
||||
|
||||
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A["bash"] --> B["os-healthcheck"]
|
||||
A --> C["disk-full"]
|
||||
A --> D["veritas"]
|
||||
A --> E["gpfs"]
|
||||
A --> C["incident-checks"]
|
||||
A --> D["disk-full"]
|
||||
A --> E["veritas"]
|
||||
A --> F["gpfs"]
|
||||
B --> B1["Host diagnostics"]
|
||||
C --> C1["Incident workflow"]
|
||||
D --> D1["VxVM and VCS change flow"]
|
||||
E --> E1["Spectrum Scale expansion flow"]
|
||||
C --> C1["Standalone triage checks"]
|
||||
D --> D1["Incident workflow"]
|
||||
E --> E1["VxVM and VCS change flow"]
|
||||
F --> F1["Spectrum Scale expansion flow"]
|
||||
```
|
||||
|
||||
## Scripts
|
||||
@@ -23,6 +25,7 @@ flowchart TD
|
||||
- `os-healthcheck/service_check.sh` - critical service status check.
|
||||
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
||||
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
||||
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
|
||||
./system_report.sh
|
||||
./network_troubleshoot.sh
|
||||
./network_troubleshoot.sh google.com
|
||||
|
||||
cd ../incident-checks
|
||||
./check_high_cpu.sh
|
||||
./check_high_memory_oom.sh --since "24 hours ago"
|
||||
./check_service_restart_loop.sh --service sshd
|
||||
./check_certificate_expiry.sh --host example.com
|
||||
```
|
||||
|
||||
## Standards
|
||||
|
||||
@@ -0,0 +1,124 @@
|
||||
# Bash Incident Checks
|
||||
|
||||
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
|
||||
|
||||
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
|
||||
|
||||
## Scripts
|
||||
|
||||
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
|
||||
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
|
||||
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
|
||||
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
|
||||
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
|
||||
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
|
||||
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
|
||||
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```bash
|
||||
./check_high_cpu.sh
|
||||
./check_high_cpu.sh --warning 70 --critical 90 --top 15
|
||||
|
||||
./check_high_memory_oom.sh
|
||||
./check_high_memory_oom.sh --since "6 hours ago" --top 5
|
||||
|
||||
./check_service_restart_loop.sh --service nginx
|
||||
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
|
||||
|
||||
./check_failed_ssh_logins.sh
|
||||
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
|
||||
|
||||
./check_certificate_expiry.sh --host example.com
|
||||
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
|
||||
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
|
||||
|
||||
./check_dns_connectivity.sh --host example.com
|
||||
./check_dns_connectivity.sh --host db.example.internal --port 5432
|
||||
|
||||
./check_ntp_time_drift.sh
|
||||
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
|
||||
|
||||
./check_filesystem_readonly.sh
|
||||
./check_filesystem_readonly.sh --include-system
|
||||
|
||||
./check_inode_usage.sh
|
||||
./check_inode_usage.sh --warning 75 --critical 90
|
||||
|
||||
./check_jvm_threads_heap.sh
|
||||
./check_jvm_threads_heap.sh --pid 1234
|
||||
./check_jvm_threads_heap.sh --match app-name
|
||||
|
||||
./incident_triage_report.sh --type cpu
|
||||
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||
```
|
||||
|
||||
## L2 Triage Report Wrapper
|
||||
|
||||
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
|
||||
|
||||
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
|
||||
|
||||
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK.
|
||||
- `1` - WARNING or operational issue detected.
|
||||
- `2` - invalid input or missing required dependency.
|
||||
- `3` - CRITICAL issue detected.
|
||||
|
||||
## Supported Platforms
|
||||
|
||||
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
|
||||
|
||||
Some data sources vary by distribution:
|
||||
|
||||
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
|
||||
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
|
||||
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- Scripts are read-only.
|
||||
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
|
||||
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
|
||||
- Treat output as triage evidence, not as complete root-cause analysis.
|
||||
|
||||
## Dependency Notes
|
||||
|
||||
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
|
||||
|
||||
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
|
||||
|
||||
## Copy-To-Server Example
|
||||
|
||||
```bash
|
||||
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
|
||||
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
|
||||
```
|
||||
|
||||
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
|
||||
|
||||
## Sample Outputs
|
||||
|
||||
Sanitized examples are available in [examples](./examples/):
|
||||
|
||||
- `high-cpu.sample.txt`
|
||||
- `high-memory-oom.sample.txt`
|
||||
- `service-restart-loop.sample.txt`
|
||||
- `failed-ssh-logins.sample.txt`
|
||||
- `certificate-expiry.sample.txt`
|
||||
- `dns-connectivity.sample.txt`
|
||||
- `ntp-time-drift.sample.txt`
|
||||
- `filesystem-readonly.sample.txt`
|
||||
- `inode-usage.sample.txt`
|
||||
- `jvm-threads-heap.sample.txt`
|
||||
|
||||
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
|
||||
@@ -0,0 +1,134 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
host_name=""
|
||||
port=443
|
||||
cert_file=""
|
||||
warning_days=30
|
||||
critical_days=7
|
||||
servername=""
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
|
||||
|
||||
Check TLS certificate expiry for a remote endpoint or local certificate file.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
|
||||
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
|
||||
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
|
||||
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if ! command -v openssl >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: openssl\n'
|
||||
exit 2
|
||||
fi
|
||||
for value in "$port" "$warning_days" "$critical_days"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((critical_days >= warning_days)); then
|
||||
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$host_name" && -n "$cert_file" ]]; then
|
||||
printf 'CRITICAL: use either --host or --file, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -z "$host_name" && -z "$cert_file" ]]; then
|
||||
printf 'CRITICAL: either --host or --file is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
|
||||
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
|
||||
exit 2
|
||||
fi
|
||||
if [[ -z "$servername" ]]; then
|
||||
servername="$host_name"
|
||||
fi
|
||||
|
||||
tmp_cert="$(mktemp)"
|
||||
trap 'rm -f "$tmp_cert"' EXIT
|
||||
|
||||
if [[ -n "$host_name" ]]; then
|
||||
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
|
||||
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
|
||||
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
|
||||
exit 2
|
||||
fi
|
||||
else
|
||||
cp "$cert_file" "$tmp_cert"
|
||||
fi
|
||||
|
||||
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
|
||||
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
|
||||
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
|
||||
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
|
||||
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
|
||||
|
||||
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
|
||||
now_epoch="$(date +%s)"
|
||||
if [[ -z "$expiry_epoch" ]]; then
|
||||
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
|
||||
exit 2
|
||||
fi
|
||||
seconds_left=$((expiry_epoch - now_epoch))
|
||||
days_left=$((seconds_left / 86400))
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((days_left < critical_days)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((days_left < warning_days)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
target="$cert_file"
|
||||
if [[ -n "$host_name" ]]; then
|
||||
target="${host_name}:${port}"
|
||||
fi
|
||||
|
||||
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
|
||||
|
||||
printf 'Certificate details:\n'
|
||||
printf 'Subject: %s\n' "$subject"
|
||||
printf 'Issuer: %s\n' "$issuer"
|
||||
printf 'notBefore: %s\n' "$not_before"
|
||||
printf 'notAfter: %s\n' "$not_after"
|
||||
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Target: %s\n' "$target"
|
||||
printf 'SNI: %s\n' "${servername:-not used}"
|
||||
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Renew certificate before the operational threshold is breached\n'
|
||||
printf -- '- Check the full chain and intermediate certificates\n'
|
||||
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
|
||||
printf -- '- Verify monitoring threshold and alert ownership\n'
|
||||
printf -- '- Attach this output to incident or change ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
host_name=""
|
||||
port=""
|
||||
count=3
|
||||
timeout_seconds=3
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
|
||||
|
||||
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
|
||||
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$host_name" ]]; then
|
||||
printf 'CRITICAL: --host is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
for value in "$count" "$timeout_seconds"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||
printf 'CRITICAL: --port must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v getent >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: getent\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
dns_ok=0
|
||||
ping_ok=0
|
||||
tcp_ok=0
|
||||
tcp_checked=0
|
||||
tcp_note=""
|
||||
ping_output="$(mktemp)"
|
||||
trap 'rm -f "$ping_output"' EXIT
|
||||
|
||||
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
|
||||
if [[ -n "$dns_output" ]]; then
|
||||
dns_ok=1
|
||||
fi
|
||||
|
||||
if command -v ping >/dev/null 2>&1; then
|
||||
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
|
||||
ping_ok=1
|
||||
fi
|
||||
else
|
||||
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
|
||||
fi
|
||||
|
||||
if [[ -n "$port" ]]; then
|
||||
tcp_checked=1
|
||||
if command -v timeout >/dev/null 2>&1; then
|
||||
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||
tcp_ok=1
|
||||
fi
|
||||
else
|
||||
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
|
||||
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||
tcp_ok=1
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((dns_ok == 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((tcp_checked == 1 && tcp_ok == 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
|
||||
if ((tcp_checked == 1)); then
|
||||
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
|
||||
fi
|
||||
printf '\n\n'
|
||||
|
||||
printf 'DNS result:\n'
|
||||
if [[ -n "$dns_output" ]]; then
|
||||
printf '%s\n' "$dns_output"
|
||||
else
|
||||
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Ping result:\n'
|
||||
if [[ -s "$ping_output" ]]; then
|
||||
cat "$ping_output"
|
||||
else
|
||||
printf 'WARNING: ping result unavailable or ping command missing\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
if ((tcp_checked == 1)); then
|
||||
printf 'TCP port result:\n'
|
||||
if ((tcp_ok == 1)); then
|
||||
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
|
||||
else
|
||||
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
|
||||
fi
|
||||
if [[ -n "$tcp_note" ]]; then
|
||||
printf '%s\n' "$tcp_note"
|
||||
fi
|
||||
printf '\n'
|
||||
fi
|
||||
|
||||
printf 'Local network hints:\n'
|
||||
if command -v ip >/dev/null 2>&1; then
|
||||
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
|
||||
elif command -v ss >/dev/null 2>&1; then
|
||||
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
|
||||
else
|
||||
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
|
||||
if [[ -n "$tcp_note" ]]; then
|
||||
printf '%s\n' "$tcp_note"
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify the DNS record and resolver path\n'
|
||||
printf -- '- Check firewall, routing, security group, or proxy policy\n'
|
||||
printf -- '- Compare results from another host or network segment\n'
|
||||
printf -- '- Check application endpoint health after network reachability is confirmed\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
since_value="1 hour ago"
|
||||
warning_count=20
|
||||
critical_count=50
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
|
||||
|
||||
Detect failed SSH login bursts from journal or readable authentication logs.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_count" "$critical_count" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_count >= critical_count)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_log="$(mktemp)"
|
||||
trap 'rm -f "$tmp_log"' EXIT
|
||||
log_source="journalctl"
|
||||
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl --since "$since_value" --no-pager 2>/dev/null \
|
||||
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
|
||||
else
|
||||
log_source="log file fallback"
|
||||
fi
|
||||
|
||||
if [[ ! -s "$tmp_log" ]]; then
|
||||
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
|
||||
if [[ -r "$log_file" ]]; then
|
||||
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
|
||||
log_source="$log_file"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((attempts >= critical_count)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((attempts >= warning_count)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
|
||||
|
||||
printf 'Top source IPs:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
|
||||
| sed -E 's/^(from|rhost=) //' \
|
||||
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||
else
|
||||
printf 'OK: no failed SSH attempts found in available sources\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Top attempted users:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
|
||||
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||
else
|
||||
printf 'OK: no attempted users extracted\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Sample recent lines:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
tail -n "$top_count" "$tmp_log"
|
||||
else
|
||||
printf 'OK: no sample lines available\n'
|
||||
fi
|
||||
printf '\n\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||
printf 'Log source: %s\n' "$log_source"
|
||||
if [[ "$log_source" != "journalctl" ]]; then
|
||||
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; authentication log visibility may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
|
||||
printf -- '- Check firewall, fail2ban, or security tooling state\n'
|
||||
printf -- '- Confirm whether the attempts are expected for this host\n'
|
||||
printf -- '- Review successful logins too, not only failures\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
include_system=0
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_filesystem_readonly.sh [--include-system] [--help]
|
||||
|
||||
Detect filesystems mounted read-only. Read-only.
|
||||
USAGE
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--include-system) include_system=1; shift ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
tmp_mounts="$(mktemp)"
|
||||
trap 'rm -f "$tmp_mounts"' EXIT
|
||||
|
||||
if command -v findmnt >/dev/null 2>&1; then
|
||||
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
|
||||
elif command -v mount >/dev/null 2>&1; then
|
||||
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
|
||||
else
|
||||
printf 'CRITICAL: findmnt or mount is required\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_ro="$(mktemp)"
|
||||
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
|
||||
|
||||
awk -v include_system="$include_system" '
|
||||
function system_fs(type, target) {
|
||||
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
|
||||
}
|
||||
{
|
||||
target=$1; source=$2; type=$3; opts=$4
|
||||
if (opts ~ /(^|,)ro(,|$)/) {
|
||||
if (include_system == 1 || ! system_fs(type, target)) {
|
||||
print target "\t" source "\t" type "\t" opts
|
||||
}
|
||||
}
|
||||
}
|
||||
' "$tmp_mounts" > "$tmp_ro"
|
||||
|
||||
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((readonly_count > 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
fi
|
||||
|
||||
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
|
||||
|
||||
printf 'Read-only filesystems:\n'
|
||||
if [[ -s "$tmp_ro" ]]; then
|
||||
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
|
||||
cat "$tmp_ro"
|
||||
else
|
||||
printf 'OK: no read-only filesystems found with current filters\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'include_system=%s\n' "$include_system"
|
||||
printf 'Collector: '
|
||||
if command -v findmnt >/dev/null 2>&1; then
|
||||
printf 'findmnt\n'
|
||||
else
|
||||
printf 'mount fallback\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
|
||||
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
|
||||
printf -- '- Check filesystem health with the platform-approved procedure\n'
|
||||
printf -- '- Do not remount read-write before understanding the cause\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
+146
@@ -0,0 +1,146 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=75
|
||||
critical_threshold=90
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||
|
||||
Detect high CPU load and show top CPU-consuming processes.
|
||||
|
||||
Exit codes:
|
||||
0 OK
|
||||
1 WARNING / operational issue detected
|
||||
2 invalid input / missing required dependency
|
||||
3 CRITICAL issue detected
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
|
||||
warning_threshold="$2"
|
||||
shift 2
|
||||
;;
|
||||
--critical)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
|
||||
critical_threshold="$2"
|
||||
shift 2
|
||||
;;
|
||||
--top)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
|
||||
top_count="$2"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd ps
|
||||
require_cmd awk
|
||||
require_cmd head
|
||||
|
||||
cpu_count=1
|
||||
if command -v getconf >/dev/null 2>&1; then
|
||||
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
|
||||
elif [[ -r /proc/cpuinfo ]]; then
|
||||
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
|
||||
fi
|
||||
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
|
||||
((cpu_count > 0)) || cpu_count=1
|
||||
|
||||
load_1m="unavailable"
|
||||
load_5m="unavailable"
|
||||
load_15m="unavailable"
|
||||
load_per_cpu_pct=0
|
||||
if [[ -r /proc/loadavg ]]; then
|
||||
read -r load_1m load_5m load_15m _ < /proc/loadavg
|
||||
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
|
||||
elif command -v uptime >/dev/null 2>&1; then
|
||||
load_line="$(uptime 2>/dev/null || true)"
|
||||
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((load_per_cpu_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((load_per_cpu_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
|
||||
|
||||
printf 'Load average:\n'
|
||||
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
|
||||
|
||||
printf 'CPU count:\n'
|
||||
printf '%s\n\n' "$cpu_count"
|
||||
|
||||
printf 'Top CPU processes:\n'
|
||||
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
if command -v uptime >/dev/null 2>&1; then
|
||||
uptime || true
|
||||
else
|
||||
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
|
||||
fi
|
||||
if ((load_per_cpu_pct >= 100)); then
|
||||
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
|
||||
else
|
||||
printf 'OK: load is not above online CPU count at collection time\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check process ownership and whether the top process is expected\n'
|
||||
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
|
||||
printf -- '- Review logs for the top CPU-consuming process\n'
|
||||
printf -- '- Compare with longer trend data from monitoring before taking action\n'
|
||||
printf -- '- Attach this output to the incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,138 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=80
|
||||
critical_threshold=90
|
||||
since_value="24 hours ago"
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
|
||||
|
||||
Detect high memory or swap usage and show recent OOM killer evidence.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd free
|
||||
require_cmd ps
|
||||
require_cmd awk
|
||||
require_cmd head
|
||||
|
||||
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
|
||||
/^Mem:/ { mt=$2; mu=$3 }
|
||||
/^Swap:/ { st=$2; su=$3 }
|
||||
END { printf "%d %d %d %d\n", mt, mu, st, su }
|
||||
')
|
||||
|
||||
mem_pct=0
|
||||
swap_pct=0
|
||||
if ((mem_total > 0)); then
|
||||
mem_pct=$((mem_used * 100 / mem_total))
|
||||
fi
|
||||
if ((swap_total > 0)); then
|
||||
swap_pct=$((swap_used * 100 / swap_total))
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
|
||||
|
||||
printf 'Memory summary:\n'
|
||||
free -m
|
||||
printf '\n'
|
||||
|
||||
printf 'Top memory processes:\n'
|
||||
printf 'PID RSS_MB COMMAND\n'
|
||||
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
|
||||
printf '\n'
|
||||
|
||||
printf 'OOM events since %s:\n' "$since_value"
|
||||
oom_found=0
|
||||
oom_source="journalctl"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
|
||||
oom_found=1
|
||||
fi
|
||||
else
|
||||
printf 'WARNING: journalctl not available; checking readable log files\n'
|
||||
oom_source="log file fallback"
|
||||
fi
|
||||
if ((oom_found == 0)); then
|
||||
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
|
||||
if [[ -r "$log_file" ]]; then
|
||||
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
|
||||
oom_found=1
|
||||
oom_source="$log_file"
|
||||
break
|
||||
fi
|
||||
fi
|
||||
done
|
||||
fi
|
||||
if ((oom_found == 0)); then
|
||||
printf 'OK: no OOM evidence found in available sources\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
|
||||
printf 'OOM evidence source: %s\n' "$oom_source"
|
||||
if [[ "$oom_source" != "journalctl" ]]; then
|
||||
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check application memory trend\n'
|
||||
printf -- '- Review JVM heap settings if process is Java\n'
|
||||
printf -- '- Verify swap pressure and paging activity\n'
|
||||
printf -- '- Confirm whether OOM events align with application impact\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
+103
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=80
|
||||
critical_threshold=90
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||
|
||||
Detect inode exhaustion using df -i.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v df >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: df\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_df="$(mktemp)"
|
||||
tmp_alerts="$(mktemp)"
|
||||
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
|
||||
|
||||
df -Pi > "$tmp_df"
|
||||
awk -v warn="$warning_threshold" '
|
||||
NR > 1 {
|
||||
pct=$5
|
||||
gsub(/%/, "", pct)
|
||||
if (pct >= warn) {
|
||||
print $0
|
||||
}
|
||||
}
|
||||
' "$tmp_df" > "$tmp_alerts"
|
||||
|
||||
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((max_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((max_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
|
||||
|
||||
printf 'Filesystems above threshold:\n'
|
||||
if [[ -s "$tmp_alerts" ]]; then
|
||||
cat "$tmp_alerts"
|
||||
else
|
||||
printf 'OK: no filesystems above warning threshold\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Inode usage table:\n'
|
||||
cat "$tmp_df"
|
||||
printf '\n'
|
||||
|
||||
printf 'Top affected mount points:\n'
|
||||
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
|
||||
| sort -rn | head -n "$top_count" \
|
||||
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Find directories with many small files under affected mount points\n'
|
||||
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
|
||||
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
|
||||
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,134 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
target_pid=""
|
||||
match_string=""
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
|
||||
|
||||
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
|
||||
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||
printf 'CRITICAL: use either --pid or --match, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||
printf 'CRITICAL: --pid must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! is_number "$top_count"; then
|
||||
printf 'CRITICAL: --top must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v ps >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: ps\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_java="$(mktemp)"
|
||||
trap 'rm -f "$tmp_java"' EXIT
|
||||
|
||||
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
|
||||
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
|
||||
|
||||
if [[ -z "$target_pid" && -n "$match_string" ]]; then
|
||||
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
|
||||
fi
|
||||
|
||||
if [[ -z "$target_pid" ]]; then
|
||||
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
|
||||
if ((detected_count == 0)); then
|
||||
printf 'WARNING: No Java processes detected\n\n'
|
||||
else
|
||||
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
|
||||
fi
|
||||
printf 'Detected JVM processes:\n'
|
||||
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
|
||||
printf '\nRecommended next steps:\n'
|
||||
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
|
||||
printf -- '- Review GC logs and application logs for the selected process\n'
|
||||
printf -- '- Check heap sizing and thread count trend\n'
|
||||
printf -- '- Capture jstack only if approved by operational process\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! ps -p "$target_pid" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
|
||||
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
|
||||
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
else
|
||||
status="OK"
|
||||
exit_code=0
|
||||
fi
|
||||
|
||||
thread_count="unavailable"
|
||||
if [[ -r "/proc/${target_pid}/status" ]]; then
|
||||
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
|
||||
fi
|
||||
|
||||
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
|
||||
|
||||
printf 'Detected JVM process:\n'
|
||||
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
|
||||
printf 'Thread count: %s\n\n' "$thread_count"
|
||||
|
||||
printf 'Heap and JVM evidence:\n'
|
||||
if command -v jcmd >/dev/null 2>&1; then
|
||||
printf '\n[jcmd VM.flags]\n'
|
||||
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
|
||||
printf '\n[jcmd GC.heap_info]\n'
|
||||
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
|
||||
printf '\n[jcmd Thread.print summary]\n'
|
||||
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
|
||||
elif command -v jstat >/dev/null 2>&1; then
|
||||
printf '\n[jstat -gc]\n'
|
||||
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
|
||||
else
|
||||
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Review GC logs and recent application errors\n'
|
||||
printf -- '- Check JVM heap sizing against container or host memory limits\n'
|
||||
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
|
||||
printf -- '- Capture jstack only if approved by operational process\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,121 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_offset_ms=500
|
||||
critical_offset_ms=5000
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
|
||||
|
||||
Check time synchronization status and offset evidence when available.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
|
||||
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_offset_ms" "$critical_offset_ms"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_offset_ms >= critical_offset_ms)); then
|
||||
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
|
||||
timezone="$(date '+%Z %z')"
|
||||
sync_status="unknown"
|
||||
detected_tool="none"
|
||||
offset_ms=""
|
||||
|
||||
timedate_output=""
|
||||
if command -v timedatectl >/dev/null 2>&1; then
|
||||
detected_tool="timedatectl"
|
||||
timedate_output="$(timedatectl 2>/dev/null || true)"
|
||||
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
|
||||
[[ -n "$sync_status" ]] || sync_status="unknown"
|
||||
fi
|
||||
|
||||
chronyc_output=""
|
||||
if command -v chronyc >/dev/null 2>&1; then
|
||||
detected_tool="chronyc"
|
||||
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
|
||||
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
|
||||
if [[ -n "$raw_offset" ]]; then
|
||||
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
|
||||
fi
|
||||
elif command -v ntpq >/dev/null 2>&1; then
|
||||
detected_tool="ntpq"
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
if [[ -n "$offset_ms" ]]; then
|
||||
if ((offset_ms >= critical_offset_ms)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((offset_ms >= warning_offset_ms)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
elif [[ "$detected_tool" == "none" ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
|
||||
|
||||
printf 'Time status:\n'
|
||||
printf 'System time: %s\n' "$system_time"
|
||||
printf 'Timezone: %s\n' "$timezone"
|
||||
printf 'Detected tool: %s\n' "$detected_tool"
|
||||
printf 'NTP synchronized: %s\n' "$sync_status"
|
||||
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
|
||||
|
||||
printf 'Tool evidence:\n'
|
||||
if [[ -n "$chronyc_output" ]]; then
|
||||
printf '%s\n' "$chronyc_output"
|
||||
elif command -v ntpq >/dev/null 2>&1; then
|
||||
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
|
||||
elif [[ -n "$timedate_output" ]]; then
|
||||
printf '%s\n' "$timedate_output"
|
||||
else
|
||||
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
|
||||
if [[ -z "$offset_ms" ]]; then
|
||||
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify chrony or ntpd service status and configuration\n'
|
||||
printf -- '- Check NTP sources and reachability\n'
|
||||
printf -- '- Check virtualization host time if this is a VM\n'
|
||||
printf -- '- Avoid restarting time services blindly in production\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
service_name=""
|
||||
since_value="1 hour ago"
|
||||
warning_count=3
|
||||
critical_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
|
||||
|
||||
Detect restart-loop evidence for a systemd service. Read-only.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$service_name" ]]; then
|
||||
printf 'CRITICAL: --service is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
for value in "$warning_count" "$critical_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_count >= critical_count)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd systemctl
|
||||
|
||||
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
|
||||
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
|
||||
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
|
||||
restart_count="${n_restarts:-0}"
|
||||
if ! is_number "$restart_count"; then
|
||||
restart_count=0
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
|
||||
|
||||
printf 'Service state:\n'
|
||||
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
|
||||
printf '\n'
|
||||
|
||||
printf 'Systemd properties:\n'
|
||||
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
|
||||
printf '\n'
|
||||
|
||||
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
|
||||
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
|
||||
| tail -n 40 || printf 'OK: no matching journal lines found\n'
|
||||
else
|
||||
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; journal visibility may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Inspect the unit file and drop-in overrides\n'
|
||||
printf -- '- Review application logs around the restart timestamps\n'
|
||||
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
|
||||
printf -- '- Verify recent configuration or package changes\n'
|
||||
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,20 @@
|
||||
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
|
||||
|
||||
Certificate details:
|
||||
Subject: CN = app.example.com
|
||||
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
|
||||
notBefore: Apr 11 00:00:00 2026 GMT
|
||||
notAfter: May 29 23:59:59 2026 GMT
|
||||
SAN/CN: DNS:app.example.com, DNS:api.example.com
|
||||
|
||||
Evidence:
|
||||
Target: app.example.com:443
|
||||
SNI: app.example.com
|
||||
Thresholds: warning=30 days critical=7 days
|
||||
|
||||
Recommended next steps:
|
||||
- Renew certificate before the operational threshold is breached
|
||||
- Check the full chain and intermediate certificates
|
||||
- Check the load balancer, ingress, or reverse proxy serving this certificate
|
||||
- Verify monitoring threshold and alert ownership
|
||||
- Attach this output to incident or change ticket
|
||||
@@ -0,0 +1,23 @@
|
||||
OK: DNS=OK ping=OK tcp_443=OK
|
||||
|
||||
DNS result:
|
||||
93.184.216.34 example.com
|
||||
|
||||
Ping result:
|
||||
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
|
||||
|
||||
TCP port result:
|
||||
OK: TCP connection to example.com:443 succeeded
|
||||
|
||||
Local network hints:
|
||||
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
|
||||
|
||||
Evidence:
|
||||
Host: example.com count=3 timeout=3s port=443
|
||||
|
||||
Recommended next steps:
|
||||
- Verify the DNS record and resolver path
|
||||
- Check firewall, routing, security group, or proxy policy
|
||||
- Compare results from another host or network segment
|
||||
- Check application endpoint health after network reachability is confirmed
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,26 @@
|
||||
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
|
||||
|
||||
Top source IPs:
|
||||
52 203.0.113.44
|
||||
12 198.51.100.20
|
||||
9 192.0.2.10
|
||||
|
||||
Top attempted users:
|
||||
31 admin
|
||||
24 oracle
|
||||
18 root
|
||||
|
||||
Sample recent lines:
|
||||
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
|
||||
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=20 critical=50 since="1 hour ago"
|
||||
Log source: journalctl
|
||||
|
||||
Recommended next steps:
|
||||
- Verify source IPs against expected scanners, admins, or automation
|
||||
- Check firewall, fail2ban, or security tooling state
|
||||
- Confirm whether the attempts are expected for this host
|
||||
- Review successful logins too, not only failures
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,16 @@
|
||||
CRITICAL: Found 1 read-only filesystem(s)
|
||||
|
||||
Read-only filesystems:
|
||||
MOUNT_POINT SOURCE FSTYPE OPTIONS
|
||||
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
|
||||
|
||||
Evidence:
|
||||
include_system=0
|
||||
Collector: findmnt
|
||||
|
||||
Recommended next steps:
|
||||
- Check dmesg or journal logs for I/O errors and filesystem remount events
|
||||
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
|
||||
- Check filesystem health with the platform-approved procedure
|
||||
- Do not remount read-write before understanding the cause
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,22 @@
|
||||
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
|
||||
|
||||
Load average:
|
||||
1m=7.82 5m=6.91 15m=5.40
|
||||
|
||||
CPU count:
|
||||
8
|
||||
|
||||
Top CPU processes:
|
||||
PID PPID USER %CPU %MEM COMMAND COMMAND
|
||||
2314 1 app 245 12.1 java java -jar order-api.jar
|
||||
991 1 root 38 0.4 backup-agent backup-agent --scan
|
||||
|
||||
Evidence:
|
||||
WARNING: load is close to online CPU count; runnable task saturation is possible
|
||||
|
||||
Recommended next steps:
|
||||
- Check process ownership and whether the top process is expected
|
||||
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
|
||||
- Review logs for the top CPU-consuming process
|
||||
- Compare with longer trend data from monitoring before taking action
|
||||
- Attach this output to the incident ticket
|
||||
@@ -0,0 +1,25 @@
|
||||
WARNING: Memory usage is 84% and swap usage is 12%
|
||||
|
||||
Memory summary:
|
||||
total used free shared buff/cache available
|
||||
Mem: 15934 13386 512 121 2036 2101
|
||||
Swap: 4095 512 3583
|
||||
|
||||
Top memory processes:
|
||||
PID RSS_MB COMMAND
|
||||
1234 2048 java
|
||||
987 812 postgres
|
||||
|
||||
OOM events since 24 hours ago:
|
||||
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=80% critical=90% since="24 hours ago"
|
||||
OOM evidence source: journalctl
|
||||
|
||||
Recommended next steps:
|
||||
- Check application memory trend
|
||||
- Review JVM heap settings if process is Java
|
||||
- Verify swap pressure and paging activity
|
||||
- Confirm whether OOM events align with application impact
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,22 @@
|
||||
WARNING: Highest inode usage is 87%
|
||||
|
||||
Filesystems above threshold:
|
||||
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||
|
||||
Inode usage table:
|
||||
Filesystem Inodes IUsed IFree IUse% Mounted on
|
||||
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
|
||||
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||
|
||||
Top affected mount points:
|
||||
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=80% critical=90%
|
||||
|
||||
Recommended next steps:
|
||||
- Find directories with many small files under affected mount points
|
||||
- Check logs, cache, spool, session, and temporary directories
|
||||
- Avoid deleting blindly; confirm ownership and application impact first
|
||||
- Confirm whether inode exhaustion is causing write or deploy failures
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,30 @@
|
||||
OK: JVM diagnostics collected for PID 1234
|
||||
|
||||
Detected JVM process:
|
||||
PID USER RSS_MB CPU COMMAND
|
||||
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
|
||||
Thread count: 188
|
||||
|
||||
Heap and JVM evidence:
|
||||
|
||||
[jcmd VM.flags]
|
||||
1234:
|
||||
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
|
||||
|
||||
[jcmd GC.heap_info]
|
||||
garbage-first heap total 2097152K, used 1521000K
|
||||
|
||||
[jcmd Thread.print summary]
|
||||
102 java.lang.Thread.State: WAITING
|
||||
53 java.lang.Thread.State: RUNNABLE
|
||||
33 java.lang.Thread.State: TIMED_WAITING
|
||||
|
||||
Evidence:
|
||||
PID=1234 thread_count=188 top=10
|
||||
|
||||
Recommended next steps:
|
||||
- Review GC logs and recent application errors
|
||||
- Check JVM heap sizing against container or host memory limits
|
||||
- Check thread count trend in monitoring before concluding a leak
|
||||
- Capture jstack only if approved by operational process
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,23 @@
|
||||
WARNING: Time sync status=yes offset_ms=812
|
||||
|
||||
Time status:
|
||||
System time: 2026-05-11 10:18:01 UTC +0000
|
||||
Timezone: UTC +0000
|
||||
Detected tool: chronyc
|
||||
NTP synchronized: yes
|
||||
Offset ms: 812
|
||||
|
||||
Tool evidence:
|
||||
Reference ID : 203.0.113.10
|
||||
System time : 0.812345 seconds fast of NTP time
|
||||
Last offset : +0.812345 seconds
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=500ms critical=5000ms
|
||||
|
||||
Recommended next steps:
|
||||
- Verify chrony or ntpd service status and configuration
|
||||
- Check NTP sources and reachability
|
||||
- Check virtualization host time if this is a VM
|
||||
- Avoid restarting time services blindly in production
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,27 @@
|
||||
CRITICAL: Service app.service state=failed substate=failed restarts=12
|
||||
|
||||
Service state:
|
||||
app.service - Example application
|
||||
Loaded: loaded (/etc/systemd/system/app.service; enabled)
|
||||
Active: failed (Result: exit-code)
|
||||
|
||||
Systemd properties:
|
||||
Id=app.service
|
||||
ActiveState=failed
|
||||
SubState=failed
|
||||
Result=exit-code
|
||||
NRestarts=12
|
||||
|
||||
Recent start/stop/failure log lines since 1 hour ago:
|
||||
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
|
||||
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
|
||||
|
||||
Recommended next steps:
|
||||
- Inspect the unit file and drop-in overrides
|
||||
- Review application logs around the restart timestamps
|
||||
- Check dependencies such as network, storage, database, or secrets
|
||||
- Verify recent configuration or package changes
|
||||
- Do not restart blindly; attach this output to the incident ticket
|
||||
@@ -0,0 +1,385 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
incident_type=""
|
||||
service_name=""
|
||||
host_name=""
|
||||
port=""
|
||||
target_pid=""
|
||||
match_string=""
|
||||
output_file=""
|
||||
since_value="1 hour ago"
|
||||
|
||||
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: incident_triage_report.sh --type TYPE [options]
|
||||
|
||||
Run selected read-only incident checks and produce a Markdown triage report.
|
||||
|
||||
Incident types:
|
||||
cpu
|
||||
memory
|
||||
service
|
||||
network
|
||||
auth
|
||||
cert
|
||||
filesystem
|
||||
jvm
|
||||
all
|
||||
|
||||
Options:
|
||||
--type TYPE Incident type to collect
|
||||
--service SERVICE_NAME systemd service name for service checks
|
||||
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
|
||||
--port PORT TCP or TLS port for host checks
|
||||
--pid PID JVM process ID
|
||||
--match PROCESS_MATCH JVM process match string
|
||||
--output FILE write Markdown report to FILE
|
||||
--since VALUE time window for log-based checks
|
||||
--help show this help
|
||||
|
||||
Examples:
|
||||
./incident_triage_report.sh --type cpu
|
||||
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
valid_type() {
|
||||
case "$1" in
|
||||
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
|
||||
*) return 1 ;;
|
||||
esac
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--type)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
|
||||
incident_type="$2"
|
||||
shift 2
|
||||
;;
|
||||
--service)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
|
||||
service_name="$2"
|
||||
shift 2
|
||||
;;
|
||||
--host)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
|
||||
host_name="$2"
|
||||
shift 2
|
||||
;;
|
||||
--port)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
|
||||
port="$2"
|
||||
shift 2
|
||||
;;
|
||||
--pid)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
|
||||
target_pid="$2"
|
||||
shift 2
|
||||
;;
|
||||
--match)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
|
||||
match_string="$2"
|
||||
shift 2
|
||||
;;
|
||||
--output)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
|
||||
output_file="$2"
|
||||
shift 2
|
||||
;;
|
||||
--since)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
|
||||
since_value="$2"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$incident_type" ]]; then
|
||||
printf 'CRITICAL: --type is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if ! valid_type "$incident_type"; then
|
||||
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||
printf 'CRITICAL: --port must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||
printf 'CRITICAL: --pid must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_dir="$(mktemp -d)"
|
||||
trap 'rm -rf "$tmp_dir"' EXIT
|
||||
|
||||
report_file="$tmp_dir/report.md"
|
||||
|
||||
check_labels=()
|
||||
check_names=()
|
||||
check_commands=()
|
||||
check_statuses=()
|
||||
check_exit_codes=()
|
||||
check_summaries=()
|
||||
check_outputs=()
|
||||
|
||||
status_from_exit() {
|
||||
case "$1" in
|
||||
0) printf 'OK' ;;
|
||||
1) printf 'WARNING' ;;
|
||||
2) printf 'INVALID' ;;
|
||||
3) printf 'CRITICAL' ;;
|
||||
*) printf 'ERROR' ;;
|
||||
esac
|
||||
}
|
||||
|
||||
render_command() {
|
||||
local item
|
||||
for item in "$@"; do
|
||||
printf '%q ' "$item"
|
||||
done | sed 's/[[:space:]]*$//'
|
||||
}
|
||||
|
||||
append_skipped_check() {
|
||||
local label="$1"
|
||||
local name="$2"
|
||||
local reason="$3"
|
||||
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||
|
||||
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
|
||||
|
||||
check_labels+=("$label")
|
||||
check_names+=("$name")
|
||||
check_commands+=("not run")
|
||||
check_statuses+=("SKIPPED")
|
||||
check_exit_codes+=("-")
|
||||
check_summaries+=("$reason")
|
||||
check_outputs+=("$output_path")
|
||||
}
|
||||
|
||||
run_check() {
|
||||
local label="$1"
|
||||
local script_name="$2"
|
||||
shift 2
|
||||
|
||||
local script_path="${script_dir}/${script_name}"
|
||||
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||
local command_text
|
||||
local exit_code
|
||||
local status
|
||||
local summary
|
||||
|
||||
command_text="$(render_command "$script_path" "$@")"
|
||||
|
||||
if [[ ! -e "$script_path" ]]; then
|
||||
append_skipped_check "$label" "$script_name" "missing script: $script_name"
|
||||
return
|
||||
fi
|
||||
if [[ ! -x "$script_path" ]]; then
|
||||
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
|
||||
return
|
||||
fi
|
||||
|
||||
set +e
|
||||
"$script_path" "$@" > "$output_path" 2>&1
|
||||
exit_code=$?
|
||||
set -e
|
||||
|
||||
status="$(status_from_exit "$exit_code")"
|
||||
summary="$(sed -n '1p' "$output_path")"
|
||||
if [[ -z "$summary" ]]; then
|
||||
summary="no output captured"
|
||||
fi
|
||||
|
||||
check_labels+=("$label")
|
||||
check_names+=("$script_name")
|
||||
check_commands+=("$command_text")
|
||||
check_statuses+=("$status")
|
||||
check_exit_codes+=("$exit_code")
|
||||
check_summaries+=("$summary")
|
||||
check_outputs+=("$output_path")
|
||||
}
|
||||
|
||||
run_cpu_checks() {
|
||||
run_check "CPU saturation" "check_high_cpu.sh"
|
||||
}
|
||||
|
||||
run_memory_checks() {
|
||||
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
|
||||
}
|
||||
|
||||
run_service_checks() {
|
||||
if [[ -z "$service_name" ]]; then
|
||||
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
|
||||
return
|
||||
fi
|
||||
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
|
||||
}
|
||||
|
||||
run_network_checks() {
|
||||
local args=(--host "$host_name")
|
||||
if [[ -z "$host_name" ]]; then
|
||||
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||
return
|
||||
fi
|
||||
if [[ -n "$port" ]]; then
|
||||
args+=(--port "$port")
|
||||
fi
|
||||
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
run_auth_checks() {
|
||||
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
|
||||
}
|
||||
|
||||
run_cert_checks() {
|
||||
local args=(--host "$host_name")
|
||||
if [[ -z "$host_name" ]]; then
|
||||
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||
return
|
||||
fi
|
||||
if [[ -n "$port" ]]; then
|
||||
args+=(--port "$port")
|
||||
fi
|
||||
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
run_filesystem_checks() {
|
||||
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
|
||||
run_check "Inode usage" "check_inode_usage.sh"
|
||||
}
|
||||
|
||||
run_jvm_checks() {
|
||||
local args=()
|
||||
if [[ -n "$target_pid" ]]; then
|
||||
args+=(--pid "$target_pid")
|
||||
elif [[ -n "$match_string" ]]; then
|
||||
args+=(--match "$match_string")
|
||||
fi
|
||||
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
case "$incident_type" in
|
||||
cpu) run_cpu_checks ;;
|
||||
memory) run_memory_checks ;;
|
||||
service) run_service_checks ;;
|
||||
network) run_network_checks ;;
|
||||
auth) run_auth_checks ;;
|
||||
cert) run_cert_checks ;;
|
||||
filesystem) run_filesystem_checks ;;
|
||||
jvm) run_jvm_checks ;;
|
||||
all)
|
||||
run_cpu_checks
|
||||
run_memory_checks
|
||||
run_service_checks
|
||||
run_network_checks
|
||||
run_auth_checks
|
||||
run_cert_checks
|
||||
run_filesystem_checks
|
||||
run_jvm_checks
|
||||
;;
|
||||
esac
|
||||
|
||||
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
|
||||
current_user="$(id -un 2>/dev/null || printf 'unknown')"
|
||||
|
||||
{
|
||||
printf '# L2 Incident Triage Report\n\n'
|
||||
printf -- '- Generated: %s\n' "$generated_at"
|
||||
printf -- '- Local hostname: %s\n' "$local_hostname"
|
||||
printf -- '- Current user: %s\n' "$current_user"
|
||||
printf -- '- Incident type: %s\n' "$incident_type"
|
||||
printf -- '- Service: %s\n' "${service_name:-not provided}"
|
||||
printf -- '- Host: %s\n' "${host_name:-not provided}"
|
||||
printf -- '- Port: %s\n' "${port:-not provided}"
|
||||
printf -- '- PID: %s\n' "${target_pid:-not provided}"
|
||||
printf -- '- Process match: %s\n' "${match_string:-not provided}"
|
||||
printf -- '- Since: %s\n\n' "$since_value"
|
||||
|
||||
printf '## Executed Checks\n\n'
|
||||
printf '| Check | Script | Status | Exit | Command |\n'
|
||||
printf '| --- | --- | --- | --- | --- |\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
|
||||
"${check_labels[$index]}" \
|
||||
"${check_names[$index]}" \
|
||||
"${check_statuses[$index]}" \
|
||||
"${check_exit_codes[$index]}" \
|
||||
"${check_commands[$index]}"
|
||||
done
|
||||
printf '\n'
|
||||
|
||||
printf '## Summary\n\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
|
||||
done
|
||||
printf '\n'
|
||||
|
||||
printf '## Raw Evidence\n\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf '### %s\n\n' "${check_labels[$index]}"
|
||||
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
|
||||
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
|
||||
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
|
||||
printf '```text\n'
|
||||
cat "${check_outputs[$index]}"
|
||||
printf '\n```\n\n'
|
||||
done
|
||||
|
||||
printf '## L2 Handover Checklist\n\n'
|
||||
printf -- '- [ ] Business impact confirmed\n'
|
||||
printf -- '- [ ] Affected host/service identified\n'
|
||||
printf -- '- [ ] Monitoring alert attached\n'
|
||||
printf -- '- [ ] Recent changes checked\n'
|
||||
printf -- '- [ ] Logs attached\n'
|
||||
printf -- '- [ ] Service owner identified\n'
|
||||
printf -- '- [ ] Escalation target identified\n\n'
|
||||
|
||||
printf '## Escalation Notes\n\n'
|
||||
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
|
||||
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
|
||||
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
|
||||
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
|
||||
|
||||
printf '## Recommended Next Steps\n\n'
|
||||
printf -- '- Confirm the symptom against monitoring and user reports.\n'
|
||||
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
|
||||
printf -- '- Attach this report to the incident ticket before handoff.\n'
|
||||
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
|
||||
} > "$report_file"
|
||||
|
||||
if [[ -n "$output_file" ]]; then
|
||||
cp "$report_file" "$output_file"
|
||||
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
|
||||
else
|
||||
cat "$report_file"
|
||||
fi
|
||||
@@ -1,5 +1,69 @@
|
||||
# python
|
||||
# Python Operational Tools
|
||||
|
||||
Planned area for small Python helpers.
|
||||
This directory contains small Python utilities that support operational analysis in `infra-run`.
|
||||
|
||||
No Python tooling is implemented in `infra-run` yet.
|
||||
Python is used here only when it adds practical value over Bash: parsing structured or noisy input, producing repeatable reports, comparing evidence, or emitting machine-readable output for later automation. Shell remains the default choice for direct host checks and simple command wrappers.
|
||||
|
||||
## Tools
|
||||
|
||||
| Tool | Path | Purpose | Typical use | Example command |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| incident-log-summary | [incident-log-summary](./incident-log-summary/) | Summarize configured incident patterns from one local log file. | First-pass incident notes from system or application logs. | `python3 incident_log_summary.py --file examples/system-messages.log` |
|
||||
| log-diff-checker | [log-diff-checker](./log-diff-checker/) | Compare configured patterns before and after a change. | Post-change review for new, increased, decreased, resolved, or unchanged log symptoms. | `python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log` |
|
||||
| auth-log-audit | [auth-log-audit](./auth-log-audit/) | Summarize SSH, sudo, su, and PAM findings from local authentication logs. | Authentication incident review or access-control evidence gathering. | `python3 auth_log_audit.py --file examples/sample-auth.log` |
|
||||
| jvm-log-analyzer | [jvm-log-analyzer](./jvm-log-analyzer/) | Summarize JVM exceptions, stack traces, HTTP 5xx entries, database issues, and TLS symptoms. | Java application support, restart review, or incident handoff evidence. | `python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log` |
|
||||
| journal-analyzer | [journal-analyzer](./journal-analyzer/) | Summarize exported `journalctl` text for failed units, restart loops, OOM events, and service warnings. | Linux service incident review or patching/change evidence. | `python3 journal_analyzer.py --file examples/sample-journal.log` |
|
||||
| known-error-matcher | [known-error-matcher](./known-error-matcher/) | Match local logs against a JSON known-error catalog. | Connect known symptoms to severity, category, samples, and runbook references. | `python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json` |
|
||||
|
||||
## Expected Use Cases
|
||||
|
||||
- Log parsing for incident review.
|
||||
- Markdown or text report generation from collected evidence.
|
||||
- Change evidence helpers for pre-check and post-check notes.
|
||||
- Incident summary builders from sanitized inputs.
|
||||
- Structured output for automation, such as JSON where useful.
|
||||
|
||||
## Standards
|
||||
|
||||
- Use the Python standard library only unless a later tool clearly justifies another dependency.
|
||||
- Keep tools read-only by default.
|
||||
- Do not perform destructive actions.
|
||||
- Use `argparse` for command-line interfaces.
|
||||
- Produce predictable text output suitable for terminal review and change notes.
|
||||
- Support text, Markdown, and JSON output where useful for terminal review, tickets, or local automation.
|
||||
- Use an `OK`, `WARNING`, `CRITICAL`, and `UNKNOWN` status model for findings.
|
||||
- Handle malformed input, permission problems, and runtime errors defensively.
|
||||
- Return meaningful exit codes.
|
||||
- Keep each tool small, focused, and easy to review.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no findings, or successful validation.
|
||||
- `1` - Operational findings detected.
|
||||
- `2` - Invalid input, missing dependency, permission issue, or runtime error.
|
||||
|
||||
## Validation
|
||||
|
||||
From the repository root:
|
||||
|
||||
```bash
|
||||
bash scripts/check-python.sh
|
||||
bash scripts/validate-repo.sh
|
||||
```
|
||||
|
||||
The checks use `python3 -m py_compile` and do not require external Python dependencies.
|
||||
|
||||
## Expected Tool Structure
|
||||
|
||||
Future tools should use a small self-contained layout:
|
||||
|
||||
```text
|
||||
tool-name/
|
||||
tool_name.py
|
||||
README.md
|
||||
examples/
|
||||
sample-input.log
|
||||
sample-report.md
|
||||
```
|
||||
|
||||
Do not add package metadata, framework scaffolding, or external dependency files unless a future tool has a specific operational reason.
|
||||
|
||||
@@ -0,0 +1,190 @@
|
||||
# auth-log-audit
|
||||
|
||||
`auth-log-audit` is a read-only Python CLI for reviewing local Linux authentication logs. It summarizes suspicious SSH, sudo, su, and PAM authentication patterns that may require operator review during incident response, hardening checks, or access-control evidence gathering.
|
||||
|
||||
The tool analyzes collected log files only. It does not modify logs, query remote systems, or prove compromise.
|
||||
|
||||
## When To Use
|
||||
|
||||
- During incident response when `/var/log/auth.log`, `/var/log/secure`, or an exported authentication log needs a quick first-pass summary.
|
||||
- During Linux hardening or access review when repeated failures, invalid users, root login attempts, or sudo failures need to be surfaced.
|
||||
- Before attaching authentication evidence to an incident, security, problem, or compliance review ticket.
|
||||
- When JSON output is useful for local automation or repeatable reporting.
|
||||
|
||||
## What It Does
|
||||
|
||||
- Reads one local authentication log supplied with `--file`.
|
||||
- Detects common SSH, sudo, su, and PAM authentication events.
|
||||
- Extracts usernames, source IPs, authentication methods, services, timestamps, and sample raw lines where practical.
|
||||
- Aggregates failed login counts by source IP and username.
|
||||
- Flags suspicious source IPs and usernames when failed attempts meet the configured threshold.
|
||||
- Produces text, Markdown, or JSON output.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not detect breaches or prove compromise.
|
||||
- It does not read remote systems or live journal streams.
|
||||
- It does not modify logs, accounts, SSH configuration, sudoers, or host state.
|
||||
- It does not query SIEM, SOC tooling, ELK, Zabbix, identity providers, or ticketing systems.
|
||||
- It does not replace host-specific incident response, access review, or forensic procedures.
|
||||
- It does not classify every vendor-specific authentication message.
|
||||
|
||||
## Supported Input Types
|
||||
|
||||
- Debian/Ubuntu-style `/var/log/auth.log`.
|
||||
- RHEL/Oracle Linux-style `/var/log/secure`.
|
||||
- Exported authentication logs with similar syslog-style lines.
|
||||
- UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue.
|
||||
|
||||
Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
## Supported Event Categories
|
||||
|
||||
SSH-related:
|
||||
|
||||
- Failed SSH password login.
|
||||
- Failed SSH publickey login.
|
||||
- Successful SSH login.
|
||||
- Invalid user attempts.
|
||||
- Root login attempts.
|
||||
- Refused or disallowed user attempts.
|
||||
- Disconnects after failed authentication where detectable.
|
||||
- Too many authentication failures where detectable.
|
||||
|
||||
sudo and su-related:
|
||||
|
||||
- sudo command usage.
|
||||
- sudo authentication failure.
|
||||
- su session opened.
|
||||
- su authentication failure.
|
||||
|
||||
Generic authentication:
|
||||
|
||||
- authentication failure.
|
||||
- `pam_unix` authentication failure.
|
||||
- Account locked messages where detectable.
|
||||
- User not known to the underlying authentication module.
|
||||
|
||||
## Timestamp Handling
|
||||
|
||||
The scanner attempts to parse:
|
||||
|
||||
- `May 11 10:15:30`
|
||||
- `2026-05-11 10:15:30`
|
||||
- `2026-05-11T10:15:30`
|
||||
|
||||
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and first seen / last seen values are reported as `UNKNOWN` when no parseable event timestamps are found. Syslog timestamps without a year use the current local year internally while preserving the original timestamp shape in text and Markdown output.
|
||||
|
||||
## Suspicious Activity Model
|
||||
|
||||
Default threshold:
|
||||
|
||||
```text
|
||||
--threshold-failed 5
|
||||
```
|
||||
|
||||
The report classifies findings conservatively:
|
||||
|
||||
- `OK` - no suspicious findings.
|
||||
- `WARNING` - repeated failed logins, invalid users, root login attempts below the threshold, or sudo authentication failures.
|
||||
- `CRITICAL` - root login attempts above threshold, high-volume brute-force indicators, or multiple suspicious source IPs above threshold.
|
||||
|
||||
This status is a triage signal. It identifies suspicious authentication patterns that require review; it does not confirm a breach.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/auth-log-audit
|
||||
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log
|
||||
python3 auth_log_audit.py --file examples/sample-secure.log
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown --output auth-report.md
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --format json
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --top 10
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --threshold-failed 5
|
||||
python3 auth_log_audit.py --file examples/sample-auth.log --ignore-users monitoring,backup,ansible
|
||||
```
|
||||
|
||||
Ignored users are excluded from suspicious username threshold findings. Their events are still counted in totals and can still appear in top-user summaries so operational context is not silently hidden.
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident or security ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no suspicious findings.
|
||||
- `1` - Suspicious findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Auth Log Audit
|
||||
==============
|
||||
|
||||
Overall status: WARNING
|
||||
First seen: May 11 09:58:12
|
||||
Last seen: May 11 10:07:48
|
||||
|
||||
Top Source IPs by Failed Attempts
|
||||
---------------------------------
|
||||
- 203.0.113.50: 7
|
||||
- 198.51.100.23: 1
|
||||
|
||||
Suspicious Source IPs
|
||||
---------------------
|
||||
- 203.0.113.50: 7
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Overall status: WARNING
|
||||
Total lines scanned: 15
|
||||
Authentication events detected: 15
|
||||
Failed logins: 8
|
||||
Successful logins: 1
|
||||
Invalid user attempts: 1
|
||||
Root login attempts: 2
|
||||
Sudo usage events: 1
|
||||
Sudo authentication failures: 1
|
||||
Suspicious source IPs: 1
|
||||
Suspicious usernames: 0
|
||||
Threshold used: 5
|
||||
Ignored users: None
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a Markdown report from a collected authentication log and attach it to the incident or security ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 auth_log_audit.py \
|
||||
--file examples/sample-auth.log \
|
||||
--format markdown \
|
||||
--output auth-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with host access history, SSH configuration, sudo policy, user ownership, and any relevant monitoring evidence.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single line may produce more than one event when PAM and service messages overlap.
|
||||
- Syslog timestamps without a year are normalized internally with the current local year.
|
||||
- Source IP extraction is IPv4-oriented.
|
||||
- The tool compares counts, not rates, authentication windows, geolocation, or identity context.
|
||||
- Large log files are read into memory; collect scoped extracts for very large incidents.
|
||||
- Vendor-specific PAM modules or SSH daemon formats may need future patterns.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input log and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as prompts that require review; the tool does not prove compromise or determine root cause automatically.
|
||||
@@ -0,0 +1,734 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Summarize suspicious authentication activity in local Linux auth logs."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter, defaultdict
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
UNKNOWN = "UNKNOWN"
|
||||
|
||||
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
|
||||
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||
SERVICE_RE = re.compile(r"\s([A-Za-z0-9_.-]+)(?:\[\d+\])?:\s")
|
||||
IP_RE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
|
||||
|
||||
|
||||
EVENT_PATTERNS = [
|
||||
{
|
||||
"event_type": "failed_ssh_password",
|
||||
"category": "failed_login",
|
||||
"method": "password",
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: Failed password for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "failed_ssh_publickey",
|
||||
"category": "failed_login",
|
||||
"method": "publickey",
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: Failed publickey for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "successful_ssh_login",
|
||||
"category": "successful_login",
|
||||
"method": None,
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: Accepted (\S+) for (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "invalid_user_attempt",
|
||||
"category": "invalid_user",
|
||||
"method": None,
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: Invalid user (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "refused_user_attempt",
|
||||
"category": "refused_user",
|
||||
"method": None,
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: (?:User|Connection closed by invalid user) (\S+).*?from ((?:\d{1,3}\.){3}\d{1,3})"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "disconnect_after_failed_auth",
|
||||
"category": "disconnect_after_failed_auth",
|
||||
"method": None,
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: Disconnected from (?:authenticating user \S+ |invalid user \S+ )?((?:\d{1,3}\.){3}\d{1,3}).*(?:preauth|Too many authentication failures)"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "too_many_auth_failures",
|
||||
"category": "failed_login",
|
||||
"method": None,
|
||||
"regex": re.compile(
|
||||
r"sshd(?:\[\d+\])?: .*(?:Too many authentication failures|maximum authentication attempts exceeded).*"
|
||||
),
|
||||
},
|
||||
{
|
||||
"event_type": "sudo_command",
|
||||
"category": "sudo_usage",
|
||||
"method": None,
|
||||
"regex": re.compile(r"sudo(?:\[\d+\])?:\s+(\S+)\s+:\s+TTY=.*COMMAND=(.+)$"),
|
||||
},
|
||||
{
|
||||
"event_type": "sudo_auth_failure",
|
||||
"category": "sudo_failure",
|
||||
"method": None,
|
||||
"regex": re.compile(r"sudo(?:\[\d+\])?: pam_unix\(sudo:auth\): authentication failure;.*"),
|
||||
},
|
||||
{
|
||||
"event_type": "su_session_opened",
|
||||
"category": "su_event",
|
||||
"method": None,
|
||||
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:session\): session opened for user (\S+)"),
|
||||
},
|
||||
{
|
||||
"event_type": "su_auth_failure",
|
||||
"category": "su_event",
|
||||
"method": None,
|
||||
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:auth\): authentication failure;.*"),
|
||||
},
|
||||
{
|
||||
"event_type": "pam_unix_auth_failure",
|
||||
"category": "generic_auth_failure",
|
||||
"method": None,
|
||||
"regex": re.compile(r"pam_unix\([^)]*:auth\): authentication failure;.*"),
|
||||
},
|
||||
{
|
||||
"event_type": "user_unknown",
|
||||
"category": "generic_auth_failure",
|
||||
"method": None,
|
||||
"regex": re.compile(r"user (?:unknown|not known to the underlying authentication module)"),
|
||||
},
|
||||
{
|
||||
"event_type": "account_locked",
|
||||
"category": "generic_auth_failure",
|
||||
"method": None,
|
||||
"regex": re.compile(r"(?:account locked|authentication failure;.*account locked)", re.IGNORECASE),
|
||||
},
|
||||
]
|
||||
|
||||
FAILED_CATEGORIES = {"failed_login", "generic_auth_failure"}
|
||||
SAMPLE_CATEGORIES = [
|
||||
"failed_login",
|
||||
"invalid_user",
|
||||
"root_login_attempt",
|
||||
"sudo_failure",
|
||||
"suspicious_source_ip",
|
||||
]
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze local Linux authentication logs for suspicious patterns."
|
||||
)
|
||||
parser.add_argument("--file", required=True, help="Local auth.log or secure file to analyze.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
default=10,
|
||||
help="Number of top IPs, usernames, and event types to display. Default: 10.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--threshold-failed",
|
||||
type=positive_int,
|
||||
default=5,
|
||||
help="Failed attempt threshold for suspicious IPs and usernames. Default: 5.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-users",
|
||||
default="",
|
||||
help="Comma-separated usernames excluded from suspicious username thresholds.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per finding category. Default: 3.",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def parse_ignore_users(value: str) -> list[str]:
|
||||
if not value.strip():
|
||||
return []
|
||||
users = []
|
||||
for item in value.split(","):
|
||||
user = item.strip()
|
||||
if user:
|
||||
users.append(user)
|
||||
return sorted(set(users))
|
||||
|
||||
|
||||
def read_log_file(path: Path) -> list[str]:
|
||||
if not path.exists():
|
||||
raise OSError(f"file does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"path is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"file is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read file {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"file is empty: {path}")
|
||||
return text.splitlines()
|
||||
|
||||
|
||||
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
|
||||
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||
if iso_match:
|
||||
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||
try:
|
||||
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
|
||||
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||
if syslog_match:
|
||||
raw = syslog_match.group(1)
|
||||
normalized = f"{syslog_year} {raw}"
|
||||
try:
|
||||
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
return parsed, raw
|
||||
|
||||
return None, UNKNOWN
|
||||
|
||||
|
||||
def render_seen(value: tuple[datetime, str] | None) -> str:
|
||||
if value is None:
|
||||
return UNKNOWN
|
||||
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
|
||||
def extract_service(line: str) -> str:
|
||||
match = SERVICE_RE.search(line)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def extract_ip(line: str) -> str:
|
||||
match = IP_RE.search(line)
|
||||
if match:
|
||||
return match.group(0)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def extract_user_from_key_values(line: str) -> str:
|
||||
for pattern in (
|
||||
r"\buser=([A-Za-z0-9_.@-]+)",
|
||||
r"\bruser=([A-Za-z0-9_.@-]+)",
|
||||
r"\bUSER=([A-Za-z0-9_.@-]+)",
|
||||
):
|
||||
match = re.search(pattern, line)
|
||||
if match and match.group(1):
|
||||
return match.group(1)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def event_from_match(line: str, pattern: dict[str, Any], match: re.Match[str]) -> dict[str, Any]:
|
||||
event_type = pattern["event_type"]
|
||||
username = UNKNOWN
|
||||
source_ip = extract_ip(line)
|
||||
method = pattern["method"] or UNKNOWN
|
||||
|
||||
if event_type in ("failed_ssh_password", "failed_ssh_publickey"):
|
||||
username = match.group(2)
|
||||
source_ip = match.group(3)
|
||||
elif event_type == "successful_ssh_login":
|
||||
method = match.group(1)
|
||||
username = match.group(2)
|
||||
source_ip = match.group(3)
|
||||
elif event_type in ("invalid_user_attempt", "refused_user_attempt"):
|
||||
username = match.group(1)
|
||||
source_ip = match.group(2)
|
||||
elif event_type == "sudo_command":
|
||||
username = match.group(1)
|
||||
elif event_type == "su_session_opened":
|
||||
username = match.group(1).rstrip(")")
|
||||
elif event_type in ("sudo_auth_failure", "su_auth_failure", "pam_unix_auth_failure"):
|
||||
username = extract_user_from_key_values(line)
|
||||
|
||||
if username == "root" and event_type in (
|
||||
"failed_ssh_password",
|
||||
"failed_ssh_publickey",
|
||||
"successful_ssh_login",
|
||||
"invalid_user_attempt",
|
||||
"refused_user_attempt",
|
||||
):
|
||||
event_type = "root_login_attempt"
|
||||
|
||||
return {
|
||||
"event_type": event_type,
|
||||
"category": pattern["category"],
|
||||
"username": username or UNKNOWN,
|
||||
"source_ip": source_ip or UNKNOWN,
|
||||
"method": method,
|
||||
"service": extract_service(line),
|
||||
"raw": line,
|
||||
}
|
||||
|
||||
|
||||
def detect_events(line: str) -> list[dict[str, Any]]:
|
||||
events = []
|
||||
for pattern in EVENT_PATTERNS:
|
||||
match = pattern["regex"].search(line)
|
||||
if match:
|
||||
events.append(event_from_match(line, pattern, match))
|
||||
|
||||
if any(event["event_type"] in ("sudo_auth_failure", "su_auth_failure") for event in events):
|
||||
events = [
|
||||
event for event in events if event["event_type"] != "pam_unix_auth_failure"
|
||||
]
|
||||
|
||||
if "authentication failure" in line and not events:
|
||||
events.append(
|
||||
{
|
||||
"event_type": "authentication_failure",
|
||||
"category": "generic_auth_failure",
|
||||
"username": extract_user_from_key_values(line),
|
||||
"source_ip": extract_ip(line),
|
||||
"method": UNKNOWN,
|
||||
"service": extract_service(line),
|
||||
"raw": line,
|
||||
}
|
||||
)
|
||||
return dedupe_events(events)
|
||||
|
||||
|
||||
def dedupe_events(events: list[dict[str, Any]]) -> list[dict[str, Any]]:
|
||||
deduped = []
|
||||
seen = set()
|
||||
for event in events:
|
||||
key = (event["event_type"], event["username"], event["source_ip"], event["raw"])
|
||||
if key in seen:
|
||||
continue
|
||||
seen.add(key)
|
||||
deduped.append(event)
|
||||
return deduped
|
||||
|
||||
|
||||
def append_sample(samples: dict[str, list[str]], category: str, line: str, max_samples: int) -> None:
|
||||
if max_samples == 0:
|
||||
return
|
||||
if len(samples[category]) < max_samples:
|
||||
samples[category].append(line)
|
||||
|
||||
|
||||
def update_seen(
|
||||
first_seen: tuple[datetime, str] | None,
|
||||
last_seen: tuple[datetime, str] | None,
|
||||
parsed_at: datetime | None,
|
||||
rendered_at: str,
|
||||
) -> tuple[tuple[datetime, str] | None, tuple[datetime, str] | None]:
|
||||
if parsed_at is None:
|
||||
return first_seen, last_seen
|
||||
if first_seen is None or parsed_at < first_seen[0]:
|
||||
first_seen = (parsed_at, rendered_at)
|
||||
if last_seen is None or parsed_at > last_seen[0]:
|
||||
last_seen = (parsed_at, rendered_at)
|
||||
return first_seen, last_seen
|
||||
|
||||
|
||||
def analyze_log(
|
||||
lines: list[str],
|
||||
threshold_failed: int,
|
||||
ignore_users: list[str],
|
||||
top: int,
|
||||
max_samples: int,
|
||||
) -> dict[str, Any]:
|
||||
syslog_year = datetime.now().year
|
||||
events = []
|
||||
samples: dict[str, list[str]] = defaultdict(list)
|
||||
event_type_counts: Counter[str] = Counter()
|
||||
failed_by_ip: Counter[str] = Counter()
|
||||
failed_by_user: Counter[str] = Counter()
|
||||
success_by_ip: Counter[str] = Counter()
|
||||
success_by_user: Counter[str] = Counter()
|
||||
first_seen: tuple[datetime, str] | None = None
|
||||
last_seen: tuple[datetime, str] | None = None
|
||||
|
||||
for line in lines:
|
||||
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||
line_events = detect_events(line)
|
||||
if not line_events:
|
||||
continue
|
||||
|
||||
first_seen, last_seen = update_seen(first_seen, last_seen, parsed_at, rendered_at)
|
||||
for event in line_events:
|
||||
event["timestamp"] = rendered_at
|
||||
events.append(event)
|
||||
event_type_counts[event["event_type"]] += 1
|
||||
|
||||
category = event["category"]
|
||||
username = event["username"]
|
||||
source_ip = event["source_ip"]
|
||||
|
||||
if event["event_type"] == "root_login_attempt":
|
||||
append_sample(samples, "root_login_attempt", line, max_samples)
|
||||
category = "failed_login"
|
||||
|
||||
if category in FAILED_CATEGORIES:
|
||||
if source_ip != UNKNOWN:
|
||||
failed_by_ip[source_ip] += 1
|
||||
if username != UNKNOWN:
|
||||
failed_by_user[username] += 1
|
||||
append_sample(samples, "failed_login", line, max_samples)
|
||||
|
||||
if category == "successful_login":
|
||||
if source_ip != UNKNOWN:
|
||||
success_by_ip[source_ip] += 1
|
||||
if username != UNKNOWN:
|
||||
success_by_user[username] += 1
|
||||
|
||||
if category == "invalid_user":
|
||||
append_sample(samples, "invalid_user", line, max_samples)
|
||||
if category == "sudo_failure":
|
||||
append_sample(samples, "sudo_failure", line, max_samples)
|
||||
|
||||
suspicious_ips = {
|
||||
ip: count for ip, count in failed_by_ip.items() if count >= threshold_failed
|
||||
}
|
||||
suspicious_users = {
|
||||
user: count
|
||||
for user, count in failed_by_user.items()
|
||||
if count >= threshold_failed and user not in ignore_users
|
||||
}
|
||||
|
||||
for event in events:
|
||||
if event["source_ip"] in suspicious_ips:
|
||||
append_sample(samples, "suspicious_source_ip", event["raw"], max_samples)
|
||||
|
||||
summary = build_summary(
|
||||
lines=lines,
|
||||
events=events,
|
||||
failed_by_ip=failed_by_ip,
|
||||
failed_by_user=failed_by_user,
|
||||
suspicious_ips=suspicious_ips,
|
||||
suspicious_users=suspicious_users,
|
||||
event_type_counts=event_type_counts,
|
||||
threshold_failed=threshold_failed,
|
||||
ignore_users=ignore_users,
|
||||
first_seen=first_seen,
|
||||
last_seen=last_seen,
|
||||
)
|
||||
|
||||
return {
|
||||
"summary": summary,
|
||||
"top_source_ips_by_failed_attempts": top_items(failed_by_ip, top),
|
||||
"top_usernames_by_failed_attempts": top_items(failed_by_user, top),
|
||||
"top_source_ips_by_successful_logins": top_items(success_by_ip, top),
|
||||
"top_usernames_by_successful_logins": top_items(success_by_user, top),
|
||||
"top_event_types": top_items(event_type_counts, top),
|
||||
"suspicious_source_ips": sorted_count_items(suspicious_ips),
|
||||
"suspicious_usernames": sorted_count_items(suspicious_users),
|
||||
"samples": {category: samples.get(category, []) for category in SAMPLE_CATEGORIES},
|
||||
}
|
||||
|
||||
|
||||
def build_summary(
|
||||
lines: list[str],
|
||||
events: list[dict[str, Any]],
|
||||
failed_by_ip: Counter[str],
|
||||
failed_by_user: Counter[str],
|
||||
suspicious_ips: dict[str, int],
|
||||
suspicious_users: dict[str, int],
|
||||
event_type_counts: Counter[str],
|
||||
threshold_failed: int,
|
||||
ignore_users: list[str],
|
||||
first_seen: tuple[datetime, str] | None,
|
||||
last_seen: tuple[datetime, str] | None,
|
||||
) -> dict[str, Any]:
|
||||
root_attempts = event_type_counts["root_login_attempt"]
|
||||
sudo_failures = event_type_counts["sudo_auth_failure"]
|
||||
invalid_users = event_type_counts["invalid_user_attempt"]
|
||||
high_volume_ips = sum(1 for count in suspicious_ips.values() if count >= threshold_failed * 2)
|
||||
high_volume_users = sum(1 for count in suspicious_users.values() if count >= threshold_failed * 2)
|
||||
|
||||
if (
|
||||
root_attempts >= threshold_failed
|
||||
or high_volume_ips > 0
|
||||
or high_volume_users > 0
|
||||
or len(suspicious_ips) >= 2
|
||||
):
|
||||
status = "CRITICAL"
|
||||
elif suspicious_ips or suspicious_users or invalid_users > 0 or sudo_failures > 0 or root_attempts > 0:
|
||||
status = "WARNING"
|
||||
else:
|
||||
status = "OK"
|
||||
|
||||
return {
|
||||
"overall_status": status,
|
||||
"first_seen": render_seen(first_seen),
|
||||
"last_seen": render_seen(last_seen),
|
||||
"total_lines_scanned": len(lines),
|
||||
"authentication_events_detected": len(events),
|
||||
"failed_login_count": sum(failed_by_ip.values()),
|
||||
"successful_login_count": event_type_counts["successful_ssh_login"],
|
||||
"invalid_user_count": invalid_users,
|
||||
"root_login_attempt_count": root_attempts,
|
||||
"sudo_command_count": event_type_counts["sudo_command"],
|
||||
"sudo_failure_count": sudo_failures,
|
||||
"su_event_count": event_type_counts["su_session_opened"] + event_type_counts["su_auth_failure"],
|
||||
"suspicious_source_ip_count": len(suspicious_ips),
|
||||
"suspicious_username_count": len(suspicious_users),
|
||||
"threshold_failed": threshold_failed,
|
||||
"ignored_users": ignore_users,
|
||||
}
|
||||
|
||||
|
||||
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
|
||||
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
|
||||
|
||||
|
||||
def sorted_count_items(items: dict[str, int]) -> list[dict[str, Any]]:
|
||||
return [
|
||||
{"value": value, "count": count}
|
||||
for value, count in sorted(items.items(), key=lambda item: (-item[1], item[0]))
|
||||
]
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any]) -> str:
|
||||
summary = report["summary"]
|
||||
lines = [
|
||||
"Auth Log Audit",
|
||||
"==============",
|
||||
"",
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
f"First seen: {summary['first_seen']}",
|
||||
f"Last seen: {summary['last_seen']}",
|
||||
"",
|
||||
]
|
||||
|
||||
lines.extend(render_text_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
|
||||
lines.extend(render_text_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
|
||||
lines.extend(render_text_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
|
||||
lines.extend(render_text_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
|
||||
lines.extend(render_text_table("Suspicious Source IPs", report["suspicious_source_ips"]))
|
||||
lines.extend(render_text_table("Suspicious Usernames", report["suspicious_usernames"]))
|
||||
lines.extend(render_text_table("Top Event Types", report["top_event_types"]))
|
||||
lines.extend(render_text_samples(report["samples"]))
|
||||
lines.extend(render_text_summary(summary))
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
|
||||
lines = [title, "-" * len(title)]
|
||||
if not rows:
|
||||
lines.append("No entries detected.")
|
||||
else:
|
||||
for item in rows:
|
||||
lines.append(f"- {item['value']}: {item['count']}")
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_text_samples(samples: dict[str, list[str]]) -> list[str]:
|
||||
lines = ["Sample Log Lines", "----------------"]
|
||||
for category in SAMPLE_CATEGORIES:
|
||||
lines.append(f"{category}:")
|
||||
if samples.get(category):
|
||||
lines.extend(f" - {sample}" for sample in samples[category])
|
||||
else:
|
||||
lines.append(" - No samples retained")
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_text_summary(summary: dict[str, Any]) -> list[str]:
|
||||
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
|
||||
return [
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
f"Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"Authentication events detected: {summary['authentication_events_detected']}",
|
||||
f"Failed logins: {summary['failed_login_count']}",
|
||||
f"Successful logins: {summary['successful_login_count']}",
|
||||
f"Invalid user attempts: {summary['invalid_user_count']}",
|
||||
f"Root login attempts: {summary['root_login_attempt_count']}",
|
||||
f"Sudo usage events: {summary['sudo_command_count']}",
|
||||
f"Sudo authentication failures: {summary['sudo_failure_count']}",
|
||||
f"su events: {summary['su_event_count']}",
|
||||
f"Suspicious source IPs: {summary['suspicious_source_ip_count']}",
|
||||
f"Suspicious usernames: {summary['suspicious_username_count']}",
|
||||
f"Threshold used: {summary['threshold_failed']}",
|
||||
f"Ignored users: {ignored}",
|
||||
]
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any]) -> str:
|
||||
summary = report["summary"]
|
||||
lines = [
|
||||
"# Auth Log Audit",
|
||||
"",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
f"- First seen: {summary['first_seen']}",
|
||||
f"- Last seen: {summary['last_seen']}",
|
||||
"",
|
||||
]
|
||||
|
||||
lines.extend(render_markdown_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
|
||||
lines.extend(render_markdown_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
|
||||
lines.extend(render_markdown_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
|
||||
lines.extend(render_markdown_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
|
||||
lines.extend(render_markdown_table("Suspicious Source IPs", report["suspicious_source_ips"]))
|
||||
lines.extend(render_markdown_table("Suspicious Usernames", report["suspicious_usernames"]))
|
||||
lines.extend(render_markdown_table("Top Event Types", report["top_event_types"]))
|
||||
lines.extend(render_markdown_samples(report["samples"]))
|
||||
|
||||
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
|
||||
lines.extend(
|
||||
[
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
f"- Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"- Authentication events detected: {summary['authentication_events_detected']}",
|
||||
f"- Failed logins: {summary['failed_login_count']}",
|
||||
f"- Successful logins: {summary['successful_login_count']}",
|
||||
f"- Invalid user attempts: {summary['invalid_user_count']}",
|
||||
f"- Root login attempts: {summary['root_login_attempt_count']}",
|
||||
f"- Sudo usage events: {summary['sudo_command_count']}",
|
||||
f"- Sudo authentication failures: {summary['sudo_failure_count']}",
|
||||
f"- su events: {summary['su_event_count']}",
|
||||
f"- Suspicious source IPs: {summary['suspicious_source_ip_count']}",
|
||||
f"- Suspicious usernames: {summary['suspicious_username_count']}",
|
||||
f"- Threshold used: {summary['threshold_failed']}",
|
||||
f"- Ignored users: {ignored}",
|
||||
"",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
|
||||
lines = [f"## {title}", ""]
|
||||
if not rows:
|
||||
lines.extend(["No entries detected.", ""])
|
||||
return lines
|
||||
lines.extend(["| Value | Count |", "| --- | ---: |"])
|
||||
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_markdown_samples(samples: dict[str, list[str]]) -> list[str]:
|
||||
lines = ["## Sample Log Lines", ""]
|
||||
for category in SAMPLE_CATEGORIES:
|
||||
lines.extend([f"### {category}", ""])
|
||||
if samples.get(category):
|
||||
lines.append("```text")
|
||||
lines.extend(samples[category])
|
||||
lines.append("```")
|
||||
else:
|
||||
lines.append("_No samples retained._")
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2, sort_keys=True) + "\n"
|
||||
|
||||
|
||||
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
|
||||
if output_path is None:
|
||||
sys.stdout.write(content)
|
||||
return
|
||||
|
||||
path = Path(output_path)
|
||||
try:
|
||||
if path.resolve() == input_path.resolve():
|
||||
raise OSError("output path must not be the same as input file")
|
||||
path.write_text(content, encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write output {path}: {exc}") from exc
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
input_path = Path(args.file)
|
||||
ignore_users = parse_ignore_users(args.ignore_users)
|
||||
|
||||
try:
|
||||
lines = read_log_file(input_path)
|
||||
report = analyze_log(
|
||||
lines=lines,
|
||||
threshold_failed=args.threshold_failed,
|
||||
ignore_users=ignore_users,
|
||||
top=args.top,
|
||||
max_samples=args.max_samples,
|
||||
)
|
||||
|
||||
if args.format == "text":
|
||||
content = render_text(report)
|
||||
elif args.format == "markdown":
|
||||
content = render_markdown(report)
|
||||
else:
|
||||
content = render_json(report)
|
||||
|
||||
write_report(input_path, args.output, content)
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"CRITICAL: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except RuntimeError as exc:
|
||||
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
if report["summary"]["overall_status"] == "OK":
|
||||
return EXIT_OK
|
||||
return EXIT_FINDINGS
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,112 @@
|
||||
# Auth Log Audit
|
||||
|
||||
- Overall status: WARNING
|
||||
- First seen: May 11 09:58:12
|
||||
- Last seen: May 11 10:07:48
|
||||
|
||||
## Top Source IPs by Failed Attempts
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| 203.0.113.50 | 7 |
|
||||
| 198.51.100.23 | 1 |
|
||||
|
||||
## Top Usernames by Failed Attempts
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| appuser | 3 |
|
||||
| root | 2 |
|
||||
| admin | 1 |
|
||||
| backup | 1 |
|
||||
|
||||
## Top Source IPs by Successful Logins
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| 10.20.30.15 | 1 |
|
||||
|
||||
## Top Usernames by Successful Logins
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| deploy | 1 |
|
||||
|
||||
## Suspicious Source IPs
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| 203.0.113.50 | 7 |
|
||||
|
||||
## Suspicious Usernames
|
||||
|
||||
No entries detected.
|
||||
|
||||
## Top Event Types
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| failed_ssh_password | 4 |
|
||||
| root_login_attempt | 2 |
|
||||
| successful_ssh_login | 1 |
|
||||
| sudo_command | 1 |
|
||||
| invalid_user_attempt | 1 |
|
||||
| disconnect_after_failed_auth | 1 |
|
||||
| failed_ssh_publickey | 1 |
|
||||
| sudo_auth_failure | 1 |
|
||||
| su_session_opened | 1 |
|
||||
| refused_user_attempt | 1 |
|
||||
|
||||
## Sample Log Lines
|
||||
|
||||
### failed_login
|
||||
|
||||
```text
|
||||
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
|
||||
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
```
|
||||
|
||||
### invalid_user
|
||||
|
||||
```text
|
||||
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
|
||||
```
|
||||
|
||||
### root_login_attempt
|
||||
|
||||
```text
|
||||
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
```
|
||||
|
||||
### sudo_failure
|
||||
|
||||
```text
|
||||
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
|
||||
```
|
||||
|
||||
### suspicious_source_ip
|
||||
|
||||
```text
|
||||
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
|
||||
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
|
||||
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
```
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Overall status: WARNING
|
||||
- Total lines scanned: 15
|
||||
- Authentication events detected: 15
|
||||
- Failed logins: 8
|
||||
- Successful logins: 1
|
||||
- Invalid user attempts: 1
|
||||
- Root login attempts: 2
|
||||
- Sudo usage events: 1
|
||||
- Sudo authentication failures: 1
|
||||
- su events: 1
|
||||
- Suspicious source IPs: 1
|
||||
- Suspicious usernames: 0
|
||||
- Threshold used: 5
|
||||
- Ignored users: None
|
||||
@@ -0,0 +1,15 @@
|
||||
May 11 09:58:12 web01 sshd[1201]: Accepted publickey for deploy from 10.20.30.15 port 52214 ssh2: ED25519 SHA256:samplekey
|
||||
May 11 10:00:01 web01 sudo: deploy : TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/usr/bin/systemctl status nginx
|
||||
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
|
||||
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
|
||||
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
|
||||
May 11 10:02:11 web01 sshd[1224]: Disconnected from authenticating user root 203.0.113.50 port 45012 [preauth]
|
||||
May 11 10:03:10 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
|
||||
May 11 10:03:14 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
|
||||
May 11 10:03:18 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
|
||||
May 11 10:03:41 web01 sshd[1238]: Failed publickey for backup from 198.51.100.23 port 50222 ssh2
|
||||
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
|
||||
May 11 10:05:02 web01 su[1244]: pam_unix(su:session): session opened for user root by deploy(uid=1001)
|
||||
May 11 10:06:31 web01 sshd[1250]: User testuser from 192.0.2.77 not allowed because not listed in AllowUsers
|
||||
May 11 10:07:48 web01 sshd[1254]: error: maximum authentication attempts exceeded for invalid user oracle from 203.0.113.50 port 45200 ssh2 [preauth]
|
||||
@@ -0,0 +1,14 @@
|
||||
May 11 09:52:44 db01 sshd[2110]: Accepted publickey for admin from 10.40.10.25 port 60124 ssh2: RSA SHA256:samplekey
|
||||
May 11 09:55:10 db01 sudo[2120]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/systemctl restart auditd
|
||||
May 11 09:55:10 db01 sudo[2120]: pam_unix(sudo:session): session opened for user root(uid=0) by admin(uid=1000)
|
||||
May 11 10:00:01 db01 sshd[2130]: Failed password for invalid user postgres from 198.51.100.90 port 42101 ssh2
|
||||
May 11 10:00:03 db01 sshd[2130]: Invalid user postgres from 198.51.100.90 port 42101
|
||||
May 11 10:00:09 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
|
||||
May 11 10:00:13 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
|
||||
May 11 10:00:20 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
|
||||
May 11 10:00:25 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
|
||||
May 11 10:00:31 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
|
||||
May 11 10:01:12 db01 su[2142]: pam_unix(su:auth): authentication failure; logname=admin uid=1000 euid=0 tty=pts/1 ruser=admin rhost= user=root
|
||||
May 11 10:01:45 db01 sshd[2149]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.77 user=monitoring
|
||||
May 11 10:02:03 db01 sshd[2154]: error: PAM: User not known to the underlying authentication module for illegal user deploy from 203.0.113.77
|
||||
May 11 10:02:36 db01 sshd[2159]: Disconnecting authenticating user oracle 198.51.100.90 port 42111: Too many authentication failures [preauth]
|
||||
@@ -0,0 +1,159 @@
|
||||
# incident-log-summary
|
||||
|
||||
`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
|
||||
|
||||
The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
|
||||
|
||||
## When To Use
|
||||
|
||||
- During incident response when a collected log file needs a fast pattern summary.
|
||||
- Before attaching evidence to an incident, problem, or change ticket.
|
||||
- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
|
||||
- When JSON output is useful for later local automation.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not read remote systems.
|
||||
- It does not modify logs or system state.
|
||||
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
|
||||
- It does not prove root cause.
|
||||
- It does not classify every possible vendor or application error.
|
||||
- It does not treat sanitized examples as production validation.
|
||||
|
||||
## Supported Input
|
||||
|
||||
- One local text log file provided with `--file`.
|
||||
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
|
||||
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
## Supported Patterns
|
||||
|
||||
Critical patterns:
|
||||
|
||||
- `CRITICAL`
|
||||
- `FATAL`
|
||||
- `panic`
|
||||
- `kernel panic`
|
||||
- `no space left on device`
|
||||
- `out of memory`
|
||||
- `killed process`
|
||||
- `read-only file system`
|
||||
- `segmentation fault`
|
||||
- `segfault`
|
||||
- `certificate expired`
|
||||
- `TLS handshake failed`
|
||||
- `SSLHandshakeException`
|
||||
- `database unavailable`
|
||||
- `HTTP 500`
|
||||
- `HTTP 502`
|
||||
- `HTTP 503`
|
||||
- `HTTP 504`
|
||||
|
||||
Warning patterns:
|
||||
|
||||
- `ERROR`
|
||||
- `failed`
|
||||
- `failure`
|
||||
- `timeout`
|
||||
- `connection refused`
|
||||
- `connection reset`
|
||||
- `permission denied`
|
||||
- `authentication failed`
|
||||
- `denied`
|
||||
- `unavailable`
|
||||
- `service restart`
|
||||
- `retrying`
|
||||
|
||||
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
|
||||
|
||||
## Timestamp Handling
|
||||
|
||||
The scanner attempts to parse:
|
||||
|
||||
- `2026-05-11 10:15:30`
|
||||
- `2026-05-11T10:15:30`
|
||||
- `May 11 10:15:30`
|
||||
|
||||
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
|
||||
|
||||
Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/incident-log-summary
|
||||
|
||||
python3 incident_log_summary.py --file examples/system-messages.log
|
||||
python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
|
||||
python3 incident_log_summary.py --file examples/app-error.log --format json
|
||||
python3 incident_log_summary.py --file examples/app-error.log --top 20
|
||||
python3 incident_log_summary.py --file examples/app-error.log --ignore-case
|
||||
python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
|
||||
python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident or change ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no findings.
|
||||
- `1` - Operational findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Incident Log Summary
|
||||
====================
|
||||
|
||||
[CRITICAL] no space left on device
|
||||
Occurrences: 1
|
||||
First seen: 2026-05-11 10:16:07
|
||||
Last seen: 2026-05-11 10:16:07
|
||||
Samples:
|
||||
- May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Total lines scanned: 7
|
||||
Total findings: 7
|
||||
Critical finding groups: 3
|
||||
Warning finding groups: 4
|
||||
Overall status: CRITICAL
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 incident_log_summary.py \
|
||||
--file examples/app-error.log \
|
||||
--format markdown \
|
||||
--output incident-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
|
||||
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||
- Syslog timestamps without a year are normalized with an inferred year.
|
||||
- Date filters are best-effort because lines without parseable timestamps are retained.
|
||||
- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input log and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
|
||||
@@ -0,0 +1,8 @@
|
||||
2026-05-11 09:48:12 app01 api[4150]: INFO request_id=7f3a status=200 path=/health
|
||||
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
|
||||
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
|
||||
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
|
||||
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
|
||||
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
|
||||
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
|
||||
2026-05-11 12:10:01 app01 api[4150]: INFO request_id=9001 status=200 path=/health
|
||||
@@ -0,0 +1,144 @@
|
||||
# Incident Log Summary
|
||||
|
||||
## CRITICAL: certificate expired
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:11:33
|
||||
- Last seen: 2026-05-11 10:11:33
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
|
||||
```
|
||||
|
||||
## CRITICAL: CRITICAL
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:11:33
|
||||
- Last seen: 2026-05-11 10:11:33
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
|
||||
```
|
||||
|
||||
## CRITICAL: database unavailable
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:03:19
|
||||
- Last seen: 2026-05-11 10:03:19
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
|
||||
```
|
||||
|
||||
## CRITICAL: HTTP 500
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:01:03
|
||||
- Last seen: 2026-05-11 10:01:03
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
|
||||
```
|
||||
|
||||
## CRITICAL: HTTP 503
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:13:58
|
||||
- Last seen: 2026-05-11 10:13:58
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
|
||||
```
|
||||
|
||||
## CRITICAL: TLS handshake failed
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:11:33
|
||||
- Last seen: 2026-05-11 10:11:33
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
|
||||
```
|
||||
|
||||
## WARNING: ERROR
|
||||
|
||||
- Occurrences: 4
|
||||
- First seen: 2026-05-11 10:01:03
|
||||
- Last seen: 2026-05-11 10:13:58
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
|
||||
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
|
||||
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
|
||||
```
|
||||
|
||||
## WARNING: unavailable
|
||||
|
||||
- Occurrences: 2
|
||||
- First seen: 2026-05-11 10:03:19
|
||||
- Last seen: 2026-05-11 10:13:58
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
|
||||
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
|
||||
```
|
||||
|
||||
## WARNING: connection refused
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:07:02
|
||||
- Last seen: 2026-05-11 10:07:02
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
|
||||
```
|
||||
|
||||
## WARNING: failed
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:11:33
|
||||
- Last seen: 2026-05-11 10:11:33
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
|
||||
```
|
||||
|
||||
## WARNING: timeout
|
||||
|
||||
- Occurrences: 1
|
||||
- First seen: 2026-05-11 10:05:44
|
||||
- Last seen: 2026-05-11 10:05:44
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
|
||||
```
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Total lines scanned: 8
|
||||
- Total findings: 15
|
||||
- Critical finding groups: 6
|
||||
- Warning finding groups: 5
|
||||
- Overall status: CRITICAL
|
||||
@@ -0,0 +1,7 @@
|
||||
May 11 09:57:01 ops-node-01 systemd[1]: Started Session 443 of user svc_backup.
|
||||
May 11 10:02:14 ops-node-01 systemd[1]: failed to start nightly-report.service: Unit entered failed state.
|
||||
May 11 10:04:22 ops-node-01 sudo[18442]: svc_backup : command not allowed ; permission denied
|
||||
May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
|
||||
May 11 10:21:45 ops-node-01 kernel: out of memory: killed process 2517 (java) total-vm:2048000kB
|
||||
May 11 10:22:03 ops-node-01 systemd[1]: service restart scheduled for app-worker.service
|
||||
May 11 10:30:31 ops-node-01 sshd[19210]: Accepted publickey for admin from 192.0.2.15 port 52210 ssh2
|
||||
@@ -0,0 +1,448 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Summarize incident-oriented patterns in local log files."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
UNKNOWN = "UNKNOWN"
|
||||
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||
|
||||
CRITICAL_PATTERNS = [
|
||||
"CRITICAL",
|
||||
"FATAL",
|
||||
"panic",
|
||||
"kernel panic",
|
||||
"no space left on device",
|
||||
"out of memory",
|
||||
"killed process",
|
||||
"read-only file system",
|
||||
"segmentation fault",
|
||||
"segfault",
|
||||
"certificate expired",
|
||||
"TLS handshake failed",
|
||||
"SSLHandshakeException",
|
||||
"database unavailable",
|
||||
"HTTP 500",
|
||||
"HTTP 502",
|
||||
"HTTP 503",
|
||||
"HTTP 504",
|
||||
]
|
||||
|
||||
WARNING_PATTERNS = [
|
||||
"ERROR",
|
||||
"failed",
|
||||
"failure",
|
||||
"timeout",
|
||||
"connection refused",
|
||||
"connection reset",
|
||||
"permission denied",
|
||||
"authentication failed",
|
||||
"denied",
|
||||
"unavailable",
|
||||
"service restart",
|
||||
"retrying",
|
||||
]
|
||||
|
||||
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
|
||||
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Summarize suspicious and critical patterns in a local log file."
|
||||
)
|
||||
parser.add_argument("--file", required=True, help="Local log file to analyze.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
help="Limit finding groups after severity and count sorting.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-case",
|
||||
action="store_true",
|
||||
help="Match all configured patterns case-insensitively.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--since",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
parser.add_argument(
|
||||
"--until",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per finding group. Default: 3.",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def parse_filter_timestamp(value: str) -> datetime:
|
||||
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
|
||||
try:
|
||||
return datetime.strptime(value, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
raise argparse.ArgumentTypeError(
|
||||
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
|
||||
)
|
||||
|
||||
|
||||
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
|
||||
flags = re.IGNORECASE if ignore_case else 0
|
||||
pattern_defs: list[dict[str, str]] = []
|
||||
pattern_defs.extend(
|
||||
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
|
||||
)
|
||||
pattern_defs.extend(
|
||||
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
|
||||
)
|
||||
|
||||
compiled = []
|
||||
for item in pattern_defs:
|
||||
compiled.append(
|
||||
{
|
||||
"pattern": item["pattern"],
|
||||
"severity": item["severity"],
|
||||
"regex": re.compile(re.escape(item["pattern"]), flags),
|
||||
}
|
||||
)
|
||||
return compiled
|
||||
|
||||
|
||||
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str | None]:
|
||||
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||
if iso_match:
|
||||
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||
try:
|
||||
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
|
||||
except ValueError:
|
||||
return None, None
|
||||
|
||||
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||
if syslog_match:
|
||||
raw = syslog_match.group(1)
|
||||
normalized = f"{syslog_year} {raw}"
|
||||
try:
|
||||
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
|
||||
except ValueError:
|
||||
return None, None
|
||||
return parsed, parsed.strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
return None, None
|
||||
|
||||
|
||||
def line_in_time_window(
|
||||
parsed_at: datetime | None, since: datetime | None, until: datetime | None
|
||||
) -> bool:
|
||||
if parsed_at is None:
|
||||
return True
|
||||
if since is not None and parsed_at < since:
|
||||
return False
|
||||
if until is not None and parsed_at > until:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def read_log_file(path: Path) -> list[str]:
|
||||
if not path.exists():
|
||||
raise OSError(f"file does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"path is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"file is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read file {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"file is empty: {path}")
|
||||
return text.splitlines()
|
||||
|
||||
|
||||
def analyze_log(
|
||||
lines: list[str],
|
||||
patterns: list[dict[str, Any]],
|
||||
since: datetime | None,
|
||||
until: datetime | None,
|
||||
max_samples: int,
|
||||
) -> dict[str, Any]:
|
||||
syslog_year = since.year if since is not None else datetime.now().year
|
||||
groups: dict[str, dict[str, Any]] = {}
|
||||
|
||||
for line in lines:
|
||||
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||
if not line_in_time_window(parsed_at, since, until):
|
||||
continue
|
||||
|
||||
for item in patterns:
|
||||
if not item["regex"].search(line):
|
||||
continue
|
||||
|
||||
key = f"{item['severity']}::{item['pattern']}"
|
||||
group = groups.setdefault(
|
||||
key,
|
||||
{
|
||||
"pattern": item["pattern"],
|
||||
"severity": item["severity"],
|
||||
"occurrences": 0,
|
||||
"first_seen": None,
|
||||
"last_seen": None,
|
||||
"samples": [],
|
||||
},
|
||||
)
|
||||
group["occurrences"] += 1
|
||||
|
||||
if parsed_at is not None:
|
||||
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
|
||||
group["first_seen"] = (parsed_at, rendered_at)
|
||||
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
|
||||
group["last_seen"] = (parsed_at, rendered_at)
|
||||
|
||||
if len(group["samples"]) < max_samples:
|
||||
group["samples"].append(line)
|
||||
|
||||
findings = sorted(
|
||||
groups.values(),
|
||||
key=lambda item: (
|
||||
SEVERITY_ORDER[item["severity"]],
|
||||
-item["occurrences"],
|
||||
item["pattern"].lower(),
|
||||
),
|
||||
)
|
||||
|
||||
rendered_findings = []
|
||||
for group in findings:
|
||||
rendered_findings.append(
|
||||
{
|
||||
"pattern": group["pattern"],
|
||||
"severity": group["severity"],
|
||||
"occurrences": group["occurrences"],
|
||||
"first_seen": render_seen(group["first_seen"]),
|
||||
"last_seen": render_seen(group["last_seen"]),
|
||||
"samples": group["samples"],
|
||||
}
|
||||
)
|
||||
|
||||
return {
|
||||
"total_lines_scanned": len(lines),
|
||||
"findings": rendered_findings,
|
||||
}
|
||||
|
||||
|
||||
def render_seen(value: tuple[datetime, str | None] | None) -> str:
|
||||
if value is None:
|
||||
return UNKNOWN
|
||||
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
|
||||
def apply_top_limit(report: dict[str, Any], top: int | None) -> dict[str, Any]:
|
||||
if top is None:
|
||||
return report
|
||||
limited = dict(report)
|
||||
limited["findings"] = report["findings"][:top]
|
||||
return limited
|
||||
|
||||
|
||||
def add_summary(report: dict[str, Any]) -> dict[str, Any]:
|
||||
findings = report["findings"]
|
||||
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
|
||||
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
|
||||
total_findings = sum(item["occurrences"] for item in findings)
|
||||
|
||||
if critical_groups > 0:
|
||||
status = "CRITICAL"
|
||||
elif warning_groups > 0:
|
||||
status = "WARNING"
|
||||
else:
|
||||
status = "OK"
|
||||
|
||||
enriched = dict(report)
|
||||
enriched["summary"] = {
|
||||
"total_lines_scanned": report["total_lines_scanned"],
|
||||
"total_findings": total_findings,
|
||||
"critical_finding_groups": critical_groups,
|
||||
"warning_finding_groups": warning_groups,
|
||||
"overall_status": status,
|
||||
}
|
||||
return enriched
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any]) -> str:
|
||||
lines = ["Incident Log Summary", "====================", ""]
|
||||
if not report["findings"]:
|
||||
lines.append("No configured incident patterns were detected.")
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"[{finding['severity']}] {finding['pattern']}",
|
||||
f"Occurrences: {finding['occurrences']}",
|
||||
f"First seen: {finding['first_seen']}",
|
||||
f"Last seen: {finding['last_seen']}",
|
||||
"Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.extend(f" - {sample}" for sample in finding["samples"])
|
||||
else:
|
||||
lines.append(" - No samples retained")
|
||||
lines.append("")
|
||||
|
||||
lines.extend(render_text_summary(report["summary"]))
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def render_text_summary(summary: dict[str, Any]) -> list[str]:
|
||||
return [
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"Total findings: {summary['total_findings']}",
|
||||
f"Critical finding groups: {summary['critical_finding_groups']}",
|
||||
f"Warning finding groups: {summary['warning_finding_groups']}",
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
]
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any]) -> str:
|
||||
lines = ["# Incident Log Summary", ""]
|
||||
if not report["findings"]:
|
||||
lines.extend(["No configured incident patterns were detected.", ""])
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"## {finding['severity']}: {finding['pattern']}",
|
||||
"",
|
||||
f"- Occurrences: {finding['occurrences']}",
|
||||
f"- First seen: {finding['first_seen']}",
|
||||
f"- Last seen: {finding['last_seen']}",
|
||||
"",
|
||||
"Sample log lines:",
|
||||
"",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.append("```text")
|
||||
lines.extend(finding["samples"])
|
||||
lines.append("```")
|
||||
else:
|
||||
lines.append("_No samples retained._")
|
||||
lines.append("")
|
||||
|
||||
summary = report["summary"]
|
||||
lines.extend(
|
||||
[
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"- Total findings: {summary['total_findings']}",
|
||||
f"- Critical finding groups: {summary['critical_finding_groups']}",
|
||||
f"- Warning finding groups: {summary['warning_finding_groups']}",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
"",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2, sort_keys=True) + "\n"
|
||||
|
||||
|
||||
def write_report(output_path: str | None, content: str) -> None:
|
||||
if output_path is None:
|
||||
sys.stdout.write(content)
|
||||
return
|
||||
|
||||
path = Path(output_path)
|
||||
try:
|
||||
path.write_text(content, encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write output {path}: {exc}") from exc
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.since is not None and args.until is not None and args.since > args.until:
|
||||
parser.error("--since must be earlier than or equal to --until")
|
||||
|
||||
try:
|
||||
lines = read_log_file(Path(args.file))
|
||||
report = analyze_log(
|
||||
lines=lines,
|
||||
patterns=compile_patterns(args.ignore_case),
|
||||
since=args.since,
|
||||
until=args.until,
|
||||
max_samples=args.max_samples,
|
||||
)
|
||||
report = add_summary(apply_top_limit(report, args.top))
|
||||
|
||||
if args.format == "text":
|
||||
content = render_text(report)
|
||||
elif args.format == "markdown":
|
||||
content = render_markdown(report)
|
||||
else:
|
||||
content = render_json(report)
|
||||
|
||||
write_report(args.output, content)
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"CRITICAL: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except RuntimeError as exc:
|
||||
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
if report["summary"]["overall_status"] == "OK":
|
||||
return EXIT_OK
|
||||
return EXIT_FINDINGS
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,215 @@
|
||||
# journal-analyzer
|
||||
|
||||
`journal-analyzer` is a read-only Python CLI for reviewing exported `journalctl` text logs. It summarizes systemd, service, and system-level journal findings that require operator review during Linux incident response, post-patching validation, restart troubleshooting, and change evidence collection.
|
||||
|
||||
The tool analyzes exported journal text only. It does not call `journalctl` directly, does not modify host state, and does not claim root cause.
|
||||
|
||||
## Purpose
|
||||
|
||||
- Summarize which units failed and which services appear repeatedly affected.
|
||||
- Surface dependency failures, restart loops, timeout patterns, OOM symptoms, disk/filesystem errors, TLS/certificate issues, authentication events, and network-related warnings.
|
||||
- Produce predictable text, Markdown, or JSON output that can be attached to an incident or change ticket.
|
||||
|
||||
## When To Use
|
||||
|
||||
- After exporting a scoped `journalctl` window during incident response.
|
||||
- After package patching or service restarts when failed units or degraded services need review.
|
||||
- During Linux service troubleshooting when repeated restart or dependency messages need a quick grouped summary.
|
||||
- Before attaching journal evidence to an incident, problem, or change record.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not call `journalctl` directly in v1.
|
||||
- It does not modify the input log, systemd state, service state, or host configuration.
|
||||
- It does not read remote systems or live journal streams.
|
||||
- It does not query SIEM, ELK, Zabbix, APM, or ticketing systems.
|
||||
- It does not prove root cause or a service defect.
|
||||
- It does not classify every vendor-specific journal message.
|
||||
|
||||
## Supported Input Type
|
||||
|
||||
- One exported local `journalctl` text file supplied with `--file`.
|
||||
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
|
||||
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
Example export commands:
|
||||
|
||||
```bash
|
||||
journalctl --since "1 hour ago" > journal.log
|
||||
journalctl -u nginx --since today > nginx-journal.log
|
||||
journalctl -p warning..alert --since "24 hours ago" > warnings.log
|
||||
journalctl --no-pager --since "2026-05-11 10:00:00" > journal.log
|
||||
```
|
||||
|
||||
## Supported Event Categories
|
||||
|
||||
Critical-oriented categories:
|
||||
|
||||
- Failed unit or failed start findings.
|
||||
- Dependency failures.
|
||||
- Kernel panic and panic findings.
|
||||
- OOM killer and killed process findings.
|
||||
- Disk and filesystem issues such as `no space left on device`, read-only filesystem, filesystem errors, and I/O errors.
|
||||
- Service or application crash patterns such as `segfault`.
|
||||
- TLS and certificate failures.
|
||||
- Emergency mode findings.
|
||||
|
||||
Warning-oriented categories:
|
||||
|
||||
- Restart and repeated start request findings.
|
||||
- Timeout and timed out findings.
|
||||
- Connection refused and connection reset findings.
|
||||
- Permission denied and denied findings.
|
||||
- Authentication failure findings.
|
||||
- Availability, degraded, failed, and warning findings that still require review.
|
||||
|
||||
The matching is practical and pattern-based. Default matching is already case-tolerant for common operational wording, and `--ignore-case` is available for explicit filter runs and predictable operator intent. The tool is intended for first-pass operational review, not for proving causality.
|
||||
|
||||
## Timestamp Support
|
||||
|
||||
The analyzer attempts to parse common journal and syslog timestamp formats:
|
||||
|
||||
- `May 11 10:15:30`
|
||||
- `2026-05-11 10:15:30`
|
||||
- `2026-05-11T10:15:30`
|
||||
- `2026-05-11 10:15:30.123456`
|
||||
- `2026-05-11 10:15:30,123`
|
||||
|
||||
If a timestamp cannot be parsed:
|
||||
|
||||
- the line is still analyzed
|
||||
- first seen / last seen remain `UNKNOWN` where needed
|
||||
- time-window filters keep the line by default rather than silently discarding it
|
||||
|
||||
Syslog-style timestamps without a year use the current local year internally unless `--since` provides a year context.
|
||||
|
||||
## Service Filtering
|
||||
|
||||
Use `--service SERVICE_NAME` to keep findings for a specific service, unit, or process name. Partial matches are allowed.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
|
||||
```
|
||||
|
||||
`--service nginx` matches practical variants such as `nginx`, `nginx.service`, and lines where the raw journal text includes `nginx`.
|
||||
|
||||
## Severity Filtering
|
||||
|
||||
Use `--severity warning` or `--severity critical` to limit the displayed findings.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --severity warning
|
||||
```
|
||||
|
||||
## Severity Model
|
||||
|
||||
Overall status is conservative:
|
||||
|
||||
- `OK` - no journal findings detected.
|
||||
- `WARNING` - warning-level findings exist but no critical findings exist.
|
||||
- `CRITICAL` - one or more critical findings exist.
|
||||
|
||||
Critical status is driven by failed units, dependency failures, OOM events, kernel panic findings, disk full or read-only filesystem symptoms, emergency mode, TLS/certificate failures, and I/O or filesystem errors.
|
||||
|
||||
Warning status is driven by restart-related findings, timeout patterns, connection issues, permission denied events, authentication failures, degraded messages, and generic warning/failure entries that still require review.
|
||||
|
||||
The report summarizes exported journal findings that require review. It does not claim root cause.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/journal-analyzer
|
||||
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown --output journal-report.md
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --format json
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --top 10
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --since "2026-05-11 10:00:00"
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --until "2026-05-11 12:00:00"
|
||||
python3 journal_analyzer.py --file examples/sample-journal.log --ignore-case
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident or change ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the report to a separate file. Without `--output`, the report is printed to stdout.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no journal findings.
|
||||
- `1` - Journal findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Journal Analyzer
|
||||
================
|
||||
|
||||
Overall status: CRITICAL
|
||||
Journal findings require review; logs alone do not prove root cause.
|
||||
|
||||
[CRITICAL] nginx.service - failed_unit
|
||||
Pattern: failed to start
|
||||
Occurrences: 1
|
||||
Unit: nginx.service
|
||||
Process: systemd
|
||||
PID: 1
|
||||
First seen: May 11 10:16:11
|
||||
Last seen: May 11 10:16:11
|
||||
Samples:
|
||||
- May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Overall status: CRITICAL
|
||||
Total lines scanned: 17
|
||||
Total findings: 13
|
||||
Critical finding groups: 7
|
||||
Warning finding groups: 5
|
||||
Affected services/units count: 9
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a Markdown report from an exported journal and attach it to the incident or change ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 journal_analyzer.py \
|
||||
--file examples/sample-journal.log \
|
||||
--format markdown \
|
||||
--output journal-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. Use it as a concise summary of exported journal findings, then correlate it with service status, monitoring, recent changes, package history, and runbook-specific post-checks.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single line can match more than one finding when it contains more than one meaningful symptom, such as a TLS failure plus certificate expiry.
|
||||
- Default matching is already case-tolerant for practical journal review; `--ignore-case` remains available when you want to force case-insensitive operator searches.
|
||||
- Unit, process, and PID extraction are best-effort and may return `UNKNOWN`.
|
||||
- Time filtering is best-effort because lines without parseable timestamps are retained.
|
||||
- Large log files are read into memory; use scoped journal exports for very large review windows.
|
||||
- The tool does not inspect structured journal fields because v1 works on exported text logs.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input journal export and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require root privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, private hostnames, customer identifiers, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as triage evidence that requires review; the tool does not determine root cause automatically.
|
||||
@@ -0,0 +1,143 @@
|
||||
# Journal Analyzer Report
|
||||
|
||||
- Overall status: `CRITICAL`
|
||||
- Journal findings require review; logs alone do not prove root cause.
|
||||
|
||||
## Finding Groups
|
||||
|
||||
### [CRITICAL] backup-agent - tls_certificate
|
||||
|
||||
- Pattern: `certificate expired`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `backup-agent`
|
||||
- PID: `777`
|
||||
- First seen: `2026-05-11 10:18:10`
|
||||
- Last seen: `2026-05-11 10:18:10`
|
||||
- Samples:
|
||||
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
|
||||
|
||||
### [CRITICAL] backup-agent - tls_certificate
|
||||
|
||||
- Pattern: `TLS handshake failed`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `backup-agent`
|
||||
- PID: `777`
|
||||
- First seen: `2026-05-11 10:18:10`
|
||||
- Last seen: `2026-05-11 10:18:10`
|
||||
- Samples:
|
||||
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
|
||||
|
||||
### [CRITICAL] dockerd - disk_filesystem
|
||||
|
||||
- Pattern: `no space left on device`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `dockerd`
|
||||
- PID: `1347`
|
||||
- First seen: `2026-05-11 10:17:33`
|
||||
- Last seen: `2026-05-11 10:17:33`
|
||||
- Samples:
|
||||
- `2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device`
|
||||
|
||||
### [CRITICAL] java - oom
|
||||
|
||||
- Pattern: `Out of memory`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `java`
|
||||
- PID: `UNKNOWN`
|
||||
- First seen: `2026-05-11 10:17:02`
|
||||
- Last seen: `2026-05-11 10:17:02`
|
||||
- Samples:
|
||||
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
|
||||
|
||||
### [CRITICAL] java - oom
|
||||
|
||||
- Pattern: `killed process`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `java`
|
||||
- PID: `UNKNOWN`
|
||||
- First seen: `2026-05-11 10:17:02`
|
||||
- Last seen: `2026-05-11 10:17:02`
|
||||
- Samples:
|
||||
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
|
||||
|
||||
### [CRITICAL] kernel - disk_filesystem
|
||||
|
||||
- Pattern: `read-only file system`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `kernel`
|
||||
- PID: `UNKNOWN`
|
||||
- First seen: `2026-05-11 10:17:54`
|
||||
- Last seen: `2026-05-11 10:17:54`
|
||||
- Samples:
|
||||
- `2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system`
|
||||
|
||||
### [CRITICAL] kernel - oom
|
||||
|
||||
- Pattern: `invoked oom-killer`
|
||||
- Occurrences: `1`
|
||||
- Unit: `UNKNOWN`
|
||||
- Process: `kernel`
|
||||
- PID: `UNKNOWN`
|
||||
- First seen: `2026-05-11 10:17:01`
|
||||
- Last seen: `2026-05-11 10:17:01`
|
||||
- Samples:
|
||||
- `2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0`
|
||||
|
||||
### [CRITICAL] nginx.service - dependency_failure
|
||||
|
||||
- Pattern: `dependency failed`
|
||||
- Occurrences: `1`
|
||||
- Unit: `nginx.service`
|
||||
- Process: `systemd`
|
||||
- PID: `1`
|
||||
- First seen: `May 11 10:16:08`
|
||||
- Last seen: `May 11 10:16:08`
|
||||
- Samples:
|
||||
- `May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.`
|
||||
|
||||
### [CRITICAL] nginx.service - failed_unit
|
||||
|
||||
- Pattern: `failed to start`
|
||||
- Occurrences: `1`
|
||||
- Unit: `nginx.service`
|
||||
- Process: `systemd`
|
||||
- PID: `1`
|
||||
- First seen: `May 11 10:16:11`
|
||||
- Last seen: `May 11 10:16:11`
|
||||
- Samples:
|
||||
- `May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.`
|
||||
|
||||
### [CRITICAL] nginx.service - failed_unit
|
||||
|
||||
- Pattern: `entered failed state`
|
||||
- Occurrences: `1`
|
||||
- Unit: `nginx.service`
|
||||
- Process: `systemd`
|
||||
- PID: `1`
|
||||
- First seen: `May 11 10:16:12`
|
||||
- Last seen: `May 11 10:16:12`
|
||||
- Samples:
|
||||
- `May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.`
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Overall status: `CRITICAL`
|
||||
- Total lines scanned: `17`
|
||||
- Total findings: `18`
|
||||
- Critical finding groups: `11`
|
||||
- Warning finding groups: `7`
|
||||
- Affected services/units count: `9`
|
||||
- Top affected services/units: nginx.service (5), sshd.service (3), kernel (2), java (2), backup-agent (2), sshd (1), dockerd (1), NetworkManager (1), systemd (1)
|
||||
- Top finding categories: restart (3), oom (3), failed_unit (2), disk_filesystem (2), tls_certificate (2), authentication (1), timeout (1), dependency_failure (1), generic_failure (1), network (1)
|
||||
- Failed unit findings: nginx.service (3)
|
||||
- Restart findings: `3`
|
||||
- OOM findings: `3`
|
||||
- Filesystem/disk findings: `2`
|
||||
- Timestamp coverage: parsed=`17`, unknown=`0`
|
||||
- Filters used: service=`None`, severity=`None`, since=`None`, until=`None`
|
||||
@@ -0,0 +1,17 @@
|
||||
May 11 10:14:01 web01 systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
|
||||
May 11 10:14:02 web01 systemd[1]: Started ssh.service - OpenBSD Secure Shell server.
|
||||
May 11 10:15:03 web01 sshd[2284]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=198.51.100.23 user=deploy
|
||||
May 11 10:15:22 web01 systemd[1]: sshd.service: Scheduled restart job, restart counter is at 3.
|
||||
May 11 10:15:23 web01 systemd[1]: sshd.service: Service restart completed after watchdog timeout warning
|
||||
May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.
|
||||
May 11 10:16:09 web01 systemd[1]: nginx.service: Job nginx.service/start failed with result 'dependency'.
|
||||
May 11 10:16:10 web01 systemd[1]: nginx.service: Start request repeated too quickly.
|
||||
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
|
||||
May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.
|
||||
2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
|
||||
2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB
|
||||
2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device
|
||||
2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system
|
||||
2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection
|
||||
2026-05-11 10:18:28 web01 NetworkManager[691]: Connection activation failed: Connection refused while reaching upstream gateway
|
||||
2026-05-11 10:18:42 web01 systemd[1]: Emergency mode is enabled. System cannot continue normal boot.
|
||||
@@ -0,0 +1,895 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze exported journalctl text logs for operational findings."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
UNKNOWN = "UNKNOWN"
|
||||
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||
|
||||
CRITICAL_PATTERNS = [
|
||||
{
|
||||
"name": "failed to start",
|
||||
"pattern": "failed to start",
|
||||
"category": "failed_unit",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "entered failed state",
|
||||
"pattern": "entered failed state",
|
||||
"category": "failed_unit",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "dependency failed",
|
||||
"pattern": "dependency failed",
|
||||
"category": "dependency_failure",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "job failed",
|
||||
"pattern": "job failed",
|
||||
"category": "failed_unit",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "unit failed",
|
||||
"pattern": "unit failed",
|
||||
"category": "failed_unit",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "kernel panic",
|
||||
"pattern": "kernel panic",
|
||||
"category": "kernel_panic",
|
||||
"service_hint": "kernel",
|
||||
},
|
||||
{
|
||||
"name": "panic",
|
||||
"pattern": "panic",
|
||||
"category": "kernel_panic",
|
||||
"service_hint": "kernel",
|
||||
},
|
||||
{
|
||||
"name": "Out of memory",
|
||||
"pattern": "Out of memory",
|
||||
"category": "oom",
|
||||
"service_hint": "kernel",
|
||||
},
|
||||
{
|
||||
"name": "invoked oom-killer",
|
||||
"pattern": "invoked oom-killer",
|
||||
"category": "oom",
|
||||
"service_hint": "kernel",
|
||||
},
|
||||
{
|
||||
"name": "killed process",
|
||||
"pattern": "killed process",
|
||||
"category": "oom",
|
||||
"service_hint": "kernel",
|
||||
},
|
||||
{
|
||||
"name": "no space left on device",
|
||||
"pattern": "no space left on device",
|
||||
"category": "disk_filesystem",
|
||||
"service_hint": "storage",
|
||||
},
|
||||
{
|
||||
"name": "read-only file system",
|
||||
"pattern": "read-only file system",
|
||||
"category": "disk_filesystem",
|
||||
"service_hint": "storage",
|
||||
},
|
||||
{
|
||||
"name": "segmentation fault",
|
||||
"pattern": "segmentation fault",
|
||||
"category": "crash",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "segfault",
|
||||
"pattern": "segfault",
|
||||
"category": "crash",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "certificate expired",
|
||||
"pattern": "certificate expired",
|
||||
"category": "tls_certificate",
|
||||
"service_hint": "tls",
|
||||
},
|
||||
{
|
||||
"name": "TLS handshake failed",
|
||||
"pattern": "TLS handshake failed",
|
||||
"category": "tls_certificate",
|
||||
"service_hint": "tls",
|
||||
},
|
||||
{
|
||||
"name": "emergency mode",
|
||||
"pattern": "emergency mode",
|
||||
"category": "system_recovery",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "filesystem error",
|
||||
"pattern": "filesystem error",
|
||||
"category": "disk_filesystem",
|
||||
"service_hint": "storage",
|
||||
},
|
||||
{
|
||||
"name": "I/O error",
|
||||
"pattern": "I/O error",
|
||||
"category": "disk_filesystem",
|
||||
"service_hint": "storage",
|
||||
},
|
||||
]
|
||||
|
||||
WARNING_PATTERNS = [
|
||||
{
|
||||
"name": "service restart",
|
||||
"pattern": "service restart",
|
||||
"category": "restart",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "scheduled restart job",
|
||||
"pattern": "scheduled restart job",
|
||||
"category": "restart",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "start request repeated too quickly",
|
||||
"pattern": "start request repeated too quickly",
|
||||
"category": "restart",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "timeout",
|
||||
"pattern": "timeout",
|
||||
"category": "timeout",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "timed out",
|
||||
"pattern": "timed out",
|
||||
"category": "timeout",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "connection refused",
|
||||
"pattern": "connection refused",
|
||||
"category": "network",
|
||||
"service_hint": "network",
|
||||
},
|
||||
{
|
||||
"name": "connection reset",
|
||||
"pattern": "connection reset",
|
||||
"category": "network",
|
||||
"service_hint": "network",
|
||||
},
|
||||
{
|
||||
"name": "permission denied",
|
||||
"pattern": "permission denied",
|
||||
"category": "permission",
|
||||
"service_hint": "security",
|
||||
},
|
||||
{
|
||||
"name": "authentication failure",
|
||||
"pattern": "authentication failure",
|
||||
"category": "authentication",
|
||||
"service_hint": "security",
|
||||
},
|
||||
{
|
||||
"name": "denied",
|
||||
"pattern": "denied",
|
||||
"category": "permission",
|
||||
"service_hint": "security",
|
||||
},
|
||||
{
|
||||
"name": "unavailable",
|
||||
"pattern": "unavailable",
|
||||
"category": "availability",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "degraded",
|
||||
"pattern": "degraded",
|
||||
"category": "degraded",
|
||||
"service_hint": "systemd",
|
||||
},
|
||||
{
|
||||
"name": "failed",
|
||||
"pattern": "failed",
|
||||
"category": "generic_failure",
|
||||
"service_hint": "application",
|
||||
},
|
||||
{
|
||||
"name": "warning",
|
||||
"pattern": "warning",
|
||||
"category": "warning",
|
||||
"service_hint": "application",
|
||||
},
|
||||
]
|
||||
|
||||
ISO_TIMESTAMP_RE = re.compile(
|
||||
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
|
||||
)
|
||||
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||
UNIT_RE = re.compile(r"\b([A-Za-z0-9_.@:-]+\.service)\b")
|
||||
ANY_UNIT_RE = re.compile(
|
||||
r"\b([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))\b"
|
||||
)
|
||||
PREFIX_RE = re.compile(
|
||||
r"^(?:[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+)?"
|
||||
r"(?:\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}:\d{2}(?:[,.]\d{1,6})?\s+)?"
|
||||
r"(?:(?P<host>[A-Za-z0-9_.:-]+)\s+)?"
|
||||
r"(?P<proc>[A-Za-z0-9_.@/-]+)(?:\[(?P<pid>\d+)\])?:"
|
||||
)
|
||||
KILLED_PROCESS_RE = re.compile(r"Killed process \d+ \(([^)]+)\)")
|
||||
SYSTEMD_FAILED_START_RE = re.compile(r"Failed to start\s+(.+?)\.")
|
||||
SYSTEMD_TRIGGER_RE = re.compile(r"Triggered By:\s*([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))")
|
||||
PID_RE = re.compile(r"\bpid[ =](\d+)\b", re.IGNORECASE)
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze exported journalctl text logs for systemd and service findings."
|
||||
)
|
||||
parser.add_argument("--file", required=True, help="Exported journal log file to analyze.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--service",
|
||||
help="Filter findings to a service, unit, or process name. Partial matching is allowed.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--severity",
|
||||
choices=("warning", "critical"),
|
||||
help="Show only warning or critical findings.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
default=10,
|
||||
help="Number of top groups, services, and categories to display. Default: 10.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per finding group. Default: 3.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-case",
|
||||
action="store_true",
|
||||
help="Match configured patterns case-insensitively.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--since",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
parser.add_argument(
|
||||
"--until",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def parse_filter_timestamp(value: str) -> datetime:
|
||||
for fmt in (
|
||||
"%Y-%m-%d %H:%M:%S",
|
||||
"%Y-%m-%dT%H:%M:%S",
|
||||
"%Y-%m-%d %H:%M:%S.%f",
|
||||
"%Y-%m-%d %H:%M:%S,%f",
|
||||
):
|
||||
try:
|
||||
return datetime.strptime(value, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
raise argparse.ArgumentTypeError(
|
||||
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
|
||||
)
|
||||
|
||||
|
||||
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
|
||||
flags = re.IGNORECASE
|
||||
if ignore_case:
|
||||
flags |= re.IGNORECASE
|
||||
compiled = []
|
||||
for item in CRITICAL_PATTERNS:
|
||||
compiled.append(
|
||||
{
|
||||
**item,
|
||||
"severity": "CRITICAL",
|
||||
"regex": re.compile(re.escape(item["pattern"]), flags),
|
||||
}
|
||||
)
|
||||
for item in WARNING_PATTERNS:
|
||||
compiled.append(
|
||||
{
|
||||
**item,
|
||||
"severity": "WARNING",
|
||||
"regex": re.compile(re.escape(item["pattern"]), flags),
|
||||
}
|
||||
)
|
||||
return compiled
|
||||
|
||||
|
||||
def read_log_file(path: Path) -> list[str]:
|
||||
if not path.exists():
|
||||
raise OSError(f"file does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"path is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"file is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read file {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"file is empty: {path}")
|
||||
return text.splitlines()
|
||||
|
||||
|
||||
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
|
||||
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||
if iso_match:
|
||||
fraction = iso_match.group(3) or ""
|
||||
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||
parse_value = raw
|
||||
fmt = "%Y-%m-%d %H:%M:%S"
|
||||
if fraction:
|
||||
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
|
||||
fmt = "%Y-%m-%d %H:%M:%S.%f"
|
||||
try:
|
||||
return datetime.strptime(parse_value, fmt), raw + fraction
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
|
||||
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||
if syslog_match:
|
||||
raw = syslog_match.group(1)
|
||||
try:
|
||||
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
return parsed, raw
|
||||
|
||||
return None, UNKNOWN
|
||||
|
||||
|
||||
def line_in_time_window(
|
||||
parsed_at: datetime | None, since: datetime | None, until: datetime | None
|
||||
) -> bool:
|
||||
if parsed_at is None:
|
||||
return True
|
||||
if since is not None and parsed_at < since:
|
||||
return False
|
||||
if until is not None and parsed_at > until:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def render_seen(value: tuple[datetime, str] | None) -> str:
|
||||
if value is None:
|
||||
return UNKNOWN
|
||||
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
|
||||
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
|
||||
if parsed_at is None:
|
||||
return
|
||||
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
|
||||
group["first_seen"] = (parsed_at, rendered_at)
|
||||
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
|
||||
group["last_seen"] = (parsed_at, rendered_at)
|
||||
|
||||
|
||||
def append_limited(items: list[str], value: str, limit: int) -> None:
|
||||
if limit == 0:
|
||||
return
|
||||
if value in items:
|
||||
return
|
||||
if len(items) < limit:
|
||||
items.append(value)
|
||||
|
||||
|
||||
def normalize_service_name(value: str) -> str:
|
||||
stripped = value.strip()
|
||||
if not stripped:
|
||||
return UNKNOWN
|
||||
return stripped
|
||||
|
||||
|
||||
def extract_service_info(line: str, pattern_item: dict[str, Any]) -> dict[str, str]:
|
||||
unit_match = UNIT_RE.search(line)
|
||||
any_unit_match = ANY_UNIT_RE.search(line)
|
||||
prefix_match = PREFIX_RE.search(line)
|
||||
killed_match = KILLED_PROCESS_RE.search(line)
|
||||
triggered_match = SYSTEMD_TRIGGER_RE.search(line)
|
||||
pid_match = PID_RE.search(line)
|
||||
|
||||
unit = UNKNOWN
|
||||
process = UNKNOWN
|
||||
pid = UNKNOWN
|
||||
|
||||
if unit_match:
|
||||
unit = unit_match.group(1)
|
||||
elif any_unit_match:
|
||||
unit = any_unit_match.group(1)
|
||||
|
||||
if prefix_match:
|
||||
process = prefix_match.group("proc") or UNKNOWN
|
||||
pid = prefix_match.group("pid") or UNKNOWN
|
||||
|
||||
if killed_match:
|
||||
process = normalize_service_name(killed_match.group(1))
|
||||
|
||||
if pid == UNKNOWN and pid_match:
|
||||
pid = pid_match.group(1)
|
||||
|
||||
if unit == UNKNOWN and process == "systemd":
|
||||
failed_start_match = SYSTEMD_FAILED_START_RE.search(line)
|
||||
if failed_start_match:
|
||||
unit = normalize_service_name(
|
||||
failed_start_match.group(1).strip().replace(" ", "-")
|
||||
)
|
||||
if not unit.endswith(".service"):
|
||||
unit = f"{unit}.service"
|
||||
|
||||
if unit == UNKNOWN and triggered_match:
|
||||
unit = triggered_match.group(1)
|
||||
|
||||
service = UNKNOWN
|
||||
if unit != UNKNOWN:
|
||||
service = unit
|
||||
elif process != UNKNOWN:
|
||||
service = process
|
||||
elif pattern_item.get("service_hint"):
|
||||
service = pattern_item["service_hint"]
|
||||
|
||||
return {
|
||||
"service": service,
|
||||
"unit": unit,
|
||||
"process": process,
|
||||
"pid": pid,
|
||||
}
|
||||
|
||||
|
||||
def service_filter_matches(service_filter: str | None, service_info: dict[str, str], line: str) -> bool:
|
||||
if not service_filter:
|
||||
return True
|
||||
needle = service_filter.lower()
|
||||
candidates = [line.lower()]
|
||||
for key in ("service", "unit", "process"):
|
||||
value = service_info.get(key, UNKNOWN)
|
||||
if value != UNKNOWN:
|
||||
candidates.append(value.lower())
|
||||
return any(needle in candidate for candidate in candidates)
|
||||
|
||||
|
||||
def severity_filter_matches(selected: str | None, severity: str) -> bool:
|
||||
if selected is None:
|
||||
return True
|
||||
return selected.upper() == severity
|
||||
|
||||
|
||||
def detect_failed_unit(line: str, service_info: dict[str, str], category: str) -> str | None:
|
||||
if category not in {"failed_unit", "dependency_failure"}:
|
||||
return None
|
||||
if service_info["unit"] != UNKNOWN:
|
||||
return service_info["unit"]
|
||||
match = ANY_UNIT_RE.search(line)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return None
|
||||
|
||||
|
||||
def analyze_log(
|
||||
lines: list[str],
|
||||
patterns: list[dict[str, Any]],
|
||||
since: datetime | None,
|
||||
until: datetime | None,
|
||||
service_filter: str | None,
|
||||
severity_filter: str | None,
|
||||
top: int,
|
||||
max_samples: int,
|
||||
) -> dict[str, Any]:
|
||||
syslog_year = since.year if since is not None else datetime.now().year
|
||||
groups: dict[str, dict[str, Any]] = {}
|
||||
total_lines_scanned = 0
|
||||
parsed_timestamps = 0
|
||||
unknown_timestamps = 0
|
||||
top_services = Counter()
|
||||
top_categories = Counter()
|
||||
failed_units = Counter()
|
||||
restart_findings = 0
|
||||
oom_findings = 0
|
||||
filesystem_findings = 0
|
||||
|
||||
for line in lines:
|
||||
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||
total_lines_scanned += 1
|
||||
if parsed_at is not None:
|
||||
parsed_timestamps += 1
|
||||
else:
|
||||
unknown_timestamps += 1
|
||||
|
||||
if not line_in_time_window(parsed_at, since, until):
|
||||
continue
|
||||
|
||||
matched_items = [item for item in patterns if item["regex"].search(line)]
|
||||
if matched_items:
|
||||
has_specific_match = any(
|
||||
item["name"] not in {"failed", "warning"} for item in matched_items
|
||||
)
|
||||
if has_specific_match:
|
||||
matched_items = [
|
||||
item for item in matched_items if item["name"] not in {"failed", "warning"}
|
||||
]
|
||||
|
||||
for item in matched_items:
|
||||
if not severity_filter_matches(severity_filter, item["severity"]):
|
||||
continue
|
||||
|
||||
service_info = extract_service_info(line, item)
|
||||
if not service_filter_matches(service_filter, service_info, line):
|
||||
continue
|
||||
|
||||
key = (
|
||||
f"{service_info['service']}::{item['name']}::{item['category']}::{item['severity']}"
|
||||
)
|
||||
group = groups.setdefault(
|
||||
key,
|
||||
{
|
||||
"service": service_info["service"],
|
||||
"unit": service_info["unit"],
|
||||
"process": service_info["process"],
|
||||
"pid": service_info["pid"],
|
||||
"category": item["category"],
|
||||
"pattern": item["name"],
|
||||
"severity": item["severity"],
|
||||
"occurrences": 0,
|
||||
"first_seen": None,
|
||||
"last_seen": None,
|
||||
"samples": [],
|
||||
},
|
||||
)
|
||||
group["occurrences"] += 1
|
||||
update_seen(group, parsed_at, rendered_at)
|
||||
append_limited(group["samples"], line, max_samples)
|
||||
|
||||
top_services[group["service"]] += 1
|
||||
top_categories[group["category"]] += 1
|
||||
|
||||
failed_unit = detect_failed_unit(line, service_info, item["category"])
|
||||
if failed_unit:
|
||||
failed_units[failed_unit] += 1
|
||||
|
||||
if item["category"] == "restart":
|
||||
restart_findings += 1
|
||||
if item["category"] == "oom":
|
||||
oom_findings += 1
|
||||
if item["category"] == "disk_filesystem":
|
||||
filesystem_findings += 1
|
||||
|
||||
findings = sorted(
|
||||
groups.values(),
|
||||
key=lambda item: (
|
||||
SEVERITY_ORDER[item["severity"]],
|
||||
-item["occurrences"],
|
||||
item["service"].lower(),
|
||||
item["category"].lower(),
|
||||
),
|
||||
)
|
||||
|
||||
rendered_findings = []
|
||||
for group in findings:
|
||||
rendered_findings.append(
|
||||
{
|
||||
"service": group["service"],
|
||||
"unit": group["unit"],
|
||||
"process": group["process"],
|
||||
"pid": group["pid"],
|
||||
"category": group["category"],
|
||||
"pattern": group["pattern"],
|
||||
"severity": group["severity"],
|
||||
"occurrences": group["occurrences"],
|
||||
"first_seen": render_seen(group["first_seen"]),
|
||||
"last_seen": render_seen(group["last_seen"]),
|
||||
"samples": group["samples"],
|
||||
}
|
||||
)
|
||||
|
||||
critical_groups = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
|
||||
warning_groups = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
|
||||
overall_status = "OK"
|
||||
if critical_groups > 0:
|
||||
overall_status = "CRITICAL"
|
||||
elif warning_groups > 0:
|
||||
overall_status = "WARNING"
|
||||
|
||||
displayed_findings = rendered_findings[:top]
|
||||
|
||||
return {
|
||||
"overall_status": overall_status,
|
||||
"total_lines_scanned": total_lines_scanned,
|
||||
"total_findings": sum(item["occurrences"] for item in rendered_findings),
|
||||
"critical_finding_groups": critical_groups,
|
||||
"warning_finding_groups": warning_groups,
|
||||
"affected_services_count": len([name for name in top_services if name != UNKNOWN]),
|
||||
"top_affected_services": [
|
||||
{"service": name, "count": count}
|
||||
for name, count in top_services.most_common(top)
|
||||
],
|
||||
"top_categories": [
|
||||
{"category": name, "count": count}
|
||||
for name, count in top_categories.most_common(top)
|
||||
],
|
||||
"failed_units": [
|
||||
{"unit": name, "count": count} for name, count in failed_units.most_common(top)
|
||||
],
|
||||
"restart_findings": restart_findings,
|
||||
"oom_findings": oom_findings,
|
||||
"filesystem_disk_findings": filesystem_findings,
|
||||
"timestamp_coverage": {
|
||||
"parsed_timestamps_count": parsed_timestamps,
|
||||
"unknown_timestamps_count": unknown_timestamps,
|
||||
},
|
||||
"filters_used": {
|
||||
"service": service_filter or None,
|
||||
"severity": severity_filter or None,
|
||||
"since": since.strftime("%Y-%m-%d %H:%M:%S") if since else None,
|
||||
"until": until.strftime("%Y-%m-%d %H:%M:%S") if until else None,
|
||||
},
|
||||
"finding_groups": displayed_findings,
|
||||
"finding_groups_total": len(rendered_findings),
|
||||
}
|
||||
|
||||
|
||||
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
|
||||
if not items:
|
||||
return "None"
|
||||
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"Journal Analyzer",
|
||||
"================",
|
||||
"",
|
||||
f"Overall status: {report['overall_status']}",
|
||||
"Journal findings require review; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
|
||||
if report["finding_groups"]:
|
||||
for finding in report["finding_groups"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"[{finding['severity']}] {finding['service']} - {finding['category']}",
|
||||
f"Pattern: {finding['pattern']}",
|
||||
f"Occurrences: {finding['occurrences']}",
|
||||
f"Unit: {finding['unit']}",
|
||||
f"Process: {finding['process']}",
|
||||
f"PID: {finding['pid']}",
|
||||
f"First seen: {finding['first_seen']}",
|
||||
f"Last seen: {finding['last_seen']}",
|
||||
"Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
for sample in finding["samples"]:
|
||||
lines.append(f" - {sample}")
|
||||
else:
|
||||
lines.append(" - None")
|
||||
lines.append("")
|
||||
else:
|
||||
lines.extend(["No journal findings detected for the selected filters.", ""])
|
||||
|
||||
lines.extend(
|
||||
[
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Overall status: {report['overall_status']}",
|
||||
f"Total lines scanned: {report['total_lines_scanned']}",
|
||||
f"Total findings: {report['total_findings']}",
|
||||
f"Critical finding groups: {report['critical_finding_groups']}",
|
||||
f"Warning finding groups: {report['warning_finding_groups']}",
|
||||
f"Affected services/units count: {report['affected_services_count']}",
|
||||
"Top affected services/units: "
|
||||
+ render_top_pairs(report["top_affected_services"], "service"),
|
||||
"Top finding categories: "
|
||||
+ render_top_pairs(report["top_categories"], "category"),
|
||||
"Failed unit findings: "
|
||||
+ render_top_pairs(report["failed_units"], "unit"),
|
||||
f"Restart findings: {report['restart_findings']}",
|
||||
f"OOM findings: {report['oom_findings']}",
|
||||
f"Filesystem/disk findings: {report['filesystem_disk_findings']}",
|
||||
"Timestamp coverage: "
|
||||
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
|
||||
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
|
||||
"Filters used: "
|
||||
f"service={report['filters_used']['service'] or 'None'}, "
|
||||
f"severity={report['filters_used']['severity'] or 'None'}, "
|
||||
f"since={report['filters_used']['since'] or 'None'}, "
|
||||
f"until={report['filters_used']['until'] or 'None'}",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"# Journal Analyzer Report",
|
||||
"",
|
||||
f"- Overall status: `{report['overall_status']}`",
|
||||
"- Journal findings require review; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
|
||||
if report["finding_groups"]:
|
||||
lines.append("## Finding Groups")
|
||||
lines.append("")
|
||||
for finding in report["finding_groups"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"### [{finding['severity']}] {finding['service']} - {finding['category']}",
|
||||
"",
|
||||
f"- Pattern: `{finding['pattern']}`",
|
||||
f"- Occurrences: `{finding['occurrences']}`",
|
||||
f"- Unit: `{finding['unit']}`",
|
||||
f"- Process: `{finding['process']}`",
|
||||
f"- PID: `{finding['pid']}`",
|
||||
f"- First seen: `{finding['first_seen']}`",
|
||||
f"- Last seen: `{finding['last_seen']}`",
|
||||
"- Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
for sample in finding["samples"]:
|
||||
lines.append(f" - `{sample}`")
|
||||
else:
|
||||
lines.append(" - `None`")
|
||||
lines.append("")
|
||||
else:
|
||||
lines.extend(["## Finding Groups", "", "No journal findings detected for the selected filters.", ""])
|
||||
|
||||
lines.extend(
|
||||
[
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Overall status: `{report['overall_status']}`",
|
||||
f"- Total lines scanned: `{report['total_lines_scanned']}`",
|
||||
f"- Total findings: `{report['total_findings']}`",
|
||||
f"- Critical finding groups: `{report['critical_finding_groups']}`",
|
||||
f"- Warning finding groups: `{report['warning_finding_groups']}`",
|
||||
f"- Affected services/units count: `{report['affected_services_count']}`",
|
||||
"- Top affected services/units: "
|
||||
+ (render_top_pairs(report["top_affected_services"], "service") or "None"),
|
||||
"- Top finding categories: "
|
||||
+ (render_top_pairs(report["top_categories"], "category") or "None"),
|
||||
"- Failed unit findings: "
|
||||
+ (render_top_pairs(report["failed_units"], "unit") or "None"),
|
||||
f"- Restart findings: `{report['restart_findings']}`",
|
||||
f"- OOM findings: `{report['oom_findings']}`",
|
||||
f"- Filesystem/disk findings: `{report['filesystem_disk_findings']}`",
|
||||
"- Timestamp coverage: "
|
||||
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
|
||||
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
|
||||
"- Filters used: "
|
||||
f"service=`{report['filters_used']['service'] or 'None'}`, "
|
||||
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
|
||||
f"since=`{report['filters_used']['since'] or 'None'}`, "
|
||||
f"until=`{report['filters_used']['until'] or 'None'}`",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2)
|
||||
|
||||
|
||||
def write_output(text: str, output_path: str | None, input_path: Path) -> None:
|
||||
if output_path is None:
|
||||
print(text)
|
||||
return
|
||||
|
||||
destination = Path(output_path)
|
||||
try:
|
||||
if destination.exists() and destination.resolve() == input_path.resolve():
|
||||
raise OSError("output path must not overwrite the input log file")
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
try:
|
||||
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write report to {destination}: {exc}") from exc
|
||||
|
||||
|
||||
def determine_exit_code(report: dict[str, Any]) -> int:
|
||||
if report["total_findings"] > 0:
|
||||
return EXIT_FINDINGS
|
||||
return EXIT_OK
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
input_path = Path(args.file)
|
||||
lines = read_log_file(input_path)
|
||||
patterns = compile_patterns(args.ignore_case)
|
||||
report = analyze_log(
|
||||
lines=lines,
|
||||
patterns=patterns,
|
||||
since=args.since,
|
||||
until=args.until,
|
||||
service_filter=args.service,
|
||||
severity_filter=args.severity.upper() if args.severity else None,
|
||||
top=args.top,
|
||||
max_samples=args.max_samples,
|
||||
)
|
||||
|
||||
if args.format == "text":
|
||||
rendered = render_text(report)
|
||||
elif args.format == "markdown":
|
||||
rendered = render_markdown(report)
|
||||
else:
|
||||
rendered = render_json(report)
|
||||
|
||||
write_output(rendered, args.output, input_path)
|
||||
return determine_exit_code(report)
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"ERROR: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except Exception as exc: # pragma: no cover - defensive operational fallback
|
||||
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,218 @@
|
||||
# jvm-log-analyzer
|
||||
|
||||
`jvm-log-analyzer` is a read-only Python CLI for reviewing local JVM and Java application logs. It summarizes common Java exceptions, stack trace fragments, JVM failure symptoms, database issues, network/TLS problems, HTTP 5xx entries, and repeated application warning/error patterns that require operator review.
|
||||
|
||||
The tool is intended for Linux infrastructure, SRE, and application support workflows where a collected log file needs a quick first-pass operational summary. It does not modify logs or system state.
|
||||
|
||||
## When To Use
|
||||
|
||||
- During incident response when a JVM application log needs a fast exception and symptom summary.
|
||||
- During application support handoff when stack traces, HTTP 5xx entries, or database failures need to be attached as evidence.
|
||||
- After a restart, deployment, certificate change, database incident, or capacity event when local log extracts are available.
|
||||
- When predictable text, Markdown, or JSON output is useful for local review.
|
||||
|
||||
## What It Does
|
||||
|
||||
- Reads one local JVM or Java application log supplied with `--file`.
|
||||
- Detects configured critical and warning JVM/application patterns.
|
||||
- Extracts timestamps, log levels, thread names, logger/class names, exception types, raw samples, and short stack trace fragments where practical.
|
||||
- Aggregates top finding groups, exception types, and operational symptoms.
|
||||
- Produces text, Markdown, or JSON output.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not read remote systems or live journal streams.
|
||||
- It does not modify logs, services, application files, JVM flags, certificates, or database state.
|
||||
- It does not query APM, ELK, SIEM, Zabbix, ticketing systems, or application APIs.
|
||||
- It does not find root cause automatically.
|
||||
- It does not prove an application defect.
|
||||
- It does not classify every vendor-specific Java framework or application message.
|
||||
|
||||
## Supported Input Types
|
||||
|
||||
- Java / JVM application logs.
|
||||
- Spring Boot style logs.
|
||||
- Tomcat-style application logs.
|
||||
- Generic application logs containing Java exceptions and stack traces.
|
||||
|
||||
UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue. Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
## Supported JVM/Application Patterns
|
||||
|
||||
Critical patterns:
|
||||
|
||||
- `OutOfMemoryError`
|
||||
- `Java heap space`
|
||||
- `GC overhead limit exceeded`
|
||||
- `StackOverflowError`
|
||||
- `NoClassDefFoundError`
|
||||
- `ClassNotFoundException`
|
||||
- `ExceptionInInitializerError`
|
||||
- `SSLHandshakeException`
|
||||
- `CertificateExpiredException`
|
||||
- `SQLException`
|
||||
- `SQLRecoverableException`
|
||||
- `CommunicationsException`
|
||||
- `database unavailable`
|
||||
- `connection pool exhausted`
|
||||
- `HTTP 500`
|
||||
- `HTTP 502`
|
||||
- `HTTP 503`
|
||||
- `HTTP 504`
|
||||
- `FATAL`
|
||||
|
||||
Warning patterns:
|
||||
|
||||
- `NullPointerException`
|
||||
- `IllegalArgumentException`
|
||||
- `IllegalStateException`
|
||||
- `SocketTimeoutException`
|
||||
- `ConnectException`
|
||||
- `TimeoutException`
|
||||
- `connection refused`
|
||||
- `connection reset`
|
||||
- `Broken pipe`
|
||||
- `WARN`
|
||||
- `ERROR`
|
||||
- `retrying`
|
||||
- `slow query`
|
||||
- `deadlock detected`
|
||||
|
||||
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across configured patterns.
|
||||
|
||||
## Stack Trace Handling
|
||||
|
||||
The scanner detects practical multiline Java stack traces using common starts such as:
|
||||
|
||||
- Fully qualified Java exception lines, such as `java.lang.NullPointerException`.
|
||||
- `Exception in thread "main"`.
|
||||
- `Caused by:`.
|
||||
- Application exceptions ending in `Exception` or `Error`.
|
||||
|
||||
Following stack frames are grouped when they look like Java frames:
|
||||
|
||||
- Lines starting with whitespace followed by `at `.
|
||||
- Lines starting with `Caused by:`.
|
||||
- Lines containing `... N more`.
|
||||
|
||||
Stack traces are associated with the detected exception type where possible. Text and Markdown output include only short sample lines by default. Use `--include-stacktraces` to include capped multiline stack trace fragments.
|
||||
|
||||
## Timestamp Handling
|
||||
|
||||
The scanner attempts to parse:
|
||||
|
||||
- `2026-05-11 10:15:30`
|
||||
- `2026-05-11T10:15:30`
|
||||
- `2026-05-11 10:15:30,123`
|
||||
- `2026-05-11 10:15:30.123`
|
||||
- `May 11 10:15:30`
|
||||
|
||||
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed. When `--since` or `--until` is used, lines without parseable timestamps are retained by default so potentially important findings are not silently discarded.
|
||||
|
||||
## Severity Model
|
||||
|
||||
Overall status is conservative:
|
||||
|
||||
- `OK` - no JVM/application findings.
|
||||
- `WARNING` - warning-level findings exist but no critical findings exist.
|
||||
- `CRITICAL` - one or more critical findings exist.
|
||||
|
||||
Critical status is driven by JVM memory failures, fatal JVM symptoms, selected class loading errors, TLS/certificate failures, database unavailable or pool exhaustion symptoms, and HTTP 5xx volume at or above the configured threshold.
|
||||
|
||||
Warning status is driven by non-fatal exceptions, `WARN`/`ERROR` entries, timeout/retry patterns, connection refused/reset symptoms, slow query findings, and deadlock patterns.
|
||||
|
||||
HTTP 5xx findings are warnings until their total reaches `--http-critical-threshold`, which defaults to `5`. The report summarizes findings that require review; it does not claim root cause.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/jvm-log-analyzer
|
||||
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown --output jvm-report.md
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format json
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --top 10
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --max-samples 5
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --include-stacktraces
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --since "2026-05-11 10:00:00"
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --until "2026-05-11 12:00:00"
|
||||
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --http-critical-threshold 2
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident or application support ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no JVM/application findings.
|
||||
- `1` - JVM/application findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
JVM Log Analyzer
|
||||
================
|
||||
|
||||
Overall status: CRITICAL
|
||||
Findings require review; logs alone do not prove root cause.
|
||||
|
||||
[CRITICAL] OutOfMemoryError
|
||||
Occurrences: 1
|
||||
Symptom: jvm_memory
|
||||
First seen: UNKNOWN
|
||||
Last seen: UNKNOWN
|
||||
Stack traces linked: 1
|
||||
Samples:
|
||||
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Overall status: CRITICAL
|
||||
Total lines scanned: 33
|
||||
Total findings: 27
|
||||
Total stack traces detected: 4
|
||||
Critical finding groups: 11
|
||||
Warning finding groups: 8
|
||||
HTTP 5xx count: 3
|
||||
Parsed timestamps count: 21
|
||||
Unknown timestamps count: 12
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a Markdown report from a collected JVM application log and attach it to the incident or application support ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 jvm_log_analyzer.py \
|
||||
--file examples/sample-jvm-app.log \
|
||||
--format markdown \
|
||||
--include-stacktraces \
|
||||
--output jvm-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with application health checks, JVM memory telemetry, database status, certificate state, recent deployments, and the relevant application owner.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single log line can match multiple findings, such as `ERROR`, `HTTP 503`, and a Java exception.
|
||||
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||
- Stack trace grouping is practical, not a complete Java parser.
|
||||
- Timestamp parsing is best-effort; unparseable lines are retained during time filtering.
|
||||
- HTTP 5xx counts are raw log counts, not request rates or customer impact.
|
||||
- Large log files are read into memory; collect scoped extracts for very large incidents.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input log and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, customer data, private hostnames, tokens, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
|
||||
@@ -0,0 +1,32 @@
|
||||
2026-05-11 09:58:01 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Starting InventoryApplication v2.8.4
|
||||
2026-05-11 09:58:07 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Started InventoryApplication in 6.2 seconds
|
||||
2026-05-11 10:02:14 WARN inventory-api[2214] --- [order-worker-2] com.example.retry.PaymentClient : upstream timeout, retrying payment authorization attempt=2
|
||||
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
|
||||
java.lang.NullPointerException: Cannot invoke "Customer.getStatus()" because "customer" is null
|
||||
at com.example.orders.OrderService.validateCustomer(OrderService.java:144)
|
||||
at com.example.orders.OrderService.submit(OrderService.java:92)
|
||||
at com.example.orders.OrderController.create(OrderController.java:61)
|
||||
Caused by: java.lang.IllegalStateException: customer lookup returned empty result
|
||||
at com.example.customers.CustomerRepository.findRequired(CustomerRepository.java:38)
|
||||
... 3 more
|
||||
2026-05-11 10:08:42 WARN inventory-api[2214] --- [http-nio-8080-exec-2] com.example.integration.ShippingClient : java.net.SocketTimeoutException: Read timed out calling shipping endpoint
|
||||
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
|
||||
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
|
||||
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
|
||||
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
|
||||
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
|
||||
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
|
||||
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
|
||||
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
|
||||
... 2 more
|
||||
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
|
||||
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
|
||||
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
|
||||
2026-05-11 10:18:01 ERROR inventory-api[2214] --- [http-nio-8080-exec-8] com.example.web.ErrorHandler : HTTP 500 POST /api/orders requestId=req-1001
|
||||
2026-05-11 10:18:03 ERROR inventory-api[2214] --- [http-nio-8080-exec-9] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1002
|
||||
2026-05-11 10:18:06 ERROR inventory-api[2214] --- [http-nio-8080-exec-3] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1003
|
||||
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
|
||||
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
|
||||
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
|
||||
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
|
||||
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
|
||||
@@ -0,0 +1,215 @@
|
||||
# JVM Log Analyzer
|
||||
|
||||
- Overall status: CRITICAL
|
||||
- Finding language is a triage summary; logs alone do not prove root cause.
|
||||
|
||||
## CRITICAL: CertificateExpiredException
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: tls_certificate
|
||||
- First seen: 2026-05-11 10:16:40
|
||||
- Last seen: 2026-05-11 10:16:40
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
|
||||
```
|
||||
|
||||
## CRITICAL: CommunicationsException
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: database
|
||||
- First seen: 2026-05-11 10:13:02
|
||||
- Last seen: 2026-05-11 10:13:02
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
|
||||
```
|
||||
|
||||
## CRITICAL: connection pool exhausted
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: database
|
||||
- First seen: 2026-05-11 10:12:55
|
||||
- Last seen: 2026-05-11 10:12:55
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
|
||||
```
|
||||
|
||||
## CRITICAL: database unavailable
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: database
|
||||
- First seen: 2026-05-11 10:13:02
|
||||
- Last seen: 2026-05-11 10:13:02
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
|
||||
```
|
||||
|
||||
## CRITICAL: FATAL
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: fatal
|
||||
- First seen: 2026-05-11 10:21:27
|
||||
- Last seen: 2026-05-11 10:21:27
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
|
||||
```
|
||||
|
||||
## CRITICAL: Java heap space
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: jvm_memory
|
||||
- First seen: 2026-05-11 10:21:27
|
||||
- Last seen: 2026-05-11 10:21:27
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
## CRITICAL: OutOfMemoryError
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: jvm_memory
|
||||
- First seen: 2026-05-11 10:21:27
|
||||
- Last seen: 2026-05-11 10:21:27
|
||||
- Stack traces linked: 1
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
Stack trace samples:
|
||||
|
||||
```text
|
||||
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
|
||||
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
|
||||
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
|
||||
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
|
||||
```
|
||||
|
||||
## CRITICAL: SQLRecoverableException
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: database
|
||||
- First seen: 2026-05-11 10:13:02
|
||||
- Last seen: 2026-05-11 10:13:02
|
||||
- Stack traces linked: 1
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
|
||||
```
|
||||
|
||||
Stack trace samples:
|
||||
|
||||
```text
|
||||
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
|
||||
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
|
||||
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
|
||||
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
|
||||
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
|
||||
... 2 more
|
||||
```
|
||||
|
||||
## CRITICAL: SSLHandshakeException
|
||||
|
||||
- Occurrences: 1
|
||||
- Symptom: tls_certificate
|
||||
- First seen: 2026-05-11 10:16:40
|
||||
- Last seen: 2026-05-11 10:16:40
|
||||
- Stack traces linked: 1
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
|
||||
```
|
||||
|
||||
Stack trace samples:
|
||||
|
||||
```text
|
||||
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
|
||||
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
|
||||
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
|
||||
```
|
||||
|
||||
## WARNING: ERROR
|
||||
|
||||
- Occurrences: 8
|
||||
- Symptom: log_level
|
||||
- First seen: 2026-05-11 10:05:31
|
||||
- Last seen: 2026-05-11 10:18:06
|
||||
- Stack traces linked: 0
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
|
||||
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
|
||||
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
|
||||
```
|
||||
|
||||
## Top Exception Types
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| NullPointerException | 1 |
|
||||
| IllegalStateException | 1 |
|
||||
| SocketTimeoutException | 1 |
|
||||
| ConnectException | 1 |
|
||||
| SQLRecoverableException | 1 |
|
||||
| CommunicationsException | 1 |
|
||||
| SSLHandshakeException | 1 |
|
||||
| CertificateExpiredException | 1 |
|
||||
| OutOfMemoryError | 1 |
|
||||
|
||||
## Top Operational Symptoms
|
||||
|
||||
| Value | Count |
|
||||
| --- | ---: |
|
||||
| log_level | 10 |
|
||||
| database | 4 |
|
||||
| http_5xx | 3 |
|
||||
| application_exception | 2 |
|
||||
| network_timeout | 2 |
|
||||
| network_connectivity | 2 |
|
||||
| tls_certificate | 2 |
|
||||
| jvm_memory | 2 |
|
||||
| retry | 1 |
|
||||
| fatal | 1 |
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Overall status: CRITICAL
|
||||
- Total lines scanned: 32
|
||||
- Total findings: 29
|
||||
- Total stack traces detected: 4
|
||||
- Critical finding groups: 9
|
||||
- Warning finding groups: 11
|
||||
- HTTP 5xx count: 3
|
||||
- Parsed timestamps count: 13
|
||||
- Unknown timestamps count: 19
|
||||
@@ -0,0 +1,837 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Analyze JVM and Java application logs for operational findings."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
UNKNOWN = "UNKNOWN"
|
||||
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||
|
||||
CRITICAL_PATTERNS = [
|
||||
{"name": "OutOfMemoryError", "pattern": "OutOfMemoryError", "symptom": "jvm_memory"},
|
||||
{"name": "Java heap space", "pattern": "Java heap space", "symptom": "jvm_memory"},
|
||||
{"name": "GC overhead limit exceeded", "pattern": "GC overhead limit exceeded", "symptom": "jvm_memory"},
|
||||
{"name": "StackOverflowError", "pattern": "StackOverflowError", "symptom": "jvm_stack"},
|
||||
{"name": "NoClassDefFoundError", "pattern": "NoClassDefFoundError", "symptom": "class_loading"},
|
||||
{"name": "ClassNotFoundException", "pattern": "ClassNotFoundException", "symptom": "class_loading"},
|
||||
{"name": "ExceptionInInitializerError", "pattern": "ExceptionInInitializerError", "symptom": "class_loading"},
|
||||
{"name": "SSLHandshakeException", "pattern": "SSLHandshakeException", "symptom": "tls_certificate"},
|
||||
{"name": "CertificateExpiredException", "pattern": "CertificateExpiredException", "symptom": "tls_certificate"},
|
||||
{"name": "SQLException", "pattern": "SQLException", "symptom": "database"},
|
||||
{"name": "SQLRecoverableException", "pattern": "SQLRecoverableException", "symptom": "database"},
|
||||
{"name": "CommunicationsException", "pattern": "CommunicationsException", "symptom": "database"},
|
||||
{"name": "database unavailable", "pattern": "database unavailable", "symptom": "database"},
|
||||
{"name": "connection pool exhausted", "pattern": "connection pool exhausted", "symptom": "database"},
|
||||
{"name": "FATAL", "pattern": "FATAL", "symptom": "fatal"},
|
||||
]
|
||||
|
||||
WARNING_PATTERNS = [
|
||||
{"name": "NullPointerException", "pattern": "NullPointerException", "symptom": "application_exception"},
|
||||
{"name": "IllegalArgumentException", "pattern": "IllegalArgumentException", "symptom": "application_exception"},
|
||||
{"name": "IllegalStateException", "pattern": "IllegalStateException", "symptom": "application_exception"},
|
||||
{"name": "SocketTimeoutException", "pattern": "SocketTimeoutException", "symptom": "network_timeout"},
|
||||
{"name": "ConnectException", "pattern": "ConnectException", "symptom": "network_connectivity"},
|
||||
{"name": "TimeoutException", "pattern": "TimeoutException", "symptom": "network_timeout"},
|
||||
{"name": "connection refused", "pattern": "connection refused", "symptom": "network_connectivity"},
|
||||
{"name": "connection reset", "pattern": "connection reset", "symptom": "network_connectivity"},
|
||||
{"name": "Broken pipe", "pattern": "Broken pipe", "symptom": "network_connectivity"},
|
||||
{"name": "WARN", "pattern": "WARN", "symptom": "log_level"},
|
||||
{"name": "ERROR", "pattern": "ERROR", "symptom": "log_level"},
|
||||
{"name": "retrying", "pattern": "retrying", "symptom": "retry"},
|
||||
{"name": "slow query", "pattern": "slow query", "symptom": "database"},
|
||||
{"name": "deadlock detected", "pattern": "deadlock detected", "symptom": "database"},
|
||||
]
|
||||
|
||||
HTTP_PATTERNS = [
|
||||
{"name": "HTTP 500", "pattern": "HTTP 500", "symptom": "http_5xx"},
|
||||
{"name": "HTTP 502", "pattern": "HTTP 502", "symptom": "http_5xx"},
|
||||
{"name": "HTTP 503", "pattern": "HTTP 503", "symptom": "http_5xx"},
|
||||
{"name": "HTTP 504", "pattern": "HTTP 504", "symptom": "http_5xx"},
|
||||
]
|
||||
|
||||
ISO_TIMESTAMP_RE = re.compile(
|
||||
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
|
||||
)
|
||||
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||
LEVEL_RE = re.compile(r"\b(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b")
|
||||
SPRING_LOGGER_RE = re.compile(r"\s---\s+\[[^\]]+\]\s+([A-Za-z0-9_.$-]+)\s*:")
|
||||
GENERIC_LOGGER_RE = re.compile(
|
||||
r"\b(?:TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b\s+(?:\d+\s+)?([A-Za-z0-9_.$-]+)\s*:"
|
||||
)
|
||||
THREAD_RE = re.compile(r"\[([^\]]+)\]")
|
||||
SPRING_THREAD_RE = re.compile(r"\s---\s+\[([^\]]+)\]")
|
||||
EXCEPTION_RE = re.compile(
|
||||
r"\b((?:[A-Za-z_$][\w$]*\.)+[A-Za-z_$][\w$]*(?:Exception|Error)|[A-Za-z_$][\w$]*(?:Exception|Error))\b"
|
||||
)
|
||||
STACK_FRAME_RE = re.compile(r"^\s+at\s+")
|
||||
CAUSED_BY_RE = re.compile(r"^\s*Caused by:\s+")
|
||||
MORE_RE = re.compile(r"^\s*\.\.\.\s+\d+\s+more\b")
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Analyze local JVM and Java application logs for operational findings."
|
||||
)
|
||||
parser.add_argument("--file", required=True, help="Local JVM or Java application log to analyze.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
default=10,
|
||||
help="Number of top finding groups, exception types, and symptoms to display. Default: 10.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per finding group. Default: 3.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--include-stacktraces",
|
||||
action="store_true",
|
||||
help="Include short multiline stack trace samples in text and Markdown reports.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-stack-lines",
|
||||
type=positive_int,
|
||||
default=12,
|
||||
help="Maximum lines retained per stack trace sample. Default: 12.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--http-critical-threshold",
|
||||
type=positive_int,
|
||||
default=5,
|
||||
help="HTTP 5xx count that raises HTTP findings to CRITICAL. Default: 5.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-case",
|
||||
action="store_true",
|
||||
help="Match configured patterns case-insensitively.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--since",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
parser.add_argument(
|
||||
"--until",
|
||||
type=parse_filter_timestamp,
|
||||
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def parse_filter_timestamp(value: str) -> datetime:
|
||||
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
|
||||
try:
|
||||
return datetime.strptime(value, fmt)
|
||||
except ValueError:
|
||||
continue
|
||||
raise argparse.ArgumentTypeError('expected timestamp format "YYYY-MM-DD HH:MM:SS"')
|
||||
|
||||
|
||||
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
|
||||
flags = re.IGNORECASE if ignore_case else 0
|
||||
compiled = []
|
||||
for item in CRITICAL_PATTERNS:
|
||||
compiled.append({**item, "severity": "CRITICAL", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
|
||||
for item in WARNING_PATTERNS:
|
||||
compiled.append({**item, "severity": "WARNING", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
|
||||
for item in HTTP_PATTERNS:
|
||||
compiled.append({**item, "severity": "WARNING", "kind": "http_5xx", "regex": re.compile(re.escape(item["pattern"]), flags)})
|
||||
return compiled
|
||||
|
||||
|
||||
def read_log_file(path: Path) -> list[str]:
|
||||
if not path.exists():
|
||||
raise OSError(f"file does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"path is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"file is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read file {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"file is empty: {path}")
|
||||
return text.splitlines()
|
||||
|
||||
|
||||
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
|
||||
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||
if iso_match:
|
||||
fraction = iso_match.group(3) or ""
|
||||
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||
parse_value = raw
|
||||
fmt = "%Y-%m-%d %H:%M:%S"
|
||||
if fraction:
|
||||
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
|
||||
fmt = "%Y-%m-%d %H:%M:%S.%f"
|
||||
try:
|
||||
return datetime.strptime(parse_value, fmt), raw + fraction
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
|
||||
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||
if syslog_match:
|
||||
raw = syslog_match.group(1)
|
||||
try:
|
||||
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
return parsed, raw
|
||||
|
||||
return None, UNKNOWN
|
||||
|
||||
|
||||
def line_in_time_window(
|
||||
parsed_at: datetime | None, since: datetime | None, until: datetime | None
|
||||
) -> bool:
|
||||
if parsed_at is None:
|
||||
return True
|
||||
if since is not None and parsed_at < since:
|
||||
return False
|
||||
if until is not None and parsed_at > until:
|
||||
return False
|
||||
return True
|
||||
|
||||
|
||||
def render_seen(value: tuple[datetime, str] | None) -> str:
|
||||
if value is None:
|
||||
return UNKNOWN
|
||||
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
|
||||
def extract_level(line: str) -> str:
|
||||
match = LEVEL_RE.search(line)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def extract_thread(line: str) -> str:
|
||||
for regex in (SPRING_THREAD_RE, THREAD_RE):
|
||||
match = regex.search(line)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def extract_logger(line: str) -> str:
|
||||
for regex in (SPRING_LOGGER_RE, GENERIC_LOGGER_RE):
|
||||
match = regex.search(line)
|
||||
if match:
|
||||
return match.group(1)
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def normalize_exception_type(value: str) -> str:
|
||||
return value.split(".")[-1]
|
||||
|
||||
|
||||
def extract_exception_type(line: str) -> str:
|
||||
match = EXCEPTION_RE.search(line)
|
||||
if match:
|
||||
return normalize_exception_type(match.group(1))
|
||||
return UNKNOWN
|
||||
|
||||
|
||||
def is_stack_start(line: str) -> bool:
|
||||
return (
|
||||
"Exception in thread" in line
|
||||
or CAUSED_BY_RE.search(line) is not None
|
||||
or EXCEPTION_RE.search(line) is not None
|
||||
)
|
||||
|
||||
|
||||
def is_stack_continuation(line: str) -> bool:
|
||||
return (
|
||||
STACK_FRAME_RE.search(line) is not None
|
||||
or CAUSED_BY_RE.search(line) is not None
|
||||
or MORE_RE.search(line) is not None
|
||||
)
|
||||
|
||||
|
||||
def update_seen(
|
||||
group: dict[str, Any], parsed_at: datetime | None, rendered_at: str
|
||||
) -> None:
|
||||
if parsed_at is None:
|
||||
return
|
||||
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
|
||||
group["first_seen"] = (parsed_at, rendered_at)
|
||||
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
|
||||
group["last_seen"] = (parsed_at, rendered_at)
|
||||
|
||||
|
||||
def append_limited(items: list[Any], value: Any, limit: int) -> None:
|
||||
if limit == 0:
|
||||
return
|
||||
if value in items:
|
||||
return
|
||||
if len(items) < limit:
|
||||
items.append(value)
|
||||
|
||||
|
||||
def finding_key(severity: str, name: str) -> str:
|
||||
return f"{severity}::{name}"
|
||||
|
||||
|
||||
def ensure_group(
|
||||
groups: dict[str, dict[str, Any]],
|
||||
name: str,
|
||||
severity: str,
|
||||
symptom: str,
|
||||
kind: str,
|
||||
) -> dict[str, Any]:
|
||||
key = finding_key(severity, name)
|
||||
return groups.setdefault(
|
||||
key,
|
||||
{
|
||||
"name": name,
|
||||
"severity": severity,
|
||||
"symptom": symptom,
|
||||
"kind": kind,
|
||||
"occurrences": 0,
|
||||
"stack_trace_count": 0,
|
||||
"first_seen": None,
|
||||
"last_seen": None,
|
||||
"samples": [],
|
||||
"stack_trace_samples": [],
|
||||
"fields": [],
|
||||
},
|
||||
)
|
||||
|
||||
|
||||
def add_finding(
|
||||
groups: dict[str, dict[str, Any]],
|
||||
name: str,
|
||||
severity: str,
|
||||
symptom: str,
|
||||
kind: str,
|
||||
line: str,
|
||||
parsed_at: datetime | None,
|
||||
rendered_at: str,
|
||||
max_samples: int,
|
||||
) -> dict[str, Any]:
|
||||
group = ensure_group(groups, name, severity, symptom, kind)
|
||||
group["occurrences"] += 1
|
||||
update_seen(group, parsed_at, rendered_at)
|
||||
append_limited(group["samples"], line, max_samples)
|
||||
append_limited(
|
||||
group["fields"],
|
||||
{
|
||||
"timestamp": rendered_at,
|
||||
"log_level": extract_level(line),
|
||||
"logger": extract_logger(line),
|
||||
"thread": extract_thread(line),
|
||||
"exception_type": extract_exception_type(line),
|
||||
"raw": line,
|
||||
},
|
||||
max_samples,
|
||||
)
|
||||
return group
|
||||
|
||||
|
||||
def record_stack_trace(
|
||||
groups: dict[str, dict[str, Any]],
|
||||
stack: dict[str, Any],
|
||||
max_samples: int,
|
||||
max_stack_lines: int,
|
||||
) -> None:
|
||||
exception_type = stack["exception_type"] if stack["exception_type"] != UNKNOWN else "Java stack trace"
|
||||
severity = severity_for_exception(exception_type)
|
||||
group = ensure_group(groups, exception_type, severity, "stack_trace", "stack_trace")
|
||||
group["stack_trace_count"] += 1
|
||||
update_seen(group, stack["parsed_at"], stack["rendered_at"])
|
||||
append_limited(group["samples"], stack["lines"][0], max_samples)
|
||||
append_limited(group["stack_trace_samples"], stack["lines"][:max_stack_lines], max_samples)
|
||||
|
||||
|
||||
def severity_for_exception(exception_type: str) -> str:
|
||||
critical = {item["name"] for item in CRITICAL_PATTERNS}
|
||||
if exception_type in critical or exception_type in {"OutOfMemoryError", "StackOverflowError"}:
|
||||
return "CRITICAL"
|
||||
return "WARNING"
|
||||
|
||||
|
||||
def detect_stack_traces(
|
||||
included: list[dict[str, Any]],
|
||||
groups: dict[str, dict[str, Any]],
|
||||
max_samples: int,
|
||||
max_stack_lines: int,
|
||||
) -> int:
|
||||
stack: dict[str, Any] | None = None
|
||||
stack_count = 0
|
||||
|
||||
for item in included:
|
||||
line = item["line"]
|
||||
if stack is None:
|
||||
if is_stack_start(line):
|
||||
stack = {
|
||||
"lines": [line],
|
||||
"exception_type": extract_exception_type(line),
|
||||
"parsed_at": item["parsed_at"],
|
||||
"rendered_at": item["rendered_at"],
|
||||
}
|
||||
continue
|
||||
|
||||
if is_stack_continuation(line):
|
||||
stack["lines"].append(line)
|
||||
if stack["exception_type"] == UNKNOWN:
|
||||
stack["exception_type"] = extract_exception_type(line)
|
||||
continue
|
||||
|
||||
if len(stack["lines"]) > 1:
|
||||
record_stack_trace(groups, stack, max_samples, max_stack_lines)
|
||||
stack_count += 1
|
||||
stack = None
|
||||
if is_stack_start(line):
|
||||
stack = {
|
||||
"lines": [line],
|
||||
"exception_type": extract_exception_type(line),
|
||||
"parsed_at": item["parsed_at"],
|
||||
"rendered_at": item["rendered_at"],
|
||||
}
|
||||
|
||||
if stack is not None and len(stack["lines"]) > 1:
|
||||
record_stack_trace(groups, stack, max_samples, max_stack_lines)
|
||||
stack_count += 1
|
||||
|
||||
return stack_count
|
||||
|
||||
|
||||
def analyze_log(
|
||||
lines: list[str],
|
||||
patterns: list[dict[str, Any]],
|
||||
since: datetime | None,
|
||||
until: datetime | None,
|
||||
top: int,
|
||||
max_samples: int,
|
||||
max_stack_lines: int,
|
||||
http_critical_threshold: int,
|
||||
) -> dict[str, Any]:
|
||||
syslog_year = since.year if since is not None else datetime.now().year
|
||||
groups: dict[str, dict[str, Any]] = {}
|
||||
exception_counts: Counter[str] = Counter()
|
||||
symptom_counts: Counter[str] = Counter()
|
||||
parsed_timestamps = 0
|
||||
unknown_timestamps = 0
|
||||
included: list[dict[str, Any]] = []
|
||||
http_5xx_count = 0
|
||||
context_parsed_at: datetime | None = None
|
||||
context_rendered_at = UNKNOWN
|
||||
|
||||
for line in lines:
|
||||
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||
if parsed_at is None:
|
||||
unknown_timestamps += 1
|
||||
else:
|
||||
parsed_timestamps += 1
|
||||
context_parsed_at = parsed_at
|
||||
context_rendered_at = rendered_at
|
||||
if not line_in_time_window(parsed_at, since, until):
|
||||
continue
|
||||
|
||||
# Stack trace frames often omit timestamps; keep nearby log context for first/last seen.
|
||||
effective_parsed_at = parsed_at
|
||||
effective_rendered_at = rendered_at
|
||||
if parsed_at is None and (is_stack_start(line) or is_stack_continuation(line)):
|
||||
effective_parsed_at = context_parsed_at
|
||||
effective_rendered_at = context_rendered_at
|
||||
|
||||
included.append(
|
||||
{
|
||||
"line": line,
|
||||
"parsed_at": effective_parsed_at,
|
||||
"rendered_at": effective_rendered_at,
|
||||
}
|
||||
)
|
||||
matched_names = set()
|
||||
|
||||
for item in patterns:
|
||||
if not item["regex"].search(line):
|
||||
continue
|
||||
severity = item["severity"]
|
||||
if item["kind"] == "http_5xx":
|
||||
http_5xx_count += 1
|
||||
add_finding(
|
||||
groups=groups,
|
||||
name=item["name"],
|
||||
severity=severity,
|
||||
symptom=item["symptom"],
|
||||
kind=item["kind"],
|
||||
line=line,
|
||||
parsed_at=effective_parsed_at,
|
||||
rendered_at=effective_rendered_at,
|
||||
max_samples=max_samples,
|
||||
)
|
||||
symptom_counts[item["symptom"]] += 1
|
||||
matched_names.add(item["name"])
|
||||
|
||||
exception_type = extract_exception_type(line)
|
||||
if exception_type != UNKNOWN:
|
||||
exception_counts[exception_type] += 1
|
||||
if exception_type not in matched_names:
|
||||
severity = severity_for_exception(exception_type)
|
||||
add_finding(
|
||||
groups=groups,
|
||||
name=exception_type,
|
||||
severity=severity,
|
||||
symptom="application_exception",
|
||||
kind="exception",
|
||||
line=line,
|
||||
parsed_at=effective_parsed_at,
|
||||
rendered_at=effective_rendered_at,
|
||||
max_samples=max_samples,
|
||||
)
|
||||
symptom_counts["application_exception"] += 1
|
||||
|
||||
stack_trace_count = detect_stack_traces(included, groups, max_samples, max_stack_lines)
|
||||
promote_http_5xx(groups, http_5xx_count, http_critical_threshold)
|
||||
|
||||
findings = sorted(
|
||||
(render_group(group) for group in groups.values()),
|
||||
key=lambda item: (
|
||||
SEVERITY_ORDER[item["severity"]],
|
||||
-item["occurrences"],
|
||||
item["name"].lower(),
|
||||
),
|
||||
)
|
||||
|
||||
summary = build_summary(
|
||||
total_lines=len(lines),
|
||||
findings=findings,
|
||||
stack_trace_count=stack_trace_count,
|
||||
http_5xx_count=http_5xx_count,
|
||||
parsed_timestamps=parsed_timestamps,
|
||||
unknown_timestamps=unknown_timestamps,
|
||||
)
|
||||
|
||||
return {
|
||||
"summary": summary,
|
||||
"findings": findings[:top],
|
||||
"top_exception_types": top_items(exception_counts, top),
|
||||
"top_operational_symptoms": top_items(symptom_counts, top),
|
||||
}
|
||||
|
||||
|
||||
def promote_http_5xx(
|
||||
groups: dict[str, dict[str, Any]], http_5xx_count: int, threshold: int
|
||||
) -> None:
|
||||
if http_5xx_count < threshold:
|
||||
return
|
||||
|
||||
http_names = {item["name"] for item in HTTP_PATTERNS}
|
||||
for old_key, group in list(groups.items()):
|
||||
if group["name"] not in http_names or group["severity"] == "CRITICAL":
|
||||
continue
|
||||
group["severity"] = "CRITICAL"
|
||||
new_key = finding_key("CRITICAL", group["name"])
|
||||
groups[new_key] = group
|
||||
del groups[old_key]
|
||||
|
||||
|
||||
def render_group(group: dict[str, Any]) -> dict[str, Any]:
|
||||
return {
|
||||
"name": group["name"],
|
||||
"severity": group["severity"],
|
||||
"symptom": group["symptom"],
|
||||
"kind": group["kind"],
|
||||
"occurrences": group["occurrences"],
|
||||
"stack_trace_count": group["stack_trace_count"],
|
||||
"first_seen": render_seen(group["first_seen"]),
|
||||
"last_seen": render_seen(group["last_seen"]),
|
||||
"samples": group["samples"],
|
||||
"stack_trace_samples": group["stack_trace_samples"],
|
||||
"fields": group["fields"],
|
||||
}
|
||||
|
||||
|
||||
def build_summary(
|
||||
total_lines: int,
|
||||
findings: list[dict[str, Any]],
|
||||
stack_trace_count: int,
|
||||
http_5xx_count: int,
|
||||
parsed_timestamps: int,
|
||||
unknown_timestamps: int,
|
||||
) -> dict[str, Any]:
|
||||
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
|
||||
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
|
||||
total_findings = sum(item["occurrences"] for item in findings)
|
||||
|
||||
if critical_groups > 0:
|
||||
status = "CRITICAL"
|
||||
elif warning_groups > 0:
|
||||
status = "WARNING"
|
||||
else:
|
||||
status = "OK"
|
||||
|
||||
return {
|
||||
"overall_status": status,
|
||||
"total_lines_scanned": total_lines,
|
||||
"total_findings": total_findings,
|
||||
"total_stack_traces_detected": stack_trace_count,
|
||||
"critical_finding_groups": critical_groups,
|
||||
"warning_finding_groups": warning_groups,
|
||||
"http_5xx_count": http_5xx_count,
|
||||
"timestamp_coverage": {
|
||||
"parsed_timestamps_count": parsed_timestamps,
|
||||
"unknown_timestamps_count": unknown_timestamps,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
|
||||
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any], include_stacktraces: bool) -> str:
|
||||
lines = ["JVM Log Analyzer", "================", ""]
|
||||
summary = report["summary"]
|
||||
lines.extend(
|
||||
[
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
"Findings require review; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
)
|
||||
|
||||
if not report["findings"]:
|
||||
lines.extend(["No configured JVM/application findings were detected.", ""])
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"[{finding['severity']}] {finding['name']}",
|
||||
f"Occurrences: {finding['occurrences']}",
|
||||
f"Symptom: {finding['symptom']}",
|
||||
f"First seen: {finding['first_seen']}",
|
||||
f"Last seen: {finding['last_seen']}",
|
||||
f"Stack traces linked: {finding['stack_trace_count']}",
|
||||
"Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.extend(f" - {sample}" for sample in finding["samples"])
|
||||
else:
|
||||
lines.append(" - No samples retained")
|
||||
if include_stacktraces and finding["stack_trace_samples"]:
|
||||
lines.append("Stack trace samples:")
|
||||
for stack in finding["stack_trace_samples"]:
|
||||
lines.append(" ---")
|
||||
lines.extend(f" {entry}" for entry in stack)
|
||||
lines.append("")
|
||||
|
||||
lines.extend(render_text_table("Top Exception Types", report["top_exception_types"]))
|
||||
lines.extend(render_text_table("Top Operational Symptoms", report["top_operational_symptoms"]))
|
||||
lines.extend(render_text_summary(summary))
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
|
||||
lines = [title, "-" * len(title)]
|
||||
if not rows:
|
||||
lines.append("No entries detected.")
|
||||
else:
|
||||
lines.extend(f"- {item['value']}: {item['count']}" for item in rows)
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_text_summary(summary: dict[str, Any]) -> list[str]:
|
||||
coverage = summary["timestamp_coverage"]
|
||||
return [
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
f"Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"Total findings: {summary['total_findings']}",
|
||||
f"Total stack traces detected: {summary['total_stack_traces_detected']}",
|
||||
f"Critical finding groups: {summary['critical_finding_groups']}",
|
||||
f"Warning finding groups: {summary['warning_finding_groups']}",
|
||||
f"HTTP 5xx count: {summary['http_5xx_count']}",
|
||||
f"Parsed timestamps count: {coverage['parsed_timestamps_count']}",
|
||||
f"Unknown timestamps count: {coverage['unknown_timestamps_count']}",
|
||||
]
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any], include_stacktraces: bool) -> str:
|
||||
summary = report["summary"]
|
||||
lines = [
|
||||
"# JVM Log Analyzer",
|
||||
"",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
"- Finding language is a triage summary; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
|
||||
if not report["findings"]:
|
||||
lines.extend(["No configured JVM/application findings were detected.", ""])
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"## {finding['severity']}: {finding['name']}",
|
||||
"",
|
||||
f"- Occurrences: {finding['occurrences']}",
|
||||
f"- Symptom: {finding['symptom']}",
|
||||
f"- First seen: {finding['first_seen']}",
|
||||
f"- Last seen: {finding['last_seen']}",
|
||||
f"- Stack traces linked: {finding['stack_trace_count']}",
|
||||
"",
|
||||
"Sample log lines:",
|
||||
"",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.append("```text")
|
||||
lines.extend(finding["samples"])
|
||||
lines.append("```")
|
||||
else:
|
||||
lines.append("_No samples retained._")
|
||||
lines.append("")
|
||||
|
||||
if include_stacktraces and finding["stack_trace_samples"]:
|
||||
lines.extend(["Stack trace samples:", ""])
|
||||
for stack in finding["stack_trace_samples"]:
|
||||
lines.append("```text")
|
||||
lines.extend(stack)
|
||||
lines.append("```")
|
||||
lines.append("")
|
||||
|
||||
lines.extend(render_markdown_table("Top Exception Types", report["top_exception_types"]))
|
||||
lines.extend(render_markdown_table("Top Operational Symptoms", report["top_operational_symptoms"]))
|
||||
lines.extend(render_markdown_summary(summary))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
|
||||
lines = [f"## {title}", ""]
|
||||
if not rows:
|
||||
lines.extend(["No entries detected.", ""])
|
||||
return lines
|
||||
lines.extend(["| Value | Count |", "| --- | ---: |"])
|
||||
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
|
||||
lines.append("")
|
||||
return lines
|
||||
|
||||
|
||||
def render_markdown_summary(summary: dict[str, Any]) -> list[str]:
|
||||
coverage = summary["timestamp_coverage"]
|
||||
return [
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
f"- Total lines scanned: {summary['total_lines_scanned']}",
|
||||
f"- Total findings: {summary['total_findings']}",
|
||||
f"- Total stack traces detected: {summary['total_stack_traces_detected']}",
|
||||
f"- Critical finding groups: {summary['critical_finding_groups']}",
|
||||
f"- Warning finding groups: {summary['warning_finding_groups']}",
|
||||
f"- HTTP 5xx count: {summary['http_5xx_count']}",
|
||||
f"- Parsed timestamps count: {coverage['parsed_timestamps_count']}",
|
||||
f"- Unknown timestamps count: {coverage['unknown_timestamps_count']}",
|
||||
"",
|
||||
]
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2, sort_keys=True) + "\n"
|
||||
|
||||
|
||||
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
|
||||
if output_path is None:
|
||||
sys.stdout.write(content)
|
||||
return
|
||||
|
||||
path = Path(output_path)
|
||||
try:
|
||||
if path.resolve() == input_path.resolve():
|
||||
raise OSError("output path must not be the same as input file")
|
||||
path.write_text(content, encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write output {path}: {exc}") from exc
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
input_path = Path(args.file)
|
||||
|
||||
if args.since is not None and args.until is not None and args.since > args.until:
|
||||
parser.error("--since must be earlier than or equal to --until")
|
||||
|
||||
try:
|
||||
lines = read_log_file(input_path)
|
||||
report = analyze_log(
|
||||
lines=lines,
|
||||
patterns=compile_patterns(args.ignore_case),
|
||||
since=args.since,
|
||||
until=args.until,
|
||||
top=args.top,
|
||||
max_samples=args.max_samples,
|
||||
max_stack_lines=args.max_stack_lines,
|
||||
http_critical_threshold=args.http_critical_threshold,
|
||||
)
|
||||
|
||||
if args.format == "text":
|
||||
content = render_text(report, args.include_stacktraces)
|
||||
elif args.format == "markdown":
|
||||
content = render_markdown(report, args.include_stacktraces)
|
||||
else:
|
||||
content = render_json(report)
|
||||
|
||||
write_report(input_path, args.output, content)
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"CRITICAL: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except RuntimeError as exc:
|
||||
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
if report["summary"]["overall_status"] == "OK":
|
||||
return EXIT_OK
|
||||
return EXIT_FINDINGS
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,199 @@
|
||||
# known-error-matcher
|
||||
|
||||
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
|
||||
|
||||
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
|
||||
|
||||
## Purpose
|
||||
|
||||
- Identify which cataloged operational problems are visible in a collected log.
|
||||
- Count how often each known error pattern appears.
|
||||
- Surface warning and critical matches conservatively.
|
||||
- Point operators toward relevant runbooks or supporting local tools.
|
||||
- Produce predictable text, Markdown, or JSON output for incident notes.
|
||||
|
||||
## When To Use
|
||||
|
||||
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
|
||||
- Before attaching log evidence to an incident, problem, or change ticket.
|
||||
- When teams maintain a small local catalog of operational patterns and runbook links.
|
||||
- When JSON output is useful for later local automation.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not read remote systems or live streams.
|
||||
- It does not modify logs, services, applications, accounts, or host state.
|
||||
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
|
||||
- It does not find root cause automatically.
|
||||
- It does not prove an incident or confirm customer impact.
|
||||
- It does not classify every vendor-specific log message.
|
||||
|
||||
## Pattern Catalog Format
|
||||
|
||||
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
|
||||
|
||||
```json
|
||||
{
|
||||
"patterns": [
|
||||
{
|
||||
"id": "disk_full",
|
||||
"name": "Disk full",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "No space left on device|disk full",
|
||||
"category": "storage",
|
||||
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||
"description": "Filesystem or application failed because free space was exhausted."
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
Required fields per pattern:
|
||||
|
||||
- `id` - stable non-empty identifier.
|
||||
- `name` - human-readable finding name.
|
||||
- `severity` - `WARNING` or `CRITICAL`.
|
||||
- `regex` - Python regular expression used for matching.
|
||||
|
||||
Optional fields:
|
||||
|
||||
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
|
||||
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
|
||||
- `description` - short operator-facing explanation. Missing values are reported as `None`.
|
||||
|
||||
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
|
||||
|
||||
## Adding A Known Error Pattern
|
||||
|
||||
Add a new object under `patterns` in `patterns.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"id": "example_dependency_failure",
|
||||
"name": "Example dependency failure",
|
||||
"severity": "WARNING",
|
||||
"regex": "dependency request failed|upstream dependency unavailable",
|
||||
"category": "application",
|
||||
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
|
||||
"description": "Application logged a dependency failure that requires review."
|
||||
}
|
||||
```
|
||||
|
||||
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
|
||||
|
||||
## Severity Model
|
||||
|
||||
Overall status is conservative:
|
||||
|
||||
- `OK` - no known error patterns matched.
|
||||
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
|
||||
- `CRITICAL` - one or more critical patterns matched.
|
||||
|
||||
The status means known error patterns require review. It is not a final root-cause statement.
|
||||
|
||||
## Category Filtering
|
||||
|
||||
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
|
||||
|
||||
Examples:
|
||||
|
||||
```bash
|
||||
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/known-error-matcher
|
||||
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
|
||||
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
|
||||
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident, problem, or change ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no known error matches.
|
||||
- `1` - Known error matches detected.
|
||||
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Known Error Matcher
|
||||
===================
|
||||
|
||||
Overall status: CRITICAL
|
||||
Known error pattern matches require operator review; logs alone do not prove root cause.
|
||||
|
||||
[CRITICAL] database_unavailable - Database unavailable
|
||||
Category: application
|
||||
Occurrences: 1
|
||||
First seen: 2026-05-11 10:16:07
|
||||
Last seen: 2026-05-11 10:16:07
|
||||
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
|
||||
Description: Application logged unavailable database or database connectivity symptoms.
|
||||
Samples:
|
||||
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Overall status: CRITICAL
|
||||
Total lines scanned: 9
|
||||
Known error matches: 7
|
||||
Matched known error patterns: 7
|
||||
Critical matched patterns: 5
|
||||
Warning matched patterns: 2
|
||||
Top categories: application (3), network (2), application_jvm (2)
|
||||
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
|
||||
Timestamp coverage: parsed=9, unknown=0
|
||||
Filters used: severity=None, category=None
|
||||
Pattern catalog path: patterns.json
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 known_error_matcher.py \
|
||||
--file examples/sample-app.log \
|
||||
--patterns patterns.json \
|
||||
--format markdown \
|
||||
--output known-error-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single log line can match multiple known error patterns.
|
||||
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
|
||||
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
|
||||
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
|
||||
- Large log files are read into memory; use scoped extracts for very large incidents.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
|
||||
@@ -0,0 +1,9 @@
|
||||
2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42
|
||||
2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted
|
||||
2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
|
||||
2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443
|
||||
2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms
|
||||
2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed
|
||||
2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"
|
||||
2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space
|
||||
2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded
|
||||
@@ -0,0 +1,97 @@
|
||||
# Known Error Matcher Report
|
||||
|
||||
- Overall status: `CRITICAL`
|
||||
- Known error pattern matches require operator review; logs alone do not prove root cause.
|
||||
|
||||
## Matched Known Errors
|
||||
|
||||
### [CRITICAL] database_unavailable - Database unavailable
|
||||
|
||||
- Category: `application`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:07`
|
||||
- Last seen: `2026-05-11 10:16:07`
|
||||
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||
- Description: Application logged unavailable database or database connectivity symptoms.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool`
|
||||
|
||||
### [CRITICAL] http_500 - HTTP 500
|
||||
|
||||
- Category: `application`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:02`
|
||||
- Last seen: `2026-05-11 10:16:02`
|
||||
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
|
||||
- Description: Application or proxy logged HTTP 500 responses.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted`
|
||||
|
||||
### [CRITICAL] http_503 - HTTP 503
|
||||
|
||||
- Category: `application`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:31.456`
|
||||
- Last seen: `2026-05-11 10:16:31.456`
|
||||
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
|
||||
- Description: Application or proxy logged HTTP 503 service unavailable responses.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"`
|
||||
|
||||
### [CRITICAL] java_out_of_memory - Java OutOfMemoryError
|
||||
|
||||
- Category: `application_jvm`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:40`
|
||||
- Last seen: `2026-05-11 10:16:40`
|
||||
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||
- Description: Java process logged memory exhaustion symptoms.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space`
|
||||
|
||||
### [CRITICAL] ssl_handshake_exception - SSLHandshakeException
|
||||
|
||||
- Category: `application_jvm`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:22`
|
||||
- Last seen: `2026-05-11 10:16:22`
|
||||
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||
- Description: Java TLS handshake exception was logged.
|
||||
- Samples:
|
||||
- `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed`
|
||||
|
||||
### [WARNING] connection_refused - Connection refused
|
||||
|
||||
- Category: `network`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:11`
|
||||
- Last seen: `2026-05-11 10:16:11`
|
||||
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
|
||||
- Description: Client connection attempts were refused by the destination service or host.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443`
|
||||
|
||||
### [WARNING] timeout - Timeout
|
||||
|
||||
- Category: `network`
|
||||
- Occurrences: `1`
|
||||
- First seen: `2026-05-11 10:16:15,123`
|
||||
- Last seen: `2026-05-11 10:16:15,123`
|
||||
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
|
||||
- Description: Operation timed out and may require network, service, or dependency review.
|
||||
- Samples:
|
||||
- `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms`
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Overall status: `CRITICAL`
|
||||
- Total lines scanned: `9`
|
||||
- Known error matches: `7`
|
||||
- Matched known error patterns: `7`
|
||||
- Critical matched patterns: `5`
|
||||
- Warning matched patterns: `2`
|
||||
- Top categories: application (3), network (2), application_jvm (2)
|
||||
- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
|
||||
- Timestamp coverage: parsed=`9`, unknown=`0`
|
||||
- Filters used: severity=`None`, category=`None`
|
||||
- Pattern catalog path: `patterns.json`
|
||||
@@ -0,0 +1,10 @@
|
||||
May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages
|
||||
May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only
|
||||
May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE)
|
||||
May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB
|
||||
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
|
||||
May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application.
|
||||
May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly.
|
||||
May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy
|
||||
May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp
|
||||
May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp
|
||||
@@ -0,0 +1,562 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Match local logs against a JSON catalog of known operational error patterns."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from collections import Counter
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
UNKNOWN = "UNKNOWN"
|
||||
VALID_SEVERITIES = {"WARNING", "CRITICAL"}
|
||||
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||
|
||||
ISO_TIMESTAMP_RE = re.compile(
|
||||
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
|
||||
)
|
||||
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Scan a local log file for known operational error patterns."
|
||||
)
|
||||
parser.add_argument("--file", required=True, help="Local log file to scan.")
|
||||
parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--severity",
|
||||
choices=("warning", "critical"),
|
||||
help="Only include findings with this severity.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--category",
|
||||
help="Only include findings from this exact category.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
default=10,
|
||||
help="Number of matched known errors and summary entries to display. Default: 10.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per matched known error. Default: 3.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-case",
|
||||
action="store_true",
|
||||
help="Compile catalog regex patterns case-insensitively.",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def read_text_file(path: Path, label: str) -> str:
|
||||
if not path.exists():
|
||||
raise OSError(f"{label} does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"{label} is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"{label} is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read {label} {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"{label} is empty: {path}")
|
||||
return text
|
||||
|
||||
|
||||
def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]:
|
||||
text = read_text_file(path, "pattern catalog")
|
||||
try:
|
||||
catalog = json.loads(text)
|
||||
except json.JSONDecodeError as exc:
|
||||
raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc
|
||||
|
||||
errors: list[str] = []
|
||||
if not isinstance(catalog, dict):
|
||||
raise ValueError("invalid pattern catalog: top-level JSON value must be an object")
|
||||
if "patterns" not in catalog:
|
||||
raise ValueError('invalid pattern catalog: missing top-level "patterns" field')
|
||||
if not isinstance(catalog["patterns"], list):
|
||||
raise ValueError('invalid pattern catalog: "patterns" must be a list')
|
||||
|
||||
seen_ids: set[str] = set()
|
||||
compiled_patterns: list[dict[str, Any]] = []
|
||||
flags = re.IGNORECASE if ignore_case else 0
|
||||
|
||||
for index, item in enumerate(catalog["patterns"], start=1):
|
||||
if not isinstance(item, dict):
|
||||
errors.append(f"pattern #{index}: must be an object")
|
||||
continue
|
||||
|
||||
pattern_id = normalize_required_text(item, "id")
|
||||
name = normalize_required_text(item, "name")
|
||||
severity = normalize_required_text(item, "severity").upper()
|
||||
regex_text = normalize_required_text(item, "regex")
|
||||
|
||||
if not pattern_id:
|
||||
errors.append(f"pattern #{index}: id is required and must be non-empty")
|
||||
elif pattern_id in seen_ids:
|
||||
errors.append(f"pattern #{index}: duplicate id {pattern_id}")
|
||||
else:
|
||||
seen_ids.add(pattern_id)
|
||||
|
||||
if not name:
|
||||
errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty")
|
||||
if severity not in VALID_SEVERITIES:
|
||||
errors.append(
|
||||
f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL"
|
||||
)
|
||||
if not regex_text:
|
||||
errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty")
|
||||
|
||||
compiled_regex = None
|
||||
if regex_text:
|
||||
try:
|
||||
compiled_regex = re.compile(regex_text, flags)
|
||||
except re.error as exc:
|
||||
errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}")
|
||||
|
||||
if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex:
|
||||
compiled_patterns.append(
|
||||
{
|
||||
"id": pattern_id,
|
||||
"name": name,
|
||||
"severity": severity,
|
||||
"regex_text": regex_text,
|
||||
"regex": compiled_regex,
|
||||
"category": normalize_optional_text(item, "category", UNKNOWN),
|
||||
"runbook": normalize_optional_text(item, "runbook", ""),
|
||||
"description": normalize_optional_text(item, "description", ""),
|
||||
}
|
||||
)
|
||||
|
||||
if errors:
|
||||
raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors))
|
||||
if not compiled_patterns:
|
||||
raise ValueError("invalid pattern catalog: no patterns configured")
|
||||
return compiled_patterns
|
||||
|
||||
|
||||
def normalize_required_text(item: dict[str, Any], field: str) -> str:
|
||||
value = item.get(field)
|
||||
if not isinstance(value, str):
|
||||
return ""
|
||||
return value.strip()
|
||||
|
||||
|
||||
def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str:
|
||||
value = item.get(field, default)
|
||||
if not isinstance(value, str):
|
||||
return default
|
||||
value = value.strip()
|
||||
return value if value else default
|
||||
|
||||
|
||||
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
|
||||
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||
if iso_match:
|
||||
fraction = iso_match.group(3) or ""
|
||||
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||
parse_value = raw
|
||||
fmt = "%Y-%m-%d %H:%M:%S"
|
||||
if fraction:
|
||||
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
|
||||
fmt = "%Y-%m-%d %H:%M:%S.%f"
|
||||
try:
|
||||
return datetime.strptime(parse_value, fmt), raw + fraction
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
|
||||
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||
if syslog_match:
|
||||
raw = syslog_match.group(1)
|
||||
try:
|
||||
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
|
||||
except ValueError:
|
||||
return None, UNKNOWN
|
||||
return parsed, raw
|
||||
|
||||
return None, UNKNOWN
|
||||
|
||||
|
||||
def severity_filter_matches(selected: str | None, severity: str) -> bool:
|
||||
if selected is None:
|
||||
return True
|
||||
return selected.upper() == severity
|
||||
|
||||
|
||||
def category_filter_matches(selected: str | None, category: str) -> bool:
|
||||
if selected is None:
|
||||
return True
|
||||
return selected == category
|
||||
|
||||
|
||||
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
|
||||
if parsed_at is None:
|
||||
return
|
||||
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
|
||||
group["first_seen"] = (parsed_at, rendered_at)
|
||||
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
|
||||
group["last_seen"] = (parsed_at, rendered_at)
|
||||
|
||||
|
||||
def render_seen(value: tuple[datetime, str] | None) -> str:
|
||||
if value is None:
|
||||
return UNKNOWN
|
||||
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||
|
||||
|
||||
def append_limited(items: list[str], value: str, limit: int) -> None:
|
||||
if limit == 0:
|
||||
return
|
||||
if value in items:
|
||||
return
|
||||
if len(items) < limit:
|
||||
items.append(value)
|
||||
|
||||
|
||||
def analyze_log(
|
||||
lines: list[str],
|
||||
patterns: list[dict[str, Any]],
|
||||
severity_filter: str | None,
|
||||
category_filter: str | None,
|
||||
top: int,
|
||||
max_samples: int,
|
||||
pattern_catalog_path: Path,
|
||||
) -> dict[str, Any]:
|
||||
syslog_year = datetime.now().year
|
||||
groups: dict[str, dict[str, Any]] = {}
|
||||
top_categories = Counter()
|
||||
total_lines_scanned = 0
|
||||
parsed_timestamps = 0
|
||||
unknown_timestamps = 0
|
||||
|
||||
for line in lines:
|
||||
total_lines_scanned += 1
|
||||
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||
if parsed_at is None:
|
||||
unknown_timestamps += 1
|
||||
else:
|
||||
parsed_timestamps += 1
|
||||
|
||||
for pattern in patterns:
|
||||
if not severity_filter_matches(severity_filter, pattern["severity"]):
|
||||
continue
|
||||
if not category_filter_matches(category_filter, pattern["category"]):
|
||||
continue
|
||||
if not pattern["regex"].search(line):
|
||||
continue
|
||||
|
||||
group = groups.setdefault(
|
||||
pattern["id"],
|
||||
{
|
||||
"id": pattern["id"],
|
||||
"name": pattern["name"],
|
||||
"severity": pattern["severity"],
|
||||
"category": pattern["category"],
|
||||
"runbook": pattern["runbook"],
|
||||
"description": pattern["description"],
|
||||
"regex": pattern["regex_text"],
|
||||
"occurrences": 0,
|
||||
"first_seen": None,
|
||||
"last_seen": None,
|
||||
"samples": [],
|
||||
},
|
||||
)
|
||||
group["occurrences"] += 1
|
||||
update_seen(group, parsed_at, rendered_at)
|
||||
append_limited(group["samples"], line, max_samples)
|
||||
top_categories[pattern["category"]] += 1
|
||||
|
||||
findings = sorted(
|
||||
groups.values(),
|
||||
key=lambda item: (
|
||||
SEVERITY_ORDER[item["severity"]],
|
||||
-item["occurrences"],
|
||||
item["id"],
|
||||
),
|
||||
)
|
||||
|
||||
rendered_findings = [
|
||||
{
|
||||
**finding,
|
||||
"first_seen": render_seen(finding["first_seen"]),
|
||||
"last_seen": render_seen(finding["last_seen"]),
|
||||
}
|
||||
for finding in findings
|
||||
]
|
||||
|
||||
critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
|
||||
warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
|
||||
total_matches = sum(item["occurrences"] for item in rendered_findings)
|
||||
|
||||
overall_status = "OK"
|
||||
if critical_patterns > 0:
|
||||
overall_status = "CRITICAL"
|
||||
elif warning_patterns > 0:
|
||||
overall_status = "WARNING"
|
||||
|
||||
return {
|
||||
"overall_status": overall_status,
|
||||
"total_lines_scanned": total_lines_scanned,
|
||||
"total_known_error_matches": total_matches,
|
||||
"matched_pattern_count": len(rendered_findings),
|
||||
"critical_matched_pattern_count": critical_patterns,
|
||||
"warning_matched_pattern_count": warning_patterns,
|
||||
"top_categories": [
|
||||
{"category": name, "count": count}
|
||||
for name, count in top_categories.most_common(top)
|
||||
],
|
||||
"top_known_errors": [
|
||||
{"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]}
|
||||
for item in rendered_findings[:top]
|
||||
],
|
||||
"timestamp_coverage": {
|
||||
"parsed_timestamps_count": parsed_timestamps,
|
||||
"unknown_timestamps_count": unknown_timestamps,
|
||||
},
|
||||
"filters_used": {
|
||||
"severity": severity_filter.lower() if severity_filter else None,
|
||||
"category": category_filter,
|
||||
},
|
||||
"pattern_catalog_path": str(pattern_catalog_path),
|
||||
"findings": rendered_findings[:top],
|
||||
"findings_total": len(rendered_findings),
|
||||
}
|
||||
|
||||
|
||||
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
|
||||
if not items:
|
||||
return "None"
|
||||
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"Known Error Matcher",
|
||||
"===================",
|
||||
"",
|
||||
f"Overall status: {report['overall_status']}",
|
||||
"Known error pattern matches require operator review; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
|
||||
if report["findings"]:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"[{finding['severity']}] {finding['id']} - {finding['name']}",
|
||||
f"Category: {finding['category']}",
|
||||
f"Occurrences: {finding['occurrences']}",
|
||||
f"First seen: {finding['first_seen']}",
|
||||
f"Last seen: {finding['last_seen']}",
|
||||
f"Runbook: {finding['runbook'] or 'None'}",
|
||||
f"Description: {finding['description'] or 'None'}",
|
||||
"Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
for sample in finding["samples"]:
|
||||
lines.append(f" - {sample}")
|
||||
else:
|
||||
lines.append(" - None")
|
||||
lines.append("")
|
||||
else:
|
||||
lines.extend(["No known error patterns matched for the selected filters.", ""])
|
||||
|
||||
lines.extend(render_summary_lines(report, markdown=False))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]:
|
||||
if markdown:
|
||||
return [
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Overall status: `{report['overall_status']}`",
|
||||
f"- Total lines scanned: `{report['total_lines_scanned']}`",
|
||||
f"- Known error matches: `{report['total_known_error_matches']}`",
|
||||
f"- Matched known error patterns: `{report['matched_pattern_count']}`",
|
||||
f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`",
|
||||
f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`",
|
||||
"- Top categories: " + render_top_pairs(report["top_categories"], "category"),
|
||||
"- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
|
||||
"- Timestamp coverage: "
|
||||
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
|
||||
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
|
||||
"- Filters used: "
|
||||
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
|
||||
f"category=`{report['filters_used']['category'] or 'None'}`",
|
||||
f"- Pattern catalog path: `{report['pattern_catalog_path']}`",
|
||||
]
|
||||
return [
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Overall status: {report['overall_status']}",
|
||||
f"Total lines scanned: {report['total_lines_scanned']}",
|
||||
f"Known error matches: {report['total_known_error_matches']}",
|
||||
f"Matched known error patterns: {report['matched_pattern_count']}",
|
||||
f"Critical matched patterns: {report['critical_matched_pattern_count']}",
|
||||
f"Warning matched patterns: {report['warning_matched_pattern_count']}",
|
||||
"Top categories: " + render_top_pairs(report["top_categories"], "category"),
|
||||
"Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
|
||||
"Timestamp coverage: "
|
||||
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
|
||||
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
|
||||
"Filters used: "
|
||||
f"severity={report['filters_used']['severity'] or 'None'}, "
|
||||
f"category={report['filters_used']['category'] or 'None'}",
|
||||
f"Pattern catalog path: {report['pattern_catalog_path']}",
|
||||
]
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any]) -> str:
|
||||
lines = [
|
||||
"# Known Error Matcher Report",
|
||||
"",
|
||||
f"- Overall status: `{report['overall_status']}`",
|
||||
"- Known error pattern matches require operator review; logs alone do not prove root cause.",
|
||||
"",
|
||||
]
|
||||
|
||||
if report["findings"]:
|
||||
lines.extend(["## Matched Known Errors", ""])
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"### [{finding['severity']}] {finding['id']} - {finding['name']}",
|
||||
"",
|
||||
f"- Category: `{finding['category']}`",
|
||||
f"- Occurrences: `{finding['occurrences']}`",
|
||||
f"- First seen: `{finding['first_seen']}`",
|
||||
f"- Last seen: `{finding['last_seen']}`",
|
||||
f"- Runbook: `{finding['runbook'] or 'None'}`",
|
||||
f"- Description: {finding['description'] or 'None'}",
|
||||
"- Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
for sample in finding["samples"]:
|
||||
lines.append(f" - `{sample}`")
|
||||
else:
|
||||
lines.append(" - `None`")
|
||||
lines.append("")
|
||||
else:
|
||||
lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""])
|
||||
|
||||
lines.extend(render_summary_lines(report, markdown=True))
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2)
|
||||
|
||||
|
||||
def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None:
|
||||
if output_path is None:
|
||||
print(text)
|
||||
return
|
||||
|
||||
destination = Path(output_path)
|
||||
try:
|
||||
destination_resolved = destination.resolve()
|
||||
for input_path in protected_inputs:
|
||||
if input_path.resolve() == destination_resolved:
|
||||
raise OSError("output path must not overwrite an input file")
|
||||
except FileNotFoundError as exc:
|
||||
raise OSError(f"unable to resolve output path {destination}: {exc}") from exc
|
||||
|
||||
try:
|
||||
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write report to {destination}: {exc}") from exc
|
||||
|
||||
|
||||
def determine_exit_code(report: dict[str, Any]) -> int:
|
||||
if report["total_known_error_matches"] > 0:
|
||||
return EXIT_FINDINGS
|
||||
return EXIT_OK
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
try:
|
||||
log_path = Path(args.file)
|
||||
pattern_path = Path(args.patterns)
|
||||
log_text = read_text_file(log_path, "log file")
|
||||
lines = log_text.splitlines()
|
||||
patterns = load_pattern_catalog(pattern_path, args.ignore_case)
|
||||
severity_filter = args.severity.upper() if args.severity else None
|
||||
|
||||
report = analyze_log(
|
||||
lines=lines,
|
||||
patterns=patterns,
|
||||
severity_filter=severity_filter,
|
||||
category_filter=args.category,
|
||||
top=args.top,
|
||||
max_samples=args.max_samples,
|
||||
pattern_catalog_path=pattern_path,
|
||||
)
|
||||
|
||||
if args.format == "text":
|
||||
rendered = render_text(report)
|
||||
elif args.format == "markdown":
|
||||
rendered = render_markdown(report)
|
||||
else:
|
||||
rendered = render_json(report)
|
||||
|
||||
write_output(rendered, args.output, [log_path, pattern_path])
|
||||
return determine_exit_code(report)
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"ERROR: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except Exception as exc: # pragma: no cover - defensive operational fallback
|
||||
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -0,0 +1,220 @@
|
||||
{
|
||||
"patterns": [
|
||||
{
|
||||
"id": "disk_full",
|
||||
"name": "Disk full",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "No space left on device|disk full|filesystem full",
|
||||
"category": "storage",
|
||||
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||
"description": "Filesystem or application failed because free space was exhausted."
|
||||
},
|
||||
{
|
||||
"id": "inode_exhaustion",
|
||||
"name": "Inode exhaustion",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0",
|
||||
"category": "storage",
|
||||
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||
"description": "Filesystem may have free blocks but too few available inodes."
|
||||
},
|
||||
{
|
||||
"id": "read_only_filesystem",
|
||||
"name": "Read-only filesystem",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "read-only file system|read-only filesystem|Remounting filesystem read-only",
|
||||
"category": "storage",
|
||||
"runbook": "infra-run/runbooks/incidents/read-only-filesystem.md",
|
||||
"description": "Filesystem writes failed because the mount was read-only or remounted read-only."
|
||||
},
|
||||
{
|
||||
"id": "io_error",
|
||||
"name": "I/O error",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error",
|
||||
"category": "storage",
|
||||
"runbook": "infra-run/runbooks/incidents/storage-io-error.md",
|
||||
"description": "Kernel or application reported storage I/O errors that require device and filesystem review."
|
||||
},
|
||||
{
|
||||
"id": "out_of_memory",
|
||||
"name": "Out of memory",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "\\bout of memory\\b|Cannot allocate memory",
|
||||
"category": "memory",
|
||||
"runbook": "infra-run/runbooks/incidents/memory-pressure.md",
|
||||
"description": "Process or host reported memory exhaustion symptoms."
|
||||
},
|
||||
{
|
||||
"id": "oom_killer",
|
||||
"name": "OOM killer invoked",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "oom-killer|Killed process \\d+|Out of memory: Killed process",
|
||||
"category": "memory",
|
||||
"runbook": "infra-run/runbooks/incidents/oom-killer.md",
|
||||
"description": "Kernel OOM killer activity was logged and affected processes should be reviewed."
|
||||
},
|
||||
{
|
||||
"id": "segmentation_fault",
|
||||
"name": "Segmentation fault",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "segmentation fault|segfault",
|
||||
"category": "process",
|
||||
"runbook": "infra-run/runbooks/incidents/process-crash.md",
|
||||
"description": "A process crash pattern was logged."
|
||||
},
|
||||
{
|
||||
"id": "connection_refused",
|
||||
"name": "Connection refused",
|
||||
"severity": "WARNING",
|
||||
"regex": "connection refused|ConnectException: Connection refused",
|
||||
"category": "network",
|
||||
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||
"description": "Client connection attempts were refused by the destination service or host."
|
||||
},
|
||||
{
|
||||
"id": "connection_reset",
|
||||
"name": "Connection reset",
|
||||
"severity": "WARNING",
|
||||
"regex": "connection reset|Connection reset by peer",
|
||||
"category": "network",
|
||||
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||
"description": "Established network connections were reset and require endpoint review."
|
||||
},
|
||||
{
|
||||
"id": "timeout",
|
||||
"name": "Timeout",
|
||||
"severity": "WARNING",
|
||||
"regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException",
|
||||
"category": "network",
|
||||
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||
"description": "Operation timed out and may require network, service, or dependency review."
|
||||
},
|
||||
{
|
||||
"id": "dns_resolution_failure",
|
||||
"name": "DNS resolution failure",
|
||||
"severity": "WARNING",
|
||||
"regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host",
|
||||
"category": "network",
|
||||
"runbook": "infra-run/runbooks/incidents/dns-resolution.md",
|
||||
"description": "Name resolution failed for a host or service dependency."
|
||||
},
|
||||
{
|
||||
"id": "certificate_expired",
|
||||
"name": "Certificate expired",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter",
|
||||
"category": "tls",
|
||||
"runbook": "infra-run/runbooks/incidents/certificate-expired.md",
|
||||
"description": "TLS certificate expiry was logged and certificate state should be reviewed."
|
||||
},
|
||||
{
|
||||
"id": "tls_handshake_failed",
|
||||
"name": "TLS handshake failed",
|
||||
"severity": "WARNING",
|
||||
"regex": "TLS handshake failed|SSL handshake failed|handshake_failure",
|
||||
"category": "tls",
|
||||
"runbook": "infra-run/runbooks/incidents/tls-handshake.md",
|
||||
"description": "TLS handshake failed and may require certificate, protocol, or trust-store review."
|
||||
},
|
||||
{
|
||||
"id": "authentication_failure",
|
||||
"name": "Authentication failure",
|
||||
"severity": "WARNING",
|
||||
"regex": "authentication failure|Failed password|authentication failed",
|
||||
"category": "security",
|
||||
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
|
||||
"description": "Authentication failures were logged and may require access review."
|
||||
},
|
||||
{
|
||||
"id": "permission_denied",
|
||||
"name": "Permission denied",
|
||||
"severity": "WARNING",
|
||||
"regex": "permission denied|access denied|denied by policy",
|
||||
"category": "security",
|
||||
"runbook": "infra-run/runbooks/incidents/permission-denied.md",
|
||||
"description": "Access or permission denial was logged."
|
||||
},
|
||||
{
|
||||
"id": "invalid_user",
|
||||
"name": "Invalid user",
|
||||
"severity": "WARNING",
|
||||
"regex": "Invalid user|invalid user|user unknown|User not known",
|
||||
"category": "security",
|
||||
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
|
||||
"description": "Log contains attempts involving invalid or unknown users."
|
||||
},
|
||||
{
|
||||
"id": "java_out_of_memory",
|
||||
"name": "Java OutOfMemoryError",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded",
|
||||
"category": "application_jvm",
|
||||
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||
"description": "Java process logged memory exhaustion symptoms."
|
||||
},
|
||||
{
|
||||
"id": "ssl_handshake_exception",
|
||||
"name": "SSLHandshakeException",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException",
|
||||
"category": "application_jvm",
|
||||
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||
"description": "Java TLS handshake exception was logged."
|
||||
},
|
||||
{
|
||||
"id": "database_unavailable",
|
||||
"name": "Database unavailable",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted",
|
||||
"category": "application",
|
||||
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||
"description": "Application logged unavailable database or database connectivity symptoms."
|
||||
},
|
||||
{
|
||||
"id": "http_500",
|
||||
"name": "HTTP 500",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s",
|
||||
"category": "application",
|
||||
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
|
||||
"description": "Application or proxy logged HTTP 500 responses."
|
||||
},
|
||||
{
|
||||
"id": "http_503",
|
||||
"name": "HTTP 503",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable",
|
||||
"category": "application",
|
||||
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
|
||||
"description": "Application or proxy logged HTTP 503 service unavailable responses."
|
||||
},
|
||||
{
|
||||
"id": "service_failed",
|
||||
"name": "Systemd service failed",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=",
|
||||
"category": "systemd",
|
||||
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||
"description": "Systemd logged a failed service or failed service start."
|
||||
},
|
||||
{
|
||||
"id": "dependency_failed",
|
||||
"name": "Systemd dependency failed",
|
||||
"severity": "CRITICAL",
|
||||
"regex": "Dependency failed for|dependency failed",
|
||||
"category": "systemd",
|
||||
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||
"description": "Systemd logged a unit dependency failure."
|
||||
},
|
||||
{
|
||||
"id": "start_request_repeated",
|
||||
"name": "Start request repeated too quickly",
|
||||
"severity": "WARNING",
|
||||
"regex": "Start request repeated too quickly|start request repeated too quickly",
|
||||
"category": "systemd",
|
||||
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||
"description": "Systemd throttled service restarts after repeated start failures."
|
||||
}
|
||||
]
|
||||
}
|
||||
@@ -0,0 +1,164 @@
|
||||
# log-diff-checker
|
||||
|
||||
`log-diff-checker` is a read-only Python CLI for comparing configured operational log patterns before and after a change. It is intended to help an infrastructure engineer decide whether a patch, deployment, configuration change, or service restart introduced new log risk or reduced existing noise.
|
||||
|
||||
The tool compares local pre-change and post-change log extracts. It does not modify input logs or system state.
|
||||
|
||||
## When To Use
|
||||
|
||||
- After a planned change when pre-check and post-check log extracts are available.
|
||||
- During change validation when the question is whether errors increased, disappeared, or stayed flat.
|
||||
- Before attaching log evidence to a change, incident, or problem ticket.
|
||||
- When predictable text, Markdown, or JSON output is useful for local review.
|
||||
|
||||
## What It Does
|
||||
|
||||
- Reads two local text log files supplied with `--before` and `--after`.
|
||||
- Scans both files for configured critical and warning patterns.
|
||||
- Compares before and after counts for each detected pattern.
|
||||
- Classifies patterns as `NEW`, `INCREASED`, `DECREASED`, `RESOLVED`, or `UNCHANGED`.
|
||||
- Sets an overall status of `OK`, `WARNING`, or `CRITICAL`.
|
||||
- Includes sample log lines from the side that best explains the change.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not read remote systems.
|
||||
- It does not modify logs, services, or host state.
|
||||
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
|
||||
- It does not prove root cause or change safety.
|
||||
- It does not replace service-specific post-change checks.
|
||||
- It does not classify every possible vendor or application error.
|
||||
|
||||
## Supported Input
|
||||
|
||||
- Two local text log files:
|
||||
- `--before` for the pre-change log extract.
|
||||
- `--after` for the post-change log extract.
|
||||
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
|
||||
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
## Supported Patterns
|
||||
|
||||
Critical patterns:
|
||||
|
||||
- `CRITICAL`
|
||||
- `FATAL`
|
||||
- `panic`
|
||||
- `kernel panic`
|
||||
- `no space left on device`
|
||||
- `out of memory`
|
||||
- `killed process`
|
||||
- `read-only file system`
|
||||
- `segmentation fault`
|
||||
- `segfault`
|
||||
- `certificate expired`
|
||||
- `TLS handshake failed`
|
||||
- `SSLHandshakeException`
|
||||
- `database unavailable`
|
||||
- `HTTP 500`
|
||||
- `HTTP 502`
|
||||
- `HTTP 503`
|
||||
- `HTTP 504`
|
||||
|
||||
Warning patterns:
|
||||
|
||||
- `ERROR`
|
||||
- `failed`
|
||||
- `failure`
|
||||
- `timeout`
|
||||
- `connection refused`
|
||||
- `connection reset`
|
||||
- `permission denied`
|
||||
- `authentication failed`
|
||||
- `denied`
|
||||
- `unavailable`
|
||||
- `service restart`
|
||||
- `retrying`
|
||||
|
||||
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/log-diff-checker
|
||||
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown --output change-log-diff.md
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format json
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --ignore-case
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --top 20
|
||||
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --max-samples 5
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - change or incident ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to either input log file.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no new or increased findings.
|
||||
- `1` - New or increased findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Log Diff Checker
|
||||
================
|
||||
|
||||
[CRITICAL] CRITICAL - NEW
|
||||
Before count: 0
|
||||
After count: 1
|
||||
Delta: +1
|
||||
Sample source: after
|
||||
Samples:
|
||||
- 2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Total lines scanned before: 7
|
||||
Total lines scanned after: 8
|
||||
Total unique patterns compared: 9
|
||||
New findings count: 3
|
||||
Increased findings count: 3
|
||||
Decreased findings count: 0
|
||||
Resolved findings count: 2
|
||||
Unchanged findings count: 1
|
||||
Overall status: CRITICAL
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a Markdown report from collected pre-change and post-change logs, review it, and attach it to the change ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 log_diff_checker.py \
|
||||
--before examples/pre-change.log \
|
||||
--after examples/post-change.log \
|
||||
--format markdown \
|
||||
--output change-log-diff.md
|
||||
```
|
||||
|
||||
Use the report as a log perspective on the change. A `CRITICAL` or `WARNING` result should be reviewed with service health checks, monitoring, rollback criteria, and the relevant application owner.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single line can match multiple patterns, such as `CRITICAL`, `database unavailable`, and `unavailable`.
|
||||
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||
- The tool compares counts, not rates, time windows, or request volume.
|
||||
- Large log files are read into memory; collect scoped extracts for very large incidents.
|
||||
- `--top` limits displayed findings only. The operational summary still reflects all compared patterns.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input logs and optionally writes a separate report.
|
||||
- The implementation uses the Python standard library only and does not require package installation.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
|
||||
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
|
||||
@@ -0,0 +1,8 @@
|
||||
2026-05-11 10:10:01 app01 systemd[1]: Started inventory-api.service after package update.
|
||||
2026-05-11 10:10:15 app01 inventory-api[2294]: INFO readiness check passed
|
||||
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
|
||||
2026-05-11 10:15:00 app01 inventory-api[2294]: INFO background reconciliation completed
|
||||
@@ -0,0 +1,7 @@
|
||||
2026-05-11 09:55:01 app01 systemd[1]: Started inventory-api.service.
|
||||
2026-05-11 09:56:12 app01 inventory-api[1842]: INFO readiness check passed
|
||||
2026-05-11 09:57:20 app01 inventory-api[1842]: WARNING timeout contacting cache01, retrying
|
||||
2026-05-11 09:58:04 app01 inventory-api[1842]: ERROR failed to refresh optional pricing cache
|
||||
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
|
||||
2026-05-11 10:00:00 app01 systemd[1]: Stopping inventory-api.service for planned restart.
|
||||
2026-05-11 10:00:03 app01 systemd[1]: Started inventory-api.service.
|
||||
@@ -0,0 +1,134 @@
|
||||
# Log Diff Checker
|
||||
|
||||
## CRITICAL: CRITICAL (NEW)
|
||||
|
||||
- Before count: 0
|
||||
- After count: 1
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
|
||||
```
|
||||
|
||||
## CRITICAL: database unavailable (NEW)
|
||||
|
||||
- Before count: 0
|
||||
- After count: 1
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
|
||||
```
|
||||
|
||||
## WARNING: unavailable (NEW)
|
||||
|
||||
- Before count: 0
|
||||
- After count: 1
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
|
||||
```
|
||||
|
||||
## WARNING: failed (INCREASED)
|
||||
|
||||
- Before count: 1
|
||||
- After count: 2
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
```
|
||||
|
||||
## WARNING: retrying (INCREASED)
|
||||
|
||||
- Before count: 1
|
||||
- After count: 2
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
```
|
||||
|
||||
## WARNING: timeout (INCREASED)
|
||||
|
||||
- Before count: 1
|
||||
- After count: 2
|
||||
- Delta: +1
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
|
||||
```
|
||||
|
||||
## WARNING: denied (RESOLVED)
|
||||
|
||||
- Before count: 1
|
||||
- After count: 0
|
||||
- Delta: -1
|
||||
- Sample source: before
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
|
||||
```
|
||||
|
||||
## WARNING: permission denied (RESOLVED)
|
||||
|
||||
- Before count: 1
|
||||
- After count: 0
|
||||
- Delta: -1
|
||||
- Sample source: before
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
|
||||
```
|
||||
|
||||
## WARNING: ERROR (UNCHANGED)
|
||||
|
||||
- Before count: 2
|
||||
- After count: 2
|
||||
- Delta: +0
|
||||
- Sample source: after
|
||||
|
||||
Sample log lines:
|
||||
|
||||
```text
|
||||
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
|
||||
```
|
||||
|
||||
## Operational Summary
|
||||
|
||||
- Total lines scanned before: 7
|
||||
- Total lines scanned after: 8
|
||||
- Total unique patterns compared: 9
|
||||
- New findings count: 3
|
||||
- Increased findings count: 3
|
||||
- Decreased findings count: 0
|
||||
- Resolved findings count: 2
|
||||
- Unchanged findings count: 1
|
||||
- Overall status: CRITICAL
|
||||
@@ -0,0 +1,462 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Compare incident-oriented log patterns before and after a change."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
|
||||
EXIT_OK = 0
|
||||
EXIT_FINDINGS = 1
|
||||
EXIT_INVALID = 2
|
||||
|
||||
STATUS_ORDER = {
|
||||
"NEW": 0,
|
||||
"INCREASED": 1,
|
||||
"DECREASED": 2,
|
||||
"RESOLVED": 3,
|
||||
"UNCHANGED": 4,
|
||||
}
|
||||
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||
|
||||
CRITICAL_PATTERNS = [
|
||||
"CRITICAL",
|
||||
"FATAL",
|
||||
"panic",
|
||||
"kernel panic",
|
||||
"no space left on device",
|
||||
"out of memory",
|
||||
"killed process",
|
||||
"read-only file system",
|
||||
"segmentation fault",
|
||||
"segfault",
|
||||
"certificate expired",
|
||||
"TLS handshake failed",
|
||||
"SSLHandshakeException",
|
||||
"database unavailable",
|
||||
"HTTP 500",
|
||||
"HTTP 502",
|
||||
"HTTP 503",
|
||||
"HTTP 504",
|
||||
]
|
||||
|
||||
WARNING_PATTERNS = [
|
||||
"ERROR",
|
||||
"failed",
|
||||
"failure",
|
||||
"timeout",
|
||||
"connection refused",
|
||||
"connection reset",
|
||||
"permission denied",
|
||||
"authentication failed",
|
||||
"denied",
|
||||
"unavailable",
|
||||
"service restart",
|
||||
"retrying",
|
||||
]
|
||||
|
||||
|
||||
def build_parser() -> argparse.ArgumentParser:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Compare configured operational log patterns before and after a change."
|
||||
)
|
||||
parser.add_argument("--before", required=True, help="Pre-change local log file.")
|
||||
parser.add_argument("--after", required=True, help="Post-change local log file.")
|
||||
parser.add_argument(
|
||||
"--format",
|
||||
choices=("text", "markdown", "json"),
|
||||
default="text",
|
||||
help="Report format. Default: text.",
|
||||
)
|
||||
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||
parser.add_argument(
|
||||
"--top",
|
||||
type=positive_int,
|
||||
help="Limit displayed findings after operational importance sorting.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--ignore-case",
|
||||
action="store_true",
|
||||
help="Match all configured patterns case-insensitively.",
|
||||
)
|
||||
parser.add_argument(
|
||||
"--max-samples",
|
||||
type=non_negative_int,
|
||||
default=3,
|
||||
help="Maximum sample lines per finding. Default: 3.",
|
||||
)
|
||||
return parser
|
||||
|
||||
|
||||
def positive_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||
if number <= 0:
|
||||
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def non_negative_int(value: str) -> int:
|
||||
try:
|
||||
number = int(value)
|
||||
except ValueError as exc:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||
if number < 0:
|
||||
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||
return number
|
||||
|
||||
|
||||
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
|
||||
flags = re.IGNORECASE if ignore_case else 0
|
||||
pattern_defs: list[dict[str, str]] = []
|
||||
pattern_defs.extend(
|
||||
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
|
||||
)
|
||||
pattern_defs.extend(
|
||||
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
|
||||
)
|
||||
|
||||
compiled = []
|
||||
for item in pattern_defs:
|
||||
compiled.append(
|
||||
{
|
||||
"pattern": item["pattern"],
|
||||
"severity": item["severity"],
|
||||
"regex": re.compile(re.escape(item["pattern"]), flags),
|
||||
}
|
||||
)
|
||||
return compiled
|
||||
|
||||
|
||||
def read_log_file(path: Path) -> list[str]:
|
||||
if not path.exists():
|
||||
raise OSError(f"file does not exist: {path}")
|
||||
if not path.is_file():
|
||||
raise OSError(f"path is not a regular file: {path}")
|
||||
try:
|
||||
text = path.read_text(encoding="utf-8", errors="replace")
|
||||
except PermissionError as exc:
|
||||
raise OSError(f"file is not readable: {path}") from exc
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to read file {path}: {exc}") from exc
|
||||
if text == "":
|
||||
raise ValueError(f"file is empty: {path}")
|
||||
return text.splitlines()
|
||||
|
||||
|
||||
def scan_log(
|
||||
lines: list[str], patterns: list[dict[str, Any]], max_samples: int
|
||||
) -> dict[str, dict[str, Any]]:
|
||||
groups: dict[str, dict[str, Any]] = {}
|
||||
|
||||
for line in lines:
|
||||
for item in patterns:
|
||||
if not item["regex"].search(line):
|
||||
continue
|
||||
|
||||
key = f"{item['severity']}::{item['pattern']}"
|
||||
group = groups.setdefault(
|
||||
key,
|
||||
{
|
||||
"pattern": item["pattern"],
|
||||
"severity": item["severity"],
|
||||
"count": 0,
|
||||
"samples": [],
|
||||
},
|
||||
)
|
||||
group["count"] += 1
|
||||
if len(group["samples"]) < max_samples:
|
||||
group["samples"].append(line)
|
||||
|
||||
return groups
|
||||
|
||||
|
||||
def classify_status(before_count: int, after_count: int) -> str:
|
||||
if before_count == 0 and after_count > 0:
|
||||
return "NEW"
|
||||
if before_count > 0 and after_count == 0:
|
||||
return "RESOLVED"
|
||||
if after_count > before_count:
|
||||
return "INCREASED"
|
||||
if after_count < before_count:
|
||||
return "DECREASED"
|
||||
return "UNCHANGED"
|
||||
|
||||
|
||||
def sample_source_for(status: str) -> str:
|
||||
if status in ("NEW", "INCREASED"):
|
||||
return "after"
|
||||
if status in ("DECREASED", "RESOLVED"):
|
||||
return "before"
|
||||
return "after"
|
||||
|
||||
|
||||
def compare_logs(
|
||||
before_lines: list[str],
|
||||
after_lines: list[str],
|
||||
patterns: list[dict[str, Any]],
|
||||
max_samples: int,
|
||||
top: int | None,
|
||||
) -> dict[str, Any]:
|
||||
before_groups = scan_log(before_lines, patterns, max_samples)
|
||||
after_groups = scan_log(after_lines, patterns, max_samples)
|
||||
compared_keys = sorted(set(before_groups) | set(after_groups))
|
||||
|
||||
findings = []
|
||||
for key in compared_keys:
|
||||
before_group = before_groups.get(key)
|
||||
after_group = after_groups.get(key)
|
||||
reference = before_group or after_group
|
||||
if reference is None:
|
||||
continue
|
||||
|
||||
before_count = before_group["count"] if before_group is not None else 0
|
||||
after_count = after_group["count"] if after_group is not None else 0
|
||||
status = classify_status(before_count, after_count)
|
||||
source = sample_source_for(status)
|
||||
sample_group = after_group if source == "after" else before_group
|
||||
|
||||
findings.append(
|
||||
{
|
||||
"pattern": reference["pattern"],
|
||||
"severity": reference["severity"],
|
||||
"before_count": before_count,
|
||||
"after_count": after_count,
|
||||
"delta": after_count - before_count,
|
||||
"status": status,
|
||||
"sample_source": source,
|
||||
"samples": sample_group["samples"] if sample_group is not None else [],
|
||||
}
|
||||
)
|
||||
|
||||
sorted_findings = sorted(findings, key=finding_sort_key)
|
||||
summary = build_summary(
|
||||
before_lines=before_lines,
|
||||
after_lines=after_lines,
|
||||
findings=sorted_findings,
|
||||
)
|
||||
|
||||
displayed_findings = sorted_findings if top is None else sorted_findings[:top]
|
||||
return {
|
||||
"findings": displayed_findings,
|
||||
"summary": summary,
|
||||
}
|
||||
|
||||
|
||||
def finding_sort_key(finding: dict[str, Any]) -> tuple[int, int, int, int, str]:
|
||||
return (
|
||||
STATUS_ORDER[finding["status"]],
|
||||
SEVERITY_ORDER[finding["severity"]],
|
||||
-abs(finding["delta"]),
|
||||
-finding["after_count"],
|
||||
finding["pattern"].lower(),
|
||||
)
|
||||
|
||||
|
||||
def build_summary(
|
||||
before_lines: list[str], after_lines: list[str], findings: list[dict[str, Any]]
|
||||
) -> dict[str, Any]:
|
||||
status_counts = {
|
||||
"NEW": 0,
|
||||
"INCREASED": 0,
|
||||
"DECREASED": 0,
|
||||
"RESOLVED": 0,
|
||||
"UNCHANGED": 0,
|
||||
}
|
||||
for finding in findings:
|
||||
status_counts[finding["status"]] += 1
|
||||
|
||||
critical_regressions = any(
|
||||
finding["severity"] == "CRITICAL"
|
||||
and finding["status"] in ("NEW", "INCREASED")
|
||||
for finding in findings
|
||||
)
|
||||
warning_regressions = any(
|
||||
finding["severity"] == "WARNING"
|
||||
and finding["status"] in ("NEW", "INCREASED")
|
||||
for finding in findings
|
||||
)
|
||||
|
||||
if critical_regressions:
|
||||
overall_status = "CRITICAL"
|
||||
elif warning_regressions:
|
||||
overall_status = "WARNING"
|
||||
else:
|
||||
overall_status = "OK"
|
||||
|
||||
return {
|
||||
"total_lines_scanned_before": len(before_lines),
|
||||
"total_lines_scanned_after": len(after_lines),
|
||||
"total_unique_patterns_compared": len(findings),
|
||||
"new_findings_count": status_counts["NEW"],
|
||||
"increased_findings_count": status_counts["INCREASED"],
|
||||
"decreased_findings_count": status_counts["DECREASED"],
|
||||
"resolved_findings_count": status_counts["RESOLVED"],
|
||||
"unchanged_findings_count": status_counts["UNCHANGED"],
|
||||
"overall_status": overall_status,
|
||||
}
|
||||
|
||||
|
||||
def render_text(report: dict[str, Any]) -> str:
|
||||
lines = ["Log Diff Checker", "================", ""]
|
||||
if not report["findings"]:
|
||||
lines.append("No configured operational patterns were detected in either log.")
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"[{finding['severity']}] {finding['pattern']} - {finding['status']}",
|
||||
f"Before count: {finding['before_count']}",
|
||||
f"After count: {finding['after_count']}",
|
||||
f"Delta: {finding['delta']:+d}",
|
||||
f"Sample source: {finding['sample_source']}",
|
||||
"Samples:",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.extend(f" - {sample}" for sample in finding["samples"])
|
||||
else:
|
||||
lines.append(" - No samples retained")
|
||||
lines.append("")
|
||||
|
||||
lines.extend(render_text_summary(report["summary"]))
|
||||
return "\n".join(lines) + "\n"
|
||||
|
||||
|
||||
def render_text_summary(summary: dict[str, Any]) -> list[str]:
|
||||
return [
|
||||
"Operational Summary",
|
||||
"-------------------",
|
||||
f"Total lines scanned before: {summary['total_lines_scanned_before']}",
|
||||
f"Total lines scanned after: {summary['total_lines_scanned_after']}",
|
||||
f"Total unique patterns compared: {summary['total_unique_patterns_compared']}",
|
||||
f"New findings count: {summary['new_findings_count']}",
|
||||
f"Increased findings count: {summary['increased_findings_count']}",
|
||||
f"Decreased findings count: {summary['decreased_findings_count']}",
|
||||
f"Resolved findings count: {summary['resolved_findings_count']}",
|
||||
f"Unchanged findings count: {summary['unchanged_findings_count']}",
|
||||
f"Overall status: {summary['overall_status']}",
|
||||
]
|
||||
|
||||
|
||||
def render_markdown(report: dict[str, Any]) -> str:
|
||||
lines = ["# Log Diff Checker", ""]
|
||||
if not report["findings"]:
|
||||
lines.extend(["No configured operational patterns were detected in either log.", ""])
|
||||
else:
|
||||
for finding in report["findings"]:
|
||||
lines.extend(
|
||||
[
|
||||
f"## {finding['severity']}: {finding['pattern']} ({finding['status']})",
|
||||
"",
|
||||
f"- Before count: {finding['before_count']}",
|
||||
f"- After count: {finding['after_count']}",
|
||||
f"- Delta: {finding['delta']:+d}",
|
||||
f"- Sample source: {finding['sample_source']}",
|
||||
"",
|
||||
"Sample log lines:",
|
||||
"",
|
||||
]
|
||||
)
|
||||
if finding["samples"]:
|
||||
lines.append("```text")
|
||||
lines.extend(finding["samples"])
|
||||
lines.append("```")
|
||||
else:
|
||||
lines.append("_No samples retained._")
|
||||
lines.append("")
|
||||
|
||||
summary = report["summary"]
|
||||
lines.extend(
|
||||
[
|
||||
"## Operational Summary",
|
||||
"",
|
||||
f"- Total lines scanned before: {summary['total_lines_scanned_before']}",
|
||||
f"- Total lines scanned after: {summary['total_lines_scanned_after']}",
|
||||
f"- Total unique patterns compared: {summary['total_unique_patterns_compared']}",
|
||||
f"- New findings count: {summary['new_findings_count']}",
|
||||
f"- Increased findings count: {summary['increased_findings_count']}",
|
||||
f"- Decreased findings count: {summary['decreased_findings_count']}",
|
||||
f"- Resolved findings count: {summary['resolved_findings_count']}",
|
||||
f"- Unchanged findings count: {summary['unchanged_findings_count']}",
|
||||
f"- Overall status: {summary['overall_status']}",
|
||||
"",
|
||||
]
|
||||
)
|
||||
return "\n".join(lines)
|
||||
|
||||
|
||||
def render_json(report: dict[str, Any]) -> str:
|
||||
return json.dumps(report, indent=2, sort_keys=True) + "\n"
|
||||
|
||||
|
||||
def write_report(
|
||||
output_path: str | None, content: str, input_paths: tuple[Path, Path]
|
||||
) -> None:
|
||||
if output_path is None:
|
||||
sys.stdout.write(content)
|
||||
return
|
||||
|
||||
path = Path(output_path)
|
||||
try:
|
||||
output_resolved = path.resolve()
|
||||
input_resolved = {input_path.resolve() for input_path in input_paths}
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to validate output path {path}: {exc}") from exc
|
||||
|
||||
if output_resolved in input_resolved:
|
||||
raise OSError("output path must not overwrite an input log file")
|
||||
|
||||
try:
|
||||
path.write_text(content, encoding="utf-8")
|
||||
except OSError as exc:
|
||||
raise OSError(f"unable to write output {path}: {exc}") from exc
|
||||
|
||||
|
||||
def main() -> int:
|
||||
parser = build_parser()
|
||||
args = parser.parse_args()
|
||||
|
||||
before_path = Path(args.before)
|
||||
after_path = Path(args.after)
|
||||
|
||||
try:
|
||||
before_lines = read_log_file(before_path)
|
||||
after_lines = read_log_file(after_path)
|
||||
report = compare_logs(
|
||||
before_lines=before_lines,
|
||||
after_lines=after_lines,
|
||||
patterns=compile_patterns(args.ignore_case),
|
||||
max_samples=args.max_samples,
|
||||
top=args.top,
|
||||
)
|
||||
|
||||
if args.format == "text":
|
||||
content = render_text(report)
|
||||
elif args.format == "markdown":
|
||||
content = render_markdown(report)
|
||||
else:
|
||||
content = render_json(report)
|
||||
|
||||
write_report(args.output, content, (before_path, after_path))
|
||||
except (OSError, ValueError) as exc:
|
||||
print(f"CRITICAL: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
except RuntimeError as exc:
|
||||
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
|
||||
return EXIT_INVALID
|
||||
|
||||
if report["summary"]["overall_status"] == "OK":
|
||||
return EXIT_OK
|
||||
return EXIT_FINDINGS
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
sys.exit(main())
|
||||
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
|
||||
- `ci-cd`
|
||||
- `docker`
|
||||
|
||||
## Linux operations labs
|
||||
|
||||
- [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
|
||||
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
|
||||
|
||||
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
||||
|
||||
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
||||
|
||||
@@ -0,0 +1,308 @@
|
||||
# AI Lab Maintenance Toolkit
|
||||
|
||||
## Executive summary
|
||||
|
||||
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
|
||||
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
|
||||
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
|
||||
configuration backup, and non-destructive VM inventory into a small toolkit
|
||||
that is readable enough for review and guarded enough for homelab use.
|
||||
|
||||
This is a portfolio and lab implementation, not evidence of production
|
||||
certification. Review package policy, backup coverage, maintenance windows, and
|
||||
application impact before deploying it to another host.
|
||||
|
||||
## Problem solved
|
||||
|
||||
AI lab hosts accumulate operating system packages, kernel packages, container
|
||||
images, build cache, journals, and configuration changes while also carrying
|
||||
stateful workloads. Manual maintenance is easy to defer and risky to perform
|
||||
without evidence. This project provides scheduled, logged tasks with explicit
|
||||
safety boundaries and separate read-only audit commands.
|
||||
|
||||
## What this demonstrates
|
||||
|
||||
- Bash strict mode, input validation, dependency checks, and operational exit
|
||||
codes.
|
||||
- Dry-run-first maintenance with explicit authorization for changes.
|
||||
- systemd oneshot services and persistent calendar timers.
|
||||
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
|
||||
- Docker cleanup that preserves volumes.
|
||||
- Configuration-focused backups with bounded retention.
|
||||
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
|
||||
- Idempotent installation and guarded JSON configuration updates.
|
||||
|
||||
## Architecture and directory layout
|
||||
|
||||
```text
|
||||
ailab-maintenance/
|
||||
├── README.md
|
||||
├── install.sh
|
||||
├── scripts/
|
||||
│ ├── ailab-healthcheck.sh
|
||||
│ ├── ailab-disk-watch.sh
|
||||
│ ├── ailab-apt-cleanup.sh
|
||||
│ ├── ailab-kernel-cleanup.sh
|
||||
│ ├── ailab-docker-cleanup.sh
|
||||
│ ├── ailab-config-backup.sh
|
||||
│ └── ailab-vm-audit.sh
|
||||
└── systemd/
|
||||
├── ailab-apt-cleanup.service
|
||||
├── ailab-apt-cleanup.timer
|
||||
├── ailab-kernel-cleanup.service
|
||||
├── ailab-kernel-cleanup.timer
|
||||
├── ailab-docker-cleanup.service
|
||||
├── ailab-docker-cleanup.timer
|
||||
├── ailab-config-backup.service
|
||||
├── ailab-config-backup.timer
|
||||
├── ailab-disk-watch.service
|
||||
└── ailab-disk-watch.timer
|
||||
```
|
||||
|
||||
The installer deploys scripts to `/usr/local/sbin` and units to
|
||||
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
|
||||
through an additional framework.
|
||||
|
||||
## Maintenance tasks
|
||||
|
||||
| Command | Purpose | Change behavior |
|
||||
| --- | --- | --- |
|
||||
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
|
||||
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
|
||||
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
|
||||
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
|
||||
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
|
||||
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
|
||||
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
|
||||
|
||||
## Safety model
|
||||
|
||||
Change-capable scripts default to dry-run behavior. Manual execution requires
|
||||
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
|
||||
use `--execute --non-interactive`; installing and enabling those reviewed unit
|
||||
files is the explicit authorization for scheduled maintenance.
|
||||
|
||||
Exit codes follow the repository convention:
|
||||
|
||||
- `0`: completed successfully or an optional component was absent.
|
||||
- `1`: an operational check or maintenance action failed.
|
||||
- `2`: invalid input, missing required dependency, or insufficient privilege.
|
||||
|
||||
The scripts do not bypass APT or Docker locks, delete VM resources, manually
|
||||
select kernel names for removal, or hide command failures.
|
||||
|
||||
## Installation
|
||||
|
||||
Review every script and unit first. Installation changes package state,
|
||||
journald settings, Docker daemon settings when Docker exists, and enabled timer
|
||||
state.
|
||||
|
||||
```bash
|
||||
cd labs/linux/ailab-maintenance
|
||||
sudo ./install.sh
|
||||
```
|
||||
|
||||
The installer:
|
||||
|
||||
1. Installs the documented Ubuntu utilities.
|
||||
2. Deploys scripts and systemd units with fixed permissions.
|
||||
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
|
||||
4. Restarts `systemd-journald`.
|
||||
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
|
||||
with `jq`, and attempts a Docker restart.
|
||||
6. Enables all five timers.
|
||||
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
|
||||
|
||||
The installer is intended for Ubuntu 26.04. It is not run automatically by
|
||||
repository validation.
|
||||
|
||||
## Manual commands
|
||||
|
||||
Read-only reports:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||
sudo /usr/local/sbin/ailab-disk-watch.sh
|
||||
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||
```
|
||||
|
||||
Preview maintenance:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-apt-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-config-backup.sh
|
||||
```
|
||||
|
||||
Apply reviewed maintenance interactively:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-config-backup.sh --execute
|
||||
```
|
||||
|
||||
`--non-interactive` is reserved for reviewed automation and is rejected unless
|
||||
`--execute` is also present.
|
||||
|
||||
## Systemd timers
|
||||
|
||||
| Timer | Schedule |
|
||||
| --- | --- |
|
||||
| `ailab-config-backup.timer` | Daily at 03:30 |
|
||||
| `ailab-disk-watch.timer` | Hourly |
|
||||
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
|
||||
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
|
||||
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
|
||||
|
||||
All timers use `Persistent=true`, so a missed event runs after the host becomes
|
||||
available. Inspect timer and service evidence with:
|
||||
|
||||
```bash
|
||||
systemctl list-timers --all | grep ailab-
|
||||
systemctl status ailab-config-backup.timer
|
||||
journalctl -u ailab-kernel-cleanup.service
|
||||
```
|
||||
|
||||
## Logs
|
||||
|
||||
Scheduled and manual maintenance writes to:
|
||||
|
||||
```text
|
||||
/var/log/ailab-apt-cleanup.log
|
||||
/var/log/ailab-kernel-cleanup.log
|
||||
/var/log/ailab-docker-cleanup.log
|
||||
/var/log/ailab-config-backup.log
|
||||
/var/log/ailab-disk-watch.log
|
||||
```
|
||||
|
||||
systemd also records service output in the journal. Logrotate is installed as a
|
||||
dependency, but this lab does not create a custom rotation policy for these
|
||||
small maintenance logs.
|
||||
|
||||
## Docker policy
|
||||
|
||||
Docker cleanup runs `docker system prune -af` and removes build cache older
|
||||
than seven days. It never passes `--volumes`. Named and anonymous volumes
|
||||
remain outside this automated policy and require application-aware review.
|
||||
|
||||
The installer configures the `json-file` driver with a maximum size of `50m`
|
||||
and five files. Existing valid JSON is backed up and merged. Invalid JSON
|
||||
causes installation to stop rather than overwrite operator configuration.
|
||||
|
||||
## Kernel policy
|
||||
|
||||
Kernel removal is delegated to `apt autoremove --purge`; package names are not
|
||||
constructed or purged with regular expressions. Before execution, the script
|
||||
logs the APT simulation and refuses cleanup unless at least two installed
|
||||
versioned kernel image packages remain after simulated removals.
|
||||
|
||||
This protects a fallback kernel while preserving Ubuntu dependency policy.
|
||||
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
|
||||
Secure Boot state, and the simulated removal set before manual execution.
|
||||
|
||||
## Backup policy
|
||||
|
||||
Backups are written to `/srv/backups/ailab-config` as
|
||||
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
|
||||
deleted only after a new archive is created.
|
||||
|
||||
The backup covers `/etc`, selected root shell configuration,
|
||||
`/opt/ailab-maintenance` when present, and libvirt configuration under
|
||||
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
|
||||
Ollama models, VM disk images, or other large application datasets. Because
|
||||
`/etc` is included, explicitly listed configuration subdirectories are already
|
||||
covered even when optional-path reporting mentions them separately.
|
||||
|
||||
This is a local configuration backup, not a disaster-recovery design. A real
|
||||
deployment should copy archives to independently protected storage and test
|
||||
restoration.
|
||||
|
||||
## Journald policy
|
||||
|
||||
The installer applies:
|
||||
|
||||
```ini
|
||||
[Journal]
|
||||
SystemMaxUse=1G
|
||||
SystemKeepFree=2G
|
||||
MaxRetentionSec=14day
|
||||
Compress=yes
|
||||
```
|
||||
|
||||
These settings bound journal growth while retaining useful troubleshooting
|
||||
evidence. Capacity and retention should be adjusted to the host's disk size
|
||||
and incident-response requirements.
|
||||
|
||||
## Disk watch policy
|
||||
|
||||
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
|
||||
`1` when any checked filesystem meets or exceeds the threshold. Override the
|
||||
threshold for a manual or unit invocation with:
|
||||
|
||||
```bash
|
||||
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
|
||||
```
|
||||
|
||||
The script reports every filesystem as `OK` or `WARNING`; it does not delete
|
||||
data or attempt remediation.
|
||||
|
||||
## Example operational workflows
|
||||
|
||||
### Weekly maintenance review
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||
systemctl list-timers --all | grep ailab-
|
||||
```
|
||||
|
||||
Review the kernel simulation, Docker usage, failed units, backup freshness, and
|
||||
disk warnings before approving manual changes.
|
||||
|
||||
### Disk pressure investigation
|
||||
|
||||
```bash
|
||||
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
|
||||
sudo docker system df
|
||||
sudo journalctl --disk-usage
|
||||
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||
```
|
||||
|
||||
Use the evidence to identify ownership. Do not treat Docker pruning or file
|
||||
deletion as a substitute for application-specific retention policy.
|
||||
|
||||
### Post-maintenance evidence
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh \
|
||||
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
|
||||
journalctl --since today -u 'ailab-*.service'
|
||||
```
|
||||
|
||||
## Interview talking points
|
||||
|
||||
- Why timer units explicitly carry the non-interactive execution boundary.
|
||||
- Why APT dependency policy is safer than regex-based kernel deletion.
|
||||
- How Docker volume preservation separates platform hygiene from application
|
||||
data lifecycle decisions.
|
||||
- How optional dependency handling keeps one health command useful across
|
||||
container, GPU, and virtualization host variants.
|
||||
- Why configuration backup and application-data backup are separate concerns.
|
||||
- How exit codes, persistent timers, logs, and post-checks support operations.
|
||||
|
||||
## Future improvements
|
||||
|
||||
- Add a dedicated logrotate policy after measuring log growth.
|
||||
- Export disk-watch status to a monitoring system instead of relying only on
|
||||
timer failure state.
|
||||
- Add automated archive integrity checks and off-host replication.
|
||||
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
|
||||
commands.
|
||||
- Add package-lock detection with bounded retry policy if recurring contention
|
||||
is observed.
|
||||
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
|
||||
dedicated read-only audit.
|
||||
Executable
+103
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||
packages=(
|
||||
logrotate
|
||||
needrestart
|
||||
smartmontools
|
||||
nvme-cli
|
||||
sysstat
|
||||
iotop
|
||||
ncdu
|
||||
duf
|
||||
jq
|
||||
lsof
|
||||
psmisc
|
||||
tar
|
||||
gzip
|
||||
)
|
||||
timers=(
|
||||
ailab-apt-cleanup.timer
|
||||
ailab-kernel-cleanup.timer
|
||||
ailab-docker-cleanup.timer
|
||||
ailab-config-backup.timer
|
||||
ailab-disk-watch.timer
|
||||
)
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: install.sh must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in apt-get install systemctl; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
printf 'Installing maintenance dependencies...\n'
|
||||
apt-get update
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||
|
||||
printf 'Installing scripts and systemd units...\n'
|
||||
for script in "$SCRIPT_DIR"/scripts/*.sh; do
|
||||
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
|
||||
done
|
||||
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
|
||||
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
|
||||
done
|
||||
|
||||
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
|
||||
tmp_journald="$(mktemp)"
|
||||
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
|
||||
cat >"$tmp_journald" <<'EOF'
|
||||
[Journal]
|
||||
SystemMaxUse=1G
|
||||
SystemKeepFree=2G
|
||||
MaxRetentionSec=14day
|
||||
Compress=yes
|
||||
EOF
|
||||
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
|
||||
systemctl restart systemd-journald
|
||||
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
printf 'Configuring Docker log rotation limits...\n'
|
||||
install -d -m 0755 /etc/docker
|
||||
tmp_docker="$(mktemp)"
|
||||
|
||||
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||
exit 1
|
||||
fi
|
||||
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
|
||||
install -m 0644 "$DOCKER_CONFIG" "$backup"
|
||||
jq '. + {
|
||||
"log-driver": "json-file",
|
||||
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||
}' "$DOCKER_CONFIG" >"$tmp_docker"
|
||||
else
|
||||
jq -n '{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {"max-size": "50m", "max-file": "5"}
|
||||
}' >"$tmp_docker"
|
||||
fi
|
||||
|
||||
jq empty "$tmp_docker"
|
||||
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
|
||||
systemctl restart docker || true
|
||||
else
|
||||
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
|
||||
fi
|
||||
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now "${timers[@]}"
|
||||
|
||||
printf '\nEnabled AI Lab timers:\n'
|
||||
systemctl list-timers --all --no-pager | grep 'ailab-' || true
|
||||
|
||||
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
|
||||
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
|
||||
@@ -0,0 +1,66 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-apt-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v apt >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: apt is required\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
|
||||
printf 'INFO: simulated autoremove follows\n'
|
||||
LC_ALL=C apt -s autoremove --purge
|
||||
printf 'INFO: rerun with --execute and confirm to apply changes\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
apt update
|
||||
apt autoremove --purge -y
|
||||
apt autoclean -y
|
||||
if command -v needrestart >/dev/null 2>&1; then
|
||||
needrestart -b || true
|
||||
else
|
||||
printf 'WARNING: needrestart is not installed\n'
|
||||
fi
|
||||
printf 'OK: APT cleanup completed\n'
|
||||
@@ -0,0 +1,90 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-config-backup.log"
|
||||
BACKUP_DIR="/srv/backups/ailab-config"
|
||||
RETENTION_DAYS=30
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in tar gzip find; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
timestamp="$(date '+%Y%m%d-%H%M%S')"
|
||||
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
|
||||
candidate_paths=(
|
||||
/etc
|
||||
/root/.bashrc
|
||||
/root/.bashrc.d
|
||||
/opt/ailab-maintenance
|
||||
/var/lib/libvirt/qemu
|
||||
)
|
||||
source_paths=()
|
||||
|
||||
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
|
||||
for path in "${candidate_paths[@]}"; do
|
||||
if [[ -e "$path" ]]; then
|
||||
source_paths+=("${path#/}")
|
||||
printf 'OK: include %s\n' "$path"
|
||||
else
|
||||
printf 'INFO: optional path is absent: %s\n' "$path"
|
||||
fi
|
||||
done
|
||||
|
||||
if ((${#source_paths[@]} == 0)); then
|
||||
printf 'CRITICAL: no backup source paths are present\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
printf 'Backup destination: %s\n' "$archive"
|
||||
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
|
||||
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
|
||||
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'Type EXECUTE to create the archive and apply retention: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
install -d -m 0750 "$BACKUP_DIR"
|
||||
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
|
||||
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
|
||||
printf 'OK: configuration backup created: %s\n' "$archive"
|
||||
@@ -0,0 +1,38 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-disk-watch.log"
|
||||
threshold="${AILAB_DISK_THRESHOLD:-85}"
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
|
||||
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
|
||||
|
||||
status=0
|
||||
while read -r filesystem _blocks _used available use_percent mountpoint; do
|
||||
usage="${use_percent%\%}"
|
||||
|
||||
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
|
||||
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
|
||||
status=1
|
||||
elif ((usage >= threshold)); then
|
||||
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
|
||||
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
|
||||
status=1
|
||||
else
|
||||
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
|
||||
fi
|
||||
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
|
||||
|
||||
exit "$status"
|
||||
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-docker-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
|
||||
|
||||
if ! command -v docker >/dev/null 2>&1; then
|
||||
printf 'INFO: Docker is not installed; nothing to do\n'
|
||||
exit 0
|
||||
fi
|
||||
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
|
||||
printf 'INFO: docker.service is inactive; nothing to do\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
printf '\nDocker disk usage before cleanup:\n'
|
||||
docker system df
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; would run docker system prune -af\n'
|
||||
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
|
||||
printf 'INFO: Docker volumes are never included in this cleanup\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
docker system prune -af
|
||||
docker builder prune -af --filter "until=168h"
|
||||
|
||||
printf '\nDocker disk usage after cleanup:\n'
|
||||
docker system df
|
||||
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
|
||||
+111
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
run_optional() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if "$@"; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
printf 'WARNING: %s failed\n' "$description"
|
||||
return 0
|
||||
}
|
||||
|
||||
section "Host identity"
|
||||
if command -v hostnamectl >/dev/null 2>&1; then
|
||||
run_optional "hostnamectl" hostnamectl
|
||||
else
|
||||
run_optional "hostname" hostname
|
||||
fi
|
||||
run_optional "kernel information" uname -a
|
||||
run_optional "uptime" uptime
|
||||
|
||||
section "Memory"
|
||||
if command -v free >/dev/null 2>&1; then
|
||||
run_optional "memory report" free -h
|
||||
else
|
||||
printf 'WARNING: free is not available\n'
|
||||
fi
|
||||
|
||||
section "Filesystems"
|
||||
if command -v df >/dev/null 2>&1; then
|
||||
run_optional "filesystem report" df -hT
|
||||
printf '\nKey mountpoints present:\n'
|
||||
for mountpoint in / /boot /var /srv /opt /home; do
|
||||
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
|
||||
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
|
||||
fi
|
||||
done
|
||||
else
|
||||
printf 'WARNING: df is not available\n'
|
||||
fi
|
||||
|
||||
section "Journal usage"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
run_optional "journal disk usage" journalctl --disk-usage
|
||||
else
|
||||
printf 'WARNING: journalctl is not available\n'
|
||||
fi
|
||||
|
||||
section "Docker"
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "Docker service state" systemctl is-active docker
|
||||
fi
|
||||
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
|
||||
run_optional "Docker disk usage" docker system df
|
||||
else
|
||||
printf 'INFO: Docker is not installed\n'
|
||||
fi
|
||||
|
||||
section "Libvirt"
|
||||
if command -v virsh >/dev/null 2>&1; then
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "libvirtd service state" systemctl is-active libvirtd
|
||||
fi
|
||||
run_optional "libvirt guest list" virsh list --all
|
||||
else
|
||||
printf 'INFO: virsh is not installed\n'
|
||||
fi
|
||||
|
||||
section "NVIDIA"
|
||||
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||
run_optional "NVIDIA status" nvidia-smi
|
||||
else
|
||||
printf 'INFO: nvidia-smi is not installed\n'
|
||||
fi
|
||||
|
||||
section "Failed systemd units"
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||
else
|
||||
printf 'WARNING: systemctl is not available\n'
|
||||
fi
|
||||
|
||||
section "SMART quick health"
|
||||
if command -v smartctl >/dev/null 2>&1; then
|
||||
shopt -s nullglob
|
||||
devices=(/dev/sd? /dev/nvme?n?)
|
||||
shopt -u nullglob
|
||||
|
||||
if ((${#devices[@]} == 0)); then
|
||||
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
|
||||
else
|
||||
for device in "${devices[@]}"; do
|
||||
printf '\n-- %s --\n' "$device"
|
||||
run_optional "SMART health check for $device" smartctl -H "$device"
|
||||
done
|
||||
fi
|
||||
else
|
||||
printf 'INFO: smartctl is not installed\n'
|
||||
fi
|
||||
|
||||
exit 0
|
||||
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
# APT autoremove respects package dependencies and kernel protection rules. That
|
||||
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
|
||||
|
||||
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
kernel_packages() {
|
||||
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
|
||||
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
|
||||
| awk '$1 ~ /^ii/ {print $2}' \
|
||||
| sort -u || true
|
||||
}
|
||||
|
||||
versioned_kernel_images() {
|
||||
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
|
||||
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
|
||||
| sort -u || true
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in apt dpkg-query uname; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
|
||||
printf 'Running kernel: %s\n' "$(uname -r)"
|
||||
printf '\nInstalled kernel-related packages before cleanup:\n'
|
||||
kernel_packages
|
||||
|
||||
simulation="$(LC_ALL=C apt -s autoremove --purge)"
|
||||
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
|
||||
|
||||
mapfile -t installed_images < <(versioned_kernel_images)
|
||||
mapfile -t removed_images < <(
|
||||
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
|
||||
)
|
||||
|
||||
remaining_images=0
|
||||
for image in "${installed_images[@]}"; do
|
||||
remove_image=false
|
||||
for removed in "${removed_images[@]}"; do
|
||||
if [[ "$image" == "$removed" ]]; then
|
||||
remove_image=true
|
||||
break
|
||||
fi
|
||||
done
|
||||
if [[ "$remove_image" != true ]]; then
|
||||
remaining_images=$((remaining_images + 1))
|
||||
fi
|
||||
done
|
||||
|
||||
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
|
||||
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
|
||||
|
||||
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
|
||||
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; no packages were removed\n'
|
||||
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
apt autoremove --purge -y
|
||||
apt autoclean -y
|
||||
if command -v update-grub >/dev/null 2>&1; then
|
||||
update-grub || true
|
||||
else
|
||||
printf 'WARNING: update-grub is not installed\n'
|
||||
fi
|
||||
|
||||
printf '\nInstalled kernel-related packages after cleanup:\n'
|
||||
kernel_packages
|
||||
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
|
||||
+42
@@ -0,0 +1,42 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
if ! command -v virsh >/dev/null 2>&1; then
|
||||
printf 'INFO: virsh is not installed; VM audit skipped\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
section "Virtual machines"
|
||||
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||
|
||||
section "Storage pools"
|
||||
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
|
||||
|
||||
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
|
||||
for pool in "${pools[@]}"; do
|
||||
section "Volumes in pool: $pool"
|
||||
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
|
||||
done
|
||||
|
||||
section "Possible VM disk and installation images"
|
||||
search_roots=()
|
||||
for path in /var/lib/libvirt /srv /opt; do
|
||||
[[ -d "$path" ]] && search_roots+=("$path")
|
||||
done
|
||||
|
||||
if ((${#search_roots[@]} == 0)); then
|
||||
printf 'INFO: no configured search roots are present\n'
|
||||
else
|
||||
find "${search_roots[@]}" -xdev -type f \
|
||||
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
|
||||
-printf '%12s bytes %p\n' 2>/dev/null \
|
||||
| sort -nr || true
|
||||
fi
|
||||
|
||||
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe APT cleanup
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab APT cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,6 @@
|
||||
[Unit]
|
||||
Description=AI Lab configuration backup
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab configuration backup daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 03:30:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,6 @@
|
||||
[Unit]
|
||||
Description=AI Lab disk usage check
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab disk usage check hourly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=hourly
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe Docker cleanup
|
||||
Requires=docker.service
|
||||
After=docker.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab Docker cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:40:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe kernel cleanup
|
||||
After=network-online.target ailab-apt-cleanup.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab kernel cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:20:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,276 @@
|
||||
# Linux Fresh Setup Toolkit
|
||||
|
||||
## Executive summary
|
||||
|
||||
The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
|
||||
lab server or workstation. It prepares a host for routine administration,
|
||||
Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
|
||||
diagnostics, bounded logging, practical kernel tuning, and a conservative
|
||||
security baseline.
|
||||
|
||||
The scripts are modular and safe to rerun. Optional components remain optional,
|
||||
UFW is not enabled without a specific flag, and an NVIDIA driver is never
|
||||
installed without an explicit version. This is a portfolio and homelab
|
||||
implementation, not a production-certified build standard.
|
||||
|
||||
## Scope and non-goals
|
||||
|
||||
The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
|
||||
with APT package management. It is suitable for a host such as `ailab` that may
|
||||
run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
|
||||
|
||||
It does not:
|
||||
|
||||
- Deploy applications, containers, or virtual machines.
|
||||
- Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
|
||||
- Select an NVIDIA driver automatically.
|
||||
- Define a complete firewall policy or compliance baseline.
|
||||
- Replace backup, monitoring, patching, or ongoing maintenance processes.
|
||||
- Claim live validation against every future Ubuntu release.
|
||||
|
||||
## Why this is separate from ailab-maintenance
|
||||
|
||||
This project establishes a fresh host. The sibling
|
||||
[AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
|
||||
checks, scheduled cleanup, configuration backup, disk monitoring, and VM
|
||||
inventory after a host is operating.
|
||||
|
||||
Keeping bootstrap and maintenance separate makes the change boundary clear:
|
||||
this toolkit installs platform capabilities and baseline configuration, while
|
||||
the maintenance toolkit manages recurring operational tasks.
|
||||
|
||||
## Directory layout
|
||||
|
||||
```text
|
||||
setup/
|
||||
├── README.md
|
||||
├── install.sh
|
||||
├── scripts/
|
||||
│ ├── 00-preflight.sh
|
||||
│ ├── 00-platform-guard.inc
|
||||
│ ├── 01-base-packages.sh
|
||||
│ ├── 02-shell-profile.sh
|
||||
│ ├── 03-cockpit.sh
|
||||
│ ├── 04-docker.sh
|
||||
│ ├── 05-libvirt.sh
|
||||
│ ├── 06-nvidia-tools.sh
|
||||
│ ├── 07-tuning.sh
|
||||
│ ├── 08-security-baseline.sh
|
||||
│ └── 99-postcheck.sh
|
||||
├── files/
|
||||
│ ├── bashrc.d/ailab.sh
|
||||
│ ├── docker/daemon.json
|
||||
│ ├── sysctl/99-ailab.conf
|
||||
│ └── systemd/journald-ailab-limits.conf
|
||||
└── docs/
|
||||
├── fresh-install-checklist.md
|
||||
├── cockpit.md
|
||||
├── docker.md
|
||||
├── libvirt.md
|
||||
├── nvidia.md
|
||||
└── bash-shell.md
|
||||
```
|
||||
|
||||
`00-platform-guard.inc` is an internal sourced helper used by mutating
|
||||
component scripts; it is not an executable profile.
|
||||
|
||||
## Supported profiles and flags
|
||||
|
||||
| Flag | Result |
|
||||
| --- | --- |
|
||||
| `--base` | Install operational CLI, diagnostic, storage, and network packages |
|
||||
| `--shell` | Install the root AI lab Bash profile |
|
||||
| `--cockpit` | Install and enable Cockpit |
|
||||
| `--docker` | Install Docker and bounded JSON-file logging |
|
||||
| `--libvirt` | Install and enable libvirt/KVM |
|
||||
| `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
|
||||
| `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
|
||||
| `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
|
||||
| `--security` | Install and enable fail2ban; install but do not enable UFW |
|
||||
| `--enable-ufw` | Run security setup and explicitly enable UFW |
|
||||
| `--all` | Run every standard profile without UFW enablement or driver installation |
|
||||
|
||||
`--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
|
||||
`--security`. With no flags, the installer prints help and makes no changes.
|
||||
|
||||
## Installation examples
|
||||
|
||||
Review the scripts and current host access path before execution:
|
||||
|
||||
```bash
|
||||
cd labs/linux/setup
|
||||
./install.sh
|
||||
sudo ./install.sh --base --shell
|
||||
sudo ./install.sh --cockpit --docker --libvirt
|
||||
sudo ./install.sh --all
|
||||
```
|
||||
|
||||
Explicit high-impact options can be combined with `--all`:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --all --enable-ufw
|
||||
sudo ./install.sh --all --install-nvidia-driver 550
|
||||
```
|
||||
|
||||
The installer runs the read-only preflight once before selected profiles and a
|
||||
postcheck after all successful profile steps.
|
||||
|
||||
## Fresh host workflow
|
||||
|
||||
1. Patch the base Ubuntu installation and confirm console or out-of-band access.
|
||||
2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
|
||||
3. Run `sudo ./install.sh --base --shell`.
|
||||
4. Add only the platform profiles needed by the host.
|
||||
5. Review service state, listening ports, storage, networking, and warnings in
|
||||
the postcheck.
|
||||
6. Reboot if a driver or kernel-related package requires it.
|
||||
7. Capture host-specific configuration and backup requirements separately.
|
||||
|
||||
## AI lab workflow
|
||||
|
||||
A general AI lab host can start with:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
|
||||
```
|
||||
|
||||
This installs GPU diagnostics but leaves driver choice to the operator. Add
|
||||
libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
|
||||
Cockpit, application, bridge, and VM networking requirements.
|
||||
|
||||
## Safety model
|
||||
|
||||
- Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
|
||||
than 24.04.
|
||||
- Component profiles install their own direct prerequisites.
|
||||
- Existing managed configuration is changed only when content differs.
|
||||
- Changed root shell, Docker, journald, and sysctl files receive timestamped
|
||||
backups.
|
||||
- Existing valid Docker JSON is merged so unrelated settings survive.
|
||||
- Invalid Docker JSON stops configuration rather than being overwritten.
|
||||
- UFW and NVIDIA driver installation require explicit flags.
|
||||
- Package and service failures are not hidden.
|
||||
- Postcheck warnings report optional or inactive components without masking a
|
||||
successfully completed diagnostic script.
|
||||
|
||||
APT installation and service restarts are real system changes. Test first on a
|
||||
disposable host and maintain a console path when changing remote access policy.
|
||||
|
||||
## Bash shell profile
|
||||
|
||||
The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
|
||||
source line is maintained in `/root/.bashrc`. It adds concise helpers for
|
||||
systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
|
||||
|
||||
See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
|
||||
|
||||
## Cockpit setup
|
||||
|
||||
Cockpit provides browser-based host, storage, network, package, VM, metrics,
|
||||
and support-report views. The installer enables `cockpit.socket` and reports
|
||||
`https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
|
||||
available in every enabled Ubuntu repository.
|
||||
|
||||
See [Cockpit setup](docs/cockpit.md).
|
||||
|
||||
## Docker setup
|
||||
|
||||
The Ubuntu `docker.io` package path is preferred. The Docker official
|
||||
repository is configured only when `docker.io` is unavailable. The daemon uses
|
||||
the `json-file` log driver with five 50 MB files per container.
|
||||
|
||||
The toolkit configures log retention only. It does not prune data, deploy
|
||||
Compose applications, or configure an NVIDIA container runtime.
|
||||
|
||||
See [Docker setup](docs/docker.md).
|
||||
|
||||
## libvirt/KVM setup
|
||||
|
||||
The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
|
||||
virt-manager, bridge utilities, and libvirt clients and services. It enables
|
||||
`libvirtd` and prints existing guests and networks.
|
||||
|
||||
See [libvirt/KVM setup](docs/libvirt.md).
|
||||
|
||||
## NVIDIA tooling
|
||||
|
||||
The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
|
||||
It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
|
||||
commands exist.
|
||||
|
||||
Driver installation requires a numeric version that maps to an available
|
||||
Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
|
||||
driver suitability, CUDA, container runtime support, and passthrough remain
|
||||
operator decisions.
|
||||
|
||||
See [NVIDIA tooling](docs/nvidia.md).
|
||||
|
||||
## Tuning
|
||||
|
||||
The tuning profile bounds persistent journal use, raises inotify limits for
|
||||
development and container workloads, reduces swappiness, enables sysstat, and
|
||||
runs automatic sensor detection when available.
|
||||
|
||||
Review these values against available memory, storage, monitoring retention,
|
||||
and workload behavior before deployment beyond a lab.
|
||||
|
||||
## Security baseline
|
||||
|
||||
The security profile installs UFW and fail2ban and enables fail2ban. It leaves
|
||||
UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
|
||||
OpenSSH and TCP port 9090 before activation.
|
||||
|
||||
This is a minimal access-preservation baseline, not a complete host firewall or
|
||||
hardening standard. Application and VM networking may require additional
|
||||
reviewed rules.
|
||||
|
||||
## Postcheck
|
||||
|
||||
The final script reports:
|
||||
|
||||
- Failed systemd units.
|
||||
- Cockpit, Docker, libvirt, and fail2ban status when installed.
|
||||
- Running Docker containers and defined virtual machines.
|
||||
- NVIDIA runtime state.
|
||||
- Filesystem usage and listening ports.
|
||||
|
||||
Warnings require operator review but optional component absence does not cause
|
||||
the postcheck itself to fail.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
Run individual read-only checks after correcting a failed profile:
|
||||
|
||||
```bash
|
||||
sudo ./scripts/00-preflight.sh
|
||||
sudo ./scripts/99-postcheck.sh
|
||||
systemctl --failed
|
||||
journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
|
||||
```
|
||||
|
||||
Common failure areas are unavailable APT repositories, unsupported package
|
||||
names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
|
||||
module signing, disabled CPU virtualization, and remote firewall assumptions.
|
||||
|
||||
To roll back a managed configuration, compare the current file with its
|
||||
timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
|
||||
owning service. Package removal is intentionally not automated because it may
|
||||
affect workloads and dependencies.
|
||||
|
||||
## Interview talking points
|
||||
|
||||
- Why day-0 bootstrap and day-2 maintenance have separate ownership.
|
||||
- How explicit flags protect firewall and GPU driver decisions.
|
||||
- Why Docker JSON is validated, backed up, and merged.
|
||||
- How idempotent content checks prevent backup and restart churn.
|
||||
- Why preflight and postcheck evidence surround mutating profiles.
|
||||
- Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
|
||||
|
||||
## Future improvements
|
||||
|
||||
- Add automated tests using disposable Ubuntu VMs.
|
||||
- Add a documented NVIDIA Container Toolkit profile.
|
||||
- Add optional non-root administrative user and group membership management.
|
||||
- Add bridge and VFIO planning checks without applying passthrough changes.
|
||||
- Add package compatibility matrices after validating future Ubuntu releases.
|
||||
- Export postcheck results in a structured format for evidence collection.
|
||||
@@ -0,0 +1,53 @@
|
||||
# Bash Shell Profile
|
||||
|
||||
## Installation
|
||||
|
||||
The shell profile is installed for root:
|
||||
|
||||
```text
|
||||
/root/.bashrc.d/ailab.sh
|
||||
```
|
||||
|
||||
The installer maintains one exact source line in `/root/.bashrc` and backs up
|
||||
changed files. Start a new Bash session or run:
|
||||
|
||||
```bash
|
||||
source /root/.bashrc
|
||||
```
|
||||
|
||||
## Aliases
|
||||
|
||||
| Alias | Purpose |
|
||||
| --- | --- |
|
||||
| `ll`, `la` | Detailed and hidden-file directory listings |
|
||||
| `ports` | Listening TCP/UDP sockets and processes |
|
||||
| `dus`, `dufh` | Directory and filesystem usage |
|
||||
| `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
|
||||
| `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
|
||||
| `vms` | All libvirt guests |
|
||||
| `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
|
||||
| `now` | Current timestamp and timezone |
|
||||
|
||||
`dcu` runs `docker compose up -d` in the current directory and therefore may
|
||||
create or start resources. Review the Compose project before using it.
|
||||
|
||||
## Functions
|
||||
|
||||
- `svc_status SERVICE`
|
||||
- `svc_logs SERVICE [LINES]`
|
||||
- `docker_logs CONTAINER [LINES]`
|
||||
- `docker_restart CONTAINER`
|
||||
- `vm_autostart VM`
|
||||
- `vm_no_autostart VM`
|
||||
- `path_backup PATH`
|
||||
- `extract ARCHIVE`
|
||||
|
||||
Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
|
||||
report missing commands clearly. `path_backup` creates a timestamped adjacent
|
||||
copy and can consume substantial space for large paths.
|
||||
|
||||
## Rollback
|
||||
|
||||
Review timestamped backups under `/root`, restore the desired `.bashrc` or
|
||||
profile copy, and start a new shell. Avoid restoring a backup without checking
|
||||
for unrelated shell changes made after bootstrap.
|
||||
@@ -0,0 +1,41 @@
|
||||
# Cockpit
|
||||
|
||||
## Purpose
|
||||
|
||||
The Cockpit profile installs browser-based host administration modules for
|
||||
system state, storage, networking, packages, virtual machines, metrics, and
|
||||
support reports. It enables the socket-activated service.
|
||||
|
||||
## Installation and validation
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --cockpit
|
||||
systemctl status cockpit.socket
|
||||
ss -ltnp | grep ':9090'
|
||||
```
|
||||
|
||||
Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
|
||||
default host certificate is not trusted.
|
||||
|
||||
`cockpit-files` is installed when available and skipped with a warning
|
||||
otherwise.
|
||||
|
||||
## Access and firewall
|
||||
|
||||
The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
|
||||
TCP 9090, but upstream firewalls and network ACLs remain external concerns.
|
||||
Use normal Linux accounts and review which users may administer the host.
|
||||
|
||||
## Troubleshooting and rollback
|
||||
|
||||
```bash
|
||||
journalctl -u cockpit.socket -u cockpit.service
|
||||
systemctl restart cockpit.socket
|
||||
apt-cache policy cockpit cockpit-machines cockpit-files
|
||||
```
|
||||
|
||||
To disable remote access without removing packages:
|
||||
|
||||
```bash
|
||||
sudo systemctl disable --now cockpit.socket
|
||||
```
|
||||
@@ -0,0 +1,56 @@
|
||||
# Docker
|
||||
|
||||
## Package policy
|
||||
|
||||
The profile prefers Ubuntu's `docker.io` package. If that package is
|
||||
unavailable after an APT refresh, it configures Docker's official Ubuntu
|
||||
repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
|
||||
|
||||
This fallback requires network access to `download.docker.com`.
|
||||
|
||||
## Daemon configuration
|
||||
|
||||
The managed settings are:
|
||||
|
||||
```json
|
||||
{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "50m",
|
||||
"max-file": "5"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Existing valid `/etc/docker/daemon.json` content is preserved and merged with
|
||||
these log settings. A changed file is backed up with a timestamp. Invalid JSON
|
||||
causes the profile to stop rather than overwrite operator configuration.
|
||||
|
||||
Log limits apply to newly created containers. Existing containers may retain
|
||||
their original logging configuration until recreated.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
docker version
|
||||
docker compose version
|
||||
docker info
|
||||
docker ps
|
||||
docker system df
|
||||
jq . /etc/docker/daemon.json
|
||||
```
|
||||
|
||||
## Troubleshooting and rollback
|
||||
|
||||
```bash
|
||||
systemctl status docker
|
||||
journalctl -u docker
|
||||
jq empty /etc/docker/daemon.json
|
||||
```
|
||||
|
||||
To restore a previous daemon configuration, review a timestamped backup,
|
||||
replace the current file, validate it with `jq empty`, and restart Docker.
|
||||
Do not restore blindly when workloads depend on newer daemon settings.
|
||||
|
||||
The profile does not configure Docker data roots, prune objects, deploy
|
||||
applications, or install the NVIDIA Container Toolkit.
|
||||
@@ -0,0 +1,47 @@
|
||||
# Fresh Install Checklist
|
||||
|
||||
## Before bootstrap
|
||||
|
||||
- Confirm Ubuntu 24.04 or newer and record the release and kernel.
|
||||
- Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
|
||||
- Confirm console or out-of-band access before firewall work.
|
||||
- Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
|
||||
- Patch the base system and reboot if required.
|
||||
- Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
|
||||
- Review application ports and VM networking before enabling UFW.
|
||||
- Confirm backups exist for any pre-existing host configuration.
|
||||
|
||||
## Bootstrap
|
||||
|
||||
Start with the least capability required:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --base --shell
|
||||
```
|
||||
|
||||
Add reviewed platform profiles:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
|
||||
```
|
||||
|
||||
Do not select `--enable-ufw` until remote access and application rules are
|
||||
understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
|
||||
and workload compatibility are known.
|
||||
|
||||
## Post-bootstrap evidence
|
||||
|
||||
- Review all installer warnings.
|
||||
- Run `systemctl --failed`.
|
||||
- Confirm expected services with `systemctl status`.
|
||||
- Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
|
||||
- Confirm Docker with `docker version` and `docker compose version`.
|
||||
- Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
|
||||
- Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
|
||||
- Reboot after driver installation and repeat the postcheck.
|
||||
|
||||
## Handover
|
||||
|
||||
Document host-specific storage, network, firewall, backup, application, GPU,
|
||||
and VM decisions. Install the separate `ailab-maintenance` toolkit only after
|
||||
reviewing its scheduled day-2 behavior.
|
||||
@@ -0,0 +1,54 @@
|
||||
# libvirt and KVM
|
||||
|
||||
## Purpose
|
||||
|
||||
The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
|
||||
TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
|
||||
supports later Linux or Windows 11 VM work without defining guests.
|
||||
|
||||
## Firmware pre-checks
|
||||
|
||||
Confirm CPU virtualization is enabled:
|
||||
|
||||
```bash
|
||||
lscpu | grep -E 'Virtualization|Hypervisor'
|
||||
grep -Eom1 '(vmx|svm)' /proc/cpuinfo
|
||||
```
|
||||
|
||||
IOMMU and GPU passthrough require separate firmware, kernel command-line,
|
||||
device isolation, driver binding, and recovery planning. This toolkit reports
|
||||
hints but does not apply those changes.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
systemctl status libvirtd
|
||||
virsh list --all
|
||||
virsh net-list --all
|
||||
virsh pool-list --all
|
||||
```
|
||||
|
||||
Use `virt-host-validate` when available for a broader host capability report.
|
||||
Desktop use of `virt-manager` requires a graphical environment or remote
|
||||
display strategy.
|
||||
|
||||
## Networking and Windows 11
|
||||
|
||||
The default libvirt NAT network is distinct from host bridge networking. Review
|
||||
DHCP, DNS, forwarding, and firewall behavior before changing it.
|
||||
|
||||
Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
|
||||
packages provide those building blocks, but guest creation and licensing remain
|
||||
manual.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
```bash
|
||||
journalctl -u libvirtd
|
||||
virsh net-info default
|
||||
virsh pool-list --all
|
||||
lsmod | grep kvm
|
||||
```
|
||||
|
||||
Disabling `libvirtd` does not remove VM disks or definitions. Package removal
|
||||
and VM data deletion are intentionally outside this toolkit.
|
||||
@@ -0,0 +1,52 @@
|
||||
# NVIDIA Tooling
|
||||
|
||||
## Diagnostic-only default
|
||||
|
||||
The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
|
||||
does not install or select a driver:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --nvidia-tools
|
||||
```
|
||||
|
||||
Review hardware and current module state:
|
||||
|
||||
```bash
|
||||
lspci -nn | grep -i nvidia
|
||||
nvidia-smi
|
||||
dkms status
|
||||
mokutil --sb-state
|
||||
```
|
||||
|
||||
## Explicit driver installation
|
||||
|
||||
Install only a reviewed Ubuntu driver package version:
|
||||
|
||||
```bash
|
||||
sudo ./install.sh --install-nvidia-driver 550
|
||||
```
|
||||
|
||||
The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
|
||||
an unavailable package. Reboot after installation, then validate `nvidia-smi`,
|
||||
kernel logs, DKMS state, and application behavior.
|
||||
|
||||
## Selection considerations
|
||||
|
||||
- GPU generation and supported driver branch.
|
||||
- Ubuntu release, kernel, and HWE stack.
|
||||
- Secure Boot module enrollment.
|
||||
- CUDA or application compatibility.
|
||||
- Docker NVIDIA Container Toolkit requirements.
|
||||
- Whether the device will be bound to VFIO instead of the host driver.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
```bash
|
||||
journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
|
||||
lsmod | grep -E 'nvidia|nouveau'
|
||||
dkms status
|
||||
apt-cache policy 'nvidia-driver-*'
|
||||
```
|
||||
|
||||
Driver rollback is environment-specific and is not automated. Preserve console
|
||||
access and a known-good kernel before changing GPU or Secure Boot configuration.
|
||||
@@ -0,0 +1,133 @@
|
||||
#!/usr/bin/env bash
|
||||
# AI lab operational shell helpers. This file is intended to be sourced.
|
||||
|
||||
alias ll='ls -alF'
|
||||
alias la='ls -A'
|
||||
alias ports='ss -tulpn'
|
||||
alias dus='du -xhd1 2>/dev/null | sort -h'
|
||||
alias dufh='df -hT'
|
||||
alias failed='systemctl --failed --no-pager'
|
||||
alias jerr='journalctl -p err -b --no-pager'
|
||||
alias timers='systemctl list-timers --all --no-pager'
|
||||
alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
|
||||
alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
|
||||
alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
|
||||
alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
|
||||
alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||
alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||
alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
|
||||
|
||||
svc_status() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: svc_status SERVICE\n' >&2
|
||||
return 2
|
||||
fi
|
||||
systemctl status "$1" --no-pager
|
||||
}
|
||||
|
||||
svc_logs() {
|
||||
if (($# < 1 || $# > 2)); then
|
||||
printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
|
||||
return 2
|
||||
fi
|
||||
local lines="${2:-100}"
|
||||
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||
printf 'LINES must be numeric\n' >&2
|
||||
return 2
|
||||
}
|
||||
journalctl -u "$1" -n "$lines" --no-pager
|
||||
}
|
||||
|
||||
docker_logs() {
|
||||
if (($# < 1 || $# > 2)); then
|
||||
printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
|
||||
return 2
|
||||
fi
|
||||
command -v docker >/dev/null 2>&1 || {
|
||||
printf 'Docker is not installed\n' >&2
|
||||
return 1
|
||||
}
|
||||
local lines="${2:-100}"
|
||||
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||
printf 'LINES must be numeric\n' >&2
|
||||
return 2
|
||||
}
|
||||
docker logs --tail "$lines" "$1"
|
||||
}
|
||||
|
||||
docker_restart() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: docker_restart CONTAINER\n' >&2
|
||||
return 2
|
||||
fi
|
||||
command -v docker >/dev/null 2>&1 || {
|
||||
printf 'Docker is not installed\n' >&2
|
||||
return 1
|
||||
}
|
||||
docker restart "$1"
|
||||
}
|
||||
|
||||
vm_autostart() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: vm_autostart VM\n' >&2
|
||||
return 2
|
||||
fi
|
||||
command -v virsh >/dev/null 2>&1 || {
|
||||
printf 'virsh is not installed\n' >&2
|
||||
return 1
|
||||
}
|
||||
virsh autostart "$1"
|
||||
}
|
||||
|
||||
vm_no_autostart() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: vm_no_autostart VM\n' >&2
|
||||
return 2
|
||||
fi
|
||||
command -v virsh >/dev/null 2>&1 || {
|
||||
printf 'virsh is not installed\n' >&2
|
||||
return 1
|
||||
}
|
||||
virsh autostart --disable "$1"
|
||||
}
|
||||
|
||||
path_backup() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: path_backup PATH\n' >&2
|
||||
return 2
|
||||
fi
|
||||
if [[ ! -e "$1" ]]; then
|
||||
printf 'Path does not exist: %s\n' "$1" >&2
|
||||
return 1
|
||||
fi
|
||||
local destination
|
||||
destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||
cp -a -- "$1" "$destination"
|
||||
printf 'Backup created: %s\n' "$destination"
|
||||
}
|
||||
|
||||
extract() {
|
||||
if (($# != 1)); then
|
||||
printf 'Usage: extract ARCHIVE\n' >&2
|
||||
return 2
|
||||
fi
|
||||
if [[ ! -f "$1" ]]; then
|
||||
printf 'Archive does not exist: %s\n' "$1" >&2
|
||||
return 1
|
||||
fi
|
||||
case "$1" in
|
||||
*.tar.bz2|*.tbz2) tar xjf "$1" ;;
|
||||
*.tar.gz|*.tgz) tar xzf "$1" ;;
|
||||
*.tar.xz|*.txz) tar xJf "$1" ;;
|
||||
*.tar) tar xf "$1" ;;
|
||||
*.bz2) bunzip2 "$1" ;;
|
||||
*.gz) gunzip "$1" ;;
|
||||
*.zip) unzip "$1" ;;
|
||||
*.7z) 7z x "$1" ;;
|
||||
*.rar) unrar x "$1" ;;
|
||||
*)
|
||||
printf 'Unsupported archive type: %s\n' "$1" >&2
|
||||
return 2
|
||||
;;
|
||||
esac
|
||||
}
|
||||
@@ -0,0 +1,7 @@
|
||||
{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {
|
||||
"max-size": "50m",
|
||||
"max-file": "5"
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,3 @@
|
||||
fs.inotify.max_user_watches=1048576
|
||||
fs.inotify.max_user_instances=1024
|
||||
vm.swappiness=10
|
||||
@@ -0,0 +1,5 @@
|
||||
[Journal]
|
||||
SystemMaxUse=1G
|
||||
SystemKeepFree=2G
|
||||
MaxRetentionSec=14day
|
||||
Compress=yes
|
||||
Executable
+182
@@ -0,0 +1,182 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
run_base=0
|
||||
run_shell=0
|
||||
run_cockpit=0
|
||||
run_docker=0
|
||||
run_libvirt=0
|
||||
run_nvidia=0
|
||||
run_tuning=0
|
||||
run_security=0
|
||||
enable_ufw=0
|
||||
nvidia_driver_version=""
|
||||
|
||||
usage() {
|
||||
cat <<'EOF'
|
||||
Usage: sudo ./install.sh [OPTIONS]
|
||||
|
||||
Day-0 bootstrap automation for Ubuntu 24.04 or newer.
|
||||
|
||||
Profiles:
|
||||
--base Install baseline operational packages
|
||||
--shell Install the root shell profile
|
||||
--cockpit Install and enable Cockpit
|
||||
--docker Install and configure Docker
|
||||
--libvirt Install and enable libvirt/KVM
|
||||
--nvidia-tools Install NVIDIA diagnostic tools only
|
||||
--install-nvidia-driver VERSION
|
||||
Install diagnostic tools and the explicit driver
|
||||
--tuning Install journald and sysctl tuning
|
||||
--security Install fail2ban and UFW; do not enable UFW
|
||||
--enable-ufw Run security profile and explicitly enable UFW
|
||||
--all Run every profile without enabling UFW or
|
||||
installing an NVIDIA driver
|
||||
-h, --help Show this help
|
||||
|
||||
Examples:
|
||||
sudo ./install.sh --base --shell
|
||||
sudo ./install.sh --all
|
||||
sudo ./install.sh --all --enable-ufw
|
||||
sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
|
||||
EOF
|
||||
}
|
||||
|
||||
require_supported_ubuntu() {
|
||||
if [[ ! -r /etc/os-release ]]; then
|
||||
printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# shellcheck disable=SC1091
|
||||
source /etc/os-release
|
||||
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||
exit 2
|
||||
fi
|
||||
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||
"${VERSION_ID:-unknown}" >&2
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
if (($# == 0)); then
|
||||
usage
|
||||
exit 0
|
||||
fi
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--base)
|
||||
run_base=1
|
||||
;;
|
||||
--shell)
|
||||
run_shell=1
|
||||
;;
|
||||
--cockpit)
|
||||
run_cockpit=1
|
||||
;;
|
||||
--docker)
|
||||
run_docker=1
|
||||
;;
|
||||
--libvirt)
|
||||
run_libvirt=1
|
||||
;;
|
||||
--nvidia-tools)
|
||||
run_nvidia=1
|
||||
;;
|
||||
--install-nvidia-driver)
|
||||
if (($# < 2)); then
|
||||
printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
nvidia_driver_version="$2"
|
||||
if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
|
||||
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
run_nvidia=1
|
||||
shift
|
||||
;;
|
||||
--tuning)
|
||||
run_tuning=1
|
||||
;;
|
||||
--security)
|
||||
run_security=1
|
||||
;;
|
||||
--enable-ufw)
|
||||
enable_ufw=1
|
||||
run_security=1
|
||||
;;
|
||||
--all)
|
||||
run_base=1
|
||||
run_shell=1
|
||||
run_cockpit=1
|
||||
run_docker=1
|
||||
run_libvirt=1
|
||||
run_nvidia=1
|
||||
run_tuning=1
|
||||
run_security=1
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
|
||||
usage >&2
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
for required_command in bash dpkg; do
|
||||
if ! command -v "$required_command" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
require_supported_ubuntu
|
||||
|
||||
printf 'INFO: running read-only preflight\n'
|
||||
"$SCRIPT_DIR/scripts/00-preflight.sh"
|
||||
|
||||
((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
|
||||
((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
|
||||
((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
|
||||
((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
|
||||
((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
|
||||
|
||||
if ((run_nvidia == 1)); then
|
||||
if [[ -n "$nvidia_driver_version" ]]; then
|
||||
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
|
||||
else
|
||||
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
|
||||
fi
|
||||
fi
|
||||
|
||||
((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
|
||||
|
||||
if ((run_security == 1)); then
|
||||
if ((enable_ufw == 1)); then
|
||||
"$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
|
||||
else
|
||||
"$SCRIPT_DIR/scripts/08-security-baseline.sh"
|
||||
fi
|
||||
fi
|
||||
|
||||
printf '\nINFO: running post-install checks\n'
|
||||
"$SCRIPT_DIR/scripts/99-postcheck.sh"
|
||||
printf '\nOK: selected Linux setup profiles completed\n'
|
||||
@@ -0,0 +1,20 @@
|
||||
# shellcheck shell=bash
|
||||
|
||||
require_supported_ubuntu() {
|
||||
if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
# shellcheck disable=SC1091
|
||||
source /etc/os-release
|
||||
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||
exit 2
|
||||
fi
|
||||
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||
"${VERSION_ID:-unknown}" >&2
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
Executable
+124
@@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
run_optional() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if "$@"; then
|
||||
return 0
|
||||
fi
|
||||
printf 'WARNING: %s failed\n' "$description"
|
||||
return 0
|
||||
}
|
||||
|
||||
section "Operating system"
|
||||
if [[ -r /etc/os-release ]]; then
|
||||
run_optional "OS release report" cat /etc/os-release
|
||||
else
|
||||
printf 'WARNING: /etc/os-release is unavailable\n'
|
||||
fi
|
||||
run_optional "kernel report" uname -a
|
||||
|
||||
section "Host"
|
||||
run_optional "hostname report" hostname
|
||||
run_optional "uptime report" uptime
|
||||
|
||||
section "CPU and virtualization"
|
||||
if command -v lscpu >/dev/null 2>&1; then
|
||||
run_optional "CPU report" lscpu
|
||||
printf '\nVirtualization flags:\n'
|
||||
lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
|
||||
printf 'INFO: no virtualization summary reported by lscpu\n'
|
||||
else
|
||||
printf 'WARNING: lscpu is unavailable\n'
|
||||
fi
|
||||
if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
|
||||
printf 'OK: CPU virtualization flags detected\n'
|
||||
else
|
||||
printf 'WARNING: CPU virtualization flags were not detected\n'
|
||||
fi
|
||||
|
||||
section "Memory"
|
||||
if command -v free >/dev/null 2>&1; then
|
||||
run_optional "memory report" free -h
|
||||
else
|
||||
run_optional "memory report" cat /proc/meminfo
|
||||
fi
|
||||
|
||||
section "Disks"
|
||||
if command -v lsblk >/dev/null 2>&1; then
|
||||
run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
|
||||
else
|
||||
printf 'WARNING: lsblk is unavailable\n'
|
||||
fi
|
||||
run_optional "filesystem report" df -hT
|
||||
|
||||
section "Network"
|
||||
if command -v ip >/dev/null 2>&1; then
|
||||
run_optional "network interface report" ip -brief address
|
||||
run_optional "route report" ip route
|
||||
else
|
||||
printf 'WARNING: ip is unavailable\n'
|
||||
fi
|
||||
|
||||
section "Firmware and Secure Boot"
|
||||
if [[ -d /sys/firmware/efi ]]; then
|
||||
printf 'OK: boot mode is UEFI\n'
|
||||
else
|
||||
printf 'INFO: boot mode appears to be legacy BIOS\n'
|
||||
fi
|
||||
if command -v mokutil >/dev/null 2>&1; then
|
||||
run_optional "Secure Boot report" mokutil --sb-state
|
||||
else
|
||||
printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
|
||||
fi
|
||||
|
||||
section "IOMMU"
|
||||
if [[ -r /proc/cmdline ]]; then
|
||||
printf 'Kernel command line:\n'
|
||||
cat /proc/cmdline
|
||||
if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
|
||||
printf 'OK: IOMMU-related kernel arguments detected\n'
|
||||
else
|
||||
printf 'INFO: no explicit IOMMU kernel argument detected\n'
|
||||
fi
|
||||
fi
|
||||
if command -v dmesg >/dev/null 2>&1; then
|
||||
dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
|
||||
printf 'INFO: no readable IOMMU hints found in dmesg\n'
|
||||
fi
|
||||
|
||||
section "NVIDIA hardware"
|
||||
if command -v lspci >/dev/null 2>&1; then
|
||||
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
|
||||
else
|
||||
printf 'INFO: lspci is unavailable\n'
|
||||
fi
|
||||
|
||||
section "Existing platform components"
|
||||
for command_name in docker virsh cockpit-bridge; do
|
||||
if command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
|
||||
else
|
||||
printf 'INFO: %s is not installed\n' "$command_name"
|
||||
fi
|
||||
done
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
for unit in docker.service libvirtd.service cockpit.socket; do
|
||||
if systemctl cat "$unit" >/dev/null 2>&1; then
|
||||
state="$(systemctl is-active "$unit" 2>/dev/null || true)"
|
||||
printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
|
||||
else
|
||||
printf 'INFO: %s is not installed\n' "$unit"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
printf '\nOK: preflight completed without modifying the host\n'
|
||||
Executable
+41
@@ -0,0 +1,41 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00-platform-guard.inc
|
||||
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||
|
||||
packages=(
|
||||
curl wget git vim nano tmux byobu htop btop glances
|
||||
jq unzip zip rsync tree ncdu duf
|
||||
lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
|
||||
smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
|
||||
sysstat iotop iftop nload
|
||||
ca-certificates gnupg software-properties-common apt-transport-https
|
||||
needrestart unattended-upgrades logrotate
|
||||
)
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: base package setup must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
require_supported_ubuntu
|
||||
if ! command -v apt-get >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: apt-get is required\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
printf 'INFO: refreshing APT metadata\n'
|
||||
apt-get update
|
||||
printf 'INFO: installing baseline operational packages\n'
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
systemctl enable --now sysstat
|
||||
else
|
||||
printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
|
||||
fi
|
||||
|
||||
printf 'OK: baseline operational packages are installed\n'
|
||||
Executable
+60
@@ -0,0 +1,60 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00-platform-guard.inc
|
||||
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||
SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
|
||||
PROFILE_DIR="/root/.bashrc.d"
|
||||
PROFILE_FILE="$PROFILE_DIR/ailab.sh"
|
||||
BASHRC="/root/.bashrc"
|
||||
SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
|
||||
|
||||
backup_file() {
|
||||
local path="$1"
|
||||
local backup
|
||||
|
||||
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||
install -m 0644 "$path" "$backup"
|
||||
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
|
||||
}
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: shell profile setup must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
require_supported_ubuntu
|
||||
if [[ ! -r "$SOURCE_FILE" ]]; then
|
||||
printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
install -d -m 0755 "$PROFILE_DIR"
|
||||
if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
|
||||
if [[ -f "$PROFILE_FILE" ]]; then
|
||||
backup_file "$PROFILE_FILE"
|
||||
fi
|
||||
install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
|
||||
printf 'OK: installed %s\n' "$PROFILE_FILE"
|
||||
else
|
||||
printf 'OK: shell profile is already current\n'
|
||||
fi
|
||||
|
||||
if [[ ! -f "$BASHRC" ]]; then
|
||||
install -m 0644 /dev/null "$BASHRC"
|
||||
fi
|
||||
|
||||
source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
|
||||
if [[ "$source_count" != "1" ]]; then
|
||||
tmp_bashrc="$(mktemp)"
|
||||
trap 'rm -f "$tmp_bashrc"' EXIT
|
||||
grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
|
||||
printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
|
||||
backup_file "$BASHRC"
|
||||
install -m 0644 "$tmp_bashrc" "$BASHRC"
|
||||
printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
|
||||
else
|
||||
printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
|
||||
fi
|
||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user