Compare commits

...

8 Commits

Author SHA1 Message Date
Mateusz Suski 8a7b7c5abc Clean up Python log analysis documentation
lint / shell-yaml-ansible (push) Failing after 20s
2026-05-11 17:10:10 +00:00
Mateusz Suski 1636f46f81 Add known error matcher tool 2026-05-11 17:06:46 +00:00
Mateusz Suski 5fc96348c5 Add journal analyzer tool 2026-05-11 17:06:05 +00:00
Mateusz Suski 89b7fabb96 Add JVM log analyzer tool 2026-05-11 17:05:27 +00:00
Mateusz Suski 2da5e8b46c Add authentication log audit tool 2026-05-11 17:04:48 +00:00
Mateusz Suski 452ff4fac1 Add log diff checker tool 2026-05-11 17:04:10 +00:00
Mateusz Suski 5dde403ce3 Add incident log summary tool 2026-05-11 17:03:31 +00:00
Mateusz Suski 61483c233f Add Python tooling validation foundation 2026-05-11 17:02:35 +00:00
40 changed files with 6502 additions and 14 deletions
+3
View File
@@ -28,6 +28,9 @@ jobs:
-P infra-run/scripts/bash/gpfs \ -P infra-run/scripts/bash/gpfs \
-P infra-run/scripts/bash/veritas -P infra-run/scripts/bash/veritas
- name: Python syntax checks
run: bash scripts/check-python.sh
- name: yamllint - name: yamllint
run: yamllint . run: yamllint .
+10
View File
@@ -41,6 +41,7 @@ Focused checks:
```bash ```bash
./scripts/check-bash.sh ./scripts/check-bash.sh
./scripts/check-ansible.sh ./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh ./scripts/check-docs.sh
``` ```
@@ -64,6 +65,15 @@ Also run targeted checks for changed files, such as `bash -n`, `ansible-playbook
- Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency. - Exit codes: `0` OK, `1` operational issue, `2` invalid input or missing dependency.
- Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps. - Keep scripts readable; separate discovery, pre-check, change, post-check, and reporting when it helps.
## Python Standards
- Use Python for parsing, reporting, and structured operational tooling where it adds value over Bash.
- Keep Python tools read-only by default.
- Prefer the Python standard library.
- Avoid frameworks and unnecessary abstractions.
- Use clear operational output and meaningful exit codes.
- Keep tools small, focused, and easy to validate.
## Ansible Standards ## Ansible Standards
- Keep playbooks short and roles simple. - Keep playbooks short and roles simple.
+9
View File
@@ -4,6 +4,13 @@
### Added ### Added
- Python tooling validation for operational scripts.
- `incident-log-summary` for general incident log summarization.
- `log-diff-checker` for pre-change and post-change log comparison.
- `auth-log-audit` for Linux authentication log review.
- `jvm-log-analyzer` for JVM application log summaries.
- `journal-analyzer` for exported `journalctl` log review.
- `known-error-matcher` with JSON-based known error patterns.
- Repository-level Codex guidance: - Repository-level Codex guidance:
- `AGENTS.md` - `AGENTS.md`
- `docs/codex/README.md` - `docs/codex/README.md`
@@ -33,6 +40,8 @@
- Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work. - Updated root, `infra-run`, Bash, Ansible, platform, and lab README guidance for safety-first usage, validation, and future Codex-driven work.
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets. - Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure. - Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
- Updated Python tooling documentation and repository roadmap.
- Integrated Python syntax validation into repository validation workflow and CI.
### Notes ### Notes
+10 -1
View File
@@ -33,6 +33,13 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks. - [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples. - [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples. - [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
- [Incident log summary](./infra-run/scripts/python/incident-log-summary/) - read-only Python helper for local incident log pattern summaries.
- [Log diff checker](./infra-run/scripts/python/log-diff-checker/) - read-only Python helper for before/after change log comparison.
- [Auth log audit](./infra-run/scripts/python/auth-log-audit/) - read-only Python helper for local authentication log review.
- [JVM log analyzer](./infra-run/scripts/python/jvm-log-analyzer/) - read-only Python helper for local JVM and Java application log review.
- [Journal analyzer](./infra-run/scripts/python/journal-analyzer/) - read-only Python helper for exported `journalctl` text review.
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles. - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
## Planned Areas ## Planned Areas
@@ -78,10 +85,11 @@ Basic local validation:
./scripts/validate-repo.sh ./scripts/validate-repo.sh
./scripts/check-bash.sh ./scripts/check-bash.sh
./scripts/check-ansible.sh ./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh ./scripts/check-docs.sh
``` ```
The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Set `STRICT=1` to fail when optional tools are missing. The validation helpers run required lightweight checks and use optional tools such as `shellcheck`, `yamllint`, `ansible-playbook`, `ansible-lint`, and `markdownlint` when available. Python checks use `python3 -m py_compile` and do not require external Python tooling. Set `STRICT=1` to fail when optional tools are missing.
Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment. Some scripts depend on platform tools such as `vxdisk`, `hagrp`, `mmcrnsd`, and `mmlscluster`. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.
@@ -90,6 +98,7 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
## Operational Areas Demonstrated ## Operational Areas Demonstrated
- Linux operations triage and reporting. - Linux operations triage and reporting.
- Local operational log analysis with read-only Python helpers.
- Disk pressure and deleted-file incident analysis. - Disk pressure and deleted-file incident analysis.
- Dry-run-first Bash automation. - Dry-run-first Bash automation.
- Controlled storage change workflow design. - Controlled storage change workflow design.
+18 -2
View File
@@ -16,6 +16,22 @@ This file keeps future portfolio ideas in one place so empty folders do not look
- Clustering: service group checks, failover review, and operational checklists. - Clustering: service group checks, failover review, and operational checklists.
- Monitoring: Zabbix-oriented alert review and host onboarding notes. - Monitoring: Zabbix-oriented alert review and host onboarding notes.
- Virtualization: VM lifecycle and platform operations examples. - Virtualization: VM lifecycle and platform operations examples.
- Log analysis: ELK-style search examples for incident review. - Log analysis: optional ELK-style search case study under `platform-projects`, separate from current local Python helpers.
Nothing in this roadmap should be read as completed implementation. ## Implemented Portfolio Additions
- Python operational log analysis suite under `infra-run/scripts/python/`:
- `incident-log-summary`
- `log-diff-checker`
- `auth-log-audit`
- `jvm-log-analyzer`
- `journal-analyzer`
- `known-error-matcher`
## Future Python Tooling Ideas
- Real-world sample report examples using sanitized evidence.
- Integration examples that combine log summaries with change evidence collection.
- A shared Python helper library only if the standalone tools begin duplicating enough stable behavior to justify it.
Planned sections remain future work unless listed as implemented.
+21 -2
View File
@@ -1,16 +1,34 @@
# infra-run # infra-run
`infra-run` is a sanitized infrastructure operations project. It contains Bash and Ansible examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows. `infra-run` is a sanitized infrastructure operations project. It contains Bash, Ansible, Python, and documentation examples based on Linux administration, incident response, storage operations, hardening, prechecks, postchecks, and controlled change workflows.
The goal is to show operational judgment, not to ship a universal automation product. The goal is to show operational judgment, not to ship a universal automation product.
## Current Contents ## Current Contents
### Bash Operational Scripts
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts. - [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow. - [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples. - [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples. - [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
### Python Log And Reporting Tools
- [scripts/python](./scripts/python/) - read-only Python operational helpers using the standard library only.
- [scripts/python/incident-log-summary](./scripts/python/incident-log-summary/) - read-only Python log summary helper for incident pattern review.
- [scripts/python/log-diff-checker](./scripts/python/log-diff-checker/) - read-only Python before/after log comparison helper for change review.
- [scripts/python/auth-log-audit](./scripts/python/auth-log-audit/) - read-only Python authentication log audit helper for SSH, sudo, su, and PAM review.
- [scripts/python/jvm-log-analyzer](./scripts/python/jvm-log-analyzer/) - read-only Python JVM and Java application log analyzer for exception, stack trace, HTTP 5xx, database, and TLS review.
- [scripts/python/journal-analyzer](./scripts/python/journal-analyzer/) - read-only Python exported journal analyzer for failed units, restart patterns, OOM events, and service warnings.
- [scripts/python/known-error-matcher](./scripts/python/known-error-matcher/) - read-only Python matcher for local logs and JSON known-error catalogs with runbook references.
### Ansible Automation
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX. - [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
### Runbooks And Documentation
- [examples](./examples/) - sanitized sample command outputs and incident notes. - [examples](./examples/) - sanitized sample command outputs and incident notes.
## Documentation ## Documentation
@@ -36,6 +54,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
- Bash syntax can be checked locally. - Bash syntax can be checked locally.
- Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked. - Shell scripts can be reviewed and partially exercised on a Linux workstation when platform commands are available or mocked.
- Disk-full read-only scripts can be run against local paths for basic behavior checks. - Disk-full read-only scripts can be run against local paths for basic behavior checks.
- Python log analysis examples can be run against sanitized sample logs under each tool directory.
- Ansible YAML and role structure can be linted locally. - Ansible YAML and role structure can be linted locally.
## Running Safely ## Running Safely
@@ -70,7 +89,7 @@ From the repository root:
./scripts/validate-repo.sh ./scripts/validate-repo.sh
``` ```
Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems. Focused checks are available in `scripts/check-bash.sh`, `scripts/check-ansible.sh`, `scripts/check-python.sh`, and `scripts/check-docs.sh`. If `ansible-lint` reports collection-related issues, install the collections listed in [ansible/collections/requirements.yml](./ansible/collections/requirements.yml) and rerun it. Treat lint as a starting point; platform testing still requires actual target systems.
## Supporting Notes ## Supporting Notes
+14
View File
@@ -8,6 +8,20 @@ This file tracks planned `infra-run` additions without presenting them as comple
- A small Python parser for converting script output into a markdown change note. - A small Python parser for converting script output into a markdown change note.
- Additional Ansible molecule or container-based syntax checks where platform support is realistic. - Additional Ansible molecule or container-based syntax checks where platform support is realistic.
- Standalone runbooks that reference the existing Bash workflows. - Standalone runbooks that reference the existing Bash workflows.
- Shared known-error pattern catalog review.
- Additional links between Python findings and existing runbooks.
- Change evidence collector for pre-check and post-check notes.
- Report examples suitable for incident and change tickets.
- Optional wrapper command only after the standalone Python tools stabilize.
## Implemented Additions
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
- `infra-run/scripts/python/jvm-log-analyzer/` - read-only JVM and Java application log analyzer for exceptions, stack traces, HTTP 5xx entries, database issues, TLS failures, and JVM failure symptoms.
- `infra-run/scripts/python/journal-analyzer/` - read-only exported `journalctl` text analyzer for summarizing failed units, dependency issues, restart patterns, OOM findings, disk/filesystem symptoms, and related service warnings.
- `infra-run/scripts/python/known-error-matcher/` - read-only known-error matcher for local logs and JSON pattern catalogs with severity, category, samples, and runbook references.
## Not Planned ## Not Planned
+7 -6
View File
@@ -1,6 +1,6 @@
# infra-run/scripts # infra-run/scripts
This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from future Python-based utilities while keeping both under one automation entry point. This directory groups executable tooling used across the `infra-run` project. It separates shell-first operational scripts from Python-based analysis utilities while keeping both under one automation entry point.
## Diagram ## Diagram
@@ -9,16 +9,17 @@ flowchart TD
A["scripts"] --> B["bash"] A["scripts"] --> B["bash"]
A --> C["python"] A --> C["python"]
B --> D["Operational toolkits"] B --> D["Operational toolkits"]
C --> E["Future helper utilities"] C --> E["Analysis helper utilities"]
``` ```
## Scope ## Scope
- `bash` - current implementation area with operations toolkits. - [bash](./bash/) - operational toolkits for host health checks, disk-full triage, Veritas examples, and GPFS examples.
- `python` - reserved space for future supporting utilities. - [python](./python/) - read-only tools for local log parsing, reporting, and structured operational analysis.
## Notes ## Notes
- The repository currently emphasizes Bash because it maps directly to day-to-day Linux operations. - Bash remains the right default for direct host checks and operational wrappers.
- The structure leaves room for higher-level helpers without mixing concerns. - Python is used where parsing, report generation, comparison, or JSON output is clearer than shell.
- Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root. - Bash tooling should remain safe by default, readable, and validated with `../../scripts/check-bash.sh` from the repository root.
- Python tooling should remain read-only by default, standard-library based, and validated with `../../scripts/check-python.sh` from the repository root.
+67 -3
View File
@@ -1,5 +1,69 @@
# python # Python Operational Tools
Planned area for small Python helpers. This directory contains small Python utilities that support operational analysis in `infra-run`.
No Python tooling is implemented in `infra-run` yet. Python is used here only when it adds practical value over Bash: parsing structured or noisy input, producing repeatable reports, comparing evidence, or emitting machine-readable output for later automation. Shell remains the default choice for direct host checks and simple command wrappers.
## Tools
| Tool | Path | Purpose | Typical use | Example command |
| --- | --- | --- | --- | --- |
| incident-log-summary | [incident-log-summary](./incident-log-summary/) | Summarize configured incident patterns from one local log file. | First-pass incident notes from system or application logs. | `python3 incident_log_summary.py --file examples/system-messages.log` |
| log-diff-checker | [log-diff-checker](./log-diff-checker/) | Compare configured patterns before and after a change. | Post-change review for new, increased, decreased, resolved, or unchanged log symptoms. | `python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log` |
| auth-log-audit | [auth-log-audit](./auth-log-audit/) | Summarize SSH, sudo, su, and PAM findings from local authentication logs. | Authentication incident review or access-control evidence gathering. | `python3 auth_log_audit.py --file examples/sample-auth.log` |
| jvm-log-analyzer | [jvm-log-analyzer](./jvm-log-analyzer/) | Summarize JVM exceptions, stack traces, HTTP 5xx entries, database issues, and TLS symptoms. | Java application support, restart review, or incident handoff evidence. | `python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log` |
| journal-analyzer | [journal-analyzer](./journal-analyzer/) | Summarize exported `journalctl` text for failed units, restart loops, OOM events, and service warnings. | Linux service incident review or patching/change evidence. | `python3 journal_analyzer.py --file examples/sample-journal.log` |
| known-error-matcher | [known-error-matcher](./known-error-matcher/) | Match local logs against a JSON known-error catalog. | Connect known symptoms to severity, category, samples, and runbook references. | `python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json` |
## Expected Use Cases
- Log parsing for incident review.
- Markdown or text report generation from collected evidence.
- Change evidence helpers for pre-check and post-check notes.
- Incident summary builders from sanitized inputs.
- Structured output for automation, such as JSON where useful.
## Standards
- Use the Python standard library only unless a later tool clearly justifies another dependency.
- Keep tools read-only by default.
- Do not perform destructive actions.
- Use `argparse` for command-line interfaces.
- Produce predictable text output suitable for terminal review and change notes.
- Support text, Markdown, and JSON output where useful for terminal review, tickets, or local automation.
- Use an `OK`, `WARNING`, `CRITICAL`, and `UNKNOWN` status model for findings.
- Handle malformed input, permission problems, and runtime errors defensively.
- Return meaningful exit codes.
- Keep each tool small, focused, and easy to review.
## Exit Codes
- `0` - OK, no findings, or successful validation.
- `1` - Operational findings detected.
- `2` - Invalid input, missing dependency, permission issue, or runtime error.
## Validation
From the repository root:
```bash
bash scripts/check-python.sh
bash scripts/validate-repo.sh
```
The checks use `python3 -m py_compile` and do not require external Python dependencies.
## Expected Tool Structure
Future tools should use a small self-contained layout:
```text
tool-name/
tool_name.py
README.md
examples/
sample-input.log
sample-report.md
```
Do not add package metadata, framework scaffolding, or external dependency files unless a future tool has a specific operational reason.
@@ -0,0 +1,190 @@
# auth-log-audit
`auth-log-audit` is a read-only Python CLI for reviewing local Linux authentication logs. It summarizes suspicious SSH, sudo, su, and PAM authentication patterns that may require operator review during incident response, hardening checks, or access-control evidence gathering.
The tool analyzes collected log files only. It does not modify logs, query remote systems, or prove compromise.
## When To Use
- During incident response when `/var/log/auth.log`, `/var/log/secure`, or an exported authentication log needs a quick first-pass summary.
- During Linux hardening or access review when repeated failures, invalid users, root login attempts, or sudo failures need to be surfaced.
- Before attaching authentication evidence to an incident, security, problem, or compliance review ticket.
- When JSON output is useful for local automation or repeatable reporting.
## What It Does
- Reads one local authentication log supplied with `--file`.
- Detects common SSH, sudo, su, and PAM authentication events.
- Extracts usernames, source IPs, authentication methods, services, timestamps, and sample raw lines where practical.
- Aggregates failed login counts by source IP and username.
- Flags suspicious source IPs and usernames when failed attempts meet the configured threshold.
- Produces text, Markdown, or JSON output.
## What It Does Not Do
- It does not detect breaches or prove compromise.
- It does not read remote systems or live journal streams.
- It does not modify logs, accounts, SSH configuration, sudoers, or host state.
- It does not query SIEM, SOC tooling, ELK, Zabbix, identity providers, or ticketing systems.
- It does not replace host-specific incident response, access review, or forensic procedures.
- It does not classify every vendor-specific authentication message.
## Supported Input Types
- Debian/Ubuntu-style `/var/log/auth.log`.
- RHEL/Oracle Linux-style `/var/log/secure`.
- Exported authentication logs with similar syslog-style lines.
- UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue.
Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Event Categories
SSH-related:
- Failed SSH password login.
- Failed SSH publickey login.
- Successful SSH login.
- Invalid user attempts.
- Root login attempts.
- Refused or disallowed user attempts.
- Disconnects after failed authentication where detectable.
- Too many authentication failures where detectable.
sudo and su-related:
- sudo command usage.
- sudo authentication failure.
- su session opened.
- su authentication failure.
Generic authentication:
- authentication failure.
- `pam_unix` authentication failure.
- Account locked messages where detectable.
- User not known to the underlying authentication module.
## Timestamp Handling
The scanner attempts to parse:
- `May 11 10:15:30`
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and first seen / last seen values are reported as `UNKNOWN` when no parseable event timestamps are found. Syslog timestamps without a year use the current local year internally while preserving the original timestamp shape in text and Markdown output.
## Suspicious Activity Model
Default threshold:
```text
--threshold-failed 5
```
The report classifies findings conservatively:
- `OK` - no suspicious findings.
- `WARNING` - repeated failed logins, invalid users, root login attempts below the threshold, or sudo authentication failures.
- `CRITICAL` - root login attempts above threshold, high-volume brute-force indicators, or multiple suspicious source IPs above threshold.
This status is a triage signal. It identifies suspicious authentication patterns that require review; it does not confirm a breach.
## Usage
```bash
cd infra-run/scripts/python/auth-log-audit
python3 auth_log_audit.py --file examples/sample-auth.log
python3 auth_log_audit.py --file examples/sample-secure.log
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown
python3 auth_log_audit.py --file examples/sample-auth.log --format markdown --output auth-report.md
python3 auth_log_audit.py --file examples/sample-auth.log --format json
python3 auth_log_audit.py --file examples/sample-auth.log --top 10
python3 auth_log_audit.py --file examples/sample-auth.log --threshold-failed 5
python3 auth_log_audit.py --file examples/sample-auth.log --ignore-users monitoring,backup,ansible
```
Ignored users are excluded from suspicious username threshold findings. Their events are still counted in totals and can still appear in top-user summaries so operational context is not silently hidden.
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or security ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
## Exit Codes
- `0` - OK, no suspicious findings.
- `1` - Suspicious findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Auth Log Audit
==============
Overall status: WARNING
First seen: May 11 09:58:12
Last seen: May 11 10:07:48
Top Source IPs by Failed Attempts
---------------------------------
- 203.0.113.50: 7
- 198.51.100.23: 1
Suspicious Source IPs
---------------------
- 203.0.113.50: 7
Operational Summary
-------------------
Overall status: WARNING
Total lines scanned: 15
Authentication events detected: 15
Failed logins: 8
Successful logins: 1
Invalid user attempts: 1
Root login attempts: 2
Sudo usage events: 1
Sudo authentication failures: 1
Suspicious source IPs: 1
Suspicious usernames: 0
Threshold used: 5
Ignored users: None
```
## Markdown Workflow
Generate a Markdown report from a collected authentication log and attach it to the incident or security ticket as supporting evidence:
```bash
python3 auth_log_audit.py \
--file examples/sample-auth.log \
--format markdown \
--output auth-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with host access history, SSH configuration, sudo policy, user ownership, and any relevant monitoring evidence.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line may produce more than one event when PAM and service messages overlap.
- Syslog timestamps without a year are normalized internally with the current local year.
- Source IP extraction is IPv4-oriented.
- The tool compares counts, not rates, authentication windows, geolocation, or identity context.
- Large log files are read into memory; collect scoped extracts for very large incidents.
- Vendor-specific PAM modules or SSH daemon formats may need future patterns.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not prove compromise or determine root cause automatically.
@@ -0,0 +1,734 @@
#!/usr/bin/env python3
"""Summarize suspicious authentication activity in local Linux auth logs."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter, defaultdict
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
SERVICE_RE = re.compile(r"\s([A-Za-z0-9_.-]+)(?:\[\d+\])?:\s")
IP_RE = re.compile(r"\b(?:\d{1,3}\.){3}\d{1,3}\b")
EVENT_PATTERNS = [
{
"event_type": "failed_ssh_password",
"category": "failed_login",
"method": "password",
"regex": re.compile(
r"sshd(?:\[\d+\])?: Failed password for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "failed_ssh_publickey",
"category": "failed_login",
"method": "publickey",
"regex": re.compile(
r"sshd(?:\[\d+\])?: Failed publickey for (?:(invalid user) )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "successful_ssh_login",
"category": "successful_login",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Accepted (\S+) for (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "invalid_user_attempt",
"category": "invalid_user",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Invalid user (\S+) from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "refused_user_attempt",
"category": "refused_user",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: (?:User|Connection closed by invalid user) (\S+).*?from ((?:\d{1,3}\.){3}\d{1,3})"
),
},
{
"event_type": "disconnect_after_failed_auth",
"category": "disconnect_after_failed_auth",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: Disconnected from (?:authenticating user \S+ |invalid user \S+ )?((?:\d{1,3}\.){3}\d{1,3}).*(?:preauth|Too many authentication failures)"
),
},
{
"event_type": "too_many_auth_failures",
"category": "failed_login",
"method": None,
"regex": re.compile(
r"sshd(?:\[\d+\])?: .*(?:Too many authentication failures|maximum authentication attempts exceeded).*"
),
},
{
"event_type": "sudo_command",
"category": "sudo_usage",
"method": None,
"regex": re.compile(r"sudo(?:\[\d+\])?:\s+(\S+)\s+:\s+TTY=.*COMMAND=(.+)$"),
},
{
"event_type": "sudo_auth_failure",
"category": "sudo_failure",
"method": None,
"regex": re.compile(r"sudo(?:\[\d+\])?: pam_unix\(sudo:auth\): authentication failure;.*"),
},
{
"event_type": "su_session_opened",
"category": "su_event",
"method": None,
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:session\): session opened for user (\S+)"),
},
{
"event_type": "su_auth_failure",
"category": "su_event",
"method": None,
"regex": re.compile(r"su(?:\[\d+\])?: pam_unix\(su(?:-l)?:auth\): authentication failure;.*"),
},
{
"event_type": "pam_unix_auth_failure",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"pam_unix\([^)]*:auth\): authentication failure;.*"),
},
{
"event_type": "user_unknown",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"user (?:unknown|not known to the underlying authentication module)"),
},
{
"event_type": "account_locked",
"category": "generic_auth_failure",
"method": None,
"regex": re.compile(r"(?:account locked|authentication failure;.*account locked)", re.IGNORECASE),
},
]
FAILED_CATEGORIES = {"failed_login", "generic_auth_failure"}
SAMPLE_CATEGORIES = [
"failed_login",
"invalid_user",
"root_login_attempt",
"sudo_failure",
"suspicious_source_ip",
]
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze local Linux authentication logs for suspicious patterns."
)
parser.add_argument("--file", required=True, help="Local auth.log or secure file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top IPs, usernames, and event types to display. Default: 10.",
)
parser.add_argument(
"--threshold-failed",
type=positive_int,
default=5,
help="Failed attempt threshold for suspicious IPs and usernames. Default: 5.",
)
parser.add_argument(
"--ignore-users",
default="",
help="Comma-separated usernames excluded from suspicious username thresholds.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding category. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_ignore_users(value: str) -> list[str]:
if not value.strip():
return []
users = []
for item in value.split(","):
user = item.strip()
if user:
users.append(user)
return sorted(set(users))
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
try:
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
normalized = f"{syslog_year} {raw}"
try:
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def extract_service(line: str) -> str:
match = SERVICE_RE.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_ip(line: str) -> str:
match = IP_RE.search(line)
if match:
return match.group(0)
return UNKNOWN
def extract_user_from_key_values(line: str) -> str:
for pattern in (
r"\buser=([A-Za-z0-9_.@-]+)",
r"\bruser=([A-Za-z0-9_.@-]+)",
r"\bUSER=([A-Za-z0-9_.@-]+)",
):
match = re.search(pattern, line)
if match and match.group(1):
return match.group(1)
return UNKNOWN
def event_from_match(line: str, pattern: dict[str, Any], match: re.Match[str]) -> dict[str, Any]:
event_type = pattern["event_type"]
username = UNKNOWN
source_ip = extract_ip(line)
method = pattern["method"] or UNKNOWN
if event_type in ("failed_ssh_password", "failed_ssh_publickey"):
username = match.group(2)
source_ip = match.group(3)
elif event_type == "successful_ssh_login":
method = match.group(1)
username = match.group(2)
source_ip = match.group(3)
elif event_type in ("invalid_user_attempt", "refused_user_attempt"):
username = match.group(1)
source_ip = match.group(2)
elif event_type == "sudo_command":
username = match.group(1)
elif event_type == "su_session_opened":
username = match.group(1).rstrip(")")
elif event_type in ("sudo_auth_failure", "su_auth_failure", "pam_unix_auth_failure"):
username = extract_user_from_key_values(line)
if username == "root" and event_type in (
"failed_ssh_password",
"failed_ssh_publickey",
"successful_ssh_login",
"invalid_user_attempt",
"refused_user_attempt",
):
event_type = "root_login_attempt"
return {
"event_type": event_type,
"category": pattern["category"],
"username": username or UNKNOWN,
"source_ip": source_ip or UNKNOWN,
"method": method,
"service": extract_service(line),
"raw": line,
}
def detect_events(line: str) -> list[dict[str, Any]]:
events = []
for pattern in EVENT_PATTERNS:
match = pattern["regex"].search(line)
if match:
events.append(event_from_match(line, pattern, match))
if any(event["event_type"] in ("sudo_auth_failure", "su_auth_failure") for event in events):
events = [
event for event in events if event["event_type"] != "pam_unix_auth_failure"
]
if "authentication failure" in line and not events:
events.append(
{
"event_type": "authentication_failure",
"category": "generic_auth_failure",
"username": extract_user_from_key_values(line),
"source_ip": extract_ip(line),
"method": UNKNOWN,
"service": extract_service(line),
"raw": line,
}
)
return dedupe_events(events)
def dedupe_events(events: list[dict[str, Any]]) -> list[dict[str, Any]]:
deduped = []
seen = set()
for event in events:
key = (event["event_type"], event["username"], event["source_ip"], event["raw"])
if key in seen:
continue
seen.add(key)
deduped.append(event)
return deduped
def append_sample(samples: dict[str, list[str]], category: str, line: str, max_samples: int) -> None:
if max_samples == 0:
return
if len(samples[category]) < max_samples:
samples[category].append(line)
def update_seen(
first_seen: tuple[datetime, str] | None,
last_seen: tuple[datetime, str] | None,
parsed_at: datetime | None,
rendered_at: str,
) -> tuple[tuple[datetime, str] | None, tuple[datetime, str] | None]:
if parsed_at is None:
return first_seen, last_seen
if first_seen is None or parsed_at < first_seen[0]:
first_seen = (parsed_at, rendered_at)
if last_seen is None or parsed_at > last_seen[0]:
last_seen = (parsed_at, rendered_at)
return first_seen, last_seen
def analyze_log(
lines: list[str],
threshold_failed: int,
ignore_users: list[str],
top: int,
max_samples: int,
) -> dict[str, Any]:
syslog_year = datetime.now().year
events = []
samples: dict[str, list[str]] = defaultdict(list)
event_type_counts: Counter[str] = Counter()
failed_by_ip: Counter[str] = Counter()
failed_by_user: Counter[str] = Counter()
success_by_ip: Counter[str] = Counter()
success_by_user: Counter[str] = Counter()
first_seen: tuple[datetime, str] | None = None
last_seen: tuple[datetime, str] | None = None
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
line_events = detect_events(line)
if not line_events:
continue
first_seen, last_seen = update_seen(first_seen, last_seen, parsed_at, rendered_at)
for event in line_events:
event["timestamp"] = rendered_at
events.append(event)
event_type_counts[event["event_type"]] += 1
category = event["category"]
username = event["username"]
source_ip = event["source_ip"]
if event["event_type"] == "root_login_attempt":
append_sample(samples, "root_login_attempt", line, max_samples)
category = "failed_login"
if category in FAILED_CATEGORIES:
if source_ip != UNKNOWN:
failed_by_ip[source_ip] += 1
if username != UNKNOWN:
failed_by_user[username] += 1
append_sample(samples, "failed_login", line, max_samples)
if category == "successful_login":
if source_ip != UNKNOWN:
success_by_ip[source_ip] += 1
if username != UNKNOWN:
success_by_user[username] += 1
if category == "invalid_user":
append_sample(samples, "invalid_user", line, max_samples)
if category == "sudo_failure":
append_sample(samples, "sudo_failure", line, max_samples)
suspicious_ips = {
ip: count for ip, count in failed_by_ip.items() if count >= threshold_failed
}
suspicious_users = {
user: count
for user, count in failed_by_user.items()
if count >= threshold_failed and user not in ignore_users
}
for event in events:
if event["source_ip"] in suspicious_ips:
append_sample(samples, "suspicious_source_ip", event["raw"], max_samples)
summary = build_summary(
lines=lines,
events=events,
failed_by_ip=failed_by_ip,
failed_by_user=failed_by_user,
suspicious_ips=suspicious_ips,
suspicious_users=suspicious_users,
event_type_counts=event_type_counts,
threshold_failed=threshold_failed,
ignore_users=ignore_users,
first_seen=first_seen,
last_seen=last_seen,
)
return {
"summary": summary,
"top_source_ips_by_failed_attempts": top_items(failed_by_ip, top),
"top_usernames_by_failed_attempts": top_items(failed_by_user, top),
"top_source_ips_by_successful_logins": top_items(success_by_ip, top),
"top_usernames_by_successful_logins": top_items(success_by_user, top),
"top_event_types": top_items(event_type_counts, top),
"suspicious_source_ips": sorted_count_items(suspicious_ips),
"suspicious_usernames": sorted_count_items(suspicious_users),
"samples": {category: samples.get(category, []) for category in SAMPLE_CATEGORIES},
}
def build_summary(
lines: list[str],
events: list[dict[str, Any]],
failed_by_ip: Counter[str],
failed_by_user: Counter[str],
suspicious_ips: dict[str, int],
suspicious_users: dict[str, int],
event_type_counts: Counter[str],
threshold_failed: int,
ignore_users: list[str],
first_seen: tuple[datetime, str] | None,
last_seen: tuple[datetime, str] | None,
) -> dict[str, Any]:
root_attempts = event_type_counts["root_login_attempt"]
sudo_failures = event_type_counts["sudo_auth_failure"]
invalid_users = event_type_counts["invalid_user_attempt"]
high_volume_ips = sum(1 for count in suspicious_ips.values() if count >= threshold_failed * 2)
high_volume_users = sum(1 for count in suspicious_users.values() if count >= threshold_failed * 2)
if (
root_attempts >= threshold_failed
or high_volume_ips > 0
or high_volume_users > 0
or len(suspicious_ips) >= 2
):
status = "CRITICAL"
elif suspicious_ips or suspicious_users or invalid_users > 0 or sudo_failures > 0 or root_attempts > 0:
status = "WARNING"
else:
status = "OK"
return {
"overall_status": status,
"first_seen": render_seen(first_seen),
"last_seen": render_seen(last_seen),
"total_lines_scanned": len(lines),
"authentication_events_detected": len(events),
"failed_login_count": sum(failed_by_ip.values()),
"successful_login_count": event_type_counts["successful_ssh_login"],
"invalid_user_count": invalid_users,
"root_login_attempt_count": root_attempts,
"sudo_command_count": event_type_counts["sudo_command"],
"sudo_failure_count": sudo_failures,
"su_event_count": event_type_counts["su_session_opened"] + event_type_counts["su_auth_failure"],
"suspicious_source_ip_count": len(suspicious_ips),
"suspicious_username_count": len(suspicious_users),
"threshold_failed": threshold_failed,
"ignored_users": ignore_users,
}
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
def sorted_count_items(items: dict[str, int]) -> list[dict[str, Any]]:
return [
{"value": value, "count": count}
for value, count in sorted(items.items(), key=lambda item: (-item[1], item[0]))
]
def render_text(report: dict[str, Any]) -> str:
summary = report["summary"]
lines = [
"Auth Log Audit",
"==============",
"",
f"Overall status: {summary['overall_status']}",
f"First seen: {summary['first_seen']}",
f"Last seen: {summary['last_seen']}",
"",
]
lines.extend(render_text_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
lines.extend(render_text_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
lines.extend(render_text_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
lines.extend(render_text_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
lines.extend(render_text_table("Suspicious Source IPs", report["suspicious_source_ips"]))
lines.extend(render_text_table("Suspicious Usernames", report["suspicious_usernames"]))
lines.extend(render_text_table("Top Event Types", report["top_event_types"]))
lines.extend(render_text_samples(report["samples"]))
lines.extend(render_text_summary(summary))
return "\n".join(lines) + "\n"
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [title, "-" * len(title)]
if not rows:
lines.append("No entries detected.")
else:
for item in rows:
lines.append(f"- {item['value']}: {item['count']}")
lines.append("")
return lines
def render_text_samples(samples: dict[str, list[str]]) -> list[str]:
lines = ["Sample Log Lines", "----------------"]
for category in SAMPLE_CATEGORIES:
lines.append(f"{category}:")
if samples.get(category):
lines.extend(f" - {sample}" for sample in samples[category])
else:
lines.append(" - No samples retained")
lines.append("")
return lines
def render_text_summary(summary: dict[str, Any]) -> list[str]:
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
return [
"Operational Summary",
"-------------------",
f"Overall status: {summary['overall_status']}",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Authentication events detected: {summary['authentication_events_detected']}",
f"Failed logins: {summary['failed_login_count']}",
f"Successful logins: {summary['successful_login_count']}",
f"Invalid user attempts: {summary['invalid_user_count']}",
f"Root login attempts: {summary['root_login_attempt_count']}",
f"Sudo usage events: {summary['sudo_command_count']}",
f"Sudo authentication failures: {summary['sudo_failure_count']}",
f"su events: {summary['su_event_count']}",
f"Suspicious source IPs: {summary['suspicious_source_ip_count']}",
f"Suspicious usernames: {summary['suspicious_username_count']}",
f"Threshold used: {summary['threshold_failed']}",
f"Ignored users: {ignored}",
]
def render_markdown(report: dict[str, Any]) -> str:
summary = report["summary"]
lines = [
"# Auth Log Audit",
"",
f"- Overall status: {summary['overall_status']}",
f"- First seen: {summary['first_seen']}",
f"- Last seen: {summary['last_seen']}",
"",
]
lines.extend(render_markdown_table("Top Source IPs by Failed Attempts", report["top_source_ips_by_failed_attempts"]))
lines.extend(render_markdown_table("Top Usernames by Failed Attempts", report["top_usernames_by_failed_attempts"]))
lines.extend(render_markdown_table("Top Source IPs by Successful Logins", report["top_source_ips_by_successful_logins"]))
lines.extend(render_markdown_table("Top Usernames by Successful Logins", report["top_usernames_by_successful_logins"]))
lines.extend(render_markdown_table("Suspicious Source IPs", report["suspicious_source_ips"]))
lines.extend(render_markdown_table("Suspicious Usernames", report["suspicious_usernames"]))
lines.extend(render_markdown_table("Top Event Types", report["top_event_types"]))
lines.extend(render_markdown_samples(report["samples"]))
ignored = ", ".join(summary["ignored_users"]) if summary["ignored_users"] else "None"
lines.extend(
[
"## Operational Summary",
"",
f"- Overall status: {summary['overall_status']}",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Authentication events detected: {summary['authentication_events_detected']}",
f"- Failed logins: {summary['failed_login_count']}",
f"- Successful logins: {summary['successful_login_count']}",
f"- Invalid user attempts: {summary['invalid_user_count']}",
f"- Root login attempts: {summary['root_login_attempt_count']}",
f"- Sudo usage events: {summary['sudo_command_count']}",
f"- Sudo authentication failures: {summary['sudo_failure_count']}",
f"- su events: {summary['su_event_count']}",
f"- Suspicious source IPs: {summary['suspicious_source_ip_count']}",
f"- Suspicious usernames: {summary['suspicious_username_count']}",
f"- Threshold used: {summary['threshold_failed']}",
f"- Ignored users: {ignored}",
"",
]
)
return "\n".join(lines)
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [f"## {title}", ""]
if not rows:
lines.extend(["No entries detected.", ""])
return lines
lines.extend(["| Value | Count |", "| --- | ---: |"])
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
lines.append("")
return lines
def render_markdown_samples(samples: dict[str, list[str]]) -> list[str]:
lines = ["## Sample Log Lines", ""]
for category in SAMPLE_CATEGORIES:
lines.extend([f"### {category}", ""])
if samples.get(category):
lines.append("```text")
lines.extend(samples[category])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
return lines
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
if path.resolve() == input_path.resolve():
raise OSError("output path must not be the same as input file")
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
input_path = Path(args.file)
ignore_users = parse_ignore_users(args.ignore_users)
try:
lines = read_log_file(input_path)
report = analyze_log(
lines=lines,
threshold_failed=args.threshold_failed,
ignore_users=ignore_users,
top=args.top,
max_samples=args.max_samples,
)
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(input_path, args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,112 @@
# Auth Log Audit
- Overall status: WARNING
- First seen: May 11 09:58:12
- Last seen: May 11 10:07:48
## Top Source IPs by Failed Attempts
| Value | Count |
| --- | ---: |
| 203.0.113.50 | 7 |
| 198.51.100.23 | 1 |
## Top Usernames by Failed Attempts
| Value | Count |
| --- | ---: |
| appuser | 3 |
| root | 2 |
| admin | 1 |
| backup | 1 |
## Top Source IPs by Successful Logins
| Value | Count |
| --- | ---: |
| 10.20.30.15 | 1 |
## Top Usernames by Successful Logins
| Value | Count |
| --- | ---: |
| deploy | 1 |
## Suspicious Source IPs
| Value | Count |
| --- | ---: |
| 203.0.113.50 | 7 |
## Suspicious Usernames
No entries detected.
## Top Event Types
| Value | Count |
| --- | ---: |
| failed_ssh_password | 4 |
| root_login_attempt | 2 |
| successful_ssh_login | 1 |
| sudo_command | 1 |
| invalid_user_attempt | 1 |
| disconnect_after_failed_auth | 1 |
| failed_ssh_publickey | 1 |
| sudo_auth_failure | 1 |
| su_session_opened | 1 |
| refused_user_attempt | 1 |
## Sample Log Lines
### failed_login
```text
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
### invalid_user
```text
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
```
### root_login_attempt
```text
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
### sudo_failure
```text
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
```
### suspicious_source_ip
```text
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
```
## Operational Summary
- Overall status: WARNING
- Total lines scanned: 15
- Authentication events detected: 15
- Failed logins: 8
- Successful logins: 1
- Invalid user attempts: 1
- Root login attempts: 2
- Sudo usage events: 1
- Sudo authentication failures: 1
- su events: 1
- Suspicious source IPs: 1
- Suspicious usernames: 0
- Threshold used: 5
- Ignored users: None
@@ -0,0 +1,15 @@
May 11 09:58:12 web01 sshd[1201]: Accepted publickey for deploy from 10.20.30.15 port 52214 ssh2: ED25519 SHA256:samplekey
May 11 10:00:01 web01 sudo: deploy : TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/usr/bin/systemctl status nginx
May 11 10:01:44 web01 sshd[1220]: Failed password for invalid user admin from 203.0.113.50 port 45001 ssh2
May 11 10:01:46 web01 sshd[1220]: Invalid user admin from 203.0.113.50 port 45001
May 11 10:02:03 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:06 web01 sshd[1224]: Failed password for root from 203.0.113.50 port 45012 ssh2
May 11 10:02:11 web01 sshd[1224]: Disconnected from authenticating user root 203.0.113.50 port 45012 [preauth]
May 11 10:03:10 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:14 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:18 web01 sshd[1231]: Failed password for appuser from 203.0.113.50 port 45101 ssh2
May 11 10:03:41 web01 sshd[1238]: Failed publickey for backup from 198.51.100.23 port 50222 ssh2
May 11 10:04:20 web01 sudo: pam_unix(sudo:auth): authentication failure; logname=deploy uid=1001 euid=0 tty=/dev/pts/0 ruser=deploy rhost= user=deploy
May 11 10:05:02 web01 su[1244]: pam_unix(su:session): session opened for user root by deploy(uid=1001)
May 11 10:06:31 web01 sshd[1250]: User testuser from 192.0.2.77 not allowed because not listed in AllowUsers
May 11 10:07:48 web01 sshd[1254]: error: maximum authentication attempts exceeded for invalid user oracle from 203.0.113.50 port 45200 ssh2 [preauth]
@@ -0,0 +1,14 @@
May 11 09:52:44 db01 sshd[2110]: Accepted publickey for admin from 10.40.10.25 port 60124 ssh2: RSA SHA256:samplekey
May 11 09:55:10 db01 sudo[2120]: admin : TTY=pts/1 ; PWD=/home/admin ; USER=root ; COMMAND=/usr/bin/systemctl restart auditd
May 11 09:55:10 db01 sudo[2120]: pam_unix(sudo:session): session opened for user root(uid=0) by admin(uid=1000)
May 11 10:00:01 db01 sshd[2130]: Failed password for invalid user postgres from 198.51.100.90 port 42101 ssh2
May 11 10:00:03 db01 sshd[2130]: Invalid user postgres from 198.51.100.90 port 42101
May 11 10:00:09 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
May 11 10:00:13 db01 sshd[2132]: Failed password for root from 198.51.100.90 port 42105 ssh2
May 11 10:00:20 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:00:25 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:00:31 db01 sshd[2135]: Failed password for oracle from 198.51.100.90 port 42111 ssh2
May 11 10:01:12 db01 su[2142]: pam_unix(su:auth): authentication failure; logname=admin uid=1000 euid=0 tty=pts/1 ruser=admin rhost= user=root
May 11 10:01:45 db01 sshd[2149]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.77 user=monitoring
May 11 10:02:03 db01 sshd[2154]: error: PAM: User not known to the underlying authentication module for illegal user deploy from 203.0.113.77
May 11 10:02:36 db01 sshd[2159]: Disconnecting authenticating user oracle 198.51.100.90 port 42111: Too many authentication failures [preauth]
@@ -0,0 +1,159 @@
# incident-log-summary
`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
## When To Use
- During incident response when a collected log file needs a fast pattern summary.
- Before attaching evidence to an incident, problem, or change ticket.
- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
- When JSON output is useful for later local automation.
## What It Does Not Do
- It does not read remote systems.
- It does not modify logs or system state.
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
- It does not prove root cause.
- It does not classify every possible vendor or application error.
- It does not treat sanitized examples as production validation.
## Supported Input
- One local text log file provided with `--file`.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Patterns
Critical patterns:
- `CRITICAL`
- `FATAL`
- `panic`
- `kernel panic`
- `no space left on device`
- `out of memory`
- `killed process`
- `read-only file system`
- `segmentation fault`
- `segfault`
- `certificate expired`
- `TLS handshake failed`
- `SSLHandshakeException`
- `database unavailable`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
Warning patterns:
- `ERROR`
- `failed`
- `failure`
- `timeout`
- `connection refused`
- `connection reset`
- `permission denied`
- `authentication failed`
- `denied`
- `unavailable`
- `service restart`
- `retrying`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
## Timestamp Handling
The scanner attempts to parse:
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `May 11 10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
## Usage
```bash
cd infra-run/scripts/python/incident-log-summary
python3 incident_log_summary.py --file examples/system-messages.log
python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
python3 incident_log_summary.py --file examples/app-error.log --format json
python3 incident_log_summary.py --file examples/app-error.log --top 20
python3 incident_log_summary.py --file examples/app-error.log --ignore-case
python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
## Exit Codes
- `0` - OK, no findings.
- `1` - Operational findings detected.
- `2` - Invalid input, unreadable file, bad argument, or runtime error.
## Example Text Output
```text
Incident Log Summary
====================
[CRITICAL] no space left on device
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Samples:
- May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
Operational Summary
-------------------
Total lines scanned: 7
Total findings: 7
Critical finding groups: 3
Warning finding groups: 4
Overall status: CRITICAL
```
## Markdown Workflow
Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
```bash
python3 incident_log_summary.py \
--file examples/app-error.log \
--format markdown \
--output incident-report.md
```
Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Syslog timestamps without a year are normalized with an inferred year.
- Date filters are best-effort because lines without parseable timestamps are retained.
- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,8 @@
2026-05-11 09:48:12 app01 api[4150]: INFO request_id=7f3a status=200 path=/health
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
2026-05-11 12:10:01 app01 api[4150]: INFO request_id=9001 status=200 path=/health
@@ -0,0 +1,144 @@
# Incident Log Summary
## CRITICAL: certificate expired
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## CRITICAL: CRITICAL
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## CRITICAL: database unavailable
- Occurrences: 1
- First seen: 2026-05-11 10:03:19
- Last seen: 2026-05-11 10:03:19
Sample log lines:
```text
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
```
## CRITICAL: HTTP 500
- Occurrences: 1
- First seen: 2026-05-11 10:01:03
- Last seen: 2026-05-11 10:01:03
Sample log lines:
```text
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
```
## CRITICAL: HTTP 503
- Occurrences: 1
- First seen: 2026-05-11 10:13:58
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
```
## CRITICAL: TLS handshake failed
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## WARNING: ERROR
- Occurrences: 4
- First seen: 2026-05-11 10:01:03
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
```
## WARNING: unavailable
- Occurrences: 2
- First seen: 2026-05-11 10:03:19
- Last seen: 2026-05-11 10:13:58
Sample log lines:
```text
2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
```
## WARNING: connection refused
- Occurrences: 1
- First seen: 2026-05-11 10:07:02
- Last seen: 2026-05-11 10:07:02
Sample log lines:
```text
2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
```
## WARNING: failed
- Occurrences: 1
- First seen: 2026-05-11 10:11:33
- Last seen: 2026-05-11 10:11:33
Sample log lines:
```text
2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
```
## WARNING: timeout
- Occurrences: 1
- First seen: 2026-05-11 10:05:44
- Last seen: 2026-05-11 10:05:44
Sample log lines:
```text
2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
```
## Operational Summary
- Total lines scanned: 8
- Total findings: 15
- Critical finding groups: 6
- Warning finding groups: 5
- Overall status: CRITICAL
@@ -0,0 +1,7 @@
May 11 09:57:01 ops-node-01 systemd[1]: Started Session 443 of user svc_backup.
May 11 10:02:14 ops-node-01 systemd[1]: failed to start nightly-report.service: Unit entered failed state.
May 11 10:04:22 ops-node-01 sudo[18442]: svc_backup : command not allowed ; permission denied
May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
May 11 10:21:45 ops-node-01 kernel: out of memory: killed process 2517 (java) total-vm:2048000kB
May 11 10:22:03 ops-node-01 systemd[1]: service restart scheduled for app-worker.service
May 11 10:30:31 ops-node-01 sshd[19210]: Accepted publickey for admin from 192.0.2.15 port 52210 ssh2
@@ -0,0 +1,448 @@
#!/usr/bin/env python3
"""Summarize incident-oriented patterns in local log files."""
from __future__ import annotations
import argparse
import json
import re
import sys
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
"CRITICAL",
"FATAL",
"panic",
"kernel panic",
"no space left on device",
"out of memory",
"killed process",
"read-only file system",
"segmentation fault",
"segfault",
"certificate expired",
"TLS handshake failed",
"SSLHandshakeException",
"database unavailable",
"HTTP 500",
"HTTP 502",
"HTTP 503",
"HTTP 504",
]
WARNING_PATTERNS = [
"ERROR",
"failed",
"failure",
"timeout",
"connection refused",
"connection reset",
"permission denied",
"authentication failed",
"denied",
"unavailable",
"service restart",
"retrying",
]
ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Summarize suspicious and critical patterns in a local log file."
)
parser.add_argument("--file", required=True, help="Local log file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
help="Limit finding groups after severity and count sorting.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match all configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError(
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
)
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
pattern_defs: list[dict[str, str]] = []
pattern_defs.extend(
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
)
pattern_defs.extend(
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
)
compiled = []
for item in pattern_defs:
compiled.append(
{
"pattern": item["pattern"],
"severity": item["severity"],
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str | None]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
try:
return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
except ValueError:
return None, None
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
normalized = f"{syslog_year} {raw}"
try:
parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
except ValueError:
return None, None
return parsed, parsed.strftime("%Y-%m-%d %H:%M:%S")
return None, None
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
max_samples: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if not line_in_time_window(parsed_at, since, until):
continue
for item in patterns:
if not item["regex"].search(line):
continue
key = f"{item['severity']}::{item['pattern']}"
group = groups.setdefault(
key,
{
"pattern": item["pattern"],
"severity": item["severity"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
if parsed_at is not None:
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
if len(group["samples"]) < max_samples:
group["samples"].append(line)
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["pattern"].lower(),
),
)
rendered_findings = []
for group in findings:
rendered_findings.append(
{
"pattern": group["pattern"],
"severity": group["severity"],
"occurrences": group["occurrences"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
}
)
return {
"total_lines_scanned": len(lines),
"findings": rendered_findings,
}
def render_seen(value: tuple[datetime, str | None] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def apply_top_limit(report: dict[str, Any], top: int | None) -> dict[str, Any]:
if top is None:
return report
limited = dict(report)
limited["findings"] = report["findings"][:top]
return limited
def add_summary(report: dict[str, Any]) -> dict[str, Any]:
findings = report["findings"]
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
total_findings = sum(item["occurrences"] for item in findings)
if critical_groups > 0:
status = "CRITICAL"
elif warning_groups > 0:
status = "WARNING"
else:
status = "OK"
enriched = dict(report)
enriched["summary"] = {
"total_lines_scanned": report["total_lines_scanned"],
"total_findings": total_findings,
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"overall_status": status,
}
return enriched
def render_text(report: dict[str, Any]) -> str:
lines = ["Incident Log Summary", "====================", ""]
if not report["findings"]:
lines.append("No configured incident patterns were detected.")
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['pattern']}",
f"Occurrences: {finding['occurrences']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
lines.append("")
lines.extend(render_text_summary(report["summary"]))
return "\n".join(lines) + "\n"
def render_text_summary(summary: dict[str, Any]) -> list[str]:
return [
"Operational Summary",
"-------------------",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Total findings: {summary['total_findings']}",
f"Critical finding groups: {summary['critical_finding_groups']}",
f"Warning finding groups: {summary['warning_finding_groups']}",
f"Overall status: {summary['overall_status']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = ["# Incident Log Summary", ""]
if not report["findings"]:
lines.extend(["No configured incident patterns were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['pattern']}",
"",
f"- Occurrences: {finding['occurrences']}",
f"- First seen: {finding['first_seen']}",
f"- Last seen: {finding['last_seen']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
summary = report["summary"]
lines.extend(
[
"## Operational Summary",
"",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Total findings: {summary['total_findings']}",
f"- Critical finding groups: {summary['critical_finding_groups']}",
f"- Warning finding groups: {summary['warning_finding_groups']}",
f"- Overall status: {summary['overall_status']}",
"",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
if args.since is not None and args.until is not None and args.since > args.until:
parser.error("--since must be earlier than or equal to --until")
try:
lines = read_log_file(Path(args.file))
report = analyze_log(
lines=lines,
patterns=compile_patterns(args.ignore_case),
since=args.since,
until=args.until,
max_samples=args.max_samples,
)
report = add_summary(apply_top_limit(report, args.top))
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,215 @@
# journal-analyzer
`journal-analyzer` is a read-only Python CLI for reviewing exported `journalctl` text logs. It summarizes systemd, service, and system-level journal findings that require operator review during Linux incident response, post-patching validation, restart troubleshooting, and change evidence collection.
The tool analyzes exported journal text only. It does not call `journalctl` directly, does not modify host state, and does not claim root cause.
## Purpose
- Summarize which units failed and which services appear repeatedly affected.
- Surface dependency failures, restart loops, timeout patterns, OOM symptoms, disk/filesystem errors, TLS/certificate issues, authentication events, and network-related warnings.
- Produce predictable text, Markdown, or JSON output that can be attached to an incident or change ticket.
## When To Use
- After exporting a scoped `journalctl` window during incident response.
- After package patching or service restarts when failed units or degraded services need review.
- During Linux service troubleshooting when repeated restart or dependency messages need a quick grouped summary.
- Before attaching journal evidence to an incident, problem, or change record.
## What It Does Not Do
- It does not call `journalctl` directly in v1.
- It does not modify the input log, systemd state, service state, or host configuration.
- It does not read remote systems or live journal streams.
- It does not query SIEM, ELK, Zabbix, APM, or ticketing systems.
- It does not prove root cause or a service defect.
- It does not classify every vendor-specific journal message.
## Supported Input Type
- One exported local `journalctl` text file supplied with `--file`.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
Example export commands:
```bash
journalctl --since "1 hour ago" > journal.log
journalctl -u nginx --since today > nginx-journal.log
journalctl -p warning..alert --since "24 hours ago" > warnings.log
journalctl --no-pager --since "2026-05-11 10:00:00" > journal.log
```
## Supported Event Categories
Critical-oriented categories:
- Failed unit or failed start findings.
- Dependency failures.
- Kernel panic and panic findings.
- OOM killer and killed process findings.
- Disk and filesystem issues such as `no space left on device`, read-only filesystem, filesystem errors, and I/O errors.
- Service or application crash patterns such as `segfault`.
- TLS and certificate failures.
- Emergency mode findings.
Warning-oriented categories:
- Restart and repeated start request findings.
- Timeout and timed out findings.
- Connection refused and connection reset findings.
- Permission denied and denied findings.
- Authentication failure findings.
- Availability, degraded, failed, and warning findings that still require review.
The matching is practical and pattern-based. Default matching is already case-tolerant for common operational wording, and `--ignore-case` is available for explicit filter runs and predictable operator intent. The tool is intended for first-pass operational review, not for proving causality.
## Timestamp Support
The analyzer attempts to parse common journal and syslog timestamp formats:
- `May 11 10:15:30`
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `2026-05-11 10:15:30.123456`
- `2026-05-11 10:15:30,123`
If a timestamp cannot be parsed:
- the line is still analyzed
- first seen / last seen remain `UNKNOWN` where needed
- time-window filters keep the line by default rather than silently discarding it
Syslog-style timestamps without a year use the current local year internally unless `--since` provides a year context.
## Service Filtering
Use `--service SERVICE_NAME` to keep findings for a specific service, unit, or process name. Partial matches are allowed.
Examples:
```bash
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
```
`--service nginx` matches practical variants such as `nginx`, `nginx.service`, and lines where the raw journal text includes `nginx`.
## Severity Filtering
Use `--severity warning` or `--severity critical` to limit the displayed findings.
Examples:
```bash
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
python3 journal_analyzer.py --file examples/sample-journal.log --severity warning
```
## Severity Model
Overall status is conservative:
- `OK` - no journal findings detected.
- `WARNING` - warning-level findings exist but no critical findings exist.
- `CRITICAL` - one or more critical findings exist.
Critical status is driven by failed units, dependency failures, OOM events, kernel panic findings, disk full or read-only filesystem symptoms, emergency mode, TLS/certificate failures, and I/O or filesystem errors.
Warning status is driven by restart-related findings, timeout patterns, connection issues, permission denied events, authentication failures, degraded messages, and generic warning/failure entries that still require review.
The report summarizes exported journal findings that require review. It does not claim root cause.
## Usage
```bash
cd infra-run/scripts/python/journal-analyzer
python3 journal_analyzer.py --file examples/sample-journal.log
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown
python3 journal_analyzer.py --file examples/sample-journal.log --format markdown --output journal-report.md
python3 journal_analyzer.py --file examples/sample-journal.log --format json
python3 journal_analyzer.py --file examples/sample-journal.log --service sshd
python3 journal_analyzer.py --file examples/sample-journal.log --service nginx
python3 journal_analyzer.py --file examples/sample-journal.log --severity critical
python3 journal_analyzer.py --file examples/sample-journal.log --top 10
python3 journal_analyzer.py --file examples/sample-journal.log --since "2026-05-11 10:00:00"
python3 journal_analyzer.py --file examples/sample-journal.log --until "2026-05-11 12:00:00"
python3 journal_analyzer.py --file examples/sample-journal.log --ignore-case
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the report to a separate file. Without `--output`, the report is printed to stdout.
## Exit Codes
- `0` - OK, no journal findings.
- `1` - Journal findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Journal Analyzer
================
Overall status: CRITICAL
Journal findings require review; logs alone do not prove root cause.
[CRITICAL] nginx.service - failed_unit
Pattern: failed to start
Occurrences: 1
Unit: nginx.service
Process: systemd
PID: 1
First seen: May 11 10:16:11
Last seen: May 11 10:16:11
Samples:
- May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 17
Total findings: 13
Critical finding groups: 7
Warning finding groups: 5
Affected services/units count: 9
```
## Markdown Workflow
Generate a Markdown report from an exported journal and attach it to the incident or change ticket as supporting evidence:
```bash
python3 journal_analyzer.py \
--file examples/sample-journal.log \
--format markdown \
--output journal-report.md
```
Review the report before attaching it. Use it as a concise summary of exported journal findings, then correlate it with service status, monitoring, recent changes, package history, and runbook-specific post-checks.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match more than one finding when it contains more than one meaningful symptom, such as a TLS failure plus certificate expiry.
- Default matching is already case-tolerant for practical journal review; `--ignore-case` remains available when you want to force case-insensitive operator searches.
- Unit, process, and PID extraction are best-effort and may return `UNKNOWN`.
- Time filtering is best-effort because lines without parseable timestamps are retained.
- Large log files are read into memory; use scoped journal exports for very large review windows.
- The tool does not inspect structured journal fields because v1 works on exported text logs.
## Safety Notes
- The tool only reads the input journal export and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require root privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, or unsanitized production details in portfolio examples.
- Treat operational findings as triage evidence that requires review; the tool does not determine root cause automatically.
@@ -0,0 +1,143 @@
# Journal Analyzer Report
- Overall status: `CRITICAL`
- Journal findings require review; logs alone do not prove root cause.
## Finding Groups
### [CRITICAL] backup-agent - tls_certificate
- Pattern: `certificate expired`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `backup-agent`
- PID: `777`
- First seen: `2026-05-11 10:18:10`
- Last seen: `2026-05-11 10:18:10`
- Samples:
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
### [CRITICAL] backup-agent - tls_certificate
- Pattern: `TLS handshake failed`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `backup-agent`
- PID: `777`
- First seen: `2026-05-11 10:18:10`
- Last seen: `2026-05-11 10:18:10`
- Samples:
- `2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection`
### [CRITICAL] dockerd - disk_filesystem
- Pattern: `no space left on device`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `dockerd`
- PID: `1347`
- First seen: `2026-05-11 10:17:33`
- Last seen: `2026-05-11 10:17:33`
- Samples:
- `2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device`
### [CRITICAL] java - oom
- Pattern: `Out of memory`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `java`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:02`
- Last seen: `2026-05-11 10:17:02`
- Samples:
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
### [CRITICAL] java - oom
- Pattern: `killed process`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `java`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:02`
- Last seen: `2026-05-11 10:17:02`
- Samples:
- `2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB`
### [CRITICAL] kernel - disk_filesystem
- Pattern: `read-only file system`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `kernel`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:54`
- Last seen: `2026-05-11 10:17:54`
- Samples:
- `2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system`
### [CRITICAL] kernel - oom
- Pattern: `invoked oom-killer`
- Occurrences: `1`
- Unit: `UNKNOWN`
- Process: `kernel`
- PID: `UNKNOWN`
- First seen: `2026-05-11 10:17:01`
- Last seen: `2026-05-11 10:17:01`
- Samples:
- `2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0`
### [CRITICAL] nginx.service - dependency_failure
- Pattern: `dependency failed`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:08`
- Last seen: `May 11 10:16:08`
- Samples:
- `May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.`
### [CRITICAL] nginx.service - failed_unit
- Pattern: `failed to start`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:11`
- Last seen: `May 11 10:16:11`
- Samples:
- `May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.`
### [CRITICAL] nginx.service - failed_unit
- Pattern: `entered failed state`
- Occurrences: `1`
- Unit: `nginx.service`
- Process: `systemd`
- PID: `1`
- First seen: `May 11 10:16:12`
- Last seen: `May 11 10:16:12`
- Samples:
- `May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.`
## Operational Summary
- Overall status: `CRITICAL`
- Total lines scanned: `17`
- Total findings: `18`
- Critical finding groups: `11`
- Warning finding groups: `7`
- Affected services/units count: `9`
- Top affected services/units: nginx.service (5), sshd.service (3), kernel (2), java (2), backup-agent (2), sshd (1), dockerd (1), NetworkManager (1), systemd (1)
- Top finding categories: restart (3), oom (3), failed_unit (2), disk_filesystem (2), tls_certificate (2), authentication (1), timeout (1), dependency_failure (1), generic_failure (1), network (1)
- Failed unit findings: nginx.service (3)
- Restart findings: `3`
- OOM findings: `3`
- Filesystem/disk findings: `2`
- Timestamp coverage: parsed=`17`, unknown=`0`
- Filters used: service=`None`, severity=`None`, since=`None`, until=`None`
@@ -0,0 +1,17 @@
May 11 10:14:01 web01 systemd[1]: Starting nginx.service - A high performance web server and a reverse proxy server...
May 11 10:14:02 web01 systemd[1]: Started ssh.service - OpenBSD Secure Shell server.
May 11 10:15:03 web01 sshd[2284]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=198.51.100.23 user=deploy
May 11 10:15:22 web01 systemd[1]: sshd.service: Scheduled restart job, restart counter is at 3.
May 11 10:15:23 web01 systemd[1]: sshd.service: Service restart completed after watchdog timeout warning
May 11 10:16:08 web01 systemd[1]: Dependency failed for nginx.service.
May 11 10:16:09 web01 systemd[1]: nginx.service: Job nginx.service/start failed with result 'dependency'.
May 11 10:16:10 web01 systemd[1]: nginx.service: Start request repeated too quickly.
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
May 11 10:16:12 web01 systemd[1]: nginx.service: Unit entered failed state.
2026-05-11 10:17:01 web01 kernel: invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
2026-05-11 10:17:02 web01 kernel: Out of memory: Killed process 4421 (java) total-vm:2048000kB, anon-rss:1024000kB, file-rss:1024kB, shmem-rss:0kB
2026-05-11 10:17:33 web01 dockerd[1347]: Error response from daemon: write /var/lib/docker/tmp/GetImageBlob123456: no space left on device
2026-05-11 10:17:54 web01 kernel: EXT4-fs error (device sda2): Remounting read-only file system
2026-05-11 10:18:10 web01 backup-agent[777]: TLS handshake failed for backup endpoint: certificate expired on peer connection
2026-05-11 10:18:28 web01 NetworkManager[691]: Connection activation failed: Connection refused while reaching upstream gateway
2026-05-11 10:18:42 web01 systemd[1]: Emergency mode is enabled. System cannot continue normal boot.
@@ -0,0 +1,895 @@
#!/usr/bin/env python3
"""Analyze exported journalctl text logs for operational findings."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
{
"name": "failed to start",
"pattern": "failed to start",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "entered failed state",
"pattern": "entered failed state",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "dependency failed",
"pattern": "dependency failed",
"category": "dependency_failure",
"service_hint": "systemd",
},
{
"name": "job failed",
"pattern": "job failed",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "unit failed",
"pattern": "unit failed",
"category": "failed_unit",
"service_hint": "systemd",
},
{
"name": "kernel panic",
"pattern": "kernel panic",
"category": "kernel_panic",
"service_hint": "kernel",
},
{
"name": "panic",
"pattern": "panic",
"category": "kernel_panic",
"service_hint": "kernel",
},
{
"name": "Out of memory",
"pattern": "Out of memory",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "invoked oom-killer",
"pattern": "invoked oom-killer",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "killed process",
"pattern": "killed process",
"category": "oom",
"service_hint": "kernel",
},
{
"name": "no space left on device",
"pattern": "no space left on device",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "read-only file system",
"pattern": "read-only file system",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "segmentation fault",
"pattern": "segmentation fault",
"category": "crash",
"service_hint": "application",
},
{
"name": "segfault",
"pattern": "segfault",
"category": "crash",
"service_hint": "application",
},
{
"name": "certificate expired",
"pattern": "certificate expired",
"category": "tls_certificate",
"service_hint": "tls",
},
{
"name": "TLS handshake failed",
"pattern": "TLS handshake failed",
"category": "tls_certificate",
"service_hint": "tls",
},
{
"name": "emergency mode",
"pattern": "emergency mode",
"category": "system_recovery",
"service_hint": "systemd",
},
{
"name": "filesystem error",
"pattern": "filesystem error",
"category": "disk_filesystem",
"service_hint": "storage",
},
{
"name": "I/O error",
"pattern": "I/O error",
"category": "disk_filesystem",
"service_hint": "storage",
},
]
WARNING_PATTERNS = [
{
"name": "service restart",
"pattern": "service restart",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "scheduled restart job",
"pattern": "scheduled restart job",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "start request repeated too quickly",
"pattern": "start request repeated too quickly",
"category": "restart",
"service_hint": "systemd",
},
{
"name": "timeout",
"pattern": "timeout",
"category": "timeout",
"service_hint": "application",
},
{
"name": "timed out",
"pattern": "timed out",
"category": "timeout",
"service_hint": "application",
},
{
"name": "connection refused",
"pattern": "connection refused",
"category": "network",
"service_hint": "network",
},
{
"name": "connection reset",
"pattern": "connection reset",
"category": "network",
"service_hint": "network",
},
{
"name": "permission denied",
"pattern": "permission denied",
"category": "permission",
"service_hint": "security",
},
{
"name": "authentication failure",
"pattern": "authentication failure",
"category": "authentication",
"service_hint": "security",
},
{
"name": "denied",
"pattern": "denied",
"category": "permission",
"service_hint": "security",
},
{
"name": "unavailable",
"pattern": "unavailable",
"category": "availability",
"service_hint": "application",
},
{
"name": "degraded",
"pattern": "degraded",
"category": "degraded",
"service_hint": "systemd",
},
{
"name": "failed",
"pattern": "failed",
"category": "generic_failure",
"service_hint": "application",
},
{
"name": "warning",
"pattern": "warning",
"category": "warning",
"service_hint": "application",
},
]
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
UNIT_RE = re.compile(r"\b([A-Za-z0-9_.@:-]+\.service)\b")
ANY_UNIT_RE = re.compile(
r"\b([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))\b"
)
PREFIX_RE = re.compile(
r"^(?:[A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+)?"
r"(?:\d{4}-\d{2}-\d{2}[ T]\d{2}:\d{2}:\d{2}(?:[,.]\d{1,6})?\s+)?"
r"(?:(?P<host>[A-Za-z0-9_.:-]+)\s+)?"
r"(?P<proc>[A-Za-z0-9_.@/-]+)(?:\[(?P<pid>\d+)\])?:"
)
KILLED_PROCESS_RE = re.compile(r"Killed process \d+ \(([^)]+)\)")
SYSTEMD_FAILED_START_RE = re.compile(r"Failed to start\s+(.+?)\.")
SYSTEMD_TRIGGER_RE = re.compile(r"Triggered By:\s*([A-Za-z0-9_.@:-]+\.(?:service|socket|mount|target|timer|path|slice|scope|device))")
PID_RE = re.compile(r"\bpid[ =](\d+)\b", re.IGNORECASE)
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze exported journalctl text logs for systemd and service findings."
)
parser.add_argument("--file", required=True, help="Exported journal log file to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--service",
help="Filter findings to a service, unit, or process name. Partial matching is allowed.",
)
parser.add_argument(
"--severity",
choices=("warning", "critical"),
help="Show only warning or critical findings.",
)
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top groups, services, and categories to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in (
"%Y-%m-%d %H:%M:%S",
"%Y-%m-%dT%H:%M:%S",
"%Y-%m-%d %H:%M:%S.%f",
"%Y-%m-%d %H:%M:%S,%f",
):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError(
'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
)
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE
if ignore_case:
flags |= re.IGNORECASE
compiled = []
for item in CRITICAL_PATTERNS:
compiled.append(
{
**item,
"severity": "CRITICAL",
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
for item in WARNING_PATTERNS:
compiled.append(
{
**item,
"severity": "WARNING",
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def append_limited(items: list[str], value: str, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def normalize_service_name(value: str) -> str:
stripped = value.strip()
if not stripped:
return UNKNOWN
return stripped
def extract_service_info(line: str, pattern_item: dict[str, Any]) -> dict[str, str]:
unit_match = UNIT_RE.search(line)
any_unit_match = ANY_UNIT_RE.search(line)
prefix_match = PREFIX_RE.search(line)
killed_match = KILLED_PROCESS_RE.search(line)
triggered_match = SYSTEMD_TRIGGER_RE.search(line)
pid_match = PID_RE.search(line)
unit = UNKNOWN
process = UNKNOWN
pid = UNKNOWN
if unit_match:
unit = unit_match.group(1)
elif any_unit_match:
unit = any_unit_match.group(1)
if prefix_match:
process = prefix_match.group("proc") or UNKNOWN
pid = prefix_match.group("pid") or UNKNOWN
if killed_match:
process = normalize_service_name(killed_match.group(1))
if pid == UNKNOWN and pid_match:
pid = pid_match.group(1)
if unit == UNKNOWN and process == "systemd":
failed_start_match = SYSTEMD_FAILED_START_RE.search(line)
if failed_start_match:
unit = normalize_service_name(
failed_start_match.group(1).strip().replace(" ", "-")
)
if not unit.endswith(".service"):
unit = f"{unit}.service"
if unit == UNKNOWN and triggered_match:
unit = triggered_match.group(1)
service = UNKNOWN
if unit != UNKNOWN:
service = unit
elif process != UNKNOWN:
service = process
elif pattern_item.get("service_hint"):
service = pattern_item["service_hint"]
return {
"service": service,
"unit": unit,
"process": process,
"pid": pid,
}
def service_filter_matches(service_filter: str | None, service_info: dict[str, str], line: str) -> bool:
if not service_filter:
return True
needle = service_filter.lower()
candidates = [line.lower()]
for key in ("service", "unit", "process"):
value = service_info.get(key, UNKNOWN)
if value != UNKNOWN:
candidates.append(value.lower())
return any(needle in candidate for candidate in candidates)
def severity_filter_matches(selected: str | None, severity: str) -> bool:
if selected is None:
return True
return selected.upper() == severity
def detect_failed_unit(line: str, service_info: dict[str, str], category: str) -> str | None:
if category not in {"failed_unit", "dependency_failure"}:
return None
if service_info["unit"] != UNKNOWN:
return service_info["unit"]
match = ANY_UNIT_RE.search(line)
if match:
return match.group(1)
return None
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
service_filter: str | None,
severity_filter: str | None,
top: int,
max_samples: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
total_lines_scanned = 0
parsed_timestamps = 0
unknown_timestamps = 0
top_services = Counter()
top_categories = Counter()
failed_units = Counter()
restart_findings = 0
oom_findings = 0
filesystem_findings = 0
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
total_lines_scanned += 1
if parsed_at is not None:
parsed_timestamps += 1
else:
unknown_timestamps += 1
if not line_in_time_window(parsed_at, since, until):
continue
matched_items = [item for item in patterns if item["regex"].search(line)]
if matched_items:
has_specific_match = any(
item["name"] not in {"failed", "warning"} for item in matched_items
)
if has_specific_match:
matched_items = [
item for item in matched_items if item["name"] not in {"failed", "warning"}
]
for item in matched_items:
if not severity_filter_matches(severity_filter, item["severity"]):
continue
service_info = extract_service_info(line, item)
if not service_filter_matches(service_filter, service_info, line):
continue
key = (
f"{service_info['service']}::{item['name']}::{item['category']}::{item['severity']}"
)
group = groups.setdefault(
key,
{
"service": service_info["service"],
"unit": service_info["unit"],
"process": service_info["process"],
"pid": service_info["pid"],
"category": item["category"],
"pattern": item["name"],
"severity": item["severity"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
top_services[group["service"]] += 1
top_categories[group["category"]] += 1
failed_unit = detect_failed_unit(line, service_info, item["category"])
if failed_unit:
failed_units[failed_unit] += 1
if item["category"] == "restart":
restart_findings += 1
if item["category"] == "oom":
oom_findings += 1
if item["category"] == "disk_filesystem":
filesystem_findings += 1
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["service"].lower(),
item["category"].lower(),
),
)
rendered_findings = []
for group in findings:
rendered_findings.append(
{
"service": group["service"],
"unit": group["unit"],
"process": group["process"],
"pid": group["pid"],
"category": group["category"],
"pattern": group["pattern"],
"severity": group["severity"],
"occurrences": group["occurrences"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
}
)
critical_groups = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
overall_status = "OK"
if critical_groups > 0:
overall_status = "CRITICAL"
elif warning_groups > 0:
overall_status = "WARNING"
displayed_findings = rendered_findings[:top]
return {
"overall_status": overall_status,
"total_lines_scanned": total_lines_scanned,
"total_findings": sum(item["occurrences"] for item in rendered_findings),
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"affected_services_count": len([name for name in top_services if name != UNKNOWN]),
"top_affected_services": [
{"service": name, "count": count}
for name, count in top_services.most_common(top)
],
"top_categories": [
{"category": name, "count": count}
for name, count in top_categories.most_common(top)
],
"failed_units": [
{"unit": name, "count": count} for name, count in failed_units.most_common(top)
],
"restart_findings": restart_findings,
"oom_findings": oom_findings,
"filesystem_disk_findings": filesystem_findings,
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
"filters_used": {
"service": service_filter or None,
"severity": severity_filter or None,
"since": since.strftime("%Y-%m-%d %H:%M:%S") if since else None,
"until": until.strftime("%Y-%m-%d %H:%M:%S") if until else None,
},
"finding_groups": displayed_findings,
"finding_groups_total": len(rendered_findings),
}
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
if not items:
return "None"
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
def render_text(report: dict[str, Any]) -> str:
lines = [
"Journal Analyzer",
"================",
"",
f"Overall status: {report['overall_status']}",
"Journal findings require review; logs alone do not prove root cause.",
"",
]
if report["finding_groups"]:
for finding in report["finding_groups"]:
lines.extend(
[
f"[{finding['severity']}] {finding['service']} - {finding['category']}",
f"Pattern: {finding['pattern']}",
f"Occurrences: {finding['occurrences']}",
f"Unit: {finding['unit']}",
f"Process: {finding['process']}",
f"PID: {finding['pid']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
"Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - {sample}")
else:
lines.append(" - None")
lines.append("")
else:
lines.extend(["No journal findings detected for the selected filters.", ""])
lines.extend(
[
"Operational Summary",
"-------------------",
f"Overall status: {report['overall_status']}",
f"Total lines scanned: {report['total_lines_scanned']}",
f"Total findings: {report['total_findings']}",
f"Critical finding groups: {report['critical_finding_groups']}",
f"Warning finding groups: {report['warning_finding_groups']}",
f"Affected services/units count: {report['affected_services_count']}",
"Top affected services/units: "
+ render_top_pairs(report["top_affected_services"], "service"),
"Top finding categories: "
+ render_top_pairs(report["top_categories"], "category"),
"Failed unit findings: "
+ render_top_pairs(report["failed_units"], "unit"),
f"Restart findings: {report['restart_findings']}",
f"OOM findings: {report['oom_findings']}",
f"Filesystem/disk findings: {report['filesystem_disk_findings']}",
"Timestamp coverage: "
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
"Filters used: "
f"service={report['filters_used']['service'] or 'None'}, "
f"severity={report['filters_used']['severity'] or 'None'}, "
f"since={report['filters_used']['since'] or 'None'}, "
f"until={report['filters_used']['until'] or 'None'}",
]
)
return "\n".join(lines)
def render_markdown(report: dict[str, Any]) -> str:
lines = [
"# Journal Analyzer Report",
"",
f"- Overall status: `{report['overall_status']}`",
"- Journal findings require review; logs alone do not prove root cause.",
"",
]
if report["finding_groups"]:
lines.append("## Finding Groups")
lines.append("")
for finding in report["finding_groups"]:
lines.extend(
[
f"### [{finding['severity']}] {finding['service']} - {finding['category']}",
"",
f"- Pattern: `{finding['pattern']}`",
f"- Occurrences: `{finding['occurrences']}`",
f"- Unit: `{finding['unit']}`",
f"- Process: `{finding['process']}`",
f"- PID: `{finding['pid']}`",
f"- First seen: `{finding['first_seen']}`",
f"- Last seen: `{finding['last_seen']}`",
"- Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - `{sample}`")
else:
lines.append(" - `None`")
lines.append("")
else:
lines.extend(["## Finding Groups", "", "No journal findings detected for the selected filters.", ""])
lines.extend(
[
"## Operational Summary",
"",
f"- Overall status: `{report['overall_status']}`",
f"- Total lines scanned: `{report['total_lines_scanned']}`",
f"- Total findings: `{report['total_findings']}`",
f"- Critical finding groups: `{report['critical_finding_groups']}`",
f"- Warning finding groups: `{report['warning_finding_groups']}`",
f"- Affected services/units count: `{report['affected_services_count']}`",
"- Top affected services/units: "
+ (render_top_pairs(report["top_affected_services"], "service") or "None"),
"- Top finding categories: "
+ (render_top_pairs(report["top_categories"], "category") or "None"),
"- Failed unit findings: "
+ (render_top_pairs(report["failed_units"], "unit") or "None"),
f"- Restart findings: `{report['restart_findings']}`",
f"- OOM findings: `{report['oom_findings']}`",
f"- Filesystem/disk findings: `{report['filesystem_disk_findings']}`",
"- Timestamp coverage: "
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
"- Filters used: "
f"service=`{report['filters_used']['service'] or 'None'}`, "
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
f"since=`{report['filters_used']['since'] or 'None'}`, "
f"until=`{report['filters_used']['until'] or 'None'}`",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2)
def write_output(text: str, output_path: str | None, input_path: Path) -> None:
if output_path is None:
print(text)
return
destination = Path(output_path)
try:
if destination.exists() and destination.resolve() == input_path.resolve():
raise OSError("output path must not overwrite the input log file")
except OSError:
pass
try:
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write report to {destination}: {exc}") from exc
def determine_exit_code(report: dict[str, Any]) -> int:
if report["total_findings"] > 0:
return EXIT_FINDINGS
return EXIT_OK
def main() -> int:
parser = build_parser()
args = parser.parse_args()
try:
input_path = Path(args.file)
lines = read_log_file(input_path)
patterns = compile_patterns(args.ignore_case)
report = analyze_log(
lines=lines,
patterns=patterns,
since=args.since,
until=args.until,
service_filter=args.service,
severity_filter=args.severity.upper() if args.severity else None,
top=args.top,
max_samples=args.max_samples,
)
if args.format == "text":
rendered = render_text(report)
elif args.format == "markdown":
rendered = render_markdown(report)
else:
rendered = render_json(report)
write_output(rendered, args.output, input_path)
return determine_exit_code(report)
except (OSError, ValueError) as exc:
print(f"ERROR: {exc}", file=sys.stderr)
return EXIT_INVALID
except Exception as exc: # pragma: no cover - defensive operational fallback
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
return EXIT_INVALID
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,218 @@
# jvm-log-analyzer
`jvm-log-analyzer` is a read-only Python CLI for reviewing local JVM and Java application logs. It summarizes common Java exceptions, stack trace fragments, JVM failure symptoms, database issues, network/TLS problems, HTTP 5xx entries, and repeated application warning/error patterns that require operator review.
The tool is intended for Linux infrastructure, SRE, and application support workflows where a collected log file needs a quick first-pass operational summary. It does not modify logs or system state.
## When To Use
- During incident response when a JVM application log needs a fast exception and symptom summary.
- During application support handoff when stack traces, HTTP 5xx entries, or database failures need to be attached as evidence.
- After a restart, deployment, certificate change, database incident, or capacity event when local log extracts are available.
- When predictable text, Markdown, or JSON output is useful for local review.
## What It Does
- Reads one local JVM or Java application log supplied with `--file`.
- Detects configured critical and warning JVM/application patterns.
- Extracts timestamps, log levels, thread names, logger/class names, exception types, raw samples, and short stack trace fragments where practical.
- Aggregates top finding groups, exception types, and operational symptoms.
- Produces text, Markdown, or JSON output.
## What It Does Not Do
- It does not read remote systems or live journal streams.
- It does not modify logs, services, application files, JVM flags, certificates, or database state.
- It does not query APM, ELK, SIEM, Zabbix, ticketing systems, or application APIs.
- It does not find root cause automatically.
- It does not prove an application defect.
- It does not classify every vendor-specific Java framework or application message.
## Supported Input Types
- Java / JVM application logs.
- Spring Boot style logs.
- Tomcat-style application logs.
- Generic application logs containing Java exceptions and stack traces.
UTF-8 text input is expected. Invalid byte sequences are replaced during read so review can continue. Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported JVM/Application Patterns
Critical patterns:
- `OutOfMemoryError`
- `Java heap space`
- `GC overhead limit exceeded`
- `StackOverflowError`
- `NoClassDefFoundError`
- `ClassNotFoundException`
- `ExceptionInInitializerError`
- `SSLHandshakeException`
- `CertificateExpiredException`
- `SQLException`
- `SQLRecoverableException`
- `CommunicationsException`
- `database unavailable`
- `connection pool exhausted`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
- `FATAL`
Warning patterns:
- `NullPointerException`
- `IllegalArgumentException`
- `IllegalStateException`
- `SocketTimeoutException`
- `ConnectException`
- `TimeoutException`
- `connection refused`
- `connection reset`
- `Broken pipe`
- `WARN`
- `ERROR`
- `retrying`
- `slow query`
- `deadlock detected`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across configured patterns.
## Stack Trace Handling
The scanner detects practical multiline Java stack traces using common starts such as:
- Fully qualified Java exception lines, such as `java.lang.NullPointerException`.
- `Exception in thread "main"`.
- `Caused by:`.
- Application exceptions ending in `Exception` or `Error`.
Following stack frames are grouped when they look like Java frames:
- Lines starting with whitespace followed by `at `.
- Lines starting with `Caused by:`.
- Lines containing `... N more`.
Stack traces are associated with the detected exception type where possible. Text and Markdown output include only short sample lines by default. Use `--include-stacktraces` to include capped multiline stack trace fragments.
## Timestamp Handling
The scanner attempts to parse:
- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `2026-05-11 10:15:30,123`
- `2026-05-11 10:15:30.123`
- `May 11 10:15:30`
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed. When `--since` or `--until` is used, lines without parseable timestamps are retained by default so potentially important findings are not silently discarded.
## Severity Model
Overall status is conservative:
- `OK` - no JVM/application findings.
- `WARNING` - warning-level findings exist but no critical findings exist.
- `CRITICAL` - one or more critical findings exist.
Critical status is driven by JVM memory failures, fatal JVM symptoms, selected class loading errors, TLS/certificate failures, database unavailable or pool exhaustion symptoms, and HTTP 5xx volume at or above the configured threshold.
Warning status is driven by non-fatal exceptions, `WARN`/`ERROR` entries, timeout/retry patterns, connection refused/reset symptoms, slow query findings, and deadlock patterns.
HTTP 5xx findings are warnings until their total reaches `--http-critical-threshold`, which defaults to `5`. The report summarizes findings that require review; it does not claim root cause.
## Usage
```bash
cd infra-run/scripts/python/jvm-log-analyzer
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format markdown --output jvm-report.md
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --format json
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --top 10
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --max-samples 5
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --include-stacktraces
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --since "2026-05-11 10:00:00"
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --until "2026-05-11 12:00:00"
python3 jvm_log_analyzer.py --file examples/sample-jvm-app.log --http-critical-threshold 2
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident or application support ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file.
## Exit Codes
- `0` - OK, no JVM/application findings.
- `1` - JVM/application findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
JVM Log Analyzer
================
Overall status: CRITICAL
Findings require review; logs alone do not prove root cause.
[CRITICAL] OutOfMemoryError
Occurrences: 1
Symptom: jvm_memory
First seen: UNKNOWN
Last seen: UNKNOWN
Stack traces linked: 1
Samples:
- Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 33
Total findings: 27
Total stack traces detected: 4
Critical finding groups: 11
Warning finding groups: 8
HTTP 5xx count: 3
Parsed timestamps count: 21
Unknown timestamps count: 12
```
## Markdown Workflow
Generate a Markdown report from a collected JVM application log and attach it to the incident or application support ticket as supporting evidence:
```bash
python3 jvm_log_analyzer.py \
--file examples/sample-jvm-app.log \
--format markdown \
--include-stacktraces \
--output jvm-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be reviewed with application health checks, JVM memory telemetry, database status, certificate state, recent deployments, and the relevant application owner.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple findings, such as `ERROR`, `HTTP 503`, and a Java exception.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Stack trace grouping is practical, not a complete Java parser.
- Timestamp parsing is best-effort; unparseable lines are retained during time filtering.
- HTTP 5xx counts are raw log counts, not request rates or customer impact.
- Large log files are read into memory; collect scoped extracts for very large incidents.
## Safety Notes
- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, tokens, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,32 @@
2026-05-11 09:58:01 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Starting InventoryApplication v2.8.4
2026-05-11 09:58:07 INFO inventory-api[2214] --- [main] com.example.InventoryApplication : Started InventoryApplication in 6.2 seconds
2026-05-11 10:02:14 WARN inventory-api[2214] --- [order-worker-2] com.example.retry.PaymentClient : upstream timeout, retrying payment authorization attempt=2
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
java.lang.NullPointerException: Cannot invoke "Customer.getStatus()" because "customer" is null
at com.example.orders.OrderService.validateCustomer(OrderService.java:144)
at com.example.orders.OrderService.submit(OrderService.java:92)
at com.example.orders.OrderController.create(OrderController.java:61)
Caused by: java.lang.IllegalStateException: customer lookup returned empty result
at com.example.customers.CustomerRepository.findRequired(CustomerRepository.java:38)
... 3 more
2026-05-11 10:08:42 WARN inventory-api[2214] --- [http-nio-8080-exec-2] com.example.integration.ShippingClient : java.net.SocketTimeoutException: Read timed out calling shipping endpoint
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
... 2 more
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
2026-05-11 10:18:01 ERROR inventory-api[2214] --- [http-nio-8080-exec-8] com.example.web.ErrorHandler : HTTP 500 POST /api/orders requestId=req-1001
2026-05-11 10:18:03 ERROR inventory-api[2214] --- [http-nio-8080-exec-9] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1002
2026-05-11 10:18:06 ERROR inventory-api[2214] --- [http-nio-8080-exec-3] com.example.web.ErrorHandler : HTTP 503 GET /api/inventory requestId=req-1003
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
@@ -0,0 +1,215 @@
# JVM Log Analyzer
- Overall status: CRITICAL
- Finding language is a triage summary; logs alone do not prove root cause.
## CRITICAL: CertificateExpiredException
- Occurrences: 1
- Symptom: tls_certificate
- First seen: 2026-05-11 10:16:40
- Last seen: 2026-05-11 10:16:40
- Stack traces linked: 0
Sample log lines:
```text
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
```
## CRITICAL: CommunicationsException
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 0
Sample log lines:
```text
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
```
## CRITICAL: connection pool exhausted
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:12:55
- Last seen: 2026-05-11 10:12:55
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
```
## CRITICAL: database unavailable
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:13:02 ERROR inventory-api[2214] --- [http-nio-8080-exec-4] com.example.db.InventoryRepository : database unavailable during checkout commit
```
## CRITICAL: FATAL
- Occurrences: 1
- Symptom: fatal
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:21:27 FATAL inventory-api[2214] --- [main] org.apache.catalina.core.StandardService : JVM failure detected, stopping service
```
## CRITICAL: Java heap space
- Occurrences: 1
- Symptom: jvm_memory
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 0
Sample log lines:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
```
## CRITICAL: OutOfMemoryError
- Occurrences: 1
- Symptom: jvm_memory
- First seen: 2026-05-11 10:21:27
- Last seen: 2026-05-11 10:21:27
- Stack traces linked: 1
Sample log lines:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
```
Stack trace samples:
```text
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.base/java.util.Arrays.copyOf(Arrays.java:3537)
at java.base/java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:228)
at com.example.cache.ReportCache.loadAll(ReportCache.java:87)
```
## CRITICAL: SQLRecoverableException
- Occurrences: 1
- Symptom: database
- First seen: 2026-05-11 10:13:02
- Last seen: 2026-05-11 10:13:02
- Stack traces linked: 1
Sample log lines:
```text
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
```
Stack trace samples:
```text
java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection
at oracle.jdbc.driver.T4CConnection.logon(T4CConnection.java:743)
at oracle.jdbc.driver.PhysicalConnection.connect(PhysicalConnection.java:666)
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: Communications link failure
at com.mysql.cj.jdbc.ConnectionImpl.createNewIO(ConnectionImpl.java:836)
... 2 more
```
## CRITICAL: SSLHandshakeException
- Occurrences: 1
- Symptom: tls_certificate
- First seen: 2026-05-11 10:16:40
- Last seen: 2026-05-11 10:16:40
- Stack traces linked: 1
Sample log lines:
```text
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
```
Stack trace samples:
```text
2026-05-11 10:16:40 ERROR inventory-api[2214] --- [cert-refresh] com.example.security.TrustStoreLoader : javax.net.ssl.SSLHandshakeException: PKIX path validation failed
Caused by: java.security.cert.CertificateExpiredException: NotAfter: Mon May 11 10:00:00 UTC 2026
at sun.security.provider.certpath.BasicChecker.verifyTimestamp(BasicChecker.java:194)
```
## WARNING: ERROR
- Occurrences: 8
- Symptom: log_level
- First seen: 2026-05-11 10:05:31
- Last seen: 2026-05-11 10:18:06
- Stack traces linked: 0
Sample log lines:
```text
2026-05-11 10:05:31 ERROR inventory-api[2214] --- [http-nio-8080-exec-7] com.example.orders.OrderController : request failed while loading order id=4812
2026-05-11 10:09:13 ERROR inventory-api[2214] --- [pool-4-thread-1] com.example.integration.TaxClient : java.net.ConnectException: connection refused connecting to tax-service:8443
2026-05-11 10:12:55 ERROR inventory-api[2214] --- [HikariPool-1 housekeeper] com.zaxxer.hikari.pool.HikariPool : connection pool exhausted waiting for database connection
```
## Top Exception Types
| Value | Count |
| --- | ---: |
| NullPointerException | 1 |
| IllegalStateException | 1 |
| SocketTimeoutException | 1 |
| ConnectException | 1 |
| SQLRecoverableException | 1 |
| CommunicationsException | 1 |
| SSLHandshakeException | 1 |
| CertificateExpiredException | 1 |
| OutOfMemoryError | 1 |
## Top Operational Symptoms
| Value | Count |
| --- | ---: |
| log_level | 10 |
| database | 4 |
| http_5xx | 3 |
| application_exception | 2 |
| network_timeout | 2 |
| network_connectivity | 2 |
| tls_certificate | 2 |
| jvm_memory | 2 |
| retry | 1 |
| fatal | 1 |
## Operational Summary
- Overall status: CRITICAL
- Total lines scanned: 32
- Total findings: 29
- Total stack traces detected: 4
- Critical finding groups: 9
- Warning finding groups: 11
- HTTP 5xx count: 3
- Parsed timestamps count: 13
- Unknown timestamps count: 19
@@ -0,0 +1,837 @@
#!/usr/bin/env python3
"""Analyze JVM and Java application logs for operational findings."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
{"name": "OutOfMemoryError", "pattern": "OutOfMemoryError", "symptom": "jvm_memory"},
{"name": "Java heap space", "pattern": "Java heap space", "symptom": "jvm_memory"},
{"name": "GC overhead limit exceeded", "pattern": "GC overhead limit exceeded", "symptom": "jvm_memory"},
{"name": "StackOverflowError", "pattern": "StackOverflowError", "symptom": "jvm_stack"},
{"name": "NoClassDefFoundError", "pattern": "NoClassDefFoundError", "symptom": "class_loading"},
{"name": "ClassNotFoundException", "pattern": "ClassNotFoundException", "symptom": "class_loading"},
{"name": "ExceptionInInitializerError", "pattern": "ExceptionInInitializerError", "symptom": "class_loading"},
{"name": "SSLHandshakeException", "pattern": "SSLHandshakeException", "symptom": "tls_certificate"},
{"name": "CertificateExpiredException", "pattern": "CertificateExpiredException", "symptom": "tls_certificate"},
{"name": "SQLException", "pattern": "SQLException", "symptom": "database"},
{"name": "SQLRecoverableException", "pattern": "SQLRecoverableException", "symptom": "database"},
{"name": "CommunicationsException", "pattern": "CommunicationsException", "symptom": "database"},
{"name": "database unavailable", "pattern": "database unavailable", "symptom": "database"},
{"name": "connection pool exhausted", "pattern": "connection pool exhausted", "symptom": "database"},
{"name": "FATAL", "pattern": "FATAL", "symptom": "fatal"},
]
WARNING_PATTERNS = [
{"name": "NullPointerException", "pattern": "NullPointerException", "symptom": "application_exception"},
{"name": "IllegalArgumentException", "pattern": "IllegalArgumentException", "symptom": "application_exception"},
{"name": "IllegalStateException", "pattern": "IllegalStateException", "symptom": "application_exception"},
{"name": "SocketTimeoutException", "pattern": "SocketTimeoutException", "symptom": "network_timeout"},
{"name": "ConnectException", "pattern": "ConnectException", "symptom": "network_connectivity"},
{"name": "TimeoutException", "pattern": "TimeoutException", "symptom": "network_timeout"},
{"name": "connection refused", "pattern": "connection refused", "symptom": "network_connectivity"},
{"name": "connection reset", "pattern": "connection reset", "symptom": "network_connectivity"},
{"name": "Broken pipe", "pattern": "Broken pipe", "symptom": "network_connectivity"},
{"name": "WARN", "pattern": "WARN", "symptom": "log_level"},
{"name": "ERROR", "pattern": "ERROR", "symptom": "log_level"},
{"name": "retrying", "pattern": "retrying", "symptom": "retry"},
{"name": "slow query", "pattern": "slow query", "symptom": "database"},
{"name": "deadlock detected", "pattern": "deadlock detected", "symptom": "database"},
]
HTTP_PATTERNS = [
{"name": "HTTP 500", "pattern": "HTTP 500", "symptom": "http_5xx"},
{"name": "HTTP 502", "pattern": "HTTP 502", "symptom": "http_5xx"},
{"name": "HTTP 503", "pattern": "HTTP 503", "symptom": "http_5xx"},
{"name": "HTTP 504", "pattern": "HTTP 504", "symptom": "http_5xx"},
]
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
LEVEL_RE = re.compile(r"\b(TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b")
SPRING_LOGGER_RE = re.compile(r"\s---\s+\[[^\]]+\]\s+([A-Za-z0-9_.$-]+)\s*:")
GENERIC_LOGGER_RE = re.compile(
r"\b(?:TRACE|DEBUG|INFO|WARN|ERROR|FATAL)\b\s+(?:\d+\s+)?([A-Za-z0-9_.$-]+)\s*:"
)
THREAD_RE = re.compile(r"\[([^\]]+)\]")
SPRING_THREAD_RE = re.compile(r"\s---\s+\[([^\]]+)\]")
EXCEPTION_RE = re.compile(
r"\b((?:[A-Za-z_$][\w$]*\.)+[A-Za-z_$][\w$]*(?:Exception|Error)|[A-Za-z_$][\w$]*(?:Exception|Error))\b"
)
STACK_FRAME_RE = re.compile(r"^\s+at\s+")
CAUSED_BY_RE = re.compile(r"^\s*Caused by:\s+")
MORE_RE = re.compile(r"^\s*\.\.\.\s+\d+\s+more\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Analyze local JVM and Java application logs for operational findings."
)
parser.add_argument("--file", required=True, help="Local JVM or Java application log to analyze.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of top finding groups, exception types, and symptoms to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding group. Default: 3.",
)
parser.add_argument(
"--include-stacktraces",
action="store_true",
help="Include short multiline stack trace samples in text and Markdown reports.",
)
parser.add_argument(
"--max-stack-lines",
type=positive_int,
default=12,
help="Maximum lines retained per stack trace sample. Default: 12.",
)
parser.add_argument(
"--http-critical-threshold",
type=positive_int,
default=5,
help="HTTP 5xx count that raises HTTP findings to CRITICAL. Default: 5.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match configured patterns case-insensitively.",
)
parser.add_argument(
"--since",
type=parse_filter_timestamp,
help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
)
parser.add_argument(
"--until",
type=parse_filter_timestamp,
help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def parse_filter_timestamp(value: str) -> datetime:
for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
try:
return datetime.strptime(value, fmt)
except ValueError:
continue
raise argparse.ArgumentTypeError('expected timestamp format "YYYY-MM-DD HH:MM:SS"')
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
compiled = []
for item in CRITICAL_PATTERNS:
compiled.append({**item, "severity": "CRITICAL", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
for item in WARNING_PATTERNS:
compiled.append({**item, "severity": "WARNING", "kind": "pattern", "regex": re.compile(re.escape(item["pattern"]), flags)})
for item in HTTP_PATTERNS:
compiled.append({**item, "severity": "WARNING", "kind": "http_5xx", "regex": re.compile(re.escape(item["pattern"]), flags)})
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def line_in_time_window(
parsed_at: datetime | None, since: datetime | None, until: datetime | None
) -> bool:
if parsed_at is None:
return True
if since is not None and parsed_at < since:
return False
if until is not None and parsed_at > until:
return False
return True
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def extract_level(line: str) -> str:
match = LEVEL_RE.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_thread(line: str) -> str:
for regex in (SPRING_THREAD_RE, THREAD_RE):
match = regex.search(line)
if match:
return match.group(1)
return UNKNOWN
def extract_logger(line: str) -> str:
for regex in (SPRING_LOGGER_RE, GENERIC_LOGGER_RE):
match = regex.search(line)
if match:
return match.group(1)
return UNKNOWN
def normalize_exception_type(value: str) -> str:
return value.split(".")[-1]
def extract_exception_type(line: str) -> str:
match = EXCEPTION_RE.search(line)
if match:
return normalize_exception_type(match.group(1))
return UNKNOWN
def is_stack_start(line: str) -> bool:
return (
"Exception in thread" in line
or CAUSED_BY_RE.search(line) is not None
or EXCEPTION_RE.search(line) is not None
)
def is_stack_continuation(line: str) -> bool:
return (
STACK_FRAME_RE.search(line) is not None
or CAUSED_BY_RE.search(line) is not None
or MORE_RE.search(line) is not None
)
def update_seen(
group: dict[str, Any], parsed_at: datetime | None, rendered_at: str
) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def append_limited(items: list[Any], value: Any, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def finding_key(severity: str, name: str) -> str:
return f"{severity}::{name}"
def ensure_group(
groups: dict[str, dict[str, Any]],
name: str,
severity: str,
symptom: str,
kind: str,
) -> dict[str, Any]:
key = finding_key(severity, name)
return groups.setdefault(
key,
{
"name": name,
"severity": severity,
"symptom": symptom,
"kind": kind,
"occurrences": 0,
"stack_trace_count": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
"stack_trace_samples": [],
"fields": [],
},
)
def add_finding(
groups: dict[str, dict[str, Any]],
name: str,
severity: str,
symptom: str,
kind: str,
line: str,
parsed_at: datetime | None,
rendered_at: str,
max_samples: int,
) -> dict[str, Any]:
group = ensure_group(groups, name, severity, symptom, kind)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
append_limited(
group["fields"],
{
"timestamp": rendered_at,
"log_level": extract_level(line),
"logger": extract_logger(line),
"thread": extract_thread(line),
"exception_type": extract_exception_type(line),
"raw": line,
},
max_samples,
)
return group
def record_stack_trace(
groups: dict[str, dict[str, Any]],
stack: dict[str, Any],
max_samples: int,
max_stack_lines: int,
) -> None:
exception_type = stack["exception_type"] if stack["exception_type"] != UNKNOWN else "Java stack trace"
severity = severity_for_exception(exception_type)
group = ensure_group(groups, exception_type, severity, "stack_trace", "stack_trace")
group["stack_trace_count"] += 1
update_seen(group, stack["parsed_at"], stack["rendered_at"])
append_limited(group["samples"], stack["lines"][0], max_samples)
append_limited(group["stack_trace_samples"], stack["lines"][:max_stack_lines], max_samples)
def severity_for_exception(exception_type: str) -> str:
critical = {item["name"] for item in CRITICAL_PATTERNS}
if exception_type in critical or exception_type in {"OutOfMemoryError", "StackOverflowError"}:
return "CRITICAL"
return "WARNING"
def detect_stack_traces(
included: list[dict[str, Any]],
groups: dict[str, dict[str, Any]],
max_samples: int,
max_stack_lines: int,
) -> int:
stack: dict[str, Any] | None = None
stack_count = 0
for item in included:
line = item["line"]
if stack is None:
if is_stack_start(line):
stack = {
"lines": [line],
"exception_type": extract_exception_type(line),
"parsed_at": item["parsed_at"],
"rendered_at": item["rendered_at"],
}
continue
if is_stack_continuation(line):
stack["lines"].append(line)
if stack["exception_type"] == UNKNOWN:
stack["exception_type"] = extract_exception_type(line)
continue
if len(stack["lines"]) > 1:
record_stack_trace(groups, stack, max_samples, max_stack_lines)
stack_count += 1
stack = None
if is_stack_start(line):
stack = {
"lines": [line],
"exception_type": extract_exception_type(line),
"parsed_at": item["parsed_at"],
"rendered_at": item["rendered_at"],
}
if stack is not None and len(stack["lines"]) > 1:
record_stack_trace(groups, stack, max_samples, max_stack_lines)
stack_count += 1
return stack_count
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
since: datetime | None,
until: datetime | None,
top: int,
max_samples: int,
max_stack_lines: int,
http_critical_threshold: int,
) -> dict[str, Any]:
syslog_year = since.year if since is not None else datetime.now().year
groups: dict[str, dict[str, Any]] = {}
exception_counts: Counter[str] = Counter()
symptom_counts: Counter[str] = Counter()
parsed_timestamps = 0
unknown_timestamps = 0
included: list[dict[str, Any]] = []
http_5xx_count = 0
context_parsed_at: datetime | None = None
context_rendered_at = UNKNOWN
for line in lines:
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if parsed_at is None:
unknown_timestamps += 1
else:
parsed_timestamps += 1
context_parsed_at = parsed_at
context_rendered_at = rendered_at
if not line_in_time_window(parsed_at, since, until):
continue
# Stack trace frames often omit timestamps; keep nearby log context for first/last seen.
effective_parsed_at = parsed_at
effective_rendered_at = rendered_at
if parsed_at is None and (is_stack_start(line) or is_stack_continuation(line)):
effective_parsed_at = context_parsed_at
effective_rendered_at = context_rendered_at
included.append(
{
"line": line,
"parsed_at": effective_parsed_at,
"rendered_at": effective_rendered_at,
}
)
matched_names = set()
for item in patterns:
if not item["regex"].search(line):
continue
severity = item["severity"]
if item["kind"] == "http_5xx":
http_5xx_count += 1
add_finding(
groups=groups,
name=item["name"],
severity=severity,
symptom=item["symptom"],
kind=item["kind"],
line=line,
parsed_at=effective_parsed_at,
rendered_at=effective_rendered_at,
max_samples=max_samples,
)
symptom_counts[item["symptom"]] += 1
matched_names.add(item["name"])
exception_type = extract_exception_type(line)
if exception_type != UNKNOWN:
exception_counts[exception_type] += 1
if exception_type not in matched_names:
severity = severity_for_exception(exception_type)
add_finding(
groups=groups,
name=exception_type,
severity=severity,
symptom="application_exception",
kind="exception",
line=line,
parsed_at=effective_parsed_at,
rendered_at=effective_rendered_at,
max_samples=max_samples,
)
symptom_counts["application_exception"] += 1
stack_trace_count = detect_stack_traces(included, groups, max_samples, max_stack_lines)
promote_http_5xx(groups, http_5xx_count, http_critical_threshold)
findings = sorted(
(render_group(group) for group in groups.values()),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["name"].lower(),
),
)
summary = build_summary(
total_lines=len(lines),
findings=findings,
stack_trace_count=stack_trace_count,
http_5xx_count=http_5xx_count,
parsed_timestamps=parsed_timestamps,
unknown_timestamps=unknown_timestamps,
)
return {
"summary": summary,
"findings": findings[:top],
"top_exception_types": top_items(exception_counts, top),
"top_operational_symptoms": top_items(symptom_counts, top),
}
def promote_http_5xx(
groups: dict[str, dict[str, Any]], http_5xx_count: int, threshold: int
) -> None:
if http_5xx_count < threshold:
return
http_names = {item["name"] for item in HTTP_PATTERNS}
for old_key, group in list(groups.items()):
if group["name"] not in http_names or group["severity"] == "CRITICAL":
continue
group["severity"] = "CRITICAL"
new_key = finding_key("CRITICAL", group["name"])
groups[new_key] = group
del groups[old_key]
def render_group(group: dict[str, Any]) -> dict[str, Any]:
return {
"name": group["name"],
"severity": group["severity"],
"symptom": group["symptom"],
"kind": group["kind"],
"occurrences": group["occurrences"],
"stack_trace_count": group["stack_trace_count"],
"first_seen": render_seen(group["first_seen"]),
"last_seen": render_seen(group["last_seen"]),
"samples": group["samples"],
"stack_trace_samples": group["stack_trace_samples"],
"fields": group["fields"],
}
def build_summary(
total_lines: int,
findings: list[dict[str, Any]],
stack_trace_count: int,
http_5xx_count: int,
parsed_timestamps: int,
unknown_timestamps: int,
) -> dict[str, Any]:
critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
total_findings = sum(item["occurrences"] for item in findings)
if critical_groups > 0:
status = "CRITICAL"
elif warning_groups > 0:
status = "WARNING"
else:
status = "OK"
return {
"overall_status": status,
"total_lines_scanned": total_lines,
"total_findings": total_findings,
"total_stack_traces_detected": stack_trace_count,
"critical_finding_groups": critical_groups,
"warning_finding_groups": warning_groups,
"http_5xx_count": http_5xx_count,
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
}
def top_items(counter: Counter[str], limit: int) -> list[dict[str, Any]]:
return [{"value": value, "count": count} for value, count in counter.most_common(limit)]
def render_text(report: dict[str, Any], include_stacktraces: bool) -> str:
lines = ["JVM Log Analyzer", "================", ""]
summary = report["summary"]
lines.extend(
[
f"Overall status: {summary['overall_status']}",
"Findings require review; logs alone do not prove root cause.",
"",
]
)
if not report["findings"]:
lines.extend(["No configured JVM/application findings were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['name']}",
f"Occurrences: {finding['occurrences']}",
f"Symptom: {finding['symptom']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
f"Stack traces linked: {finding['stack_trace_count']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
if include_stacktraces and finding["stack_trace_samples"]:
lines.append("Stack trace samples:")
for stack in finding["stack_trace_samples"]:
lines.append(" ---")
lines.extend(f" {entry}" for entry in stack)
lines.append("")
lines.extend(render_text_table("Top Exception Types", report["top_exception_types"]))
lines.extend(render_text_table("Top Operational Symptoms", report["top_operational_symptoms"]))
lines.extend(render_text_summary(summary))
return "\n".join(lines) + "\n"
def render_text_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [title, "-" * len(title)]
if not rows:
lines.append("No entries detected.")
else:
lines.extend(f"- {item['value']}: {item['count']}" for item in rows)
lines.append("")
return lines
def render_text_summary(summary: dict[str, Any]) -> list[str]:
coverage = summary["timestamp_coverage"]
return [
"Operational Summary",
"-------------------",
f"Overall status: {summary['overall_status']}",
f"Total lines scanned: {summary['total_lines_scanned']}",
f"Total findings: {summary['total_findings']}",
f"Total stack traces detected: {summary['total_stack_traces_detected']}",
f"Critical finding groups: {summary['critical_finding_groups']}",
f"Warning finding groups: {summary['warning_finding_groups']}",
f"HTTP 5xx count: {summary['http_5xx_count']}",
f"Parsed timestamps count: {coverage['parsed_timestamps_count']}",
f"Unknown timestamps count: {coverage['unknown_timestamps_count']}",
]
def render_markdown(report: dict[str, Any], include_stacktraces: bool) -> str:
summary = report["summary"]
lines = [
"# JVM Log Analyzer",
"",
f"- Overall status: {summary['overall_status']}",
"- Finding language is a triage summary; logs alone do not prove root cause.",
"",
]
if not report["findings"]:
lines.extend(["No configured JVM/application findings were detected.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['name']}",
"",
f"- Occurrences: {finding['occurrences']}",
f"- Symptom: {finding['symptom']}",
f"- First seen: {finding['first_seen']}",
f"- Last seen: {finding['last_seen']}",
f"- Stack traces linked: {finding['stack_trace_count']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
if include_stacktraces and finding["stack_trace_samples"]:
lines.extend(["Stack trace samples:", ""])
for stack in finding["stack_trace_samples"]:
lines.append("```text")
lines.extend(stack)
lines.append("```")
lines.append("")
lines.extend(render_markdown_table("Top Exception Types", report["top_exception_types"]))
lines.extend(render_markdown_table("Top Operational Symptoms", report["top_operational_symptoms"]))
lines.extend(render_markdown_summary(summary))
return "\n".join(lines)
def render_markdown_table(title: str, rows: list[dict[str, Any]]) -> list[str]:
lines = [f"## {title}", ""]
if not rows:
lines.extend(["No entries detected.", ""])
return lines
lines.extend(["| Value | Count |", "| --- | ---: |"])
lines.extend(f"| {item['value']} | {item['count']} |" for item in rows)
lines.append("")
return lines
def render_markdown_summary(summary: dict[str, Any]) -> list[str]:
coverage = summary["timestamp_coverage"]
return [
"## Operational Summary",
"",
f"- Overall status: {summary['overall_status']}",
f"- Total lines scanned: {summary['total_lines_scanned']}",
f"- Total findings: {summary['total_findings']}",
f"- Total stack traces detected: {summary['total_stack_traces_detected']}",
f"- Critical finding groups: {summary['critical_finding_groups']}",
f"- Warning finding groups: {summary['warning_finding_groups']}",
f"- HTTP 5xx count: {summary['http_5xx_count']}",
f"- Parsed timestamps count: {coverage['parsed_timestamps_count']}",
f"- Unknown timestamps count: {coverage['unknown_timestamps_count']}",
"",
]
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(input_path: Path, output_path: str | None, content: str) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
if path.resolve() == input_path.resolve():
raise OSError("output path must not be the same as input file")
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
input_path = Path(args.file)
if args.since is not None and args.until is not None and args.since > args.until:
parser.error("--since must be earlier than or equal to --until")
try:
lines = read_log_file(input_path)
report = analyze_log(
lines=lines,
patterns=compile_patterns(args.ignore_case),
since=args.since,
until=args.until,
top=args.top,
max_samples=args.max_samples,
max_stack_lines=args.max_stack_lines,
http_critical_threshold=args.http_critical_threshold,
)
if args.format == "text":
content = render_text(report, args.include_stacktraces)
elif args.format == "markdown":
content = render_markdown(report, args.include_stacktraces)
else:
content = render_json(report)
write_report(input_path, args.output, content)
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,199 @@
# known-error-matcher
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
## Purpose
- Identify which cataloged operational problems are visible in a collected log.
- Count how often each known error pattern appears.
- Surface warning and critical matches conservatively.
- Point operators toward relevant runbooks or supporting local tools.
- Produce predictable text, Markdown, or JSON output for incident notes.
## When To Use
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
- Before attaching log evidence to an incident, problem, or change ticket.
- When teams maintain a small local catalog of operational patterns and runbook links.
- When JSON output is useful for later local automation.
## What It Does Not Do
- It does not read remote systems or live streams.
- It does not modify logs, services, applications, accounts, or host state.
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
- It does not find root cause automatically.
- It does not prove an incident or confirm customer impact.
- It does not classify every vendor-specific log message.
## Pattern Catalog Format
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
```json
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
}
]
}
```
Required fields per pattern:
- `id` - stable non-empty identifier.
- `name` - human-readable finding name.
- `severity` - `WARNING` or `CRITICAL`.
- `regex` - Python regular expression used for matching.
Optional fields:
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
- `description` - short operator-facing explanation. Missing values are reported as `None`.
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
## Adding A Known Error Pattern
Add a new object under `patterns` in `patterns.json`:
```json
{
"id": "example_dependency_failure",
"name": "Example dependency failure",
"severity": "WARNING",
"regex": "dependency request failed|upstream dependency unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
"description": "Application logged a dependency failure that requires review."
}
```
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
## Severity Model
Overall status is conservative:
- `OK` - no known error patterns matched.
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
- `CRITICAL` - one or more critical patterns matched.
The status means known error patterns require review. It is not a final root-cause statement.
## Category Filtering
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
Examples:
```bash
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
```
## Usage
```bash
cd infra-run/scripts/python/known-error-matcher
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident, problem, or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
## Exit Codes
- `0` - OK, no known error matches.
- `1` - Known error matches detected.
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Known Error Matcher
===================
Overall status: CRITICAL
Known error pattern matches require operator review; logs alone do not prove root cause.
[CRITICAL] database_unavailable - Database unavailable
Category: application
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
Description: Application logged unavailable database or database connectivity symptoms.
Samples:
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 9
Known error matches: 7
Matched known error patterns: 7
Critical matched patterns: 5
Warning matched patterns: 2
Top categories: application (3), network (2), application_jvm (2)
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
Timestamp coverage: parsed=9, unknown=0
Filters used: severity=None, category=None
Pattern catalog path: patterns.json
```
## Markdown Workflow
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
```bash
python3 known_error_matcher.py \
--file examples/sample-app.log \
--patterns patterns.json \
--format markdown \
--output known-error-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple known error patterns.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
- Large log files are read into memory; use scoped extracts for very large incidents.
## Safety Notes
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,9 @@
2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42
2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted
2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443
2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms
2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed
2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"
2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space
2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded
@@ -0,0 +1,97 @@
# Known Error Matcher Report
- Overall status: `CRITICAL`
- Known error pattern matches require operator review; logs alone do not prove root cause.
## Matched Known Errors
### [CRITICAL] database_unavailable - Database unavailable
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:07`
- Last seen: `2026-05-11 10:16:07`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Application logged unavailable database or database connectivity symptoms.
- Samples:
- `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool`
### [CRITICAL] http_500 - HTTP 500
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:02`
- Last seen: `2026-05-11 10:16:02`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 500 responses.
- Samples:
- `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted`
### [CRITICAL] http_503 - HTTP 503
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:31.456`
- Last seen: `2026-05-11 10:16:31.456`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 503 service unavailable responses.
- Samples:
- `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"`
### [CRITICAL] java_out_of_memory - Java OutOfMemoryError
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:40`
- Last seen: `2026-05-11 10:16:40`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java process logged memory exhaustion symptoms.
- Samples:
- `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space`
### [CRITICAL] ssl_handshake_exception - SSLHandshakeException
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:22`
- Last seen: `2026-05-11 10:16:22`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java TLS handshake exception was logged.
- Samples:
- `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed`
### [WARNING] connection_refused - Connection refused
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:11`
- Last seen: `2026-05-11 10:16:11`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Client connection attempts were refused by the destination service or host.
- Samples:
- `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443`
### [WARNING] timeout - Timeout
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:15,123`
- Last seen: `2026-05-11 10:16:15,123`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Operation timed out and may require network, service, or dependency review.
- Samples:
- `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms`
## Operational Summary
- Overall status: `CRITICAL`
- Total lines scanned: `9`
- Known error matches: `7`
- Matched known error patterns: `7`
- Critical matched patterns: `5`
- Warning matched patterns: `2`
- Top categories: application (3), network (2), application_jvm (2)
- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
- Timestamp coverage: parsed=`9`, unknown=`0`
- Filters used: severity=`None`, category=`None`
- Pattern catalog path: `patterns.json`
@@ -0,0 +1,10 @@
May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages
May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only
May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE)
May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application.
May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly.
May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy
May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp
May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp
@@ -0,0 +1,562 @@
#!/usr/bin/env python3
"""Match local logs against a JSON catalog of known operational error patterns."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
VALID_SEVERITIES = {"WARNING", "CRITICAL"}
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Scan a local log file for known operational error patterns."
)
parser.add_argument("--file", required=True, help="Local log file to scan.")
parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--severity",
choices=("warning", "critical"),
help="Only include findings with this severity.",
)
parser.add_argument(
"--category",
help="Only include findings from this exact category.",
)
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of matched known errors and summary entries to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per matched known error. Default: 3.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Compile catalog regex patterns case-insensitively.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def read_text_file(path: Path, label: str) -> str:
if not path.exists():
raise OSError(f"{label} does not exist: {path}")
if not path.is_file():
raise OSError(f"{label} is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"{label} is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read {label} {path}: {exc}") from exc
if text == "":
raise ValueError(f"{label} is empty: {path}")
return text
def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]:
text = read_text_file(path, "pattern catalog")
try:
catalog = json.loads(text)
except json.JSONDecodeError as exc:
raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc
errors: list[str] = []
if not isinstance(catalog, dict):
raise ValueError("invalid pattern catalog: top-level JSON value must be an object")
if "patterns" not in catalog:
raise ValueError('invalid pattern catalog: missing top-level "patterns" field')
if not isinstance(catalog["patterns"], list):
raise ValueError('invalid pattern catalog: "patterns" must be a list')
seen_ids: set[str] = set()
compiled_patterns: list[dict[str, Any]] = []
flags = re.IGNORECASE if ignore_case else 0
for index, item in enumerate(catalog["patterns"], start=1):
if not isinstance(item, dict):
errors.append(f"pattern #{index}: must be an object")
continue
pattern_id = normalize_required_text(item, "id")
name = normalize_required_text(item, "name")
severity = normalize_required_text(item, "severity").upper()
regex_text = normalize_required_text(item, "regex")
if not pattern_id:
errors.append(f"pattern #{index}: id is required and must be non-empty")
elif pattern_id in seen_ids:
errors.append(f"pattern #{index}: duplicate id {pattern_id}")
else:
seen_ids.add(pattern_id)
if not name:
errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty")
if severity not in VALID_SEVERITIES:
errors.append(
f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL"
)
if not regex_text:
errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty")
compiled_regex = None
if regex_text:
try:
compiled_regex = re.compile(regex_text, flags)
except re.error as exc:
errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}")
if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex:
compiled_patterns.append(
{
"id": pattern_id,
"name": name,
"severity": severity,
"regex_text": regex_text,
"regex": compiled_regex,
"category": normalize_optional_text(item, "category", UNKNOWN),
"runbook": normalize_optional_text(item, "runbook", ""),
"description": normalize_optional_text(item, "description", ""),
}
)
if errors:
raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors))
if not compiled_patterns:
raise ValueError("invalid pattern catalog: no patterns configured")
return compiled_patterns
def normalize_required_text(item: dict[str, Any], field: str) -> str:
value = item.get(field)
if not isinstance(value, str):
return ""
return value.strip()
def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str:
value = item.get(field, default)
if not isinstance(value, str):
return default
value = value.strip()
return value if value else default
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def severity_filter_matches(selected: str | None, severity: str) -> bool:
if selected is None:
return True
return selected.upper() == severity
def category_filter_matches(selected: str | None, category: str) -> bool:
if selected is None:
return True
return selected == category
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def append_limited(items: list[str], value: str, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
severity_filter: str | None,
category_filter: str | None,
top: int,
max_samples: int,
pattern_catalog_path: Path,
) -> dict[str, Any]:
syslog_year = datetime.now().year
groups: dict[str, dict[str, Any]] = {}
top_categories = Counter()
total_lines_scanned = 0
parsed_timestamps = 0
unknown_timestamps = 0
for line in lines:
total_lines_scanned += 1
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if parsed_at is None:
unknown_timestamps += 1
else:
parsed_timestamps += 1
for pattern in patterns:
if not severity_filter_matches(severity_filter, pattern["severity"]):
continue
if not category_filter_matches(category_filter, pattern["category"]):
continue
if not pattern["regex"].search(line):
continue
group = groups.setdefault(
pattern["id"],
{
"id": pattern["id"],
"name": pattern["name"],
"severity": pattern["severity"],
"category": pattern["category"],
"runbook": pattern["runbook"],
"description": pattern["description"],
"regex": pattern["regex_text"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
top_categories[pattern["category"]] += 1
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["id"],
),
)
rendered_findings = [
{
**finding,
"first_seen": render_seen(finding["first_seen"]),
"last_seen": render_seen(finding["last_seen"]),
}
for finding in findings
]
critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
total_matches = sum(item["occurrences"] for item in rendered_findings)
overall_status = "OK"
if critical_patterns > 0:
overall_status = "CRITICAL"
elif warning_patterns > 0:
overall_status = "WARNING"
return {
"overall_status": overall_status,
"total_lines_scanned": total_lines_scanned,
"total_known_error_matches": total_matches,
"matched_pattern_count": len(rendered_findings),
"critical_matched_pattern_count": critical_patterns,
"warning_matched_pattern_count": warning_patterns,
"top_categories": [
{"category": name, "count": count}
for name, count in top_categories.most_common(top)
],
"top_known_errors": [
{"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]}
for item in rendered_findings[:top]
],
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
"filters_used": {
"severity": severity_filter.lower() if severity_filter else None,
"category": category_filter,
},
"pattern_catalog_path": str(pattern_catalog_path),
"findings": rendered_findings[:top],
"findings_total": len(rendered_findings),
}
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
if not items:
return "None"
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
def render_text(report: dict[str, Any]) -> str:
lines = [
"Known Error Matcher",
"===================",
"",
f"Overall status: {report['overall_status']}",
"Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['id']} - {finding['name']}",
f"Category: {finding['category']}",
f"Occurrences: {finding['occurrences']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
f"Runbook: {finding['runbook'] or 'None'}",
f"Description: {finding['description'] or 'None'}",
"Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - {sample}")
else:
lines.append(" - None")
lines.append("")
else:
lines.extend(["No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=False))
return "\n".join(lines)
def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]:
if markdown:
return [
"## Operational Summary",
"",
f"- Overall status: `{report['overall_status']}`",
f"- Total lines scanned: `{report['total_lines_scanned']}`",
f"- Known error matches: `{report['total_known_error_matches']}`",
f"- Matched known error patterns: `{report['matched_pattern_count']}`",
f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`",
f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`",
"- Top categories: " + render_top_pairs(report["top_categories"], "category"),
"- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"- Timestamp coverage: "
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
"- Filters used: "
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
f"category=`{report['filters_used']['category'] or 'None'}`",
f"- Pattern catalog path: `{report['pattern_catalog_path']}`",
]
return [
"Operational Summary",
"-------------------",
f"Overall status: {report['overall_status']}",
f"Total lines scanned: {report['total_lines_scanned']}",
f"Known error matches: {report['total_known_error_matches']}",
f"Matched known error patterns: {report['matched_pattern_count']}",
f"Critical matched patterns: {report['critical_matched_pattern_count']}",
f"Warning matched patterns: {report['warning_matched_pattern_count']}",
"Top categories: " + render_top_pairs(report["top_categories"], "category"),
"Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"Timestamp coverage: "
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
"Filters used: "
f"severity={report['filters_used']['severity'] or 'None'}, "
f"category={report['filters_used']['category'] or 'None'}",
f"Pattern catalog path: {report['pattern_catalog_path']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = [
"# Known Error Matcher Report",
"",
f"- Overall status: `{report['overall_status']}`",
"- Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
lines.extend(["## Matched Known Errors", ""])
for finding in report["findings"]:
lines.extend(
[
f"### [{finding['severity']}] {finding['id']} - {finding['name']}",
"",
f"- Category: `{finding['category']}`",
f"- Occurrences: `{finding['occurrences']}`",
f"- First seen: `{finding['first_seen']}`",
f"- Last seen: `{finding['last_seen']}`",
f"- Runbook: `{finding['runbook'] or 'None'}`",
f"- Description: {finding['description'] or 'None'}",
"- Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - `{sample}`")
else:
lines.append(" - `None`")
lines.append("")
else:
lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=True))
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2)
def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None:
if output_path is None:
print(text)
return
destination = Path(output_path)
try:
destination_resolved = destination.resolve()
for input_path in protected_inputs:
if input_path.resolve() == destination_resolved:
raise OSError("output path must not overwrite an input file")
except FileNotFoundError as exc:
raise OSError(f"unable to resolve output path {destination}: {exc}") from exc
try:
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write report to {destination}: {exc}") from exc
def determine_exit_code(report: dict[str, Any]) -> int:
if report["total_known_error_matches"] > 0:
return EXIT_FINDINGS
return EXIT_OK
def main() -> int:
parser = build_parser()
args = parser.parse_args()
try:
log_path = Path(args.file)
pattern_path = Path(args.patterns)
log_text = read_text_file(log_path, "log file")
lines = log_text.splitlines()
patterns = load_pattern_catalog(pattern_path, args.ignore_case)
severity_filter = args.severity.upper() if args.severity else None
report = analyze_log(
lines=lines,
patterns=patterns,
severity_filter=severity_filter,
category_filter=args.category,
top=args.top,
max_samples=args.max_samples,
pattern_catalog_path=pattern_path,
)
if args.format == "text":
rendered = render_text(report)
elif args.format == "markdown":
rendered = render_markdown(report)
else:
rendered = render_json(report)
write_output(rendered, args.output, [log_path, pattern_path])
return determine_exit_code(report)
except (OSError, ValueError) as exc:
print(f"ERROR: {exc}", file=sys.stderr)
return EXIT_INVALID
except Exception as exc: # pragma: no cover - defensive operational fallback
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
return EXIT_INVALID
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,220 @@
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full|filesystem full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
},
{
"id": "inode_exhaustion",
"name": "Inode exhaustion",
"severity": "CRITICAL",
"regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem may have free blocks but too few available inodes."
},
{
"id": "read_only_filesystem",
"name": "Read-only filesystem",
"severity": "CRITICAL",
"regex": "read-only file system|read-only filesystem|Remounting filesystem read-only",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/read-only-filesystem.md",
"description": "Filesystem writes failed because the mount was read-only or remounted read-only."
},
{
"id": "io_error",
"name": "I/O error",
"severity": "CRITICAL",
"regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/storage-io-error.md",
"description": "Kernel or application reported storage I/O errors that require device and filesystem review."
},
{
"id": "out_of_memory",
"name": "Out of memory",
"severity": "CRITICAL",
"regex": "\\bout of memory\\b|Cannot allocate memory",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/memory-pressure.md",
"description": "Process or host reported memory exhaustion symptoms."
},
{
"id": "oom_killer",
"name": "OOM killer invoked",
"severity": "CRITICAL",
"regex": "oom-killer|Killed process \\d+|Out of memory: Killed process",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/oom-killer.md",
"description": "Kernel OOM killer activity was logged and affected processes should be reviewed."
},
{
"id": "segmentation_fault",
"name": "Segmentation fault",
"severity": "CRITICAL",
"regex": "segmentation fault|segfault",
"category": "process",
"runbook": "infra-run/runbooks/incidents/process-crash.md",
"description": "A process crash pattern was logged."
},
{
"id": "connection_refused",
"name": "Connection refused",
"severity": "WARNING",
"regex": "connection refused|ConnectException: Connection refused",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Client connection attempts were refused by the destination service or host."
},
{
"id": "connection_reset",
"name": "Connection reset",
"severity": "WARNING",
"regex": "connection reset|Connection reset by peer",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Established network connections were reset and require endpoint review."
},
{
"id": "timeout",
"name": "Timeout",
"severity": "WARNING",
"regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Operation timed out and may require network, service, or dependency review."
},
{
"id": "dns_resolution_failure",
"name": "DNS resolution failure",
"severity": "WARNING",
"regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host",
"category": "network",
"runbook": "infra-run/runbooks/incidents/dns-resolution.md",
"description": "Name resolution failed for a host or service dependency."
},
{
"id": "certificate_expired",
"name": "Certificate expired",
"severity": "CRITICAL",
"regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/certificate-expired.md",
"description": "TLS certificate expiry was logged and certificate state should be reviewed."
},
{
"id": "tls_handshake_failed",
"name": "TLS handshake failed",
"severity": "WARNING",
"regex": "TLS handshake failed|SSL handshake failed|handshake_failure",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/tls-handshake.md",
"description": "TLS handshake failed and may require certificate, protocol, or trust-store review."
},
{
"id": "authentication_failure",
"name": "Authentication failure",
"severity": "WARNING",
"regex": "authentication failure|Failed password|authentication failed",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Authentication failures were logged and may require access review."
},
{
"id": "permission_denied",
"name": "Permission denied",
"severity": "WARNING",
"regex": "permission denied|access denied|denied by policy",
"category": "security",
"runbook": "infra-run/runbooks/incidents/permission-denied.md",
"description": "Access or permission denial was logged."
},
{
"id": "invalid_user",
"name": "Invalid user",
"severity": "WARNING",
"regex": "Invalid user|invalid user|user unknown|User not known",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Log contains attempts involving invalid or unknown users."
},
{
"id": "java_out_of_memory",
"name": "Java OutOfMemoryError",
"severity": "CRITICAL",
"regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java process logged memory exhaustion symptoms."
},
{
"id": "ssl_handshake_exception",
"name": "SSLHandshakeException",
"severity": "CRITICAL",
"regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java TLS handshake exception was logged."
},
{
"id": "database_unavailable",
"name": "Database unavailable",
"severity": "CRITICAL",
"regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted",
"category": "application",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Application logged unavailable database or database connectivity symptoms."
},
{
"id": "http_500",
"name": "HTTP 500",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 500 responses."
},
{
"id": "http_503",
"name": "HTTP 503",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 503 service unavailable responses."
},
{
"id": "service_failed",
"name": "Systemd service failed",
"severity": "CRITICAL",
"regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a failed service or failed service start."
},
{
"id": "dependency_failed",
"name": "Systemd dependency failed",
"severity": "CRITICAL",
"regex": "Dependency failed for|dependency failed",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a unit dependency failure."
},
{
"id": "start_request_repeated",
"name": "Start request repeated too quickly",
"severity": "WARNING",
"regex": "Start request repeated too quickly|start request repeated too quickly",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd throttled service restarts after repeated start failures."
}
]
}
@@ -0,0 +1,164 @@
# log-diff-checker
`log-diff-checker` is a read-only Python CLI for comparing configured operational log patterns before and after a change. It is intended to help an infrastructure engineer decide whether a patch, deployment, configuration change, or service restart introduced new log risk or reduced existing noise.
The tool compares local pre-change and post-change log extracts. It does not modify input logs or system state.
## When To Use
- After a planned change when pre-check and post-check log extracts are available.
- During change validation when the question is whether errors increased, disappeared, or stayed flat.
- Before attaching log evidence to a change, incident, or problem ticket.
- When predictable text, Markdown, or JSON output is useful for local review.
## What It Does
- Reads two local text log files supplied with `--before` and `--after`.
- Scans both files for configured critical and warning patterns.
- Compares before and after counts for each detected pattern.
- Classifies patterns as `NEW`, `INCREASED`, `DECREASED`, `RESOLVED`, or `UNCHANGED`.
- Sets an overall status of `OK`, `WARNING`, or `CRITICAL`.
- Includes sample log lines from the side that best explains the change.
## What It Does Not Do
- It does not read remote systems.
- It does not modify logs, services, or host state.
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
- It does not prove root cause or change safety.
- It does not replace service-specific post-change checks.
- It does not classify every possible vendor or application error.
## Supported Input
- Two local text log files:
- `--before` for the pre-change log extract.
- `--after` for the post-change log extract.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
## Supported Patterns
Critical patterns:
- `CRITICAL`
- `FATAL`
- `panic`
- `kernel panic`
- `no space left on device`
- `out of memory`
- `killed process`
- `read-only file system`
- `segmentation fault`
- `segfault`
- `certificate expired`
- `TLS handshake failed`
- `SSLHandshakeException`
- `database unavailable`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`
Warning patterns:
- `ERROR`
- `failed`
- `failure`
- `timeout`
- `connection refused`
- `connection reset`
- `permission denied`
- `authentication failed`
- `denied`
- `unavailable`
- `service restart`
- `retrying`
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
## Usage
```bash
cd infra-run/scripts/python/log-diff-checker
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format markdown --output change-log-diff.md
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --format json
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --ignore-case
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --top 20
python3 log_diff_checker.py --before examples/pre-change.log --after examples/post-change.log --max-samples 5
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - change or incident ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to either input log file.
## Exit Codes
- `0` - OK, no new or increased findings.
- `1` - New or increased findings detected.
- `2` - Invalid input, unreadable file, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Log Diff Checker
================
[CRITICAL] CRITICAL - NEW
Before count: 0
After count: 1
Delta: +1
Sample source: after
Samples:
- 2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
Operational Summary
-------------------
Total lines scanned before: 7
Total lines scanned after: 8
Total unique patterns compared: 9
New findings count: 3
Increased findings count: 3
Decreased findings count: 0
Resolved findings count: 2
Unchanged findings count: 1
Overall status: CRITICAL
```
## Markdown Workflow
Generate a Markdown report from collected pre-change and post-change logs, review it, and attach it to the change ticket as supporting evidence:
```bash
python3 log_diff_checker.py \
--before examples/pre-change.log \
--after examples/post-change.log \
--format markdown \
--output change-log-diff.md
```
Use the report as a log perspective on the change. A `CRITICAL` or `WARNING` result should be reviewed with service health checks, monitoring, rollback criteria, and the relevant application owner.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single line can match multiple patterns, such as `CRITICAL`, `database unavailable`, and `unavailable`.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- The tool compares counts, not rates, time windows, or request volume.
- Large log files are read into memory; collect scoped extracts for very large incidents.
- `--top` limits displayed findings only. The operational summary still reflects all compared patterns.
## Safety Notes
- The tool only reads the input logs and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
@@ -0,0 +1,8 @@
2026-05-11 10:10:01 app01 systemd[1]: Started inventory-api.service after package update.
2026-05-11 10:10:15 app01 inventory-api[2294]: INFO readiness check passed
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
2026-05-11 10:15:00 app01 inventory-api[2294]: INFO background reconciliation completed
@@ -0,0 +1,7 @@
2026-05-11 09:55:01 app01 systemd[1]: Started inventory-api.service.
2026-05-11 09:56:12 app01 inventory-api[1842]: INFO readiness check passed
2026-05-11 09:57:20 app01 inventory-api[1842]: WARNING timeout contacting cache01, retrying
2026-05-11 09:58:04 app01 inventory-api[1842]: ERROR failed to refresh optional pricing cache
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
2026-05-11 10:00:00 app01 systemd[1]: Stopping inventory-api.service for planned restart.
2026-05-11 10:00:03 app01 systemd[1]: Started inventory-api.service.
@@ -0,0 +1,134 @@
# Log Diff Checker
## CRITICAL: CRITICAL (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## CRITICAL: database unavailable (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## WARNING: unavailable (NEW)
- Before count: 0
- After count: 1
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:14:31 app01 inventory-api[2294]: CRITICAL database unavailable while opening checkout connection
```
## WARNING: failed (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
```
## WARNING: retrying (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
```
## WARNING: timeout (INCREASED)
- Before count: 1
- After count: 2
- Delta: +1
- Sample source: after
Sample log lines:
```text
2026-05-11 10:11:02 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
2026-05-11 10:11:18 app01 inventory-api[2294]: WARNING timeout contacting cache01, retrying
```
## WARNING: denied (RESOLVED)
- Before count: 1
- After count: 0
- Delta: -1
- Sample source: before
Sample log lines:
```text
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
```
## WARNING: permission denied (RESOLVED)
- Before count: 1
- After count: 0
- Delta: -1
- Sample source: before
Sample log lines:
```text
2026-05-11 09:59:10 app01 inventory-api[1842]: ERROR permission denied reading /etc/inventory/legacy.conf
```
## WARNING: ERROR (UNCHANGED)
- Before count: 2
- After count: 2
- Delta: +0
- Sample source: after
Sample log lines:
```text
2026-05-11 10:12:44 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
2026-05-11 10:13:05 app01 inventory-api[2294]: ERROR failed to refresh optional pricing cache
```
## Operational Summary
- Total lines scanned before: 7
- Total lines scanned after: 8
- Total unique patterns compared: 9
- New findings count: 3
- Increased findings count: 3
- Decreased findings count: 0
- Resolved findings count: 2
- Unchanged findings count: 1
- Overall status: CRITICAL
@@ -0,0 +1,462 @@
#!/usr/bin/env python3
"""Compare incident-oriented log patterns before and after a change."""
from __future__ import annotations
import argparse
import json
import re
import sys
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
STATUS_ORDER = {
"NEW": 0,
"INCREASED": 1,
"DECREASED": 2,
"RESOLVED": 3,
"UNCHANGED": 4,
}
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
CRITICAL_PATTERNS = [
"CRITICAL",
"FATAL",
"panic",
"kernel panic",
"no space left on device",
"out of memory",
"killed process",
"read-only file system",
"segmentation fault",
"segfault",
"certificate expired",
"TLS handshake failed",
"SSLHandshakeException",
"database unavailable",
"HTTP 500",
"HTTP 502",
"HTTP 503",
"HTTP 504",
]
WARNING_PATTERNS = [
"ERROR",
"failed",
"failure",
"timeout",
"connection refused",
"connection reset",
"permission denied",
"authentication failed",
"denied",
"unavailable",
"service restart",
"retrying",
]
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Compare configured operational log patterns before and after a change."
)
parser.add_argument("--before", required=True, help="Pre-change local log file.")
parser.add_argument("--after", required=True, help="Post-change local log file.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--top",
type=positive_int,
help="Limit displayed findings after operational importance sorting.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Match all configured patterns case-insensitively.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per finding. Default: 3.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
flags = re.IGNORECASE if ignore_case else 0
pattern_defs: list[dict[str, str]] = []
pattern_defs.extend(
{"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
)
pattern_defs.extend(
{"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
)
compiled = []
for item in pattern_defs:
compiled.append(
{
"pattern": item["pattern"],
"severity": item["severity"],
"regex": re.compile(re.escape(item["pattern"]), flags),
}
)
return compiled
def read_log_file(path: Path) -> list[str]:
if not path.exists():
raise OSError(f"file does not exist: {path}")
if not path.is_file():
raise OSError(f"path is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"file is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read file {path}: {exc}") from exc
if text == "":
raise ValueError(f"file is empty: {path}")
return text.splitlines()
def scan_log(
lines: list[str], patterns: list[dict[str, Any]], max_samples: int
) -> dict[str, dict[str, Any]]:
groups: dict[str, dict[str, Any]] = {}
for line in lines:
for item in patterns:
if not item["regex"].search(line):
continue
key = f"{item['severity']}::{item['pattern']}"
group = groups.setdefault(
key,
{
"pattern": item["pattern"],
"severity": item["severity"],
"count": 0,
"samples": [],
},
)
group["count"] += 1
if len(group["samples"]) < max_samples:
group["samples"].append(line)
return groups
def classify_status(before_count: int, after_count: int) -> str:
if before_count == 0 and after_count > 0:
return "NEW"
if before_count > 0 and after_count == 0:
return "RESOLVED"
if after_count > before_count:
return "INCREASED"
if after_count < before_count:
return "DECREASED"
return "UNCHANGED"
def sample_source_for(status: str) -> str:
if status in ("NEW", "INCREASED"):
return "after"
if status in ("DECREASED", "RESOLVED"):
return "before"
return "after"
def compare_logs(
before_lines: list[str],
after_lines: list[str],
patterns: list[dict[str, Any]],
max_samples: int,
top: int | None,
) -> dict[str, Any]:
before_groups = scan_log(before_lines, patterns, max_samples)
after_groups = scan_log(after_lines, patterns, max_samples)
compared_keys = sorted(set(before_groups) | set(after_groups))
findings = []
for key in compared_keys:
before_group = before_groups.get(key)
after_group = after_groups.get(key)
reference = before_group or after_group
if reference is None:
continue
before_count = before_group["count"] if before_group is not None else 0
after_count = after_group["count"] if after_group is not None else 0
status = classify_status(before_count, after_count)
source = sample_source_for(status)
sample_group = after_group if source == "after" else before_group
findings.append(
{
"pattern": reference["pattern"],
"severity": reference["severity"],
"before_count": before_count,
"after_count": after_count,
"delta": after_count - before_count,
"status": status,
"sample_source": source,
"samples": sample_group["samples"] if sample_group is not None else [],
}
)
sorted_findings = sorted(findings, key=finding_sort_key)
summary = build_summary(
before_lines=before_lines,
after_lines=after_lines,
findings=sorted_findings,
)
displayed_findings = sorted_findings if top is None else sorted_findings[:top]
return {
"findings": displayed_findings,
"summary": summary,
}
def finding_sort_key(finding: dict[str, Any]) -> tuple[int, int, int, int, str]:
return (
STATUS_ORDER[finding["status"]],
SEVERITY_ORDER[finding["severity"]],
-abs(finding["delta"]),
-finding["after_count"],
finding["pattern"].lower(),
)
def build_summary(
before_lines: list[str], after_lines: list[str], findings: list[dict[str, Any]]
) -> dict[str, Any]:
status_counts = {
"NEW": 0,
"INCREASED": 0,
"DECREASED": 0,
"RESOLVED": 0,
"UNCHANGED": 0,
}
for finding in findings:
status_counts[finding["status"]] += 1
critical_regressions = any(
finding["severity"] == "CRITICAL"
and finding["status"] in ("NEW", "INCREASED")
for finding in findings
)
warning_regressions = any(
finding["severity"] == "WARNING"
and finding["status"] in ("NEW", "INCREASED")
for finding in findings
)
if critical_regressions:
overall_status = "CRITICAL"
elif warning_regressions:
overall_status = "WARNING"
else:
overall_status = "OK"
return {
"total_lines_scanned_before": len(before_lines),
"total_lines_scanned_after": len(after_lines),
"total_unique_patterns_compared": len(findings),
"new_findings_count": status_counts["NEW"],
"increased_findings_count": status_counts["INCREASED"],
"decreased_findings_count": status_counts["DECREASED"],
"resolved_findings_count": status_counts["RESOLVED"],
"unchanged_findings_count": status_counts["UNCHANGED"],
"overall_status": overall_status,
}
def render_text(report: dict[str, Any]) -> str:
lines = ["Log Diff Checker", "================", ""]
if not report["findings"]:
lines.append("No configured operational patterns were detected in either log.")
else:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['pattern']} - {finding['status']}",
f"Before count: {finding['before_count']}",
f"After count: {finding['after_count']}",
f"Delta: {finding['delta']:+d}",
f"Sample source: {finding['sample_source']}",
"Samples:",
]
)
if finding["samples"]:
lines.extend(f" - {sample}" for sample in finding["samples"])
else:
lines.append(" - No samples retained")
lines.append("")
lines.extend(render_text_summary(report["summary"]))
return "\n".join(lines) + "\n"
def render_text_summary(summary: dict[str, Any]) -> list[str]:
return [
"Operational Summary",
"-------------------",
f"Total lines scanned before: {summary['total_lines_scanned_before']}",
f"Total lines scanned after: {summary['total_lines_scanned_after']}",
f"Total unique patterns compared: {summary['total_unique_patterns_compared']}",
f"New findings count: {summary['new_findings_count']}",
f"Increased findings count: {summary['increased_findings_count']}",
f"Decreased findings count: {summary['decreased_findings_count']}",
f"Resolved findings count: {summary['resolved_findings_count']}",
f"Unchanged findings count: {summary['unchanged_findings_count']}",
f"Overall status: {summary['overall_status']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = ["# Log Diff Checker", ""]
if not report["findings"]:
lines.extend(["No configured operational patterns were detected in either log.", ""])
else:
for finding in report["findings"]:
lines.extend(
[
f"## {finding['severity']}: {finding['pattern']} ({finding['status']})",
"",
f"- Before count: {finding['before_count']}",
f"- After count: {finding['after_count']}",
f"- Delta: {finding['delta']:+d}",
f"- Sample source: {finding['sample_source']}",
"",
"Sample log lines:",
"",
]
)
if finding["samples"]:
lines.append("```text")
lines.extend(finding["samples"])
lines.append("```")
else:
lines.append("_No samples retained._")
lines.append("")
summary = report["summary"]
lines.extend(
[
"## Operational Summary",
"",
f"- Total lines scanned before: {summary['total_lines_scanned_before']}",
f"- Total lines scanned after: {summary['total_lines_scanned_after']}",
f"- Total unique patterns compared: {summary['total_unique_patterns_compared']}",
f"- New findings count: {summary['new_findings_count']}",
f"- Increased findings count: {summary['increased_findings_count']}",
f"- Decreased findings count: {summary['decreased_findings_count']}",
f"- Resolved findings count: {summary['resolved_findings_count']}",
f"- Unchanged findings count: {summary['unchanged_findings_count']}",
f"- Overall status: {summary['overall_status']}",
"",
]
)
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2, sort_keys=True) + "\n"
def write_report(
output_path: str | None, content: str, input_paths: tuple[Path, Path]
) -> None:
if output_path is None:
sys.stdout.write(content)
return
path = Path(output_path)
try:
output_resolved = path.resolve()
input_resolved = {input_path.resolve() for input_path in input_paths}
except OSError as exc:
raise OSError(f"unable to validate output path {path}: {exc}") from exc
if output_resolved in input_resolved:
raise OSError("output path must not overwrite an input log file")
try:
path.write_text(content, encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write output {path}: {exc}") from exc
def main() -> int:
parser = build_parser()
args = parser.parse_args()
before_path = Path(args.before)
after_path = Path(args.after)
try:
before_lines = read_log_file(before_path)
after_lines = read_log_file(after_path)
report = compare_logs(
before_lines=before_lines,
after_lines=after_lines,
patterns=compile_patterns(args.ignore_case),
max_samples=args.max_samples,
top=args.top,
)
if args.format == "text":
content = render_text(report)
elif args.format == "markdown":
content = render_markdown(report)
else:
content = render_json(report)
write_report(args.output, content, (before_path, after_path))
except (OSError, ValueError) as exc:
print(f"CRITICAL: {exc}", file=sys.stderr)
return EXIT_INVALID
except RuntimeError as exc:
print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
return EXIT_INVALID
if report["summary"]["overall_status"] == "OK":
return EXIT_OK
return EXIT_FINDINGS
if __name__ == "__main__":
sys.exit(main())
+62
View File
@@ -0,0 +1,62 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
ROOT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
PYTHON_DIR="$ROOT_DIR/infra-run/scripts/python"
ok_count=0
warn_count=0
fail_count=0
ok() {
printf 'OK: %s\n' "$*"
ok_count=$((ok_count + 1))
}
warning() {
printf 'WARNING: %s\n' "$*"
warn_count=$((warn_count + 1))
}
critical() {
printf 'CRITICAL: %s\n' "$*"
fail_count=$((fail_count + 1))
}
if ! command -v python3 >/dev/null 2>&1; then
critical "python3 not installed"
printf '\nPython summary: %d OK, %d WARNING, %d CRITICAL\n' "$ok_count" "$warn_count" "$fail_count"
exit 2
fi
if [[ ! -d "$PYTHON_DIR" ]]; then
warning "No infra-run/scripts/python directory found"
printf '\nPython summary: %d OK, %d WARNING, %d CRITICAL\n' "$ok_count" "$warn_count" "$fail_count"
exit 0
fi
mapfile -t python_files < <(find "$PYTHON_DIR" -type f -name '*.py' -print | sort)
if ((${#python_files[@]} == 0)); then
warning "No Python files found under infra-run/scripts/python"
else
ok "Found ${#python_files[@]} Python files"
fi
for file in "${python_files[@]}"; do
if python3 -m py_compile "$file"; then
ok "py_compile ${file#"$ROOT_DIR"/}"
else
critical "Python syntax failed: ${file#"$ROOT_DIR"/}"
fi
done
printf '\nPython summary: %d OK, %d WARNING, %d CRITICAL\n' "$ok_count" "$warn_count" "$fail_count"
if ((fail_count > 0)); then
exit 1
fi
exit 0
+6
View File
@@ -12,6 +12,11 @@ run_check() {
shift shift
printf '\n== %s ==\n' "$name" printf '\n== %s ==\n' "$name"
if [[ ! -x "$1" ]]; then
printf 'WARNING: %s check is missing or not executable: %s\n' "$name" "$1"
return 0
fi
if "$@"; then if "$@"; then
printf 'OK: %s completed\n' "$name" printf 'OK: %s completed\n' "$name"
else else
@@ -22,6 +27,7 @@ run_check() {
run_check "Bash" "$ROOT_DIR/scripts/check-bash.sh" run_check "Bash" "$ROOT_DIR/scripts/check-bash.sh"
run_check "Ansible" "$ROOT_DIR/scripts/check-ansible.sh" run_check "Ansible" "$ROOT_DIR/scripts/check-ansible.sh"
run_check "Python" "$ROOT_DIR/scripts/check-python.sh"
run_check "Docs" "$ROOT_DIR/scripts/check-docs.sh" run_check "Docs" "$ROOT_DIR/scripts/check-docs.sh"
printf '\n== Repository summary ==\n' printf '\n== Repository summary ==\n'