Add incident log summary tool

2026-05-11 17:03:31 +00:00
parent 61483c233f
commit 5dde403ce3
5 changed files with 765 additions and 0 deletions
@@ -0,0 +1,158 @@
+# incident-log-summary
+
+`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
+
+The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
+
+## When To Use
+
+- During incident response when a collected log file needs a fast pattern summary.
+- Before attaching evidence to an incident, problem, or change ticket.
+- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
+- When JSON output is useful for later local automation.
+
+## What It Does Not Do
+
+- It does not read remote systems.
+- It does not modify logs or system state.
+- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
+- It does not prove root cause.
+- It does not classify every possible vendor or application error.
+- It does not treat sanitized examples as production validation.
+
+## Supported Input
+
+- One local text log file provided with `--file`.
+- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
+- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
+
+## Supported Patterns
+
+Critical patterns:
+
+- `CRITICAL`
+- `FATAL`
+- `panic`
+- `kernel panic`
+- `no space left on device`
+- `out of memory`
+- `killed process`
+- `read-only file system`
+- `segmentation fault`
+- `segfault`
+- `certificate expired`
+- `TLS handshake failed`
+- `SSLHandshakeException`
+- `database unavailable`
+- `HTTP 500`
+- `HTTP 502`
+- `HTTP 503`
+- `HTTP 504`
+
+Warning patterns:
+
+- `ERROR`
+- `failed`
+- `failure`
+- `timeout`
+- `connection refused`
+- `connection reset`
+- `permission denied`
+- `authentication failed`
+- `denied`
+- `unavailable`
+- `service restart`
+- `retrying`
+
+By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
+
+## Timestamp Handling
+
+The scanner attempts to parse:
+
+- `2026-05-11 10:15:30`
+- `2026-05-11T10:15:30`
+- `May 11 10:15:30`
+
+Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
+
+Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
+
+## Usage
+
+```bash
+cd infra-run/scripts/python/incident-log-summary
+
+python3 incident_log_summary.py --file examples/system-messages.log
+python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
+python3 incident_log_summary.py --file examples/app-error.log --format json
+python3 incident_log_summary.py --file examples/app-error.log --top 20
+python3 incident_log_summary.py --file examples/app-error.log --ignore-case
+python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
+python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
+```
+
+## Output Formats
+
+- `text` - default terminal-oriented report.
+- `markdown` - incident or change ticket attachment format.
+- `json` - structured output for local automation.
+
+Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
+
+## Exit Codes
+
+- `0` - OK, no findings.
+- `1` - Operational findings detected.
+- `2` - Invalid input, unreadable file, bad argument, or runtime error.
+
+## Example Text Output
+
+```text
+Incident Log Summary
+====================
+
+[CRITICAL] no space left on device
+Occurrences: 1
+First seen: 2026-05-11 10:16:07
+Last seen: 2026-05-11 10:16:07
+Samples:
+  - May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
+
+Operational Summary
+-------------------
+Total lines scanned: 7
+Total findings: 7
+Critical finding groups: 3
+Warning finding groups: 4
+Overall status: CRITICAL
+```
+
+## Markdown Workflow
+
+Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
+
+```bash
+python3 incident_log_summary.py \
+  --file examples/app-error.log \
+  --format markdown \
+  --output incident-report.md
+```
+
+Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
+
+## Operational Limitations
+
+- Pattern matching is intentionally simple and predictable.
+- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
+- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
+- Syslog timestamps without a year are normalized with an inferred year.
+- Date filters are best-effort because lines without parseable timestamps are retained.
+- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
+
+## Safety Notes
+
+- The tool only reads the input log and optionally writes a separate report.
+- It does not require elevated privileges unless the chosen log path requires them.
+- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
+- Treat findings as prompts for operator review, not automated remediation instructions.