Add incident log summary tool
This commit is contained in:
@@ -0,0 +1,158 @@
|
||||
# incident-log-summary
|
||||
|
||||
`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
|
||||
|
||||
The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
|
||||
|
||||
## When To Use
|
||||
|
||||
- During incident response when a collected log file needs a fast pattern summary.
|
||||
- Before attaching evidence to an incident, problem, or change ticket.
|
||||
- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
|
||||
- When JSON output is useful for later local automation.
|
||||
|
||||
## What It Does Not Do
|
||||
|
||||
- It does not read remote systems.
|
||||
- It does not modify logs or system state.
|
||||
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
|
||||
- It does not prove root cause.
|
||||
- It does not classify every possible vendor or application error.
|
||||
- It does not treat sanitized examples as production validation.
|
||||
|
||||
## Supported Input
|
||||
|
||||
- One local text log file provided with `--file`.
|
||||
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
|
||||
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
|
||||
|
||||
## Supported Patterns
|
||||
|
||||
Critical patterns:
|
||||
|
||||
- `CRITICAL`
|
||||
- `FATAL`
|
||||
- `panic`
|
||||
- `kernel panic`
|
||||
- `no space left on device`
|
||||
- `out of memory`
|
||||
- `killed process`
|
||||
- `read-only file system`
|
||||
- `segmentation fault`
|
||||
- `segfault`
|
||||
- `certificate expired`
|
||||
- `TLS handshake failed`
|
||||
- `SSLHandshakeException`
|
||||
- `database unavailable`
|
||||
- `HTTP 500`
|
||||
- `HTTP 502`
|
||||
- `HTTP 503`
|
||||
- `HTTP 504`
|
||||
|
||||
Warning patterns:
|
||||
|
||||
- `ERROR`
|
||||
- `failed`
|
||||
- `failure`
|
||||
- `timeout`
|
||||
- `connection refused`
|
||||
- `connection reset`
|
||||
- `permission denied`
|
||||
- `authentication failed`
|
||||
- `denied`
|
||||
- `unavailable`
|
||||
- `service restart`
|
||||
- `retrying`
|
||||
|
||||
By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
|
||||
|
||||
## Timestamp Handling
|
||||
|
||||
The scanner attempts to parse:
|
||||
|
||||
- `2026-05-11 10:15:30`
|
||||
- `2026-05-11T10:15:30`
|
||||
- `May 11 10:15:30`
|
||||
|
||||
Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
|
||||
|
||||
Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/python/incident-log-summary
|
||||
|
||||
python3 incident_log_summary.py --file examples/system-messages.log
|
||||
python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
|
||||
python3 incident_log_summary.py --file examples/app-error.log --format json
|
||||
python3 incident_log_summary.py --file examples/app-error.log --top 20
|
||||
python3 incident_log_summary.py --file examples/app-error.log --ignore-case
|
||||
python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
|
||||
python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
- `text` - default terminal-oriented report.
|
||||
- `markdown` - incident or change ticket attachment format.
|
||||
- `json` - structured output for local automation.
|
||||
|
||||
Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK, no findings.
|
||||
- `1` - Operational findings detected.
|
||||
- `2` - Invalid input, unreadable file, bad argument, or runtime error.
|
||||
|
||||
## Example Text Output
|
||||
|
||||
```text
|
||||
Incident Log Summary
|
||||
====================
|
||||
|
||||
[CRITICAL] no space left on device
|
||||
Occurrences: 1
|
||||
First seen: 2026-05-11 10:16:07
|
||||
Last seen: 2026-05-11 10:16:07
|
||||
Samples:
|
||||
- May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
|
||||
|
||||
Operational Summary
|
||||
-------------------
|
||||
Total lines scanned: 7
|
||||
Total findings: 7
|
||||
Critical finding groups: 3
|
||||
Warning finding groups: 4
|
||||
Overall status: CRITICAL
|
||||
```
|
||||
|
||||
## Markdown Workflow
|
||||
|
||||
Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
|
||||
|
||||
```bash
|
||||
python3 incident_log_summary.py \
|
||||
--file examples/app-error.log \
|
||||
--format markdown \
|
||||
--output incident-report.md
|
||||
```
|
||||
|
||||
Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
|
||||
|
||||
## Operational Limitations
|
||||
|
||||
- Pattern matching is intentionally simple and predictable.
|
||||
- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
|
||||
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||
- Syslog timestamps without a year are normalized with an inferred year.
|
||||
- Date filters are best-effort because lines without parseable timestamps are retained.
|
||||
- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- The tool only reads the input log and optionally writes a separate report.
|
||||
- It does not require elevated privileges unless the chosen log path requires them.
|
||||
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
|
||||
- Treat findings as prompts for operator review, not automated remediation instructions.
|
||||
Reference in New Issue
Block a user