portfolio/infra-run/scripts/python/incident-log-summary/README.md

# incident-log-summary

`incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.

The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.

## When To Use

- During incident response when a collected log file needs a fast pattern summary.
- Before attaching evidence to an incident, problem, or change ticket.
- When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
- When JSON output is useful for later local automation.

## What It Does Not Do

- It does not read remote systems.
- It does not modify logs or system state.
- It does not query ELK, Zabbix, SIEM, journald, or application APIs.
- It does not prove root cause.
- It does not classify every possible vendor or application error.
- It does not treat sanitized examples as production validation.

## Supported Input

- One local text log file provided with `--file`.
- UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
- Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.

## Supported Patterns

Critical patterns:

- `CRITICAL`
- `FATAL`
- `panic`
- `kernel panic`
- `no space left on device`
- `out of memory`
- `killed process`
- `read-only file system`
- `segmentation fault`
- `segfault`
- `certificate expired`
- `TLS handshake failed`
- `SSLHandshakeException`
- `database unavailable`
- `HTTP 500`
- `HTTP 502`
- `HTTP 503`
- `HTTP 504`

Warning patterns:

- `ERROR`
- `failed`
- `failure`
- `timeout`
- `connection refused`
- `connection reset`
- `permission denied`
- `authentication failed`
- `denied`
- `unavailable`
- `service restart`
- `retrying`

By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.

## Timestamp Handling

The scanner attempts to parse:

- `2026-05-11 10:15:30`
- `2026-05-11T10:15:30`
- `May 11 10:15:30`

Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.

Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.

## Usage

```bash
cd infra-run/scripts/python/incident-log-summary

python3 incident_log_summary.py --file examples/system-messages.log
python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
python3 incident_log_summary.py --file examples/app-error.log --format json
python3 incident_log_summary.py --file examples/app-error.log --top 20
python3 incident_log_summary.py --file examples/app-error.log --ignore-case
python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
```

## Output Formats

- `text` - default terminal-oriented report.
- `markdown` - incident or change ticket attachment format.
- `json` - structured output for local automation.

Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.

## Exit Codes

- `0` - OK, no findings.
- `1` - Operational findings detected.
- `2` - Invalid input, unreadable file, bad argument, or runtime error.

## Example Text Output

```text
Incident Log Summary
====================

[CRITICAL] no space left on device
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Samples:
  - May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages

Operational Summary
-------------------
Total lines scanned: 7
Total findings: 7
Critical finding groups: 3
Warning finding groups: 4
Overall status: CRITICAL
```

## Markdown Workflow

Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:

```bash
python3 incident_log_summary.py \
  --file examples/app-error.log \
  --format markdown \
  --output incident-report.md
```

Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.

## Operational Limitations

- Pattern matching is intentionally simple and predictable.
- A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Syslog timestamps without a year are normalized with an inferred year.
- Date filters are best-effort because lines without parseable timestamps are retained.
- Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.

## Safety Notes

- The tool only reads the input log and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.