portfolio/infra-run/scripts/python/known-error-matcher/README.md

# known-error-matcher

`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.

The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.

## Purpose

- Identify which cataloged operational problems are visible in a collected log.
- Count how often each known error pattern appears.
- Surface warning and critical matches conservatively.
- Point operators toward relevant runbooks or supporting local tools.
- Produce predictable text, Markdown, or JSON output for incident notes.

## When To Use

- During incident response when a collected application, system, or journal extract needs quick known-error matching.
- Before attaching log evidence to an incident, problem, or change ticket.
- When teams maintain a small local catalog of operational patterns and runbook links.
- When JSON output is useful for later local automation.

## What It Does Not Do

- It does not read remote systems or live streams.
- It does not modify logs, services, applications, accounts, or host state.
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
- It does not find root cause automatically.
- It does not prove an incident or confirm customer impact.
- It does not classify every vendor-specific log message.

## Pattern Catalog Format

Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.

```json
{
  "patterns": [
    {
      "id": "disk_full",
      "name": "Disk full",
      "severity": "CRITICAL",
      "regex": "No space left on device|disk full",
      "category": "storage",
      "runbook": "infra-run/scripts/bash/disk-full/README.md",
      "description": "Filesystem or application failed because free space was exhausted."
    }
  ]
}
```

Required fields per pattern:

- `id` - stable non-empty identifier.
- `name` - human-readable finding name.
- `severity` - `WARNING` or `CRITICAL`.
- `regex` - Python regular expression used for matching.

Optional fields:

- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
- `description` - short operator-facing explanation. Missing values are reported as `None`.

The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.

## Adding A Known Error Pattern

Add a new object under `patterns` in `patterns.json`:

```json
{
  "id": "example_dependency_failure",
  "name": "Example dependency failure",
  "severity": "WARNING",
  "regex": "dependency request failed|upstream dependency unavailable",
  "category": "application",
  "runbook": "infra-run/runbooks/incidents/dependency-failure.md",
  "description": "Application logged a dependency failure that requires review."
}
```

Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.

## Severity Model

Overall status is conservative:

- `OK` - no known error patterns matched.
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
- `CRITICAL` - one or more critical patterns matched.

The status means known error patterns require review. It is not a final root-cause statement.

## Category Filtering

Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.

Examples:

```bash
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
```

## Usage

```bash
cd infra-run/scripts/python/known-error-matcher

python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
```

## Output Formats

- `text` - default terminal-oriented report.
- `markdown` - incident, problem, or change ticket attachment format.
- `json` - structured output for local automation.

Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.

## Exit Codes

- `0` - OK, no known error matches.
- `1` - Known error matches detected.
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.

## Example Text Output

```text
Known Error Matcher
===================

Overall status: CRITICAL
Known error pattern matches require operator review; logs alone do not prove root cause.

[CRITICAL] database_unavailable - Database unavailable
Category: application
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
Description: Application logged unavailable database or database connectivity symptoms.
Samples:
  - 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool

Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 9
Known error matches: 7
Matched known error patterns: 7
Critical matched patterns: 5
Warning matched patterns: 2
Top categories: application (3), network (2), application_jvm (2)
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
Timestamp coverage: parsed=9, unknown=0
Filters used: severity=None, category=None
Pattern catalog path: patterns.json
```

## Markdown Workflow

Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:

```bash
python3 known_error_matcher.py \
  --file examples/sample-app.log \
  --patterns patterns.json \
  --format markdown \
  --output known-error-report.md
```

Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.

## Operational Limitations

- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple known error patterns.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
- Large log files are read into memory; use scoped extracts for very large incidents.

## Safety Notes

- The tool only reads the input log and pattern catalog and optionally writes a separate report.
- The implementation uses the Python standard library only and does not require package installation.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.