200 lines
8.6 KiB
Markdown
200 lines
8.6 KiB
Markdown
# known-error-matcher
|
|
|
|
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
|
|
|
|
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
|
|
|
|
## Purpose
|
|
|
|
- Identify which cataloged operational problems are visible in a collected log.
|
|
- Count how often each known error pattern appears.
|
|
- Surface warning and critical matches conservatively.
|
|
- Point operators toward relevant runbooks or supporting local tools.
|
|
- Produce predictable text, Markdown, or JSON output for incident notes.
|
|
|
|
## When To Use
|
|
|
|
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
|
|
- Before attaching log evidence to an incident, problem, or change ticket.
|
|
- When teams maintain a small local catalog of operational patterns and runbook links.
|
|
- When JSON output is useful for later local automation.
|
|
|
|
## What It Does Not Do
|
|
|
|
- It does not read remote systems or live streams.
|
|
- It does not modify logs, services, applications, accounts, or host state.
|
|
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
|
|
- It does not find root cause automatically.
|
|
- It does not prove an incident or confirm customer impact.
|
|
- It does not classify every vendor-specific log message.
|
|
|
|
## Pattern Catalog Format
|
|
|
|
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
|
|
|
|
```json
|
|
{
|
|
"patterns": [
|
|
{
|
|
"id": "disk_full",
|
|
"name": "Disk full",
|
|
"severity": "CRITICAL",
|
|
"regex": "No space left on device|disk full",
|
|
"category": "storage",
|
|
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
|
"description": "Filesystem or application failed because free space was exhausted."
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
Required fields per pattern:
|
|
|
|
- `id` - stable non-empty identifier.
|
|
- `name` - human-readable finding name.
|
|
- `severity` - `WARNING` or `CRITICAL`.
|
|
- `regex` - Python regular expression used for matching.
|
|
|
|
Optional fields:
|
|
|
|
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
|
|
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
|
|
- `description` - short operator-facing explanation. Missing values are reported as `None`.
|
|
|
|
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
|
|
|
|
## Adding A Known Error Pattern
|
|
|
|
Add a new object under `patterns` in `patterns.json`:
|
|
|
|
```json
|
|
{
|
|
"id": "example_dependency_failure",
|
|
"name": "Example dependency failure",
|
|
"severity": "WARNING",
|
|
"regex": "dependency request failed|upstream dependency unavailable",
|
|
"category": "application",
|
|
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
|
|
"description": "Application logged a dependency failure that requires review."
|
|
}
|
|
```
|
|
|
|
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
|
|
|
|
## Severity Model
|
|
|
|
Overall status is conservative:
|
|
|
|
- `OK` - no known error patterns matched.
|
|
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
|
|
- `CRITICAL` - one or more critical patterns matched.
|
|
|
|
The status means known error patterns require review. It is not a final root-cause statement.
|
|
|
|
## Category Filtering
|
|
|
|
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
|
|
|
|
Examples:
|
|
|
|
```bash
|
|
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
|
|
```
|
|
|
|
## Usage
|
|
|
|
```bash
|
|
cd infra-run/scripts/python/known-error-matcher
|
|
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
|
|
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
|
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
|
|
```
|
|
|
|
## Output Formats
|
|
|
|
- `text` - default terminal-oriented report.
|
|
- `markdown` - incident, problem, or change ticket attachment format.
|
|
- `json` - structured output for local automation.
|
|
|
|
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
|
|
|
|
## Exit Codes
|
|
|
|
- `0` - OK, no known error matches.
|
|
- `1` - Known error matches detected.
|
|
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
|
|
|
|
## Example Text Output
|
|
|
|
```text
|
|
Known Error Matcher
|
|
===================
|
|
|
|
Overall status: CRITICAL
|
|
Known error pattern matches require operator review; logs alone do not prove root cause.
|
|
|
|
[CRITICAL] database_unavailable - Database unavailable
|
|
Category: application
|
|
Occurrences: 1
|
|
First seen: 2026-05-11 10:16:07
|
|
Last seen: 2026-05-11 10:16:07
|
|
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
|
|
Description: Application logged unavailable database or database connectivity symptoms.
|
|
Samples:
|
|
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
|
|
|
|
Operational Summary
|
|
-------------------
|
|
Overall status: CRITICAL
|
|
Total lines scanned: 9
|
|
Known error matches: 7
|
|
Matched known error patterns: 7
|
|
Critical matched patterns: 5
|
|
Warning matched patterns: 2
|
|
Top categories: application (3), network (2), application_jvm (2)
|
|
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
|
|
Timestamp coverage: parsed=9, unknown=0
|
|
Filters used: severity=None, category=None
|
|
Pattern catalog path: patterns.json
|
|
```
|
|
|
|
## Markdown Workflow
|
|
|
|
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
|
|
|
|
```bash
|
|
python3 known_error_matcher.py \
|
|
--file examples/sample-app.log \
|
|
--patterns patterns.json \
|
|
--format markdown \
|
|
--output known-error-report.md
|
|
```
|
|
|
|
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
|
|
|
|
## Operational Limitations
|
|
|
|
- Pattern matching is intentionally simple and predictable.
|
|
- A single log line can match multiple known error patterns.
|
|
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
|
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
|
|
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
|
|
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
|
|
- Large log files are read into memory; use scoped extracts for very large incidents.
|
|
|
|
## Safety Notes
|
|
|
|
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
|
|
- The implementation uses the Python standard library only and does not require package installation.
|
|
- It does not require elevated privileges unless the chosen log path requires them.
|
|
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
|
|
- Treat operational findings as prompts that require review; the tool does not determine root cause automatically.
|