Add known error matcher tool

This commit is contained in:
Mateusz Suski
2026-05-11 17:06:46 +00:00
parent 5fc96348c5
commit 1636f46f81
6 changed files with 1096 additions and 0 deletions
@@ -0,0 +1,198 @@
# known-error-matcher
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
## Purpose
- Identify which cataloged operational problems are visible in a collected log.
- Count how often each known error pattern appears.
- Surface warning and critical matches conservatively.
- Point operators toward relevant runbooks or supporting local tools.
- Produce predictable text, Markdown, or JSON output for incident notes.
## When To Use
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
- Before attaching log evidence to an incident, problem, or change ticket.
- When teams maintain a small local catalog of operational patterns and runbook links.
- When JSON output is useful for later local automation.
## What It Does Not Do
- It does not read remote systems or live streams.
- It does not modify logs, services, applications, accounts, or host state.
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
- It does not find root cause automatically.
- It does not prove an incident or confirm customer impact.
- It does not classify every vendor-specific log message.
## Pattern Catalog Format
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
```json
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
}
]
}
```
Required fields per pattern:
- `id` - stable non-empty identifier.
- `name` - human-readable finding name.
- `severity` - `WARNING` or `CRITICAL`.
- `regex` - Python regular expression used for matching.
Optional fields:
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
- `description` - short operator-facing explanation. Missing values are reported as `None`.
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
## Adding A Known Error Pattern
Add a new object under `patterns` in `patterns.json`:
```json
{
"id": "example_dependency_failure",
"name": "Example dependency failure",
"severity": "WARNING",
"regex": "dependency request failed|upstream dependency unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
"description": "Application logged a dependency failure that requires review."
}
```
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
## Severity Model
Overall status is conservative:
- `OK` - no known error patterns matched.
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
- `CRITICAL` - one or more critical patterns matched.
The status means known error patterns require review. It is not a final root-cause statement.
## Category Filtering
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
Examples:
```bash
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
```
## Usage
```bash
cd infra-run/scripts/python/known-error-matcher
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
```
## Output Formats
- `text` - default terminal-oriented report.
- `markdown` - incident, problem, or change ticket attachment format.
- `json` - structured output for local automation.
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
## Exit Codes
- `0` - OK, no known error matches.
- `1` - Known error matches detected.
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
## Example Text Output
```text
Known Error Matcher
===================
Overall status: CRITICAL
Known error pattern matches require operator review; logs alone do not prove root cause.
[CRITICAL] database_unavailable - Database unavailable
Category: application
Occurrences: 1
First seen: 2026-05-11 10:16:07
Last seen: 2026-05-11 10:16:07
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
Description: Application logged unavailable database or database connectivity symptoms.
Samples:
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
Operational Summary
-------------------
Overall status: CRITICAL
Total lines scanned: 9
Known error matches: 7
Matched known error patterns: 7
Critical matched patterns: 5
Warning matched patterns: 2
Top categories: application (3), network (2), application_jvm (2)
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
Timestamp coverage: parsed=9, unknown=0
Filters used: severity=None, category=None
Pattern catalog path: patterns.json
```
## Markdown Workflow
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
```bash
python3 known_error_matcher.py \
--file examples/sample-app.log \
--patterns patterns.json \
--format markdown \
--output known-error-report.md
```
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
## Operational Limitations
- Pattern matching is intentionally simple and predictable.
- A single log line can match multiple known error patterns.
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
- Large log files are read into memory; use scoped extracts for very large incidents.
## Safety Notes
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
- It does not require elevated privileges unless the chosen log path requires them.
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
- Treat matches as prompts for operator review, not automated remediation instructions.
@@ -0,0 +1,9 @@
2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42
2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted
2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443
2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms
2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed
2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"
2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space
2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded
@@ -0,0 +1,97 @@
# Known Error Matcher Report
- Overall status: `CRITICAL`
- Known error pattern matches require operator review; logs alone do not prove root cause.
## Matched Known Errors
### [CRITICAL] database_unavailable - Database unavailable
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:07`
- Last seen: `2026-05-11 10:16:07`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Application logged unavailable database or database connectivity symptoms.
- Samples:
- `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool`
### [CRITICAL] http_500 - HTTP 500
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:02`
- Last seen: `2026-05-11 10:16:02`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 500 responses.
- Samples:
- `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted`
### [CRITICAL] http_503 - HTTP 503
- Category: `application`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:31.456`
- Last seen: `2026-05-11 10:16:31.456`
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
- Description: Application or proxy logged HTTP 503 service unavailable responses.
- Samples:
- `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"`
### [CRITICAL] java_out_of_memory - Java OutOfMemoryError
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:40`
- Last seen: `2026-05-11 10:16:40`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java process logged memory exhaustion symptoms.
- Samples:
- `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space`
### [CRITICAL] ssl_handshake_exception - SSLHandshakeException
- Category: `application_jvm`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:22`
- Last seen: `2026-05-11 10:16:22`
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
- Description: Java TLS handshake exception was logged.
- Samples:
- `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed`
### [WARNING] connection_refused - Connection refused
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:11`
- Last seen: `2026-05-11 10:16:11`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Client connection attempts were refused by the destination service or host.
- Samples:
- `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443`
### [WARNING] timeout - Timeout
- Category: `network`
- Occurrences: `1`
- First seen: `2026-05-11 10:16:15,123`
- Last seen: `2026-05-11 10:16:15,123`
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
- Description: Operation timed out and may require network, service, or dependency review.
- Samples:
- `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms`
## Operational Summary
- Overall status: `CRITICAL`
- Total lines scanned: `9`
- Known error matches: `7`
- Matched known error patterns: `7`
- Critical matched patterns: `5`
- Warning matched patterns: `2`
- Top categories: application (3), network (2), application_jvm (2)
- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
- Timestamp coverage: parsed=`9`, unknown=`0`
- Filters used: severity=`None`, category=`None`
- Pattern catalog path: `patterns.json`
@@ -0,0 +1,10 @@
May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages
May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only
May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE)
May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application.
May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly.
May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy
May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp
May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp
@@ -0,0 +1,562 @@
#!/usr/bin/env python3
"""Match local logs against a JSON catalog of known operational error patterns."""
from __future__ import annotations
import argparse
import json
import re
import sys
from collections import Counter
from datetime import datetime
from pathlib import Path
from typing import Any
EXIT_OK = 0
EXIT_FINDINGS = 1
EXIT_INVALID = 2
UNKNOWN = "UNKNOWN"
VALID_SEVERITIES = {"WARNING", "CRITICAL"}
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
ISO_TIMESTAMP_RE = re.compile(
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
)
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
description="Scan a local log file for known operational error patterns."
)
parser.add_argument("--file", required=True, help="Local log file to scan.")
parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.")
parser.add_argument(
"--format",
choices=("text", "markdown", "json"),
default="text",
help="Report format. Default: text.",
)
parser.add_argument("--output", help="Write report to this path instead of stdout.")
parser.add_argument(
"--severity",
choices=("warning", "critical"),
help="Only include findings with this severity.",
)
parser.add_argument(
"--category",
help="Only include findings from this exact category.",
)
parser.add_argument(
"--top",
type=positive_int,
default=10,
help="Number of matched known errors and summary entries to display. Default: 10.",
)
parser.add_argument(
"--max-samples",
type=non_negative_int,
default=3,
help="Maximum sample lines per matched known error. Default: 3.",
)
parser.add_argument(
"--ignore-case",
action="store_true",
help="Compile catalog regex patterns case-insensitively.",
)
return parser
def positive_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be a positive integer") from exc
if number <= 0:
raise argparse.ArgumentTypeError("must be a positive integer")
return number
def non_negative_int(value: str) -> int:
try:
number = int(value)
except ValueError as exc:
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
if number < 0:
raise argparse.ArgumentTypeError("must be zero or a positive integer")
return number
def read_text_file(path: Path, label: str) -> str:
if not path.exists():
raise OSError(f"{label} does not exist: {path}")
if not path.is_file():
raise OSError(f"{label} is not a regular file: {path}")
try:
text = path.read_text(encoding="utf-8", errors="replace")
except PermissionError as exc:
raise OSError(f"{label} is not readable: {path}") from exc
except OSError as exc:
raise OSError(f"unable to read {label} {path}: {exc}") from exc
if text == "":
raise ValueError(f"{label} is empty: {path}")
return text
def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]:
text = read_text_file(path, "pattern catalog")
try:
catalog = json.loads(text)
except json.JSONDecodeError as exc:
raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc
errors: list[str] = []
if not isinstance(catalog, dict):
raise ValueError("invalid pattern catalog: top-level JSON value must be an object")
if "patterns" not in catalog:
raise ValueError('invalid pattern catalog: missing top-level "patterns" field')
if not isinstance(catalog["patterns"], list):
raise ValueError('invalid pattern catalog: "patterns" must be a list')
seen_ids: set[str] = set()
compiled_patterns: list[dict[str, Any]] = []
flags = re.IGNORECASE if ignore_case else 0
for index, item in enumerate(catalog["patterns"], start=1):
if not isinstance(item, dict):
errors.append(f"pattern #{index}: must be an object")
continue
pattern_id = normalize_required_text(item, "id")
name = normalize_required_text(item, "name")
severity = normalize_required_text(item, "severity").upper()
regex_text = normalize_required_text(item, "regex")
if not pattern_id:
errors.append(f"pattern #{index}: id is required and must be non-empty")
elif pattern_id in seen_ids:
errors.append(f"pattern #{index}: duplicate id {pattern_id}")
else:
seen_ids.add(pattern_id)
if not name:
errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty")
if severity not in VALID_SEVERITIES:
errors.append(
f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL"
)
if not regex_text:
errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty")
compiled_regex = None
if regex_text:
try:
compiled_regex = re.compile(regex_text, flags)
except re.error as exc:
errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}")
if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex:
compiled_patterns.append(
{
"id": pattern_id,
"name": name,
"severity": severity,
"regex_text": regex_text,
"regex": compiled_regex,
"category": normalize_optional_text(item, "category", UNKNOWN),
"runbook": normalize_optional_text(item, "runbook", ""),
"description": normalize_optional_text(item, "description", ""),
}
)
if errors:
raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors))
if not compiled_patterns:
raise ValueError("invalid pattern catalog: no patterns configured")
return compiled_patterns
def normalize_required_text(item: dict[str, Any], field: str) -> str:
value = item.get(field)
if not isinstance(value, str):
return ""
return value.strip()
def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str:
value = item.get(field, default)
if not isinstance(value, str):
return default
value = value.strip()
return value if value else default
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
iso_match = ISO_TIMESTAMP_RE.search(line)
if iso_match:
fraction = iso_match.group(3) or ""
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
parse_value = raw
fmt = "%Y-%m-%d %H:%M:%S"
if fraction:
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
fmt = "%Y-%m-%d %H:%M:%S.%f"
try:
return datetime.strptime(parse_value, fmt), raw + fraction
except ValueError:
return None, UNKNOWN
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
if syslog_match:
raw = syslog_match.group(1)
try:
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
except ValueError:
return None, UNKNOWN
return parsed, raw
return None, UNKNOWN
def severity_filter_matches(selected: str | None, severity: str) -> bool:
if selected is None:
return True
return selected.upper() == severity
def category_filter_matches(selected: str | None, category: str) -> bool:
if selected is None:
return True
return selected == category
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
if parsed_at is None:
return
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
group["first_seen"] = (parsed_at, rendered_at)
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
group["last_seen"] = (parsed_at, rendered_at)
def render_seen(value: tuple[datetime, str] | None) -> str:
if value is None:
return UNKNOWN
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
def append_limited(items: list[str], value: str, limit: int) -> None:
if limit == 0:
return
if value in items:
return
if len(items) < limit:
items.append(value)
def analyze_log(
lines: list[str],
patterns: list[dict[str, Any]],
severity_filter: str | None,
category_filter: str | None,
top: int,
max_samples: int,
pattern_catalog_path: Path,
) -> dict[str, Any]:
syslog_year = datetime.now().year
groups: dict[str, dict[str, Any]] = {}
top_categories = Counter()
total_lines_scanned = 0
parsed_timestamps = 0
unknown_timestamps = 0
for line in lines:
total_lines_scanned += 1
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
if parsed_at is None:
unknown_timestamps += 1
else:
parsed_timestamps += 1
for pattern in patterns:
if not severity_filter_matches(severity_filter, pattern["severity"]):
continue
if not category_filter_matches(category_filter, pattern["category"]):
continue
if not pattern["regex"].search(line):
continue
group = groups.setdefault(
pattern["id"],
{
"id": pattern["id"],
"name": pattern["name"],
"severity": pattern["severity"],
"category": pattern["category"],
"runbook": pattern["runbook"],
"description": pattern["description"],
"regex": pattern["regex_text"],
"occurrences": 0,
"first_seen": None,
"last_seen": None,
"samples": [],
},
)
group["occurrences"] += 1
update_seen(group, parsed_at, rendered_at)
append_limited(group["samples"], line, max_samples)
top_categories[pattern["category"]] += 1
findings = sorted(
groups.values(),
key=lambda item: (
SEVERITY_ORDER[item["severity"]],
-item["occurrences"],
item["id"],
),
)
rendered_findings = [
{
**finding,
"first_seen": render_seen(finding["first_seen"]),
"last_seen": render_seen(finding["last_seen"]),
}
for finding in findings
]
critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
total_matches = sum(item["occurrences"] for item in rendered_findings)
overall_status = "OK"
if critical_patterns > 0:
overall_status = "CRITICAL"
elif warning_patterns > 0:
overall_status = "WARNING"
return {
"overall_status": overall_status,
"total_lines_scanned": total_lines_scanned,
"total_known_error_matches": total_matches,
"matched_pattern_count": len(rendered_findings),
"critical_matched_pattern_count": critical_patterns,
"warning_matched_pattern_count": warning_patterns,
"top_categories": [
{"category": name, "count": count}
for name, count in top_categories.most_common(top)
],
"top_known_errors": [
{"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]}
for item in rendered_findings[:top]
],
"timestamp_coverage": {
"parsed_timestamps_count": parsed_timestamps,
"unknown_timestamps_count": unknown_timestamps,
},
"filters_used": {
"severity": severity_filter.lower() if severity_filter else None,
"category": category_filter,
},
"pattern_catalog_path": str(pattern_catalog_path),
"findings": rendered_findings[:top],
"findings_total": len(rendered_findings),
}
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
if not items:
return "None"
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
def render_text(report: dict[str, Any]) -> str:
lines = [
"Known Error Matcher",
"===================",
"",
f"Overall status: {report['overall_status']}",
"Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
for finding in report["findings"]:
lines.extend(
[
f"[{finding['severity']}] {finding['id']} - {finding['name']}",
f"Category: {finding['category']}",
f"Occurrences: {finding['occurrences']}",
f"First seen: {finding['first_seen']}",
f"Last seen: {finding['last_seen']}",
f"Runbook: {finding['runbook'] or 'None'}",
f"Description: {finding['description'] or 'None'}",
"Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - {sample}")
else:
lines.append(" - None")
lines.append("")
else:
lines.extend(["No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=False))
return "\n".join(lines)
def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]:
if markdown:
return [
"## Operational Summary",
"",
f"- Overall status: `{report['overall_status']}`",
f"- Total lines scanned: `{report['total_lines_scanned']}`",
f"- Known error matches: `{report['total_known_error_matches']}`",
f"- Matched known error patterns: `{report['matched_pattern_count']}`",
f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`",
f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`",
"- Top categories: " + render_top_pairs(report["top_categories"], "category"),
"- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"- Timestamp coverage: "
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
"- Filters used: "
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
f"category=`{report['filters_used']['category'] or 'None'}`",
f"- Pattern catalog path: `{report['pattern_catalog_path']}`",
]
return [
"Operational Summary",
"-------------------",
f"Overall status: {report['overall_status']}",
f"Total lines scanned: {report['total_lines_scanned']}",
f"Known error matches: {report['total_known_error_matches']}",
f"Matched known error patterns: {report['matched_pattern_count']}",
f"Critical matched patterns: {report['critical_matched_pattern_count']}",
f"Warning matched patterns: {report['warning_matched_pattern_count']}",
"Top categories: " + render_top_pairs(report["top_categories"], "category"),
"Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
"Timestamp coverage: "
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
"Filters used: "
f"severity={report['filters_used']['severity'] or 'None'}, "
f"category={report['filters_used']['category'] or 'None'}",
f"Pattern catalog path: {report['pattern_catalog_path']}",
]
def render_markdown(report: dict[str, Any]) -> str:
lines = [
"# Known Error Matcher Report",
"",
f"- Overall status: `{report['overall_status']}`",
"- Known error pattern matches require operator review; logs alone do not prove root cause.",
"",
]
if report["findings"]:
lines.extend(["## Matched Known Errors", ""])
for finding in report["findings"]:
lines.extend(
[
f"### [{finding['severity']}] {finding['id']} - {finding['name']}",
"",
f"- Category: `{finding['category']}`",
f"- Occurrences: `{finding['occurrences']}`",
f"- First seen: `{finding['first_seen']}`",
f"- Last seen: `{finding['last_seen']}`",
f"- Runbook: `{finding['runbook'] or 'None'}`",
f"- Description: {finding['description'] or 'None'}",
"- Samples:",
]
)
if finding["samples"]:
for sample in finding["samples"]:
lines.append(f" - `{sample}`")
else:
lines.append(" - `None`")
lines.append("")
else:
lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""])
lines.extend(render_summary_lines(report, markdown=True))
return "\n".join(lines)
def render_json(report: dict[str, Any]) -> str:
return json.dumps(report, indent=2)
def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None:
if output_path is None:
print(text)
return
destination = Path(output_path)
try:
destination_resolved = destination.resolve()
for input_path in protected_inputs:
if input_path.resolve() == destination_resolved:
raise OSError("output path must not overwrite an input file")
except FileNotFoundError as exc:
raise OSError(f"unable to resolve output path {destination}: {exc}") from exc
try:
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
except OSError as exc:
raise OSError(f"unable to write report to {destination}: {exc}") from exc
def determine_exit_code(report: dict[str, Any]) -> int:
if report["total_known_error_matches"] > 0:
return EXIT_FINDINGS
return EXIT_OK
def main() -> int:
parser = build_parser()
args = parser.parse_args()
try:
log_path = Path(args.file)
pattern_path = Path(args.patterns)
log_text = read_text_file(log_path, "log file")
lines = log_text.splitlines()
patterns = load_pattern_catalog(pattern_path, args.ignore_case)
severity_filter = args.severity.upper() if args.severity else None
report = analyze_log(
lines=lines,
patterns=patterns,
severity_filter=severity_filter,
category_filter=args.category,
top=args.top,
max_samples=args.max_samples,
pattern_catalog_path=pattern_path,
)
if args.format == "text":
rendered = render_text(report)
elif args.format == "markdown":
rendered = render_markdown(report)
else:
rendered = render_json(report)
write_output(rendered, args.output, [log_path, pattern_path])
return determine_exit_code(report)
except (OSError, ValueError) as exc:
print(f"ERROR: {exc}", file=sys.stderr)
return EXIT_INVALID
except Exception as exc: # pragma: no cover - defensive operational fallback
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
return EXIT_INVALID
if __name__ == "__main__":
sys.exit(main())
@@ -0,0 +1,220 @@
{
"patterns": [
{
"id": "disk_full",
"name": "Disk full",
"severity": "CRITICAL",
"regex": "No space left on device|disk full|filesystem full",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem or application failed because free space was exhausted."
},
{
"id": "inode_exhaustion",
"name": "Inode exhaustion",
"severity": "CRITICAL",
"regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0",
"category": "storage",
"runbook": "infra-run/scripts/bash/disk-full/README.md",
"description": "Filesystem may have free blocks but too few available inodes."
},
{
"id": "read_only_filesystem",
"name": "Read-only filesystem",
"severity": "CRITICAL",
"regex": "read-only file system|read-only filesystem|Remounting filesystem read-only",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/read-only-filesystem.md",
"description": "Filesystem writes failed because the mount was read-only or remounted read-only."
},
{
"id": "io_error",
"name": "I/O error",
"severity": "CRITICAL",
"regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error",
"category": "storage",
"runbook": "infra-run/runbooks/incidents/storage-io-error.md",
"description": "Kernel or application reported storage I/O errors that require device and filesystem review."
},
{
"id": "out_of_memory",
"name": "Out of memory",
"severity": "CRITICAL",
"regex": "\\bout of memory\\b|Cannot allocate memory",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/memory-pressure.md",
"description": "Process or host reported memory exhaustion symptoms."
},
{
"id": "oom_killer",
"name": "OOM killer invoked",
"severity": "CRITICAL",
"regex": "oom-killer|Killed process \\d+|Out of memory: Killed process",
"category": "memory",
"runbook": "infra-run/runbooks/incidents/oom-killer.md",
"description": "Kernel OOM killer activity was logged and affected processes should be reviewed."
},
{
"id": "segmentation_fault",
"name": "Segmentation fault",
"severity": "CRITICAL",
"regex": "segmentation fault|segfault",
"category": "process",
"runbook": "infra-run/runbooks/incidents/process-crash.md",
"description": "A process crash pattern was logged."
},
{
"id": "connection_refused",
"name": "Connection refused",
"severity": "WARNING",
"regex": "connection refused|ConnectException: Connection refused",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Client connection attempts were refused by the destination service or host."
},
{
"id": "connection_reset",
"name": "Connection reset",
"severity": "WARNING",
"regex": "connection reset|Connection reset by peer",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Established network connections were reset and require endpoint review."
},
{
"id": "timeout",
"name": "Timeout",
"severity": "WARNING",
"regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException",
"category": "network",
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
"description": "Operation timed out and may require network, service, or dependency review."
},
{
"id": "dns_resolution_failure",
"name": "DNS resolution failure",
"severity": "WARNING",
"regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host",
"category": "network",
"runbook": "infra-run/runbooks/incidents/dns-resolution.md",
"description": "Name resolution failed for a host or service dependency."
},
{
"id": "certificate_expired",
"name": "Certificate expired",
"severity": "CRITICAL",
"regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/certificate-expired.md",
"description": "TLS certificate expiry was logged and certificate state should be reviewed."
},
{
"id": "tls_handshake_failed",
"name": "TLS handshake failed",
"severity": "WARNING",
"regex": "TLS handshake failed|SSL handshake failed|handshake_failure",
"category": "tls",
"runbook": "infra-run/runbooks/incidents/tls-handshake.md",
"description": "TLS handshake failed and may require certificate, protocol, or trust-store review."
},
{
"id": "authentication_failure",
"name": "Authentication failure",
"severity": "WARNING",
"regex": "authentication failure|Failed password|authentication failed",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Authentication failures were logged and may require access review."
},
{
"id": "permission_denied",
"name": "Permission denied",
"severity": "WARNING",
"regex": "permission denied|access denied|denied by policy",
"category": "security",
"runbook": "infra-run/runbooks/incidents/permission-denied.md",
"description": "Access or permission denial was logged."
},
{
"id": "invalid_user",
"name": "Invalid user",
"severity": "WARNING",
"regex": "Invalid user|invalid user|user unknown|User not known",
"category": "security",
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
"description": "Log contains attempts involving invalid or unknown users."
},
{
"id": "java_out_of_memory",
"name": "Java OutOfMemoryError",
"severity": "CRITICAL",
"regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java process logged memory exhaustion symptoms."
},
{
"id": "ssl_handshake_exception",
"name": "SSLHandshakeException",
"severity": "CRITICAL",
"regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException",
"category": "application_jvm",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Java TLS handshake exception was logged."
},
{
"id": "database_unavailable",
"name": "Database unavailable",
"severity": "CRITICAL",
"regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted",
"category": "application",
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
"description": "Application logged unavailable database or database connectivity symptoms."
},
{
"id": "http_500",
"name": "HTTP 500",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 500 responses."
},
{
"id": "http_503",
"name": "HTTP 503",
"severity": "CRITICAL",
"regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable",
"category": "application",
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
"description": "Application or proxy logged HTTP 503 service unavailable responses."
},
{
"id": "service_failed",
"name": "Systemd service failed",
"severity": "CRITICAL",
"regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a failed service or failed service start."
},
{
"id": "dependency_failed",
"name": "Systemd dependency failed",
"severity": "CRITICAL",
"regex": "Dependency failed for|dependency failed",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd logged a unit dependency failure."
},
{
"id": "start_request_repeated",
"name": "Start request repeated too quickly",
"severity": "WARNING",
"regex": "Start request repeated too quickly|start request repeated too quickly",
"category": "systemd",
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
"description": "Systemd throttled service restarts after repeated start failures."
}
]
}