diff --git a/infra-run/scripts/python/known-error-matcher/README.md b/infra-run/scripts/python/known-error-matcher/README.md new file mode 100644 index 0000000..13df8aa --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/README.md @@ -0,0 +1,198 @@ +# known-error-matcher + +`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next. + +The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks. + +## Purpose + +- Identify which cataloged operational problems are visible in a collected log. +- Count how often each known error pattern appears. +- Surface warning and critical matches conservatively. +- Point operators toward relevant runbooks or supporting local tools. +- Produce predictable text, Markdown, or JSON output for incident notes. + +## When To Use + +- During incident response when a collected application, system, or journal extract needs quick known-error matching. +- Before attaching log evidence to an incident, problem, or change ticket. +- When teams maintain a small local catalog of operational patterns and runbook links. +- When JSON output is useful for later local automation. + +## What It Does Not Do + +- It does not read remote systems or live streams. +- It does not modify logs, services, applications, accounts, or host state. +- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services. +- It does not find root cause automatically. +- It does not prove an incident or confirm customer impact. +- It does not classify every vendor-specific log message. + +## Pattern Catalog Format + +Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies. + +```json +{ + "patterns": [ + { + "id": "disk_full", + "name": "Disk full", + "severity": "CRITICAL", + "regex": "No space left on device|disk full", + "category": "storage", + "runbook": "infra-run/scripts/bash/disk-full/README.md", + "description": "Filesystem or application failed because free space was exhausted." + } + ] +} +``` + +Required fields per pattern: + +- `id` - stable non-empty identifier. +- `name` - human-readable finding name. +- `severity` - `WARNING` or `CRITICAL`. +- `regex` - Python regular expression used for matching. + +Optional fields: + +- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`. +- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`. +- `description` - short operator-facing explanation. Missing values are reported as `None`. + +The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`. + +## Adding A Known Error Pattern + +Add a new object under `patterns` in `patterns.json`: + +```json +{ + "id": "example_dependency_failure", + "name": "Example dependency failure", + "severity": "WARNING", + "regex": "dependency request failed|upstream dependency unavailable", + "category": "application", + "runbook": "infra-run/runbooks/incidents/dependency-failure.md", + "description": "Application logged a dependency failure that requires review." +} +``` + +Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty. + +## Severity Model + +Overall status is conservative: + +- `OK` - no known error patterns matched. +- `WARNING` - one or more warning patterns matched and no critical patterns matched. +- `CRITICAL` - one or more critical patterns matched. + +The status means known error patterns require review. It is not a final root-cause statement. + +## Category Filtering + +Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value. + +Examples: + +```bash +python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application +``` + +## Usage + +```bash +cd infra-run/scripts/python/known-error-matcher + +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json +python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10 +python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5 +``` + +## Output Formats + +- `text` - default terminal-oriented report. +- `markdown` - incident, problem, or change ticket attachment format. +- `json` - structured output for local automation. + +Use `--output ` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file. + +## Exit Codes + +- `0` - OK, no known error matches. +- `1` - Known error matches detected. +- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error. + +## Example Text Output + +```text +Known Error Matcher +=================== + +Overall status: CRITICAL +Known error pattern matches require operator review; logs alone do not prove root cause. + +[CRITICAL] database_unavailable - Database unavailable +Category: application +Occurrences: 1 +First seen: 2026-05-11 10:16:07 +Last seen: 2026-05-11 10:16:07 +Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md +Description: Application logged unavailable database or database connectivity symptoms. +Samples: + - 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool + +Operational Summary +------------------- +Overall status: CRITICAL +Total lines scanned: 9 +Known error matches: 7 +Matched known error patterns: 7 +Critical matched patterns: 5 +Warning matched patterns: 2 +Top categories: application (3), network (2), application_jvm (2) +Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1) +Timestamp coverage: parsed=9, unknown=0 +Filters used: severity=None, category=None +Pattern catalog path: patterns.json +``` + +## Markdown Workflow + +Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence: + +```bash +python3 known_error_matcher.py \ + --file examples/sample-app.log \ + --patterns patterns.json \ + --format markdown \ + --output known-error-report.md +``` + +Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook. + +## Operational Limitations + +- Pattern matching is intentionally simple and predictable. +- A single log line can match multiple known error patterns. +- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used. +- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`. +- Counts are raw log-line matches, not request rates, incident duration, or customer impact. +- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters. +- Large log files are read into memory; use scoped extracts for very large incidents. + +## Safety Notes + +- The tool only reads the input log and pattern catalog and optionally writes a separate report. +- It does not require elevated privileges unless the chosen log path requires them. +- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples. +- Treat matches as prompts for operator review, not automated remediation instructions. diff --git a/infra-run/scripts/python/known-error-matcher/examples/sample-app.log b/infra-run/scripts/python/known-error-matcher/examples/sample-app.log new file mode 100644 index 0000000..9c509b2 --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/examples/sample-app.log @@ -0,0 +1,9 @@ +2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42 +2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted +2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool +2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443 +2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms +2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed +2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check" +2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space +2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded diff --git a/infra-run/scripts/python/known-error-matcher/examples/sample-known-error-report.md b/infra-run/scripts/python/known-error-matcher/examples/sample-known-error-report.md new file mode 100644 index 0000000..3e28f0f --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/examples/sample-known-error-report.md @@ -0,0 +1,97 @@ +# Known Error Matcher Report + +- Overall status: `CRITICAL` +- Known error pattern matches require operator review; logs alone do not prove root cause. + +## Matched Known Errors + +### [CRITICAL] database_unavailable - Database unavailable + +- Category: `application` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:07` +- Last seen: `2026-05-11 10:16:07` +- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md` +- Description: Application logged unavailable database or database connectivity symptoms. +- Samples: + - `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool` + +### [CRITICAL] http_500 - HTTP 500 + +- Category: `application` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:02` +- Last seen: `2026-05-11 10:16:02` +- Runbook: `infra-run/runbooks/incidents/http-5xx.md` +- Description: Application or proxy logged HTTP 500 responses. +- Samples: + - `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted` + +### [CRITICAL] http_503 - HTTP 503 + +- Category: `application` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:31.456` +- Last seen: `2026-05-11 10:16:31.456` +- Runbook: `infra-run/runbooks/incidents/http-5xx.md` +- Description: Application or proxy logged HTTP 503 service unavailable responses. +- Samples: + - `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"` + +### [CRITICAL] java_out_of_memory - Java OutOfMemoryError + +- Category: `application_jvm` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:40` +- Last seen: `2026-05-11 10:16:40` +- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md` +- Description: Java process logged memory exhaustion symptoms. +- Samples: + - `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space` + +### [CRITICAL] ssl_handshake_exception - SSLHandshakeException + +- Category: `application_jvm` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:22` +- Last seen: `2026-05-11 10:16:22` +- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md` +- Description: Java TLS handshake exception was logged. +- Samples: + - `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed` + +### [WARNING] connection_refused - Connection refused + +- Category: `network` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:11` +- Last seen: `2026-05-11 10:16:11` +- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md` +- Description: Client connection attempts were refused by the destination service or host. +- Samples: + - `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443` + +### [WARNING] timeout - Timeout + +- Category: `network` +- Occurrences: `1` +- First seen: `2026-05-11 10:16:15,123` +- Last seen: `2026-05-11 10:16:15,123` +- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md` +- Description: Operation timed out and may require network, service, or dependency review. +- Samples: + - `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms` + +## Operational Summary + +- Overall status: `CRITICAL` +- Total lines scanned: `9` +- Known error matches: `7` +- Matched known error patterns: `7` +- Critical matched patterns: `5` +- Warning matched patterns: `2` +- Top categories: application (3), network (2), application_jvm (2) +- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1) +- Timestamp coverage: parsed=`9`, unknown=`0` +- Filters used: severity=`None`, category=`None` +- Pattern catalog path: `patterns.json` diff --git a/infra-run/scripts/python/known-error-matcher/examples/sample-system.log b/infra-run/scripts/python/known-error-matcher/examples/sample-system.log new file mode 100644 index 0000000..192c99e --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/examples/sample-system.log @@ -0,0 +1,10 @@ +May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages +May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only +May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE) +May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB +May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server. +May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application. +May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly. +May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy +May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp +May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp diff --git a/infra-run/scripts/python/known-error-matcher/known_error_matcher.py b/infra-run/scripts/python/known-error-matcher/known_error_matcher.py new file mode 100644 index 0000000..d382361 --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/known_error_matcher.py @@ -0,0 +1,562 @@ +#!/usr/bin/env python3 +"""Match local logs against a JSON catalog of known operational error patterns.""" + +from __future__ import annotations + +import argparse +import json +import re +import sys +from collections import Counter +from datetime import datetime +from pathlib import Path +from typing import Any + + +EXIT_OK = 0 +EXIT_FINDINGS = 1 +EXIT_INVALID = 2 + +UNKNOWN = "UNKNOWN" +VALID_SEVERITIES = {"WARNING", "CRITICAL"} +SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1} + +ISO_TIMESTAMP_RE = re.compile( + r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b" +) +SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b") + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser( + description="Scan a local log file for known operational error patterns." + ) + parser.add_argument("--file", required=True, help="Local log file to scan.") + parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.") + parser.add_argument( + "--format", + choices=("text", "markdown", "json"), + default="text", + help="Report format. Default: text.", + ) + parser.add_argument("--output", help="Write report to this path instead of stdout.") + parser.add_argument( + "--severity", + choices=("warning", "critical"), + help="Only include findings with this severity.", + ) + parser.add_argument( + "--category", + help="Only include findings from this exact category.", + ) + parser.add_argument( + "--top", + type=positive_int, + default=10, + help="Number of matched known errors and summary entries to display. Default: 10.", + ) + parser.add_argument( + "--max-samples", + type=non_negative_int, + default=3, + help="Maximum sample lines per matched known error. Default: 3.", + ) + parser.add_argument( + "--ignore-case", + action="store_true", + help="Compile catalog regex patterns case-insensitively.", + ) + return parser + + +def positive_int(value: str) -> int: + try: + number = int(value) + except ValueError as exc: + raise argparse.ArgumentTypeError("must be a positive integer") from exc + if number <= 0: + raise argparse.ArgumentTypeError("must be a positive integer") + return number + + +def non_negative_int(value: str) -> int: + try: + number = int(value) + except ValueError as exc: + raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc + if number < 0: + raise argparse.ArgumentTypeError("must be zero or a positive integer") + return number + + +def read_text_file(path: Path, label: str) -> str: + if not path.exists(): + raise OSError(f"{label} does not exist: {path}") + if not path.is_file(): + raise OSError(f"{label} is not a regular file: {path}") + try: + text = path.read_text(encoding="utf-8", errors="replace") + except PermissionError as exc: + raise OSError(f"{label} is not readable: {path}") from exc + except OSError as exc: + raise OSError(f"unable to read {label} {path}: {exc}") from exc + if text == "": + raise ValueError(f"{label} is empty: {path}") + return text + + +def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]: + text = read_text_file(path, "pattern catalog") + try: + catalog = json.loads(text) + except json.JSONDecodeError as exc: + raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc + + errors: list[str] = [] + if not isinstance(catalog, dict): + raise ValueError("invalid pattern catalog: top-level JSON value must be an object") + if "patterns" not in catalog: + raise ValueError('invalid pattern catalog: missing top-level "patterns" field') + if not isinstance(catalog["patterns"], list): + raise ValueError('invalid pattern catalog: "patterns" must be a list') + + seen_ids: set[str] = set() + compiled_patterns: list[dict[str, Any]] = [] + flags = re.IGNORECASE if ignore_case else 0 + + for index, item in enumerate(catalog["patterns"], start=1): + if not isinstance(item, dict): + errors.append(f"pattern #{index}: must be an object") + continue + + pattern_id = normalize_required_text(item, "id") + name = normalize_required_text(item, "name") + severity = normalize_required_text(item, "severity").upper() + regex_text = normalize_required_text(item, "regex") + + if not pattern_id: + errors.append(f"pattern #{index}: id is required and must be non-empty") + elif pattern_id in seen_ids: + errors.append(f"pattern #{index}: duplicate id {pattern_id}") + else: + seen_ids.add(pattern_id) + + if not name: + errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty") + if severity not in VALID_SEVERITIES: + errors.append( + f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL" + ) + if not regex_text: + errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty") + + compiled_regex = None + if regex_text: + try: + compiled_regex = re.compile(regex_text, flags) + except re.error as exc: + errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}") + + if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex: + compiled_patterns.append( + { + "id": pattern_id, + "name": name, + "severity": severity, + "regex_text": regex_text, + "regex": compiled_regex, + "category": normalize_optional_text(item, "category", UNKNOWN), + "runbook": normalize_optional_text(item, "runbook", ""), + "description": normalize_optional_text(item, "description", ""), + } + ) + + if errors: + raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors)) + if not compiled_patterns: + raise ValueError("invalid pattern catalog: no patterns configured") + return compiled_patterns + + +def normalize_required_text(item: dict[str, Any], field: str) -> str: + value = item.get(field) + if not isinstance(value, str): + return "" + return value.strip() + + +def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str: + value = item.get(field, default) + if not isinstance(value, str): + return default + value = value.strip() + return value if value else default + + +def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]: + iso_match = ISO_TIMESTAMP_RE.search(line) + if iso_match: + fraction = iso_match.group(3) or "" + raw = f"{iso_match.group(1)} {iso_match.group(2)}" + parse_value = raw + fmt = "%Y-%m-%d %H:%M:%S" + if fraction: + parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}" + fmt = "%Y-%m-%d %H:%M:%S.%f" + try: + return datetime.strptime(parse_value, fmt), raw + fraction + except ValueError: + return None, UNKNOWN + + syslog_match = SYSLOG_TIMESTAMP_RE.search(line) + if syslog_match: + raw = syslog_match.group(1) + try: + parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S") + except ValueError: + return None, UNKNOWN + return parsed, raw + + return None, UNKNOWN + + +def severity_filter_matches(selected: str | None, severity: str) -> bool: + if selected is None: + return True + return selected.upper() == severity + + +def category_filter_matches(selected: str | None, category: str) -> bool: + if selected is None: + return True + return selected == category + + +def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None: + if parsed_at is None: + return + if group["first_seen"] is None or parsed_at < group["first_seen"][0]: + group["first_seen"] = (parsed_at, rendered_at) + if group["last_seen"] is None or parsed_at > group["last_seen"][0]: + group["last_seen"] = (parsed_at, rendered_at) + + +def render_seen(value: tuple[datetime, str] | None) -> str: + if value is None: + return UNKNOWN + return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S") + + +def append_limited(items: list[str], value: str, limit: int) -> None: + if limit == 0: + return + if value in items: + return + if len(items) < limit: + items.append(value) + + +def analyze_log( + lines: list[str], + patterns: list[dict[str, Any]], + severity_filter: str | None, + category_filter: str | None, + top: int, + max_samples: int, + pattern_catalog_path: Path, +) -> dict[str, Any]: + syslog_year = datetime.now().year + groups: dict[str, dict[str, Any]] = {} + top_categories = Counter() + total_lines_scanned = 0 + parsed_timestamps = 0 + unknown_timestamps = 0 + + for line in lines: + total_lines_scanned += 1 + parsed_at, rendered_at = parse_line_timestamp(line, syslog_year) + if parsed_at is None: + unknown_timestamps += 1 + else: + parsed_timestamps += 1 + + for pattern in patterns: + if not severity_filter_matches(severity_filter, pattern["severity"]): + continue + if not category_filter_matches(category_filter, pattern["category"]): + continue + if not pattern["regex"].search(line): + continue + + group = groups.setdefault( + pattern["id"], + { + "id": pattern["id"], + "name": pattern["name"], + "severity": pattern["severity"], + "category": pattern["category"], + "runbook": pattern["runbook"], + "description": pattern["description"], + "regex": pattern["regex_text"], + "occurrences": 0, + "first_seen": None, + "last_seen": None, + "samples": [], + }, + ) + group["occurrences"] += 1 + update_seen(group, parsed_at, rendered_at) + append_limited(group["samples"], line, max_samples) + top_categories[pattern["category"]] += 1 + + findings = sorted( + groups.values(), + key=lambda item: ( + SEVERITY_ORDER[item["severity"]], + -item["occurrences"], + item["id"], + ), + ) + + rendered_findings = [ + { + **finding, + "first_seen": render_seen(finding["first_seen"]), + "last_seen": render_seen(finding["last_seen"]), + } + for finding in findings + ] + + critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL") + warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING") + total_matches = sum(item["occurrences"] for item in rendered_findings) + + overall_status = "OK" + if critical_patterns > 0: + overall_status = "CRITICAL" + elif warning_patterns > 0: + overall_status = "WARNING" + + return { + "overall_status": overall_status, + "total_lines_scanned": total_lines_scanned, + "total_known_error_matches": total_matches, + "matched_pattern_count": len(rendered_findings), + "critical_matched_pattern_count": critical_patterns, + "warning_matched_pattern_count": warning_patterns, + "top_categories": [ + {"category": name, "count": count} + for name, count in top_categories.most_common(top) + ], + "top_known_errors": [ + {"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]} + for item in rendered_findings[:top] + ], + "timestamp_coverage": { + "parsed_timestamps_count": parsed_timestamps, + "unknown_timestamps_count": unknown_timestamps, + }, + "filters_used": { + "severity": severity_filter.lower() if severity_filter else None, + "category": category_filter, + }, + "pattern_catalog_path": str(pattern_catalog_path), + "findings": rendered_findings[:top], + "findings_total": len(rendered_findings), + } + + +def render_top_pairs(items: list[dict[str, Any]], key: str) -> str: + if not items: + return "None" + return ", ".join(f"{item[key]} ({item['count']})" for item in items) + + +def render_text(report: dict[str, Any]) -> str: + lines = [ + "Known Error Matcher", + "===================", + "", + f"Overall status: {report['overall_status']}", + "Known error pattern matches require operator review; logs alone do not prove root cause.", + "", + ] + + if report["findings"]: + for finding in report["findings"]: + lines.extend( + [ + f"[{finding['severity']}] {finding['id']} - {finding['name']}", + f"Category: {finding['category']}", + f"Occurrences: {finding['occurrences']}", + f"First seen: {finding['first_seen']}", + f"Last seen: {finding['last_seen']}", + f"Runbook: {finding['runbook'] or 'None'}", + f"Description: {finding['description'] or 'None'}", + "Samples:", + ] + ) + if finding["samples"]: + for sample in finding["samples"]: + lines.append(f" - {sample}") + else: + lines.append(" - None") + lines.append("") + else: + lines.extend(["No known error patterns matched for the selected filters.", ""]) + + lines.extend(render_summary_lines(report, markdown=False)) + return "\n".join(lines) + + +def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]: + if markdown: + return [ + "## Operational Summary", + "", + f"- Overall status: `{report['overall_status']}`", + f"- Total lines scanned: `{report['total_lines_scanned']}`", + f"- Known error matches: `{report['total_known_error_matches']}`", + f"- Matched known error patterns: `{report['matched_pattern_count']}`", + f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`", + f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`", + "- Top categories: " + render_top_pairs(report["top_categories"], "category"), + "- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"), + "- Timestamp coverage: " + f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, " + f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`", + "- Filters used: " + f"severity=`{report['filters_used']['severity'] or 'None'}`, " + f"category=`{report['filters_used']['category'] or 'None'}`", + f"- Pattern catalog path: `{report['pattern_catalog_path']}`", + ] + return [ + "Operational Summary", + "-------------------", + f"Overall status: {report['overall_status']}", + f"Total lines scanned: {report['total_lines_scanned']}", + f"Known error matches: {report['total_known_error_matches']}", + f"Matched known error patterns: {report['matched_pattern_count']}", + f"Critical matched patterns: {report['critical_matched_pattern_count']}", + f"Warning matched patterns: {report['warning_matched_pattern_count']}", + "Top categories: " + render_top_pairs(report["top_categories"], "category"), + "Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"), + "Timestamp coverage: " + f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, " + f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}", + "Filters used: " + f"severity={report['filters_used']['severity'] or 'None'}, " + f"category={report['filters_used']['category'] or 'None'}", + f"Pattern catalog path: {report['pattern_catalog_path']}", + ] + + +def render_markdown(report: dict[str, Any]) -> str: + lines = [ + "# Known Error Matcher Report", + "", + f"- Overall status: `{report['overall_status']}`", + "- Known error pattern matches require operator review; logs alone do not prove root cause.", + "", + ] + + if report["findings"]: + lines.extend(["## Matched Known Errors", ""]) + for finding in report["findings"]: + lines.extend( + [ + f"### [{finding['severity']}] {finding['id']} - {finding['name']}", + "", + f"- Category: `{finding['category']}`", + f"- Occurrences: `{finding['occurrences']}`", + f"- First seen: `{finding['first_seen']}`", + f"- Last seen: `{finding['last_seen']}`", + f"- Runbook: `{finding['runbook'] or 'None'}`", + f"- Description: {finding['description'] or 'None'}", + "- Samples:", + ] + ) + if finding["samples"]: + for sample in finding["samples"]: + lines.append(f" - `{sample}`") + else: + lines.append(" - `None`") + lines.append("") + else: + lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""]) + + lines.extend(render_summary_lines(report, markdown=True)) + return "\n".join(lines) + + +def render_json(report: dict[str, Any]) -> str: + return json.dumps(report, indent=2) + + +def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None: + if output_path is None: + print(text) + return + + destination = Path(output_path) + try: + destination_resolved = destination.resolve() + for input_path in protected_inputs: + if input_path.resolve() == destination_resolved: + raise OSError("output path must not overwrite an input file") + except FileNotFoundError as exc: + raise OSError(f"unable to resolve output path {destination}: {exc}") from exc + + try: + destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8") + except OSError as exc: + raise OSError(f"unable to write report to {destination}: {exc}") from exc + + +def determine_exit_code(report: dict[str, Any]) -> int: + if report["total_known_error_matches"] > 0: + return EXIT_FINDINGS + return EXIT_OK + + +def main() -> int: + parser = build_parser() + args = parser.parse_args() + + try: + log_path = Path(args.file) + pattern_path = Path(args.patterns) + log_text = read_text_file(log_path, "log file") + lines = log_text.splitlines() + patterns = load_pattern_catalog(pattern_path, args.ignore_case) + severity_filter = args.severity.upper() if args.severity else None + + report = analyze_log( + lines=lines, + patterns=patterns, + severity_filter=severity_filter, + category_filter=args.category, + top=args.top, + max_samples=args.max_samples, + pattern_catalog_path=pattern_path, + ) + + if args.format == "text": + rendered = render_text(report) + elif args.format == "markdown": + rendered = render_markdown(report) + else: + rendered = render_json(report) + + write_output(rendered, args.output, [log_path, pattern_path]) + return determine_exit_code(report) + except (OSError, ValueError) as exc: + print(f"ERROR: {exc}", file=sys.stderr) + return EXIT_INVALID + except Exception as exc: # pragma: no cover - defensive operational fallback + print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr) + return EXIT_INVALID + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/infra-run/scripts/python/known-error-matcher/patterns.json b/infra-run/scripts/python/known-error-matcher/patterns.json new file mode 100644 index 0000000..3d30007 --- /dev/null +++ b/infra-run/scripts/python/known-error-matcher/patterns.json @@ -0,0 +1,220 @@ +{ + "patterns": [ + { + "id": "disk_full", + "name": "Disk full", + "severity": "CRITICAL", + "regex": "No space left on device|disk full|filesystem full", + "category": "storage", + "runbook": "infra-run/scripts/bash/disk-full/README.md", + "description": "Filesystem or application failed because free space was exhausted." + }, + { + "id": "inode_exhaustion", + "name": "Inode exhaustion", + "severity": "CRITICAL", + "regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0", + "category": "storage", + "runbook": "infra-run/scripts/bash/disk-full/README.md", + "description": "Filesystem may have free blocks but too few available inodes." + }, + { + "id": "read_only_filesystem", + "name": "Read-only filesystem", + "severity": "CRITICAL", + "regex": "read-only file system|read-only filesystem|Remounting filesystem read-only", + "category": "storage", + "runbook": "infra-run/runbooks/incidents/read-only-filesystem.md", + "description": "Filesystem writes failed because the mount was read-only or remounted read-only." + }, + { + "id": "io_error", + "name": "I/O error", + "severity": "CRITICAL", + "regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error", + "category": "storage", + "runbook": "infra-run/runbooks/incidents/storage-io-error.md", + "description": "Kernel or application reported storage I/O errors that require device and filesystem review." + }, + { + "id": "out_of_memory", + "name": "Out of memory", + "severity": "CRITICAL", + "regex": "\\bout of memory\\b|Cannot allocate memory", + "category": "memory", + "runbook": "infra-run/runbooks/incidents/memory-pressure.md", + "description": "Process or host reported memory exhaustion symptoms." + }, + { + "id": "oom_killer", + "name": "OOM killer invoked", + "severity": "CRITICAL", + "regex": "oom-killer|Killed process \\d+|Out of memory: Killed process", + "category": "memory", + "runbook": "infra-run/runbooks/incidents/oom-killer.md", + "description": "Kernel OOM killer activity was logged and affected processes should be reviewed." + }, + { + "id": "segmentation_fault", + "name": "Segmentation fault", + "severity": "CRITICAL", + "regex": "segmentation fault|segfault", + "category": "process", + "runbook": "infra-run/runbooks/incidents/process-crash.md", + "description": "A process crash pattern was logged." + }, + { + "id": "connection_refused", + "name": "Connection refused", + "severity": "WARNING", + "regex": "connection refused|ConnectException: Connection refused", + "category": "network", + "runbook": "infra-run/scripts/bash/os-healthcheck/README.md", + "description": "Client connection attempts were refused by the destination service or host." + }, + { + "id": "connection_reset", + "name": "Connection reset", + "severity": "WARNING", + "regex": "connection reset|Connection reset by peer", + "category": "network", + "runbook": "infra-run/scripts/bash/os-healthcheck/README.md", + "description": "Established network connections were reset and require endpoint review." + }, + { + "id": "timeout", + "name": "Timeout", + "severity": "WARNING", + "regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException", + "category": "network", + "runbook": "infra-run/scripts/bash/os-healthcheck/README.md", + "description": "Operation timed out and may require network, service, or dependency review." + }, + { + "id": "dns_resolution_failure", + "name": "DNS resolution failure", + "severity": "WARNING", + "regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host", + "category": "network", + "runbook": "infra-run/runbooks/incidents/dns-resolution.md", + "description": "Name resolution failed for a host or service dependency." + }, + { + "id": "certificate_expired", + "name": "Certificate expired", + "severity": "CRITICAL", + "regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter", + "category": "tls", + "runbook": "infra-run/runbooks/incidents/certificate-expired.md", + "description": "TLS certificate expiry was logged and certificate state should be reviewed." + }, + { + "id": "tls_handshake_failed", + "name": "TLS handshake failed", + "severity": "WARNING", + "regex": "TLS handshake failed|SSL handshake failed|handshake_failure", + "category": "tls", + "runbook": "infra-run/runbooks/incidents/tls-handshake.md", + "description": "TLS handshake failed and may require certificate, protocol, or trust-store review." + }, + { + "id": "authentication_failure", + "name": "Authentication failure", + "severity": "WARNING", + "regex": "authentication failure|Failed password|authentication failed", + "category": "security", + "runbook": "infra-run/scripts/python/auth-log-audit/README.md", + "description": "Authentication failures were logged and may require access review." + }, + { + "id": "permission_denied", + "name": "Permission denied", + "severity": "WARNING", + "regex": "permission denied|access denied|denied by policy", + "category": "security", + "runbook": "infra-run/runbooks/incidents/permission-denied.md", + "description": "Access or permission denial was logged." + }, + { + "id": "invalid_user", + "name": "Invalid user", + "severity": "WARNING", + "regex": "Invalid user|invalid user|user unknown|User not known", + "category": "security", + "runbook": "infra-run/scripts/python/auth-log-audit/README.md", + "description": "Log contains attempts involving invalid or unknown users." + }, + { + "id": "java_out_of_memory", + "name": "Java OutOfMemoryError", + "severity": "CRITICAL", + "regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded", + "category": "application_jvm", + "runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md", + "description": "Java process logged memory exhaustion symptoms." + }, + { + "id": "ssl_handshake_exception", + "name": "SSLHandshakeException", + "severity": "CRITICAL", + "regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException", + "category": "application_jvm", + "runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md", + "description": "Java TLS handshake exception was logged." + }, + { + "id": "database_unavailable", + "name": "Database unavailable", + "severity": "CRITICAL", + "regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted", + "category": "application", + "runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md", + "description": "Application logged unavailable database or database connectivity symptoms." + }, + { + "id": "http_500", + "name": "HTTP 500", + "severity": "CRITICAL", + "regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s", + "category": "application", + "runbook": "infra-run/runbooks/incidents/http-5xx.md", + "description": "Application or proxy logged HTTP 500 responses." + }, + { + "id": "http_503", + "name": "HTTP 503", + "severity": "CRITICAL", + "regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable", + "category": "application", + "runbook": "infra-run/runbooks/incidents/http-5xx.md", + "description": "Application or proxy logged HTTP 503 service unavailable responses." + }, + { + "id": "service_failed", + "name": "Systemd service failed", + "severity": "CRITICAL", + "regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=", + "category": "systemd", + "runbook": "infra-run/scripts/python/journal-analyzer/README.md", + "description": "Systemd logged a failed service or failed service start." + }, + { + "id": "dependency_failed", + "name": "Systemd dependency failed", + "severity": "CRITICAL", + "regex": "Dependency failed for|dependency failed", + "category": "systemd", + "runbook": "infra-run/scripts/python/journal-analyzer/README.md", + "description": "Systemd logged a unit dependency failure." + }, + { + "id": "start_request_repeated", + "name": "Start request repeated too quickly", + "severity": "WARNING", + "regex": "Start request repeated too quickly|start request repeated too quickly", + "category": "systemd", + "runbook": "infra-run/scripts/python/journal-analyzer/README.md", + "description": "Systemd throttled service restarts after repeated start failures." + } + ] +}