Add incident log summary tool

2026-05-11 17:03:31 +00:00
parent 61483c233f
commit 5dde403ce3
5 changed files with 765 additions and 0 deletions
@@ -0,0 +1,158 @@
 # incident-log-summary
 `incident-log-summary` is a read-only Python CLI for quick incident log review. It scans a local Linux system log or application log and groups configured operational patterns by severity, count, timestamps, and sample lines.
 The tool is meant for first-pass triage and incident notes. It does not replace full log search, alert correlation, service-specific runbooks, or review by an operator who understands the affected platform.
 ## When To Use
 - During incident response when a collected log file needs a fast pattern summary.
 - Before attaching evidence to an incident, problem, or change ticket.
 - When comparing whether a log contains obvious storage, memory, service, TLS, HTTP, or connectivity failures.
 - When JSON output is useful for later local automation.
 ## What It Does Not Do
 - It does not read remote systems.
 - It does not modify logs or system state.
 - It does not query ELK, Zabbix, SIEM, journald, or application APIs.
 - It does not prove root cause.
 - It does not classify every possible vendor or application error.
 - It does not treat sanitized examples as production validation.
 ## Supported Input
 - One local text log file provided with `--file`.
 - UTF-8 input is expected. Invalid byte sequences are replaced during read so review can continue.
 - Empty, missing, unreadable, or non-file paths are rejected with exit code `2`.
 ## Supported Patterns
 Critical patterns:
 - `CRITICAL`
 - `FATAL`
 - `panic`
 - `kernel panic`
 - `no space left on device`
 - `out of memory`
 - `killed process`
 - `read-only file system`
 - `segmentation fault`
 - `segfault`
 - `certificate expired`
 - `TLS handshake failed`
 - `SSLHandshakeException`
 - `database unavailable`
 - `HTTP 500`
 - `HTTP 502`
 - `HTTP 503`
 - `HTTP 504`
 Warning patterns:
 - `ERROR`
 - `failed`
 - `failure`
 - `timeout`
 - `connection refused`
 - `connection reset`
 - `permission denied`
 - `authentication failed`
 - `denied`
 - `unavailable`
 - `service restart`
 - `retrying`
 By default matching is case-sensitive. Use `--ignore-case` for case-insensitive matching across all configured patterns.
 ## Timestamp Handling
 The scanner attempts to parse:
 - `2026-05-11 10:15:30`
 - `2026-05-11T10:15:30`
 - `May 11 10:15:30`
 Timestamp parsing is best-effort. Lines with unparseable timestamps are still analyzed, and date filtering keeps those lines by default so potentially important findings are not silently discarded.
 Syslog-style timestamps do not include a year. For filtering, the tool uses the year from `--since` when present, otherwise the current local year.
 ## Usage
 ```bash
 cd infra-run/scripts/python/incident-log-summary
 python3 incident_log_summary.py --file examples/system-messages.log
 python3 incident_log_summary.py --file examples/app-error.log --format markdown --output incident-report.md
 python3 incident_log_summary.py --file examples/app-error.log --format json
 python3 incident_log_summary.py --file examples/app-error.log --top 20
 python3 incident_log_summary.py --file examples/app-error.log --ignore-case
 python3 incident_log_summary.py --file examples/app-error.log --since "2026-05-11 10:00:00"
 python3 incident_log_summary.py --file examples/app-error.log --until "2026-05-11 12:00:00"
 ```
 ## Output Formats
 - `text` - default terminal-oriented report.
 - `markdown` - incident or change ticket attachment format.
 - `json` - structured output for local automation.
 Use `--output <path>` to write the rendered report to a file. Without `--output`, the report is printed to stdout.
 ## Exit Codes
 - `0` - OK, no findings.
 - `1` - Operational findings detected.
 - `2` - Invalid input, unreadable file, bad argument, or runtime error.
 ## Example Text Output
 ```text
 Incident Log Summary
 ====================
 [CRITICAL] no space left on device
 Occurrences: 1
 First seen: 2026-05-11 10:16:07
 Last seen: 2026-05-11 10:16:07
 Samples:
  - May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
 Operational Summary
 -------------------
 Total lines scanned: 7
 Total findings: 7
 Critical finding groups: 3
 Warning finding groups: 4
 Overall status: CRITICAL
 ```
 ## Markdown Workflow
 Generate a markdown report from the collected log and attach it to the incident or change ticket as supporting evidence:
 ```bash
 python3 incident_log_summary.py \
  --file examples/app-error.log \
  --format markdown \
  --output incident-report.md
 ```
 Review the report before attaching it. The output is evidence for triage; it is not a final root cause statement.
 ## Operational Limitations
 - Pattern matching is intentionally simple and predictable.
 - A single line can match multiple patterns, such as `ERROR`, `HTTP 503`, and `unavailable`.
 - Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
 - Syslog timestamps without a year are normalized with an inferred year.
 - Date filters are best-effort because lines without parseable timestamps are retained.
 - Large log files are read into memory; collect a scoped file or time-windowed extract for very large incidents.
 ## Safety Notes
 - The tool only reads the input log and optionally writes a separate report.
 - It does not require elevated privileges unless the chosen log path requires them.
 - Do not include secrets, customer data, private hostnames, or unsanitized production details in portfolio examples.
 - Treat findings as prompts for operator review, not automated remediation instructions.
@@ -0,0 +1,8 @@
 2026-05-11 09:48:12 app01 api[4150]: INFO request_id=7f3a status=200 path=/health
 2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
 2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
 2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
 2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
 2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
 2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
 2026-05-11 12:10:01 app01 api[4150]: INFO request_id=9001 status=200 path=/health
@@ -0,0 +1,144 @@
 # Incident Log Summary
 ## CRITICAL: certificate expired
 - Occurrences: 1
 - First seen: 2026-05-11 10:11:33
 - Last seen: 2026-05-11 10:11:33
 Sample log lines:
 ```text
 2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
 ```
 ## CRITICAL: CRITICAL
 - Occurrences: 1
 - First seen: 2026-05-11 10:11:33
 - Last seen: 2026-05-11 10:11:33
 Sample log lines:
 ```text
 2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
 ```
 ## CRITICAL: database unavailable
 - Occurrences: 1
 - First seen: 2026-05-11 10:03:19
 - Last seen: 2026-05-11 10:03:19
 Sample log lines:
 ```text
 2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
 ```
 ## CRITICAL: HTTP 500
 - Occurrences: 1
 - First seen: 2026-05-11 10:01:03
 - Last seen: 2026-05-11 10:01:03
 Sample log lines:
 ```text
 2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
 ```
 ## CRITICAL: HTTP 503
 - Occurrences: 1
 - First seen: 2026-05-11 10:13:58
 - Last seen: 2026-05-11 10:13:58
 Sample log lines:
 ```text
 2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
 ```
 ## CRITICAL: TLS handshake failed
 - Occurrences: 1
 - First seen: 2026-05-11 10:11:33
 - Last seen: 2026-05-11 10:11:33
 Sample log lines:
 ```text
 2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
 ```
 ## WARNING: ERROR
 - Occurrences: 4
 - First seen: 2026-05-11 10:01:03
 - Last seen: 2026-05-11 10:13:58
 Sample log lines:
 ```text
 2026-05-11 10:01:03 app01 api[4150]: ERROR request_id=8b21 HTTP 500 path=/checkout duration_ms=942
 2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
 2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
 ```
 ## WARNING: unavailable
 - Occurrences: 2
 - First seen: 2026-05-11 10:03:19
 - Last seen: 2026-05-11 10:13:58
 Sample log lines:
 ```text
 2026-05-11 10:03:19 app01 api[4150]: WARNING request_id=8b22 database unavailable for payments cluster
 2026-05-11 10:13:58 app01 api[4150]: ERROR request_id=8b44 HTTP 503 path=/checkout upstream unavailable
 ```
 ## WARNING: connection refused
 - Occurrences: 1
 - First seen: 2026-05-11 10:07:02
 - Last seen: 2026-05-11 10:07:02
 Sample log lines:
 ```text
 2026-05-11 10:07:02 app01 api[4150]: ERROR request_id=8b29 connection refused connecting to redis-cache:6379
 ```
 ## WARNING: failed
 - Occurrences: 1
 - First seen: 2026-05-11 10:11:33
 - Last seen: 2026-05-11 10:11:33
 Sample log lines:
 ```text
 2026-05-11T10:11:33 app01 api[4150]: CRITICAL request_id=8b31 TLS handshake failed: certificate expired
 ```
 ## WARNING: timeout
 - Occurrences: 1
 - First seen: 2026-05-11 10:05:44
 - Last seen: 2026-05-11 10:05:44
 Sample log lines:
 ```text
 2026-05-11 10:05:44 app01 api[4150]: ERROR request_id=8b25 timeout waiting for inventory service
 ```
 ## Operational Summary
 - Total lines scanned: 8
 - Total findings: 15
 - Critical finding groups: 6
 - Warning finding groups: 5
 - Overall status: CRITICAL
@@ -0,0 +1,7 @@
 May 11 09:57:01 ops-node-01 systemd[1]: Started Session 443 of user svc_backup.
 May 11 10:02:14 ops-node-01 systemd[1]: failed to start nightly-report.service: Unit entered failed state.
 May 11 10:04:22 ops-node-01 sudo[18442]: svc_backup : command not allowed ; permission denied
 May 11 10:16:07 ops-node-01 kernel: EXT4-fs warning: no space left on device while writing /var/log/messages
 May 11 10:21:45 ops-node-01 kernel: out of memory: killed process 2517 (java) total-vm:2048000kB
 May 11 10:22:03 ops-node-01 systemd[1]: service restart scheduled for app-worker.service
 May 11 10:30:31 ops-node-01 sshd[19210]: Accepted publickey for admin from 192.0.2.15 port 52210 ssh2
@@ -0,0 +1,448 @@
 #!/usr/bin/env python3
 """Summarize incident-oriented patterns in local log files."""
 from __future__ import annotations
 import argparse
 import json
 import re
 import sys
 from datetime import datetime
 from pathlib import Path
 from typing import Any
 EXIT_OK = 0
 EXIT_FINDINGS = 1
 EXIT_INVALID = 2
 UNKNOWN = "UNKNOWN"
 SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
 CRITICAL_PATTERNS = [
    "CRITICAL",
    "FATAL",
    "panic",
    "kernel panic",
    "no space left on device",
    "out of memory",
    "killed process",
    "read-only file system",
    "segmentation fault",
    "segfault",
    "certificate expired",
    "TLS handshake failed",
    "SSLHandshakeException",
    "database unavailable",
    "HTTP 500",
    "HTTP 502",
    "HTTP 503",
    "HTTP 504",
 ]
 WARNING_PATTERNS = [
    "ERROR",
    "failed",
    "failure",
    "timeout",
    "connection refused",
    "connection reset",
    "permission denied",
    "authentication failed",
    "denied",
    "unavailable",
    "service restart",
    "retrying",
 ]
 ISO_TIMESTAMP_RE = re.compile(r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})\b")
 SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
 def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        description="Summarize suspicious and critical patterns in a local log file."
    )
    parser.add_argument("--file", required=True, help="Local log file to analyze.")
    parser.add_argument(
        "--format",
        choices=("text", "markdown", "json"),
        default="text",
        help="Report format. Default: text.",
    )
    parser.add_argument("--output", help="Write report to this path instead of stdout.")
    parser.add_argument(
        "--top",
        type=positive_int,
        help="Limit finding groups after severity and count sorting.",
    )
    parser.add_argument(
        "--ignore-case",
        action="store_true",
        help="Match all configured patterns case-insensitively.",
    )
    parser.add_argument(
        "--since",
        type=parse_filter_timestamp,
        help='Include lines at or after "YYYY-MM-DD HH:MM:SS".',
    )
    parser.add_argument(
        "--until",
        type=parse_filter_timestamp,
        help='Include lines at or before "YYYY-MM-DD HH:MM:SS".',
    )
    parser.add_argument(
        "--max-samples",
        type=non_negative_int,
        default=3,
        help="Maximum sample lines per finding group. Default: 3.",
    )
    return parser
 def positive_int(value: str) -> int:
    try:
        number = int(value)
    except ValueError as exc:
        raise argparse.ArgumentTypeError("must be a positive integer") from exc
    if number <= 0:
        raise argparse.ArgumentTypeError("must be a positive integer")
    return number
 def non_negative_int(value: str) -> int:
    try:
        number = int(value)
    except ValueError as exc:
        raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
    if number < 0:
        raise argparse.ArgumentTypeError("must be zero or a positive integer")
    return number
 def parse_filter_timestamp(value: str) -> datetime:
    for fmt in ("%Y-%m-%d %H:%M:%S", "%Y-%m-%dT%H:%M:%S"):
        try:
            return datetime.strptime(value, fmt)
        except ValueError:
            continue
    raise argparse.ArgumentTypeError(
        'expected timestamp format "YYYY-MM-DD HH:MM:SS"'
    )
 def compile_patterns(ignore_case: bool) -> list[dict[str, Any]]:
    flags = re.IGNORECASE if ignore_case else 0
    pattern_defs: list[dict[str, str]] = []
    pattern_defs.extend(
        {"pattern": pattern, "severity": "CRITICAL"} for pattern in CRITICAL_PATTERNS
    )
    pattern_defs.extend(
        {"pattern": pattern, "severity": "WARNING"} for pattern in WARNING_PATTERNS
    )
    compiled = []
    for item in pattern_defs:
        compiled.append(
            {
                "pattern": item["pattern"],
                "severity": item["severity"],
                "regex": re.compile(re.escape(item["pattern"]), flags),
            }
        )
    return compiled
 def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str | None]:
    iso_match = ISO_TIMESTAMP_RE.search(line)
    if iso_match:
        raw = f"{iso_match.group(1)} {iso_match.group(2)}"
        try:
            return datetime.strptime(raw, "%Y-%m-%d %H:%M:%S"), raw
        except ValueError:
            return None, None
    syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
    if syslog_match:
        raw = syslog_match.group(1)
        normalized = f"{syslog_year} {raw}"
        try:
            parsed = datetime.strptime(normalized, "%Y %b %d %H:%M:%S")
        except ValueError:
            return None, None
        return parsed, parsed.strftime("%Y-%m-%d %H:%M:%S")
    return None, None
 def line_in_time_window(
    parsed_at: datetime | None, since: datetime | None, until: datetime | None
 ) -> bool:
    if parsed_at is None:
        return True
    if since is not None and parsed_at < since:
        return False
    if until is not None and parsed_at > until:
        return False
    return True
 def read_log_file(path: Path) -> list[str]:
    if not path.exists():
        raise OSError(f"file does not exist: {path}")
    if not path.is_file():
        raise OSError(f"path is not a regular file: {path}")
    try:
        text = path.read_text(encoding="utf-8", errors="replace")
    except PermissionError as exc:
        raise OSError(f"file is not readable: {path}") from exc
    except OSError as exc:
        raise OSError(f"unable to read file {path}: {exc}") from exc
    if text == "":
        raise ValueError(f"file is empty: {path}")
    return text.splitlines()
 def analyze_log(
    lines: list[str],
    patterns: list[dict[str, Any]],
    since: datetime | None,
    until: datetime | None,
    max_samples: int,
 ) -> dict[str, Any]:
    syslog_year = since.year if since is not None else datetime.now().year
    groups: dict[str, dict[str, Any]] = {}
    for line in lines:
        parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
        if not line_in_time_window(parsed_at, since, until):
            continue
        for item in patterns:
            if not item["regex"].search(line):
                continue
            key = f"{item['severity']}::{item['pattern']}"
            group = groups.setdefault(
                key,
                {
                    "pattern": item["pattern"],
                    "severity": item["severity"],
                    "occurrences": 0,
                    "first_seen": None,
                    "last_seen": None,
                    "samples": [],
                },
            )
            group["occurrences"] += 1
            if parsed_at is not None:
                if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
                    group["first_seen"] = (parsed_at, rendered_at)
                if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
                    group["last_seen"] = (parsed_at, rendered_at)
            if len(group["samples"]) < max_samples:
                group["samples"].append(line)
    findings = sorted(
        groups.values(),
        key=lambda item: (
            SEVERITY_ORDER[item["severity"]],
            -item["occurrences"],
            item["pattern"].lower(),
        ),
    )
    rendered_findings = []
    for group in findings:
        rendered_findings.append(
            {
                "pattern": group["pattern"],
                "severity": group["severity"],
                "occurrences": group["occurrences"],
                "first_seen": render_seen(group["first_seen"]),
                "last_seen": render_seen(group["last_seen"]),
                "samples": group["samples"],
            }
        )
    return {
        "total_lines_scanned": len(lines),
        "findings": rendered_findings,
    }
 def render_seen(value: tuple[datetime, str | None] | None) -> str:
    if value is None:
        return UNKNOWN
    return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
 def apply_top_limit(report: dict[str, Any], top: int | None) -> dict[str, Any]:
    if top is None:
        return report
    limited = dict(report)
    limited["findings"] = report["findings"][:top]
    return limited
 def add_summary(report: dict[str, Any]) -> dict[str, Any]:
    findings = report["findings"]
    critical_groups = sum(1 for item in findings if item["severity"] == "CRITICAL")
    warning_groups = sum(1 for item in findings if item["severity"] == "WARNING")
    total_findings = sum(item["occurrences"] for item in findings)
    if critical_groups > 0:
        status = "CRITICAL"
    elif warning_groups > 0:
        status = "WARNING"
    else:
        status = "OK"
    enriched = dict(report)
    enriched["summary"] = {
        "total_lines_scanned": report["total_lines_scanned"],
        "total_findings": total_findings,
        "critical_finding_groups": critical_groups,
        "warning_finding_groups": warning_groups,
        "overall_status": status,
    }
    return enriched
 def render_text(report: dict[str, Any]) -> str:
    lines = ["Incident Log Summary", "====================", ""]
    if not report["findings"]:
        lines.append("No configured incident patterns were detected.")
    else:
        for finding in report["findings"]:
            lines.extend(
                [
                    f"[{finding['severity']}] {finding['pattern']}",
                    f"Occurrences: {finding['occurrences']}",
                    f"First seen: {finding['first_seen']}",
                    f"Last seen: {finding['last_seen']}",
                    "Samples:",
                ]
            )
            if finding["samples"]:
                lines.extend(f"  - {sample}" for sample in finding["samples"])
            else:
                lines.append("  - No samples retained")
            lines.append("")
    lines.extend(render_text_summary(report["summary"]))
    return "\n".join(lines) + "\n"
 def render_text_summary(summary: dict[str, Any]) -> list[str]:
    return [
        "Operational Summary",
        "-------------------",
        f"Total lines scanned: {summary['total_lines_scanned']}",
        f"Total findings: {summary['total_findings']}",
        f"Critical finding groups: {summary['critical_finding_groups']}",
        f"Warning finding groups: {summary['warning_finding_groups']}",
        f"Overall status: {summary['overall_status']}",
    ]
 def render_markdown(report: dict[str, Any]) -> str:
    lines = ["# Incident Log Summary", ""]
    if not report["findings"]:
        lines.extend(["No configured incident patterns were detected.", ""])
    else:
        for finding in report["findings"]:
            lines.extend(
                [
                    f"## {finding['severity']}: {finding['pattern']}",
                    "",
                    f"- Occurrences: {finding['occurrences']}",
                    f"- First seen: {finding['first_seen']}",
                    f"- Last seen: {finding['last_seen']}",
                    "",
                    "Sample log lines:",
                    "",
                ]
            )
            if finding["samples"]:
                lines.append("```text")
                lines.extend(finding["samples"])
                lines.append("```")
            else:
                lines.append("_No samples retained._")
            lines.append("")
    summary = report["summary"]
    lines.extend(
        [
            "## Operational Summary",
            "",
            f"- Total lines scanned: {summary['total_lines_scanned']}",
            f"- Total findings: {summary['total_findings']}",
            f"- Critical finding groups: {summary['critical_finding_groups']}",
            f"- Warning finding groups: {summary['warning_finding_groups']}",
            f"- Overall status: {summary['overall_status']}",
            "",
        ]
    )
    return "\n".join(lines)
 def render_json(report: dict[str, Any]) -> str:
    return json.dumps(report, indent=2, sort_keys=True) + "\n"
 def write_report(output_path: str | None, content: str) -> None:
    if output_path is None:
        sys.stdout.write(content)
        return
    path = Path(output_path)
    try:
        path.write_text(content, encoding="utf-8")
    except OSError as exc:
        raise OSError(f"unable to write output {path}: {exc}") from exc
 def main() -> int:
    parser = build_parser()
    args = parser.parse_args()
    if args.since is not None and args.until is not None and args.since > args.until:
        parser.error("--since must be earlier than or equal to --until")
    try:
        lines = read_log_file(Path(args.file))
        report = analyze_log(
            lines=lines,
            patterns=compile_patterns(args.ignore_case),
            since=args.since,
            until=args.until,
            max_samples=args.max_samples,
        )
        report = add_summary(apply_top_limit(report, args.top))
        if args.format == "text":
            content = render_text(report)
        elif args.format == "markdown":
            content = render_markdown(report)
        else:
            content = render_json(report)
        write_report(args.output, content)
    except (OSError, ValueError) as exc:
        print(f"CRITICAL: {exc}", file=sys.stderr)
        return EXIT_INVALID
    except RuntimeError as exc:
        print(f"CRITICAL: runtime error: {exc}", file=sys.stderr)
        return EXIT_INVALID
    if report["summary"]["overall_status"] == "OK":
        return EXIT_OK
    return EXIT_FINDINGS
 if __name__ == "__main__":
    sys.exit(main())