Add known error matcher tool
This commit is contained in:
@@ -0,0 +1,198 @@
|
|||||||
|
# known-error-matcher
|
||||||
|
|
||||||
|
`known-error-matcher` is a read-only Python CLI for scanning local log files against a JSON catalog of known operational error patterns. It connects matched log symptoms with severity, category, sample lines, and runbook references so an infrastructure engineer can decide what needs review next.
|
||||||
|
|
||||||
|
The tool matches known operational error patterns that require review. It does not prove an incident, identify root cause automatically, or replace service-specific runbooks.
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
- Identify which cataloged operational problems are visible in a collected log.
|
||||||
|
- Count how often each known error pattern appears.
|
||||||
|
- Surface warning and critical matches conservatively.
|
||||||
|
- Point operators toward relevant runbooks or supporting local tools.
|
||||||
|
- Produce predictable text, Markdown, or JSON output for incident notes.
|
||||||
|
|
||||||
|
## When To Use
|
||||||
|
|
||||||
|
- During incident response when a collected application, system, or journal extract needs quick known-error matching.
|
||||||
|
- Before attaching log evidence to an incident, problem, or change ticket.
|
||||||
|
- When teams maintain a small local catalog of operational patterns and runbook links.
|
||||||
|
- When JSON output is useful for later local automation.
|
||||||
|
|
||||||
|
## What It Does Not Do
|
||||||
|
|
||||||
|
- It does not read remote systems or live streams.
|
||||||
|
- It does not modify logs, services, applications, accounts, or host state.
|
||||||
|
- It does not query ELK, SIEM, APM, Zabbix, ticketing systems, or external services.
|
||||||
|
- It does not find root cause automatically.
|
||||||
|
- It does not prove an incident or confirm customer impact.
|
||||||
|
- It does not classify every vendor-specific log message.
|
||||||
|
|
||||||
|
## Pattern Catalog Format
|
||||||
|
|
||||||
|
Patterns are defined in JSON because the Python standard library can parse JSON without third-party dependencies.
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"patterns": [
|
||||||
|
{
|
||||||
|
"id": "disk_full",
|
||||||
|
"name": "Disk full",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "No space left on device|disk full",
|
||||||
|
"category": "storage",
|
||||||
|
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||||
|
"description": "Filesystem or application failed because free space was exhausted."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Required fields per pattern:
|
||||||
|
|
||||||
|
- `id` - stable non-empty identifier.
|
||||||
|
- `name` - human-readable finding name.
|
||||||
|
- `severity` - `WARNING` or `CRITICAL`.
|
||||||
|
- `regex` - Python regular expression used for matching.
|
||||||
|
|
||||||
|
Optional fields:
|
||||||
|
|
||||||
|
- `category` - operational grouping such as `storage`, `network`, `security`, `application`, or `systemd`. Missing values are reported as `UNKNOWN`.
|
||||||
|
- `runbook` - repository path to review when the pattern matches. Missing values are reported as `None`.
|
||||||
|
- `description` - short operator-facing explanation. Missing values are reported as `None`.
|
||||||
|
|
||||||
|
The catalog is validated before scanning starts. Invalid JSON, missing required fields, duplicate IDs, invalid severity values, and invalid regexes fail with exit code `2`.
|
||||||
|
|
||||||
|
## Adding A Known Error Pattern
|
||||||
|
|
||||||
|
Add a new object under `patterns` in `patterns.json`:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"id": "example_dependency_failure",
|
||||||
|
"name": "Example dependency failure",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "dependency request failed|upstream dependency unavailable",
|
||||||
|
"category": "application",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/dependency-failure.md",
|
||||||
|
"description": "Application logged a dependency failure that requires review."
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Use a stable `id`, choose the lowest severity that still reflects operational risk, and keep the regex specific enough to avoid noisy generic matches. Prefer a runbook path that already exists; otherwise use a plausible future path under `infra-run/runbooks/incidents/` or leave it empty.
|
||||||
|
|
||||||
|
## Severity Model
|
||||||
|
|
||||||
|
Overall status is conservative:
|
||||||
|
|
||||||
|
- `OK` - no known error patterns matched.
|
||||||
|
- `WARNING` - one or more warning patterns matched and no critical patterns matched.
|
||||||
|
- `CRITICAL` - one or more critical patterns matched.
|
||||||
|
|
||||||
|
The status means known error patterns require review. It is not a final root-cause statement.
|
||||||
|
|
||||||
|
## Category Filtering
|
||||||
|
|
||||||
|
Use `--category CATEGORY` to include only matches where the pattern category exactly matches the provided value.
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json --category storage
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --category application
|
||||||
|
```
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd infra-run/scripts/python/known-error-matcher
|
||||||
|
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json
|
||||||
|
python3 known_error_matcher.py --file examples/sample-system.log --patterns patterns.json
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format markdown --output known-error-report.md
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --format json
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --ignore-case
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --severity critical
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --top 10
|
||||||
|
python3 known_error_matcher.py --file examples/sample-app.log --patterns patterns.json --max-samples 5
|
||||||
|
```
|
||||||
|
|
||||||
|
## Output Formats
|
||||||
|
|
||||||
|
- `text` - default terminal-oriented report.
|
||||||
|
- `markdown` - incident, problem, or change ticket attachment format.
|
||||||
|
- `json` - structured output for local automation.
|
||||||
|
|
||||||
|
Use `--output <path>` to write the rendered report to a separate file. Without `--output`, the report is printed to stdout. The tool rejects an output path that resolves to the input log file or pattern catalog file.
|
||||||
|
|
||||||
|
## Exit Codes
|
||||||
|
|
||||||
|
- `0` - OK, no known error matches.
|
||||||
|
- `1` - Known error matches detected.
|
||||||
|
- `2` - Invalid input, unreadable file, invalid JSON, invalid pattern catalog, invalid regex, bad argument, output write failure, or runtime error.
|
||||||
|
|
||||||
|
## Example Text Output
|
||||||
|
|
||||||
|
```text
|
||||||
|
Known Error Matcher
|
||||||
|
===================
|
||||||
|
|
||||||
|
Overall status: CRITICAL
|
||||||
|
Known error pattern matches require operator review; logs alone do not prove root cause.
|
||||||
|
|
||||||
|
[CRITICAL] database_unavailable - Database unavailable
|
||||||
|
Category: application
|
||||||
|
Occurrences: 1
|
||||||
|
First seen: 2026-05-11 10:16:07
|
||||||
|
Last seen: 2026-05-11 10:16:07
|
||||||
|
Runbook: infra-run/scripts/python/jvm-log-analyzer/README.md
|
||||||
|
Description: Application logged unavailable database or database connectivity symptoms.
|
||||||
|
Samples:
|
||||||
|
- 2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
|
||||||
|
|
||||||
|
Operational Summary
|
||||||
|
-------------------
|
||||||
|
Overall status: CRITICAL
|
||||||
|
Total lines scanned: 9
|
||||||
|
Known error matches: 7
|
||||||
|
Matched known error patterns: 7
|
||||||
|
Critical matched patterns: 5
|
||||||
|
Warning matched patterns: 2
|
||||||
|
Top categories: application (3), network (2), application_jvm (2)
|
||||||
|
Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
|
||||||
|
Timestamp coverage: parsed=9, unknown=0
|
||||||
|
Filters used: severity=None, category=None
|
||||||
|
Pattern catalog path: patterns.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Markdown Workflow
|
||||||
|
|
||||||
|
Generate a Markdown report from the collected log and attach it to the incident or problem ticket as supporting evidence:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 known_error_matcher.py \
|
||||||
|
--file examples/sample-app.log \
|
||||||
|
--patterns patterns.json \
|
||||||
|
--format markdown \
|
||||||
|
--output known-error-report.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Review the report before attaching it. A `WARNING` or `CRITICAL` result should be correlated with service health, monitoring, recent changes, dependency status, and the referenced runbook.
|
||||||
|
|
||||||
|
## Operational Limitations
|
||||||
|
|
||||||
|
- Pattern matching is intentionally simple and predictable.
|
||||||
|
- A single log line can match multiple known error patterns.
|
||||||
|
- Case-sensitive default matching can miss lowercase variants unless `--ignore-case` is used.
|
||||||
|
- Timestamp parsing is best-effort; unparseable timestamps are reported as `UNKNOWN`.
|
||||||
|
- Counts are raw log-line matches, not request rates, incident duration, or customer impact.
|
||||||
|
- `--top` limits displayed findings only. The summary still reflects all matched patterns after filters.
|
||||||
|
- Large log files are read into memory; use scoped extracts for very large incidents.
|
||||||
|
|
||||||
|
## Safety Notes
|
||||||
|
|
||||||
|
- The tool only reads the input log and pattern catalog and optionally writes a separate report.
|
||||||
|
- It does not require elevated privileges unless the chosen log path requires them.
|
||||||
|
- Do not include secrets, private hostnames, customer identifiers, tokens, or unsanitized production details in portfolio examples.
|
||||||
|
- Treat matches as prompts for operator review, not automated remediation instructions.
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
2026-05-11 10:15:30 app01 checkout-api[1842]: INFO request_id=a1 path=/checkout status=200 duration_ms=42
|
||||||
|
2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted
|
||||||
|
2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool
|
||||||
|
2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443
|
||||||
|
2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms
|
||||||
|
2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed
|
||||||
|
2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"
|
||||||
|
2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space
|
||||||
|
2026-05-11 10:17:03 app01 checkout-api[1842]: INFO healthcheck completed status=degraded
|
||||||
@@ -0,0 +1,97 @@
|
|||||||
|
# Known Error Matcher Report
|
||||||
|
|
||||||
|
- Overall status: `CRITICAL`
|
||||||
|
- Known error pattern matches require operator review; logs alone do not prove root cause.
|
||||||
|
|
||||||
|
## Matched Known Errors
|
||||||
|
|
||||||
|
### [CRITICAL] database_unavailable - Database unavailable
|
||||||
|
|
||||||
|
- Category: `application`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:07`
|
||||||
|
- Last seen: `2026-05-11 10:16:07`
|
||||||
|
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||||
|
- Description: Application logged unavailable database or database connectivity symptoms.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:07 app01 checkout-api[1842]: ERROR database unavailable while opening checkout connection pool`
|
||||||
|
|
||||||
|
### [CRITICAL] http_500 - HTTP 500
|
||||||
|
|
||||||
|
- Category: `application`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:02`
|
||||||
|
- Last seen: `2026-05-11 10:16:02`
|
||||||
|
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
|
||||||
|
- Description: Application or proxy logged HTTP 500 responses.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:02 app01 checkout-api[1842]: ERROR HTTP 500 request_id=a2 path=/checkout customer_id=redacted`
|
||||||
|
|
||||||
|
### [CRITICAL] http_503 - HTTP 503
|
||||||
|
|
||||||
|
- Category: `application`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:31.456`
|
||||||
|
- Last seen: `2026-05-11 10:16:31.456`
|
||||||
|
- Runbook: `infra-run/runbooks/incidents/http-5xx.md`
|
||||||
|
- Description: Application or proxy logged HTTP 503 service unavailable responses.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:31.456 app01 nginx[907]: 198.51.100.25 - - "GET /checkout HTTP/1.1" 503 312 "-" "synthetic-check"`
|
||||||
|
|
||||||
|
### [CRITICAL] java_out_of_memory - Java OutOfMemoryError
|
||||||
|
|
||||||
|
- Category: `application_jvm`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:40`
|
||||||
|
- Last seen: `2026-05-11 10:16:40`
|
||||||
|
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||||
|
- Description: Java process logged memory exhaustion symptoms.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:40 app01 checkout-api[1842]: FATAL java.lang.OutOfMemoryError: Java heap space`
|
||||||
|
|
||||||
|
### [CRITICAL] ssl_handshake_exception - SSLHandshakeException
|
||||||
|
|
||||||
|
- Category: `application_jvm`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:22`
|
||||||
|
- Last seen: `2026-05-11 10:16:22`
|
||||||
|
- Runbook: `infra-run/scripts/python/jvm-log-analyzer/README.md`
|
||||||
|
- Description: Java TLS handshake exception was logged.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11T10:16:22 app01 checkout-api[1842]: ERROR javax.net.ssl.SSLHandshakeException: PKIX path building failed`
|
||||||
|
|
||||||
|
### [WARNING] connection_refused - Connection refused
|
||||||
|
|
||||||
|
- Category: `network`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:11`
|
||||||
|
- Last seen: `2026-05-11 10:16:11`
|
||||||
|
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
|
||||||
|
- Description: Client connection attempts were refused by the destination service or host.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:11 app01 checkout-api[1842]: WARN upstream inventory-api connection refused at 10.20.30.40:8443`
|
||||||
|
|
||||||
|
### [WARNING] timeout - Timeout
|
||||||
|
|
||||||
|
- Category: `network`
|
||||||
|
- Occurrences: `1`
|
||||||
|
- First seen: `2026-05-11 10:16:15,123`
|
||||||
|
- Last seen: `2026-05-11 10:16:15,123`
|
||||||
|
- Runbook: `infra-run/scripts/bash/os-healthcheck/README.md`
|
||||||
|
- Description: Operation timed out and may require network, service, or dependency review.
|
||||||
|
- Samples:
|
||||||
|
- `2026-05-11 10:16:15,123 app01 checkout-api[1842]: WARN payment provider request timed out after 5000 ms`
|
||||||
|
|
||||||
|
## Operational Summary
|
||||||
|
|
||||||
|
- Overall status: `CRITICAL`
|
||||||
|
- Total lines scanned: `9`
|
||||||
|
- Known error matches: `7`
|
||||||
|
- Matched known error patterns: `7`
|
||||||
|
- Critical matched patterns: `5`
|
||||||
|
- Warning matched patterns: `2`
|
||||||
|
- Top categories: application (3), network (2), application_jvm (2)
|
||||||
|
- Top matched known errors: database_unavailable (1), http_500 (1), http_503 (1), java_out_of_memory (1), ssl_handshake_exception (1), connection_refused (1), timeout (1)
|
||||||
|
- Timestamp coverage: parsed=`9`, unknown=`0`
|
||||||
|
- Filters used: severity=`None`, category=`None`
|
||||||
|
- Pattern catalog path: `patterns.json`
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
May 11 10:15:30 web01 kernel: EXT4-fs warning: No space left on device while writing /var/log/messages
|
||||||
|
May 11 10:15:35 web01 kernel: EXT4-fs error (device dm-0): Remounting filesystem read-only
|
||||||
|
May 11 10:15:41 web01 kernel: nginx invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE)
|
||||||
|
May 11 10:15:42 web01 kernel: Out of memory: Killed process 2281 (java) total-vm:2097152kB
|
||||||
|
May 11 10:16:11 web01 systemd[1]: Failed to start nginx.service - A high performance web server and a reverse proxy server.
|
||||||
|
May 11 10:16:12 web01 systemd[1]: Dependency failed for webapp.service - Local web application.
|
||||||
|
May 11 10:16:13 web01 systemd[1]: nginx.service: Start request repeated too quickly.
|
||||||
|
May 11 10:16:25 web01 sshd[3371]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=203.0.113.50 user=deploy
|
||||||
|
May 11 10:16:31 web01 sudo: deploy : command not allowed ; TTY=pts/0 ; PWD=/srv/app ; USER=root ; COMMAND=/bin/systemctl restart webapp
|
||||||
|
May 11 10:16:32 web01 sudo: deploy : permission denied while opening /etc/sudoers.d/webapp
|
||||||
@@ -0,0 +1,562 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""Match local logs against a JSON catalog of known operational error patterns."""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import sys
|
||||||
|
from collections import Counter
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
|
||||||
|
EXIT_OK = 0
|
||||||
|
EXIT_FINDINGS = 1
|
||||||
|
EXIT_INVALID = 2
|
||||||
|
|
||||||
|
UNKNOWN = "UNKNOWN"
|
||||||
|
VALID_SEVERITIES = {"WARNING", "CRITICAL"}
|
||||||
|
SEVERITY_ORDER = {"CRITICAL": 0, "WARNING": 1}
|
||||||
|
|
||||||
|
ISO_TIMESTAMP_RE = re.compile(
|
||||||
|
r"\b(\d{4}-\d{2}-\d{2})[ T](\d{2}:\d{2}:\d{2})([,.]\d{1,6})?\b"
|
||||||
|
)
|
||||||
|
SYSLOG_TIMESTAMP_RE = re.compile(r"^([A-Z][a-z]{2}\s+\d{1,2}\s+\d{2}:\d{2}:\d{2})\b")
|
||||||
|
|
||||||
|
|
||||||
|
def build_parser() -> argparse.ArgumentParser:
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Scan a local log file for known operational error patterns."
|
||||||
|
)
|
||||||
|
parser.add_argument("--file", required=True, help="Local log file to scan.")
|
||||||
|
parser.add_argument("--patterns", required=True, help="JSON known error pattern catalog.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--format",
|
||||||
|
choices=("text", "markdown", "json"),
|
||||||
|
default="text",
|
||||||
|
help="Report format. Default: text.",
|
||||||
|
)
|
||||||
|
parser.add_argument("--output", help="Write report to this path instead of stdout.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--severity",
|
||||||
|
choices=("warning", "critical"),
|
||||||
|
help="Only include findings with this severity.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--category",
|
||||||
|
help="Only include findings from this exact category.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--top",
|
||||||
|
type=positive_int,
|
||||||
|
default=10,
|
||||||
|
help="Number of matched known errors and summary entries to display. Default: 10.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--max-samples",
|
||||||
|
type=non_negative_int,
|
||||||
|
default=3,
|
||||||
|
help="Maximum sample lines per matched known error. Default: 3.",
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--ignore-case",
|
||||||
|
action="store_true",
|
||||||
|
help="Compile catalog regex patterns case-insensitively.",
|
||||||
|
)
|
||||||
|
return parser
|
||||||
|
|
||||||
|
|
||||||
|
def positive_int(value: str) -> int:
|
||||||
|
try:
|
||||||
|
number = int(value)
|
||||||
|
except ValueError as exc:
|
||||||
|
raise argparse.ArgumentTypeError("must be a positive integer") from exc
|
||||||
|
if number <= 0:
|
||||||
|
raise argparse.ArgumentTypeError("must be a positive integer")
|
||||||
|
return number
|
||||||
|
|
||||||
|
|
||||||
|
def non_negative_int(value: str) -> int:
|
||||||
|
try:
|
||||||
|
number = int(value)
|
||||||
|
except ValueError as exc:
|
||||||
|
raise argparse.ArgumentTypeError("must be zero or a positive integer") from exc
|
||||||
|
if number < 0:
|
||||||
|
raise argparse.ArgumentTypeError("must be zero or a positive integer")
|
||||||
|
return number
|
||||||
|
|
||||||
|
|
||||||
|
def read_text_file(path: Path, label: str) -> str:
|
||||||
|
if not path.exists():
|
||||||
|
raise OSError(f"{label} does not exist: {path}")
|
||||||
|
if not path.is_file():
|
||||||
|
raise OSError(f"{label} is not a regular file: {path}")
|
||||||
|
try:
|
||||||
|
text = path.read_text(encoding="utf-8", errors="replace")
|
||||||
|
except PermissionError as exc:
|
||||||
|
raise OSError(f"{label} is not readable: {path}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise OSError(f"unable to read {label} {path}: {exc}") from exc
|
||||||
|
if text == "":
|
||||||
|
raise ValueError(f"{label} is empty: {path}")
|
||||||
|
return text
|
||||||
|
|
||||||
|
|
||||||
|
def load_pattern_catalog(path: Path, ignore_case: bool) -> list[dict[str, Any]]:
|
||||||
|
text = read_text_file(path, "pattern catalog")
|
||||||
|
try:
|
||||||
|
catalog = json.loads(text)
|
||||||
|
except json.JSONDecodeError as exc:
|
||||||
|
raise ValueError(f"invalid JSON in pattern catalog {path}: {exc}") from exc
|
||||||
|
|
||||||
|
errors: list[str] = []
|
||||||
|
if not isinstance(catalog, dict):
|
||||||
|
raise ValueError("invalid pattern catalog: top-level JSON value must be an object")
|
||||||
|
if "patterns" not in catalog:
|
||||||
|
raise ValueError('invalid pattern catalog: missing top-level "patterns" field')
|
||||||
|
if not isinstance(catalog["patterns"], list):
|
||||||
|
raise ValueError('invalid pattern catalog: "patterns" must be a list')
|
||||||
|
|
||||||
|
seen_ids: set[str] = set()
|
||||||
|
compiled_patterns: list[dict[str, Any]] = []
|
||||||
|
flags = re.IGNORECASE if ignore_case else 0
|
||||||
|
|
||||||
|
for index, item in enumerate(catalog["patterns"], start=1):
|
||||||
|
if not isinstance(item, dict):
|
||||||
|
errors.append(f"pattern #{index}: must be an object")
|
||||||
|
continue
|
||||||
|
|
||||||
|
pattern_id = normalize_required_text(item, "id")
|
||||||
|
name = normalize_required_text(item, "name")
|
||||||
|
severity = normalize_required_text(item, "severity").upper()
|
||||||
|
regex_text = normalize_required_text(item, "regex")
|
||||||
|
|
||||||
|
if not pattern_id:
|
||||||
|
errors.append(f"pattern #{index}: id is required and must be non-empty")
|
||||||
|
elif pattern_id in seen_ids:
|
||||||
|
errors.append(f"pattern #{index}: duplicate id {pattern_id}")
|
||||||
|
else:
|
||||||
|
seen_ids.add(pattern_id)
|
||||||
|
|
||||||
|
if not name:
|
||||||
|
errors.append(f"pattern {pattern_id or f'#{index}'}: name is required and must be non-empty")
|
||||||
|
if severity not in VALID_SEVERITIES:
|
||||||
|
errors.append(
|
||||||
|
f"pattern {pattern_id or f'#{index}'}: severity must be WARNING or CRITICAL"
|
||||||
|
)
|
||||||
|
if not regex_text:
|
||||||
|
errors.append(f"pattern {pattern_id or f'#{index}'}: regex is required and must be non-empty")
|
||||||
|
|
||||||
|
compiled_regex = None
|
||||||
|
if regex_text:
|
||||||
|
try:
|
||||||
|
compiled_regex = re.compile(regex_text, flags)
|
||||||
|
except re.error as exc:
|
||||||
|
errors.append(f"pattern {pattern_id or f'#{index}'}: invalid regex: {exc}")
|
||||||
|
|
||||||
|
if pattern_id and name and severity in VALID_SEVERITIES and regex_text and compiled_regex:
|
||||||
|
compiled_patterns.append(
|
||||||
|
{
|
||||||
|
"id": pattern_id,
|
||||||
|
"name": name,
|
||||||
|
"severity": severity,
|
||||||
|
"regex_text": regex_text,
|
||||||
|
"regex": compiled_regex,
|
||||||
|
"category": normalize_optional_text(item, "category", UNKNOWN),
|
||||||
|
"runbook": normalize_optional_text(item, "runbook", ""),
|
||||||
|
"description": normalize_optional_text(item, "description", ""),
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
if errors:
|
||||||
|
raise ValueError("invalid pattern catalog:\n- " + "\n- ".join(errors))
|
||||||
|
if not compiled_patterns:
|
||||||
|
raise ValueError("invalid pattern catalog: no patterns configured")
|
||||||
|
return compiled_patterns
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_required_text(item: dict[str, Any], field: str) -> str:
|
||||||
|
value = item.get(field)
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return ""
|
||||||
|
return value.strip()
|
||||||
|
|
||||||
|
|
||||||
|
def normalize_optional_text(item: dict[str, Any], field: str, default: str) -> str:
|
||||||
|
value = item.get(field, default)
|
||||||
|
if not isinstance(value, str):
|
||||||
|
return default
|
||||||
|
value = value.strip()
|
||||||
|
return value if value else default
|
||||||
|
|
||||||
|
|
||||||
|
def parse_line_timestamp(line: str, syslog_year: int) -> tuple[datetime | None, str]:
|
||||||
|
iso_match = ISO_TIMESTAMP_RE.search(line)
|
||||||
|
if iso_match:
|
||||||
|
fraction = iso_match.group(3) or ""
|
||||||
|
raw = f"{iso_match.group(1)} {iso_match.group(2)}"
|
||||||
|
parse_value = raw
|
||||||
|
fmt = "%Y-%m-%d %H:%M:%S"
|
||||||
|
if fraction:
|
||||||
|
parse_value = f"{raw}.{fraction[1:].ljust(6, '0')[:6]}"
|
||||||
|
fmt = "%Y-%m-%d %H:%M:%S.%f"
|
||||||
|
try:
|
||||||
|
return datetime.strptime(parse_value, fmt), raw + fraction
|
||||||
|
except ValueError:
|
||||||
|
return None, UNKNOWN
|
||||||
|
|
||||||
|
syslog_match = SYSLOG_TIMESTAMP_RE.search(line)
|
||||||
|
if syslog_match:
|
||||||
|
raw = syslog_match.group(1)
|
||||||
|
try:
|
||||||
|
parsed = datetime.strptime(f"{syslog_year} {raw}", "%Y %b %d %H:%M:%S")
|
||||||
|
except ValueError:
|
||||||
|
return None, UNKNOWN
|
||||||
|
return parsed, raw
|
||||||
|
|
||||||
|
return None, UNKNOWN
|
||||||
|
|
||||||
|
|
||||||
|
def severity_filter_matches(selected: str | None, severity: str) -> bool:
|
||||||
|
if selected is None:
|
||||||
|
return True
|
||||||
|
return selected.upper() == severity
|
||||||
|
|
||||||
|
|
||||||
|
def category_filter_matches(selected: str | None, category: str) -> bool:
|
||||||
|
if selected is None:
|
||||||
|
return True
|
||||||
|
return selected == category
|
||||||
|
|
||||||
|
|
||||||
|
def update_seen(group: dict[str, Any], parsed_at: datetime | None, rendered_at: str) -> None:
|
||||||
|
if parsed_at is None:
|
||||||
|
return
|
||||||
|
if group["first_seen"] is None or parsed_at < group["first_seen"][0]:
|
||||||
|
group["first_seen"] = (parsed_at, rendered_at)
|
||||||
|
if group["last_seen"] is None or parsed_at > group["last_seen"][0]:
|
||||||
|
group["last_seen"] = (parsed_at, rendered_at)
|
||||||
|
|
||||||
|
|
||||||
|
def render_seen(value: tuple[datetime, str] | None) -> str:
|
||||||
|
if value is None:
|
||||||
|
return UNKNOWN
|
||||||
|
return value[1] or value[0].strftime("%Y-%m-%d %H:%M:%S")
|
||||||
|
|
||||||
|
|
||||||
|
def append_limited(items: list[str], value: str, limit: int) -> None:
|
||||||
|
if limit == 0:
|
||||||
|
return
|
||||||
|
if value in items:
|
||||||
|
return
|
||||||
|
if len(items) < limit:
|
||||||
|
items.append(value)
|
||||||
|
|
||||||
|
|
||||||
|
def analyze_log(
|
||||||
|
lines: list[str],
|
||||||
|
patterns: list[dict[str, Any]],
|
||||||
|
severity_filter: str | None,
|
||||||
|
category_filter: str | None,
|
||||||
|
top: int,
|
||||||
|
max_samples: int,
|
||||||
|
pattern_catalog_path: Path,
|
||||||
|
) -> dict[str, Any]:
|
||||||
|
syslog_year = datetime.now().year
|
||||||
|
groups: dict[str, dict[str, Any]] = {}
|
||||||
|
top_categories = Counter()
|
||||||
|
total_lines_scanned = 0
|
||||||
|
parsed_timestamps = 0
|
||||||
|
unknown_timestamps = 0
|
||||||
|
|
||||||
|
for line in lines:
|
||||||
|
total_lines_scanned += 1
|
||||||
|
parsed_at, rendered_at = parse_line_timestamp(line, syslog_year)
|
||||||
|
if parsed_at is None:
|
||||||
|
unknown_timestamps += 1
|
||||||
|
else:
|
||||||
|
parsed_timestamps += 1
|
||||||
|
|
||||||
|
for pattern in patterns:
|
||||||
|
if not severity_filter_matches(severity_filter, pattern["severity"]):
|
||||||
|
continue
|
||||||
|
if not category_filter_matches(category_filter, pattern["category"]):
|
||||||
|
continue
|
||||||
|
if not pattern["regex"].search(line):
|
||||||
|
continue
|
||||||
|
|
||||||
|
group = groups.setdefault(
|
||||||
|
pattern["id"],
|
||||||
|
{
|
||||||
|
"id": pattern["id"],
|
||||||
|
"name": pattern["name"],
|
||||||
|
"severity": pattern["severity"],
|
||||||
|
"category": pattern["category"],
|
||||||
|
"runbook": pattern["runbook"],
|
||||||
|
"description": pattern["description"],
|
||||||
|
"regex": pattern["regex_text"],
|
||||||
|
"occurrences": 0,
|
||||||
|
"first_seen": None,
|
||||||
|
"last_seen": None,
|
||||||
|
"samples": [],
|
||||||
|
},
|
||||||
|
)
|
||||||
|
group["occurrences"] += 1
|
||||||
|
update_seen(group, parsed_at, rendered_at)
|
||||||
|
append_limited(group["samples"], line, max_samples)
|
||||||
|
top_categories[pattern["category"]] += 1
|
||||||
|
|
||||||
|
findings = sorted(
|
||||||
|
groups.values(),
|
||||||
|
key=lambda item: (
|
||||||
|
SEVERITY_ORDER[item["severity"]],
|
||||||
|
-item["occurrences"],
|
||||||
|
item["id"],
|
||||||
|
),
|
||||||
|
)
|
||||||
|
|
||||||
|
rendered_findings = [
|
||||||
|
{
|
||||||
|
**finding,
|
||||||
|
"first_seen": render_seen(finding["first_seen"]),
|
||||||
|
"last_seen": render_seen(finding["last_seen"]),
|
||||||
|
}
|
||||||
|
for finding in findings
|
||||||
|
]
|
||||||
|
|
||||||
|
critical_patterns = sum(1 for item in rendered_findings if item["severity"] == "CRITICAL")
|
||||||
|
warning_patterns = sum(1 for item in rendered_findings if item["severity"] == "WARNING")
|
||||||
|
total_matches = sum(item["occurrences"] for item in rendered_findings)
|
||||||
|
|
||||||
|
overall_status = "OK"
|
||||||
|
if critical_patterns > 0:
|
||||||
|
overall_status = "CRITICAL"
|
||||||
|
elif warning_patterns > 0:
|
||||||
|
overall_status = "WARNING"
|
||||||
|
|
||||||
|
return {
|
||||||
|
"overall_status": overall_status,
|
||||||
|
"total_lines_scanned": total_lines_scanned,
|
||||||
|
"total_known_error_matches": total_matches,
|
||||||
|
"matched_pattern_count": len(rendered_findings),
|
||||||
|
"critical_matched_pattern_count": critical_patterns,
|
||||||
|
"warning_matched_pattern_count": warning_patterns,
|
||||||
|
"top_categories": [
|
||||||
|
{"category": name, "count": count}
|
||||||
|
for name, count in top_categories.most_common(top)
|
||||||
|
],
|
||||||
|
"top_known_errors": [
|
||||||
|
{"id": item["id"], "name": item["name"], "severity": item["severity"], "count": item["occurrences"]}
|
||||||
|
for item in rendered_findings[:top]
|
||||||
|
],
|
||||||
|
"timestamp_coverage": {
|
||||||
|
"parsed_timestamps_count": parsed_timestamps,
|
||||||
|
"unknown_timestamps_count": unknown_timestamps,
|
||||||
|
},
|
||||||
|
"filters_used": {
|
||||||
|
"severity": severity_filter.lower() if severity_filter else None,
|
||||||
|
"category": category_filter,
|
||||||
|
},
|
||||||
|
"pattern_catalog_path": str(pattern_catalog_path),
|
||||||
|
"findings": rendered_findings[:top],
|
||||||
|
"findings_total": len(rendered_findings),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def render_top_pairs(items: list[dict[str, Any]], key: str) -> str:
|
||||||
|
if not items:
|
||||||
|
return "None"
|
||||||
|
return ", ".join(f"{item[key]} ({item['count']})" for item in items)
|
||||||
|
|
||||||
|
|
||||||
|
def render_text(report: dict[str, Any]) -> str:
|
||||||
|
lines = [
|
||||||
|
"Known Error Matcher",
|
||||||
|
"===================",
|
||||||
|
"",
|
||||||
|
f"Overall status: {report['overall_status']}",
|
||||||
|
"Known error pattern matches require operator review; logs alone do not prove root cause.",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
if report["findings"]:
|
||||||
|
for finding in report["findings"]:
|
||||||
|
lines.extend(
|
||||||
|
[
|
||||||
|
f"[{finding['severity']}] {finding['id']} - {finding['name']}",
|
||||||
|
f"Category: {finding['category']}",
|
||||||
|
f"Occurrences: {finding['occurrences']}",
|
||||||
|
f"First seen: {finding['first_seen']}",
|
||||||
|
f"Last seen: {finding['last_seen']}",
|
||||||
|
f"Runbook: {finding['runbook'] or 'None'}",
|
||||||
|
f"Description: {finding['description'] or 'None'}",
|
||||||
|
"Samples:",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
if finding["samples"]:
|
||||||
|
for sample in finding["samples"]:
|
||||||
|
lines.append(f" - {sample}")
|
||||||
|
else:
|
||||||
|
lines.append(" - None")
|
||||||
|
lines.append("")
|
||||||
|
else:
|
||||||
|
lines.extend(["No known error patterns matched for the selected filters.", ""])
|
||||||
|
|
||||||
|
lines.extend(render_summary_lines(report, markdown=False))
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def render_summary_lines(report: dict[str, Any], markdown: bool) -> list[str]:
|
||||||
|
if markdown:
|
||||||
|
return [
|
||||||
|
"## Operational Summary",
|
||||||
|
"",
|
||||||
|
f"- Overall status: `{report['overall_status']}`",
|
||||||
|
f"- Total lines scanned: `{report['total_lines_scanned']}`",
|
||||||
|
f"- Known error matches: `{report['total_known_error_matches']}`",
|
||||||
|
f"- Matched known error patterns: `{report['matched_pattern_count']}`",
|
||||||
|
f"- Critical matched patterns: `{report['critical_matched_pattern_count']}`",
|
||||||
|
f"- Warning matched patterns: `{report['warning_matched_pattern_count']}`",
|
||||||
|
"- Top categories: " + render_top_pairs(report["top_categories"], "category"),
|
||||||
|
"- Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
|
||||||
|
"- Timestamp coverage: "
|
||||||
|
f"parsed=`{report['timestamp_coverage']['parsed_timestamps_count']}`, "
|
||||||
|
f"unknown=`{report['timestamp_coverage']['unknown_timestamps_count']}`",
|
||||||
|
"- Filters used: "
|
||||||
|
f"severity=`{report['filters_used']['severity'] or 'None'}`, "
|
||||||
|
f"category=`{report['filters_used']['category'] or 'None'}`",
|
||||||
|
f"- Pattern catalog path: `{report['pattern_catalog_path']}`",
|
||||||
|
]
|
||||||
|
return [
|
||||||
|
"Operational Summary",
|
||||||
|
"-------------------",
|
||||||
|
f"Overall status: {report['overall_status']}",
|
||||||
|
f"Total lines scanned: {report['total_lines_scanned']}",
|
||||||
|
f"Known error matches: {report['total_known_error_matches']}",
|
||||||
|
f"Matched known error patterns: {report['matched_pattern_count']}",
|
||||||
|
f"Critical matched patterns: {report['critical_matched_pattern_count']}",
|
||||||
|
f"Warning matched patterns: {report['warning_matched_pattern_count']}",
|
||||||
|
"Top categories: " + render_top_pairs(report["top_categories"], "category"),
|
||||||
|
"Top matched known errors: " + render_top_pairs(report["top_known_errors"], "id"),
|
||||||
|
"Timestamp coverage: "
|
||||||
|
f"parsed={report['timestamp_coverage']['parsed_timestamps_count']}, "
|
||||||
|
f"unknown={report['timestamp_coverage']['unknown_timestamps_count']}",
|
||||||
|
"Filters used: "
|
||||||
|
f"severity={report['filters_used']['severity'] or 'None'}, "
|
||||||
|
f"category={report['filters_used']['category'] or 'None'}",
|
||||||
|
f"Pattern catalog path: {report['pattern_catalog_path']}",
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def render_markdown(report: dict[str, Any]) -> str:
|
||||||
|
lines = [
|
||||||
|
"# Known Error Matcher Report",
|
||||||
|
"",
|
||||||
|
f"- Overall status: `{report['overall_status']}`",
|
||||||
|
"- Known error pattern matches require operator review; logs alone do not prove root cause.",
|
||||||
|
"",
|
||||||
|
]
|
||||||
|
|
||||||
|
if report["findings"]:
|
||||||
|
lines.extend(["## Matched Known Errors", ""])
|
||||||
|
for finding in report["findings"]:
|
||||||
|
lines.extend(
|
||||||
|
[
|
||||||
|
f"### [{finding['severity']}] {finding['id']} - {finding['name']}",
|
||||||
|
"",
|
||||||
|
f"- Category: `{finding['category']}`",
|
||||||
|
f"- Occurrences: `{finding['occurrences']}`",
|
||||||
|
f"- First seen: `{finding['first_seen']}`",
|
||||||
|
f"- Last seen: `{finding['last_seen']}`",
|
||||||
|
f"- Runbook: `{finding['runbook'] or 'None'}`",
|
||||||
|
f"- Description: {finding['description'] or 'None'}",
|
||||||
|
"- Samples:",
|
||||||
|
]
|
||||||
|
)
|
||||||
|
if finding["samples"]:
|
||||||
|
for sample in finding["samples"]:
|
||||||
|
lines.append(f" - `{sample}`")
|
||||||
|
else:
|
||||||
|
lines.append(" - `None`")
|
||||||
|
lines.append("")
|
||||||
|
else:
|
||||||
|
lines.extend(["## Matched Known Errors", "", "No known error patterns matched for the selected filters.", ""])
|
||||||
|
|
||||||
|
lines.extend(render_summary_lines(report, markdown=True))
|
||||||
|
return "\n".join(lines)
|
||||||
|
|
||||||
|
|
||||||
|
def render_json(report: dict[str, Any]) -> str:
|
||||||
|
return json.dumps(report, indent=2)
|
||||||
|
|
||||||
|
|
||||||
|
def write_output(text: str, output_path: str | None, protected_inputs: list[Path]) -> None:
|
||||||
|
if output_path is None:
|
||||||
|
print(text)
|
||||||
|
return
|
||||||
|
|
||||||
|
destination = Path(output_path)
|
||||||
|
try:
|
||||||
|
destination_resolved = destination.resolve()
|
||||||
|
for input_path in protected_inputs:
|
||||||
|
if input_path.resolve() == destination_resolved:
|
||||||
|
raise OSError("output path must not overwrite an input file")
|
||||||
|
except FileNotFoundError as exc:
|
||||||
|
raise OSError(f"unable to resolve output path {destination}: {exc}") from exc
|
||||||
|
|
||||||
|
try:
|
||||||
|
destination.write_text(text + ("\n" if not text.endswith("\n") else ""), encoding="utf-8")
|
||||||
|
except OSError as exc:
|
||||||
|
raise OSError(f"unable to write report to {destination}: {exc}") from exc
|
||||||
|
|
||||||
|
|
||||||
|
def determine_exit_code(report: dict[str, Any]) -> int:
|
||||||
|
if report["total_known_error_matches"] > 0:
|
||||||
|
return EXIT_FINDINGS
|
||||||
|
return EXIT_OK
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
parser = build_parser()
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
log_path = Path(args.file)
|
||||||
|
pattern_path = Path(args.patterns)
|
||||||
|
log_text = read_text_file(log_path, "log file")
|
||||||
|
lines = log_text.splitlines()
|
||||||
|
patterns = load_pattern_catalog(pattern_path, args.ignore_case)
|
||||||
|
severity_filter = args.severity.upper() if args.severity else None
|
||||||
|
|
||||||
|
report = analyze_log(
|
||||||
|
lines=lines,
|
||||||
|
patterns=patterns,
|
||||||
|
severity_filter=severity_filter,
|
||||||
|
category_filter=args.category,
|
||||||
|
top=args.top,
|
||||||
|
max_samples=args.max_samples,
|
||||||
|
pattern_catalog_path=pattern_path,
|
||||||
|
)
|
||||||
|
|
||||||
|
if args.format == "text":
|
||||||
|
rendered = render_text(report)
|
||||||
|
elif args.format == "markdown":
|
||||||
|
rendered = render_markdown(report)
|
||||||
|
else:
|
||||||
|
rendered = render_json(report)
|
||||||
|
|
||||||
|
write_output(rendered, args.output, [log_path, pattern_path])
|
||||||
|
return determine_exit_code(report)
|
||||||
|
except (OSError, ValueError) as exc:
|
||||||
|
print(f"ERROR: {exc}", file=sys.stderr)
|
||||||
|
return EXIT_INVALID
|
||||||
|
except Exception as exc: # pragma: no cover - defensive operational fallback
|
||||||
|
print(f"ERROR: unexpected runtime failure: {exc}", file=sys.stderr)
|
||||||
|
return EXIT_INVALID
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
@@ -0,0 +1,220 @@
|
|||||||
|
{
|
||||||
|
"patterns": [
|
||||||
|
{
|
||||||
|
"id": "disk_full",
|
||||||
|
"name": "Disk full",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "No space left on device|disk full|filesystem full",
|
||||||
|
"category": "storage",
|
||||||
|
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||||
|
"description": "Filesystem or application failed because free space was exhausted."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "inode_exhaustion",
|
||||||
|
"name": "Inode exhaustion",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "No space left on device.*inode|inode.*exhaust|free inodes.*0",
|
||||||
|
"category": "storage",
|
||||||
|
"runbook": "infra-run/scripts/bash/disk-full/README.md",
|
||||||
|
"description": "Filesystem may have free blocks but too few available inodes."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "read_only_filesystem",
|
||||||
|
"name": "Read-only filesystem",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "read-only file system|read-only filesystem|Remounting filesystem read-only",
|
||||||
|
"category": "storage",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/read-only-filesystem.md",
|
||||||
|
"description": "Filesystem writes failed because the mount was read-only or remounted read-only."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "io_error",
|
||||||
|
"name": "I/O error",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "\\bI/O error\\b|Buffer I/O error|blk_update_request.*I/O error",
|
||||||
|
"category": "storage",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/storage-io-error.md",
|
||||||
|
"description": "Kernel or application reported storage I/O errors that require device and filesystem review."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "out_of_memory",
|
||||||
|
"name": "Out of memory",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "\\bout of memory\\b|Cannot allocate memory",
|
||||||
|
"category": "memory",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/memory-pressure.md",
|
||||||
|
"description": "Process or host reported memory exhaustion symptoms."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "oom_killer",
|
||||||
|
"name": "OOM killer invoked",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "oom-killer|Killed process \\d+|Out of memory: Killed process",
|
||||||
|
"category": "memory",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/oom-killer.md",
|
||||||
|
"description": "Kernel OOM killer activity was logged and affected processes should be reviewed."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "segmentation_fault",
|
||||||
|
"name": "Segmentation fault",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "segmentation fault|segfault",
|
||||||
|
"category": "process",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/process-crash.md",
|
||||||
|
"description": "A process crash pattern was logged."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "connection_refused",
|
||||||
|
"name": "Connection refused",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "connection refused|ConnectException: Connection refused",
|
||||||
|
"category": "network",
|
||||||
|
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||||
|
"description": "Client connection attempts were refused by the destination service or host."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "connection_reset",
|
||||||
|
"name": "Connection reset",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "connection reset|Connection reset by peer",
|
||||||
|
"category": "network",
|
||||||
|
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||||
|
"description": "Established network connections were reset and require endpoint review."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "timeout",
|
||||||
|
"name": "Timeout",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "\\btimeout\\b|timed out|TimeoutException|SocketTimeoutException",
|
||||||
|
"category": "network",
|
||||||
|
"runbook": "infra-run/scripts/bash/os-healthcheck/README.md",
|
||||||
|
"description": "Operation timed out and may require network, service, or dependency review."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "dns_resolution_failure",
|
||||||
|
"name": "DNS resolution failure",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "Temporary failure in name resolution|Name or service not known|NXDOMAIN|UnknownHostException|could not resolve host",
|
||||||
|
"category": "network",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/dns-resolution.md",
|
||||||
|
"description": "Name resolution failed for a host or service dependency."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "certificate_expired",
|
||||||
|
"name": "Certificate expired",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "certificate expired|CertificateExpiredException|certificate has expired|notAfter",
|
||||||
|
"category": "tls",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/certificate-expired.md",
|
||||||
|
"description": "TLS certificate expiry was logged and certificate state should be reviewed."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "tls_handshake_failed",
|
||||||
|
"name": "TLS handshake failed",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "TLS handshake failed|SSL handshake failed|handshake_failure",
|
||||||
|
"category": "tls",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/tls-handshake.md",
|
||||||
|
"description": "TLS handshake failed and may require certificate, protocol, or trust-store review."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "authentication_failure",
|
||||||
|
"name": "Authentication failure",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "authentication failure|Failed password|authentication failed",
|
||||||
|
"category": "security",
|
||||||
|
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
|
||||||
|
"description": "Authentication failures were logged and may require access review."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "permission_denied",
|
||||||
|
"name": "Permission denied",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "permission denied|access denied|denied by policy",
|
||||||
|
"category": "security",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/permission-denied.md",
|
||||||
|
"description": "Access or permission denial was logged."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "invalid_user",
|
||||||
|
"name": "Invalid user",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "Invalid user|invalid user|user unknown|User not known",
|
||||||
|
"category": "security",
|
||||||
|
"runbook": "infra-run/scripts/python/auth-log-audit/README.md",
|
||||||
|
"description": "Log contains attempts involving invalid or unknown users."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "java_out_of_memory",
|
||||||
|
"name": "Java OutOfMemoryError",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "OutOfMemoryError|Java heap space|GC overhead limit exceeded",
|
||||||
|
"category": "application_jvm",
|
||||||
|
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||||
|
"description": "Java process logged memory exhaustion symptoms."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "ssl_handshake_exception",
|
||||||
|
"name": "SSLHandshakeException",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "SSLHandshakeException|javax\\.net\\.ssl\\.SSLHandshakeException",
|
||||||
|
"category": "application_jvm",
|
||||||
|
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||||
|
"description": "Java TLS handshake exception was logged."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "database_unavailable",
|
||||||
|
"name": "Database unavailable",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "database unavailable|database is unavailable|SQLRecoverableException|CommunicationsException|connection pool exhausted",
|
||||||
|
"category": "application",
|
||||||
|
"runbook": "infra-run/scripts/python/jvm-log-analyzer/README.md",
|
||||||
|
"description": "Application logged unavailable database or database connectivity symptoms."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "http_500",
|
||||||
|
"name": "HTTP 500",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "\\bHTTP\\s+500\\b|\\bstatus=500\\b|\\s500\\s",
|
||||||
|
"category": "application",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
|
||||||
|
"description": "Application or proxy logged HTTP 500 responses."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "http_503",
|
||||||
|
"name": "HTTP 503",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "\\bHTTP\\s+503\\b|\\bstatus=503\\b|\\s503\\s|Service Unavailable",
|
||||||
|
"category": "application",
|
||||||
|
"runbook": "infra-run/runbooks/incidents/http-5xx.md",
|
||||||
|
"description": "Application or proxy logged HTTP 503 service unavailable responses."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "service_failed",
|
||||||
|
"name": "Systemd service failed",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "Failed to start .*\\.service|entered failed state|Unit .*\\.service failed|Main process exited.*status=",
|
||||||
|
"category": "systemd",
|
||||||
|
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||||
|
"description": "Systemd logged a failed service or failed service start."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "dependency_failed",
|
||||||
|
"name": "Systemd dependency failed",
|
||||||
|
"severity": "CRITICAL",
|
||||||
|
"regex": "Dependency failed for|dependency failed",
|
||||||
|
"category": "systemd",
|
||||||
|
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||||
|
"description": "Systemd logged a unit dependency failure."
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"id": "start_request_repeated",
|
||||||
|
"name": "Start request repeated too quickly",
|
||||||
|
"severity": "WARNING",
|
||||||
|
"regex": "Start request repeated too quickly|start request repeated too quickly",
|
||||||
|
"category": "systemd",
|
||||||
|
"runbook": "infra-run/scripts/python/journal-analyzer/README.md",
|
||||||
|
"description": "Systemd throttled service restarts after repeated start failures."
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user