From e851568c8c5c9528c616c96f3ed0a1332d8b5c09 Mon Sep 17 00:00:00 2001 From: Mateusz Suski Date: Mon, 11 May 2026 18:49:00 +0000 Subject: [PATCH] Add standalone Bash incident check scripts --- CHANGELOG.md | 1 + README.md | 1 + ROADMAP.md | 1 + infra-run/README.md | 1 + infra-run/ROADMAP.md | 1 + infra-run/scripts/bash/README.md | 21 ++- .../scripts/bash/incident-checks/README.md | 108 ++++++++++++ .../check_certificate_expiry.sh | 134 +++++++++++++++ .../incident-checks/check_dns_connectivity.sh | 161 ++++++++++++++++++ .../check_failed_ssh_logins.sh | 124 ++++++++++++++ .../check_filesystem_readonly.sh | 89 ++++++++++ .../bash/incident-checks/check_high_cpu.sh | 146 ++++++++++++++++ .../incident-checks/check_high_memory_oom.sh | 138 +++++++++++++++ .../bash/incident-checks/check_inode_usage.sh | 103 +++++++++++ .../incident-checks/check_jvm_threads_heap.sh | 134 +++++++++++++++ .../incident-checks/check_ntp_time_drift.sh | 121 +++++++++++++ .../check_service_restart_loop.sh | 111 ++++++++++++ .../examples/certificate-expiry.sample.txt | 20 +++ .../examples/dns-connectivity.sample.txt | 23 +++ .../examples/failed-ssh-logins.sample.txt | 26 +++ .../examples/filesystem-readonly.sample.txt | 16 ++ .../examples/high-cpu.sample.txt | 22 +++ .../examples/high-memory-oom.sample.txt | 25 +++ .../examples/inode-usage.sample.txt | 22 +++ .../examples/jvm-threads-heap.sample.txt | 30 ++++ .../examples/ntp-time-drift.sample.txt | 23 +++ .../examples/service-restart-loop.sample.txt | 27 +++ 27 files changed, 1623 insertions(+), 6 deletions(-) create mode 100644 infra-run/scripts/bash/incident-checks/README.md create mode 100755 infra-run/scripts/bash/incident-checks/check_certificate_expiry.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_dns_connectivity.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_failed_ssh_logins.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_filesystem_readonly.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_high_cpu.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_inode_usage.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_jvm_threads_heap.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_ntp_time_drift.sh create mode 100755 infra-run/scripts/bash/incident-checks/check_service_restart_loop.sh create mode 100644 infra-run/scripts/bash/incident-checks/examples/certificate-expiry.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/dns-connectivity.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/failed-ssh-logins.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/filesystem-readonly.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/high-cpu.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/high-memory-oom.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/inode-usage.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/jvm-threads-heap.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/ntp-time-drift.sample.txt create mode 100644 infra-run/scripts/bash/incident-checks/examples/service-restart-loop.sample.txt diff --git a/CHANGELOG.md b/CHANGELOG.md index d5a9343..b60d04a 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,6 +11,7 @@ - `jvm-log-analyzer` for JVM application log summaries. - `journal-analyzer` for exported `journalctl` log review. - `known-error-matcher` with JSON-based known error patterns. +- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics. - Repository-level Codex guidance: - `AGENTS.md` - `docs/codex/README.md` diff --git a/README.md b/README.md index 48fcb6f..eee2af3 100644 --- a/README.md +++ b/README.md @@ -30,6 +30,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope - [infra-run](./infra-run/) - the main implemented project in this repository. - [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers. +- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, designed for copy-to-server triage and ticket evidence. - [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks. - [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples. - [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples. diff --git a/ROADMAP.md b/ROADMAP.md index 32f4386..ff411b0 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -20,6 +20,7 @@ This file keeps future portfolio ideas in one place so empty folders do not look ## Implemented Portfolio Additions +- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence. - Python operational log analysis suite under `infra-run/scripts/python/`: - `incident-log-summary` - `log-diff-checker` diff --git a/infra-run/README.md b/infra-run/README.md index 295047d..ea10cac 100644 --- a/infra-run/README.md +++ b/infra-run/README.md @@ -9,6 +9,7 @@ The goal is to show operational judgment, not to ship a universal automation pro ### Bash Operational Scripts - [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts. +- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, and JVM diagnostics. - [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow. - [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples. - [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples. diff --git a/infra-run/ROADMAP.md b/infra-run/ROADMAP.md index 8bc7627..0e1a1d1 100644 --- a/infra-run/ROADMAP.md +++ b/infra-run/ROADMAP.md @@ -16,6 +16,7 @@ This file tracks planned `infra-run` additions without presenting them as comple ## Implemented Additions +- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics. - `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files. - `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review. - `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review. diff --git a/infra-run/scripts/bash/README.md b/infra-run/scripts/bash/README.md index 0c3846e..77fd12e 100644 --- a/infra-run/scripts/bash/README.md +++ b/infra-run/scripts/bash/README.md @@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T ```mermaid flowchart TD A["bash"] --> B["os-healthcheck"] - A --> C["disk-full"] - A --> D["veritas"] - A --> E["gpfs"] + A --> C["incident-checks"] + A --> D["disk-full"] + A --> E["veritas"] + A --> F["gpfs"] B --> B1["Host diagnostics"] - C --> C1["Incident workflow"] - D --> D1["VxVM and VCS change flow"] - E --> E1["Spectrum Scale expansion flow"] + C --> C1["Standalone triage checks"] + D --> D1["Incident workflow"] + E --> E1["VxVM and VCS change flow"] + F --> F1["Spectrum Scale expansion flow"] ``` ## Scripts @@ -23,6 +25,7 @@ flowchart TD - `os-healthcheck/service_check.sh` - critical service status check. - `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`. - `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics. +- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics. ## Usage @@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck ./system_report.sh ./network_troubleshoot.sh ./network_troubleshoot.sh google.com + +cd ../incident-checks +./check_high_cpu.sh +./check_high_memory_oom.sh --since "24 hours ago" +./check_service_restart_loop.sh --service sshd +./check_certificate_expiry.sh --host example.com ``` ## Standards diff --git a/infra-run/scripts/bash/incident-checks/README.md b/infra-run/scripts/bash/incident-checks/README.md new file mode 100644 index 0000000..7851e8f --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/README.md @@ -0,0 +1,108 @@ +# Bash Incident Checks + +Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence. + +They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing. + +## Scripts + +- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes. +- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence. +- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines. +- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs. +- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check. +- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints. +- `check_ntp_time_drift.sh` - time sync status and offset evidence when available. +- `check_filesystem_readonly.sh` - read-only filesystem detection. +- `check_inode_usage.sh` - inode pressure and top affected mount points. +- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics. + +## Usage Examples + +```bash +./check_high_cpu.sh +./check_high_cpu.sh --warning 70 --critical 90 --top 15 + +./check_high_memory_oom.sh +./check_high_memory_oom.sh --since "6 hours ago" --top 5 + +./check_service_restart_loop.sh --service nginx +./check_service_restart_loop.sh --service app.service --since "30 minutes ago" + +./check_failed_ssh_logins.sh +./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25 + +./check_certificate_expiry.sh --host example.com +./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com +./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt + +./check_dns_connectivity.sh --host example.com +./check_dns_connectivity.sh --host db.example.internal --port 5432 + +./check_ntp_time_drift.sh +./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000 + +./check_filesystem_readonly.sh +./check_filesystem_readonly.sh --include-system + +./check_inode_usage.sh +./check_inode_usage.sh --warning 75 --critical 90 + +./check_jvm_threads_heap.sh +./check_jvm_threads_heap.sh --pid 1234 +./check_jvm_threads_heap.sh --match app-name +``` + +## Exit Codes + +- `0` - OK. +- `1` - WARNING or operational issue detected. +- `2` - invalid input or missing required dependency. +- `3` - CRITICAL issue detected. + +## Supported Platforms + +These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed. + +Some data sources vary by distribution: + +- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`. +- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`. +- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available. + +## Safety Notes + +- Scripts are read-only. +- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files. +- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions. +- Treat output as triage evidence, not as complete root-cause analysis. + +## Dependency Notes + +Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`. + +Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files. + +## Copy-To-Server Example + +```bash +scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/ +ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"' +``` + +Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations. + +## Sample Outputs + +Sanitized examples are available in [examples](./examples/): + +- `high-cpu.sample.txt` +- `high-memory-oom.sample.txt` +- `service-restart-loop.sample.txt` +- `failed-ssh-logins.sample.txt` +- `certificate-expiry.sample.txt` +- `dns-connectivity.sample.txt` +- `ntp-time-drift.sample.txt` +- `filesystem-readonly.sample.txt` +- `inode-usage.sample.txt` +- `jvm-threads-heap.sample.txt` diff --git a/infra-run/scripts/bash/incident-checks/check_certificate_expiry.sh b/infra-run/scripts/bash/incident-checks/check_certificate_expiry.sh new file mode 100755 index 0000000..1b676d9 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_certificate_expiry.sh @@ -0,0 +1,134 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +host_name="" +port=443 +cert_file="" +warning_days=30 +critical_days=7 +servername="" + +usage() { + cat <<'USAGE' +Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help] + +Check TLS certificate expiry for a remote endpoint or local certificate file. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;; + --port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;; + --file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;; + --servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;; + --warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;; + --critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +if ! command -v openssl >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: openssl\n' + exit 2 +fi +for value in "$port" "$warning_days" "$critical_days"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((critical_days >= warning_days)); then + printf 'CRITICAL: --critical-days must be lower than --warning-days\n' + exit 2 +fi +if [[ -n "$host_name" && -n "$cert_file" ]]; then + printf 'CRITICAL: use either --host or --file, not both\n' + exit 2 +fi +if [[ -z "$host_name" && -z "$cert_file" ]]; then + printf 'CRITICAL: either --host or --file is required\n' + usage + exit 2 +fi +if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then + printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file" + exit 2 +fi +if [[ -z "$servername" ]]; then + servername="$host_name" +fi + +tmp_cert="$(mktemp)" +trap 'rm -f "$tmp_cert"' EXIT + +if [[ -n "$host_name" ]]; then + if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts /dev/null \ + | openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then + printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port" + exit 2 + fi +else + cp "$cert_file" "$tmp_cert" +fi + +subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')" +issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')" +not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')" +not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')" +san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')" + +expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')" +now_epoch="$(date +%s)" +if [[ -z "$expiry_epoch" ]]; then + printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after" + exit 2 +fi +seconds_left=$((expiry_epoch - now_epoch)) +days_left=$((seconds_left / 86400)) + +status="OK" +exit_code=0 +if ((days_left < critical_days)); then + status="CRITICAL" + exit_code=3 +elif ((days_left < warning_days)); then + status="WARNING" + exit_code=1 +fi + +target="$cert_file" +if [[ -n "$host_name" ]]; then + target="${host_name}:${port}" +fi + +printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left" + +printf 'Certificate details:\n' +printf 'Subject: %s\n' "$subject" +printf 'Issuer: %s\n' "$issuer" +printf 'notBefore: %s\n' "$not_before" +printf 'notAfter: %s\n' "$not_after" +printf 'SAN/CN: %s\n' "${san_text:-$subject}" +printf '\n' + +printf 'Evidence:\n' +printf 'Target: %s\n' "$target" +printf 'SNI: %s\n' "${servername:-not used}" +printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days" + +printf 'Recommended next steps:\n' +printf -- '- Renew certificate before the operational threshold is breached\n' +printf -- '- Check the full chain and intermediate certificates\n' +printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n' +printf -- '- Verify monitoring threshold and alert ownership\n' +printf -- '- Attach this output to incident or change ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_dns_connectivity.sh b/infra-run/scripts/bash/incident-checks/check_dns_connectivity.sh new file mode 100755 index 0000000..449f14e --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_dns_connectivity.sh @@ -0,0 +1,161 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +host_name="" +port="" +count=3 +timeout_seconds=3 + +usage() { + cat <<'USAGE' +Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help] + +Check DNS resolution, ping, optional TCP connectivity, and local route hints. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;; + --port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;; + --count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;; + --timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +if [[ -z "$host_name" ]]; then + printf 'CRITICAL: --host is required\n' + usage + exit 2 +fi +for value in "$count" "$timeout_seconds"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if [[ -n "$port" ]] && ! is_number "$port"; then + printf 'CRITICAL: --port must be numeric\n' + exit 2 +fi +if ! command -v getent >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: getent\n' + exit 2 +fi + +dns_ok=0 +ping_ok=0 +tcp_ok=0 +tcp_checked=0 +tcp_note="" +ping_output="$(mktemp)" +trap 'rm -f "$ping_output"' EXIT + +dns_output="$(getent hosts "$host_name" 2>/dev/null || true)" +if [[ -n "$dns_output" ]]; then + dns_ok=1 +fi + +if command -v ping >/dev/null 2>&1; then + if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then + ping_ok=1 + fi +else + printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output" +fi + +if [[ -n "$port" ]]; then + tcp_checked=1 + if command -v timeout >/dev/null 2>&1; then + if timeout "$timeout_seconds" bash -c ":/dev/null 2>&1; then + tcp_ok=1 + fi + else + tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout" + if bash -c ":/dev/null 2>&1; then + tcp_ok=1 + fi + fi +fi + +status="OK" +exit_code=0 +if ((dns_ok == 0)); then + status="CRITICAL" + exit_code=3 +elif ((tcp_checked == 1 && tcp_ok == 0)); then + status="CRITICAL" + exit_code=3 +elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then + status="WARNING" + exit_code=1 +fi + +printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)" +if ((tcp_checked == 1)); then + printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)" +fi +printf '\n\n' + +printf 'DNS result:\n' +if [[ -n "$dns_output" ]]; then + printf '%s\n' "$dns_output" +else + printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name" +fi +printf '\n' + +printf 'Ping result:\n' +if [[ -s "$ping_output" ]]; then + cat "$ping_output" +else + printf 'WARNING: ping result unavailable or ping command missing\n' +fi +printf '\n' + +if ((tcp_checked == 1)); then + printf 'TCP port result:\n' + if ((tcp_ok == 1)); then + printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port" + else + printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port" + fi + if [[ -n "$tcp_note" ]]; then + printf '%s\n' "$tcp_note" + fi + printf '\n' +fi + +printf 'Local network hints:\n' +if command -v ip >/dev/null 2>&1; then + ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n' +elif command -v ss >/dev/null 2>&1; then + ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n' +else + printf 'WARNING: ip and ss are unavailable; local network hints skipped\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}" +if [[ -n "$tcp_note" ]]; then + printf '%s\n' "$tcp_note" +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Verify the DNS record and resolver path\n' +printf -- '- Check firewall, routing, security group, or proxy policy\n' +printf -- '- Compare results from another host or network segment\n' +printf -- '- Check application endpoint health after network reachability is confirmed\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_failed_ssh_logins.sh b/infra-run/scripts/bash/incident-checks/check_failed_ssh_logins.sh new file mode 100755 index 0000000..cc54fde --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_failed_ssh_logins.sh @@ -0,0 +1,124 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +since_value="1 hour ago" +warning_count=20 +critical_count=50 +top_count=10 + +usage() { + cat <<'USAGE' +Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help] + +Detect failed SSH login bursts from journal or readable authentication logs. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;; + --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;; + --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;; + --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +for value in "$warning_count" "$critical_count" "$top_count"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((warning_count >= critical_count)); then + printf 'CRITICAL: --warning must be lower than --critical\n' + exit 2 +fi + +tmp_log="$(mktemp)" +trap 'rm -f "$tmp_log"' EXIT +log_source="journalctl" + +if command -v journalctl >/dev/null 2>&1; then + journalctl --since "$since_value" --no-pager 2>/dev/null \ + | grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true +else + log_source="log file fallback" +fi + +if [[ ! -s "$tmp_log" ]]; then + for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do + if [[ -r "$log_file" ]]; then + grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true + log_source="$log_file" + fi + done +fi + +attempts="$(wc -l < "$tmp_log" | awk '{print $1}')" + +status="OK" +exit_code=0 +if ((attempts >= critical_count)); then + status="CRITICAL" + exit_code=3 +elif ((attempts >= warning_count)); then + status="WARNING" + exit_code=1 +fi + +printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts" + +printf 'Top source IPs:\n' +if [[ -s "$tmp_log" ]]; then + grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \ + | sed -E 's/^(from|rhost=) //' \ + | sort | uniq -c | sort -rn | head -n "$top_count" || true +else + printf 'OK: no failed SSH attempts found in available sources\n' +fi +printf '\n' + +printf 'Top attempted users:\n' +if [[ -s "$tmp_log" ]]; then + sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \ + | sort | uniq -c | sort -rn | head -n "$top_count" || true +else + printf 'OK: no attempted users extracted\n' +fi +printf '\n' + +printf 'Sample recent lines:\n' +if [[ -s "$tmp_log" ]]; then + tail -n "$top_count" "$tmp_log" +else + printf 'OK: no sample lines available\n' +fi +printf '\n\n' + +printf 'Evidence:\n' +printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value" +printf 'Log source: %s\n' "$log_source" +if [[ "$log_source" != "journalctl" ]]; then + printf 'WARNING: log file fallback may include entries outside the requested --since window\n' +fi +if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then + printf 'WARNING: running without root; authentication log visibility may be limited\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Verify source IPs against expected scanners, admins, or automation\n' +printf -- '- Check firewall, fail2ban, or security tooling state\n' +printf -- '- Confirm whether the attempts are expected for this host\n' +printf -- '- Review successful logins too, not only failures\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_filesystem_readonly.sh b/infra-run/scripts/bash/incident-checks/check_filesystem_readonly.sh new file mode 100755 index 0000000..6e38ec7 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_filesystem_readonly.sh @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +include_system=0 + +usage() { + cat <<'USAGE' +Usage: check_filesystem_readonly.sh [--include-system] [--help] + +Detect filesystems mounted read-only. Read-only. +USAGE +} + +while (($# > 0)); do + case "$1" in + --include-system) include_system=1; shift ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +tmp_mounts="$(mktemp)" +trap 'rm -f "$tmp_mounts"' EXIT + +if command -v findmnt >/dev/null 2>&1; then + findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true +elif command -v mount >/dev/null 2>&1; then + mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts" +else + printf 'CRITICAL: findmnt or mount is required\n' + exit 2 +fi + +tmp_ro="$(mktemp)" +trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT + +awk -v include_system="$include_system" ' + function system_fs(type, target) { + return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/ + } + { + target=$1; source=$2; type=$3; opts=$4 + if (opts ~ /(^|,)ro(,|$)/) { + if (include_system == 1 || ! system_fs(type, target)) { + print target "\t" source "\t" type "\t" opts + } + } + } +' "$tmp_mounts" > "$tmp_ro" + +readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')" +status="OK" +exit_code=0 +if ((readonly_count > 0)); then + status="CRITICAL" + exit_code=3 +fi + +printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count" + +printf 'Read-only filesystems:\n' +if [[ -s "$tmp_ro" ]]; then + printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n' + cat "$tmp_ro" +else + printf 'OK: no read-only filesystems found with current filters\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'include_system=%s\n' "$include_system" +printf 'Collector: ' +if command -v findmnt >/dev/null 2>&1; then + printf 'findmnt\n' +else + printf 'mount fallback\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n' +printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n' +printf -- '- Check filesystem health with the platform-approved procedure\n' +printf -- '- Do not remount read-write before understanding the cause\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_high_cpu.sh b/infra-run/scripts/bash/incident-checks/check_high_cpu.sh new file mode 100755 index 0000000..9ac8fb8 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_high_cpu.sh @@ -0,0 +1,146 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +warning_threshold=75 +critical_threshold=90 +top_count=10 + +usage() { + cat <<'USAGE' +Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help] + +Detect high CPU load and show top CPU-consuming processes. + +Exit codes: + 0 OK + 1 WARNING / operational issue detected + 2 invalid input / missing required dependency + 3 CRITICAL issue detected +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +require_cmd() { + if ! command -v "$1" >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: %s\n' "$1" + exit 2 + fi +} + +while (($# > 0)); do + case "$1" in + --warning) + [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; } + warning_threshold="$2" + shift 2 + ;; + --critical) + [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; } + critical_threshold="$2" + shift 2 + ;; + --top) + [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; } + top_count="$2" + shift 2 + ;; + --help|-h) + usage + exit 0 + ;; + *) + printf 'CRITICAL: unknown option: %s\n' "$1" + usage + exit 2 + ;; + esac +done + +for value in "$warning_threshold" "$critical_threshold" "$top_count"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done + +if ((warning_threshold >= critical_threshold)); then + printf 'CRITICAL: --warning must be lower than --critical\n' + exit 2 +fi + +require_cmd ps +require_cmd awk +require_cmd head + +cpu_count=1 +if command -v getconf >/dev/null 2>&1; then + cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')" +elif [[ -r /proc/cpuinfo ]]; then + cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')" +fi +[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1 +((cpu_count > 0)) || cpu_count=1 + +load_1m="unavailable" +load_5m="unavailable" +load_15m="unavailable" +load_per_cpu_pct=0 +if [[ -r /proc/loadavg ]]; then + read -r load_1m load_5m load_15m _ < /proc/loadavg + load_per_cpu_pct="$(awk -v load="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load / cpus) * 100 }')" +elif command -v uptime >/dev/null 2>&1; then + load_line="$(uptime 2>/dev/null || true)" + load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')" +fi + +status="OK" +exit_code=0 +if ((load_per_cpu_pct >= critical_threshold)); then + status="CRITICAL" + exit_code=3 +elif ((load_per_cpu_pct >= warning_threshold)); then + status="WARNING" + exit_code=1 +fi + +printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct" + +printf 'Load average:\n' +printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m" + +printf 'CPU count:\n' +printf '%s\n\n' "$cpu_count" + +printf 'Top CPU processes:\n' +ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))" +printf '\n' + +printf 'Evidence:\n' +if command -v uptime >/dev/null 2>&1; then + uptime || true +else + printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n' +fi +if ((load_per_cpu_pct >= 100)); then + printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n' +else + printf 'OK: load is not above online CPU count at collection time\n' +fi +if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then + printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Check process ownership and whether the top process is expected\n' +printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n' +printf -- '- Review logs for the top CPU-consuming process\n' +printf -- '- Compare with longer trend data from monitoring before taking action\n' +printf -- '- Attach this output to the incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh b/infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh new file mode 100755 index 0000000..b9dea91 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh @@ -0,0 +1,138 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +warning_threshold=80 +critical_threshold=90 +since_value="24 hours ago" +top_count=10 + +usage() { + cat <<'USAGE' +Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help] + +Detect high memory or swap usage and show recent OOM killer evidence. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +require_cmd() { + if ! command -v "$1" >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: %s\n' "$1" + exit 2 + fi +} + +while (($# > 0)); do + case "$1" in + --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;; + --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;; + --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;; + --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +for value in "$warning_threshold" "$critical_threshold" "$top_count"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((warning_threshold >= critical_threshold)); then + printf 'CRITICAL: --warning must be lower than --critical\n' + exit 2 +fi + +require_cmd free +require_cmd ps +require_cmd awk +require_cmd head + +read -r mem_total mem_used swap_total swap_used < <(free -m | awk ' + /^Mem:/ { mt=$2; mu=$3 } + /^Swap:/ { st=$2; su=$3 } + END { printf "%d %d %d %d\n", mt, mu, st, su } +') + +mem_pct=0 +swap_pct=0 +if ((mem_total > 0)); then + mem_pct=$((mem_used * 100 / mem_total)) +fi +if ((swap_total > 0)); then + swap_pct=$((swap_used * 100 / swap_total)) +fi + +status="OK" +exit_code=0 +if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then + status="CRITICAL" + exit_code=3 +elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then + status="WARNING" + exit_code=1 +fi + +printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct" + +printf 'Memory summary:\n' +free -m +printf '\n' + +printf 'Top memory processes:\n' +printf 'PID RSS_MB COMMAND\n' +ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }' +printf '\n' + +printf 'OOM events since %s:\n' "$since_value" +oom_found=0 +oom_source="journalctl" +if command -v journalctl >/dev/null 2>&1; then + if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then + oom_found=1 + fi +else + printf 'WARNING: journalctl not available; checking readable log files\n' + oom_source="log file fallback" +fi +if ((oom_found == 0)); then + for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do + if [[ -r "$log_file" ]]; then + if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then + oom_found=1 + oom_source="$log_file" + break + fi + fi + done +fi +if ((oom_found == 0)); then + printf 'OK: no OOM evidence found in available sources\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value" +printf 'OOM evidence source: %s\n' "$oom_source" +if [[ "$oom_source" != "journalctl" ]]; then + printf 'WARNING: log file fallback may include entries outside the requested --since window\n' +fi +if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then + printf 'WARNING: running without root; kernel logs or process details may be limited\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Check application memory trend\n' +printf -- '- Review JVM heap settings if process is Java\n' +printf -- '- Verify swap pressure and paging activity\n' +printf -- '- Confirm whether OOM events align with application impact\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_inode_usage.sh b/infra-run/scripts/bash/incident-checks/check_inode_usage.sh new file mode 100755 index 0000000..01d81aa --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_inode_usage.sh @@ -0,0 +1,103 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +warning_threshold=80 +critical_threshold=90 +top_count=10 + +usage() { + cat <<'USAGE' +Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help] + +Detect inode exhaustion using df -i. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;; + --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;; + --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +for value in "$warning_threshold" "$critical_threshold" "$top_count"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((warning_threshold >= critical_threshold)); then + printf 'CRITICAL: --warning must be lower than --critical\n' + exit 2 +fi +if ! command -v df >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: df\n' + exit 2 +fi + +tmp_df="$(mktemp)" +tmp_alerts="$(mktemp)" +trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT + +df -Pi > "$tmp_df" +awk -v warn="$warning_threshold" ' + NR > 1 { + pct=$5 + gsub(/%/, "", pct) + if (pct >= warn) { + print $0 + } + } +' "$tmp_df" > "$tmp_alerts" + +max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")" +status="OK" +exit_code=0 +if ((max_pct >= critical_threshold)); then + status="CRITICAL" + exit_code=3 +elif ((max_pct >= warning_threshold)); then + status="WARNING" + exit_code=1 +fi + +printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct" + +printf 'Filesystems above threshold:\n' +if [[ -s "$tmp_alerts" ]]; then + cat "$tmp_alerts" +else + printf 'OK: no filesystems above warning threshold\n' +fi +printf '\n' + +printf 'Inode usage table:\n' +cat "$tmp_df" +printf '\n' + +printf 'Top affected mount points:\n' +awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \ + | sort -rn | head -n "$top_count" \ + | awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }' +printf '\n' + +printf 'Evidence:\n' +printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold" + +printf 'Recommended next steps:\n' +printf -- '- Find directories with many small files under affected mount points\n' +printf -- '- Check logs, cache, spool, session, and temporary directories\n' +printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n' +printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_jvm_threads_heap.sh b/infra-run/scripts/bash/incident-checks/check_jvm_threads_heap.sh new file mode 100755 index 0000000..477c202 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_jvm_threads_heap.sh @@ -0,0 +1,134 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +target_pid="" +match_string="" +top_count=10 + +usage() { + cat <<'USAGE' +Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help] + +Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;; + --match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;; + --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +if [[ -n "$target_pid" && -n "$match_string" ]]; then + printf 'CRITICAL: use either --pid or --match, not both\n' + exit 2 +fi +if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then + printf 'CRITICAL: --pid must be numeric\n' + exit 2 +fi +if ! is_number "$top_count"; then + printf 'CRITICAL: --top must be numeric\n' + exit 2 +fi +if ! command -v ps >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: ps\n' + exit 2 +fi + +tmp_java="$(mktemp)" +trap 'rm -f "$tmp_java"' EXIT + +ps -eo pid=,user=,rss=,pcpu=,comm=,args= \ + | awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java" + +if [[ -z "$target_pid" && -n "$match_string" ]]; then + target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)" +fi + +if [[ -z "$target_pid" ]]; then + detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')" + if ((detected_count == 0)); then + printf 'WARNING: No Java processes detected\n\n' + else + printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count" + fi + printf 'Detected JVM processes:\n' + printf 'PID USER RSS_MB CPU COMMAND\n' + awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count" + printf '\nRecommended next steps:\n' + printf -- '- Select a JVM process with --pid for focused diagnostics\n' + printf -- '- Review GC logs and application logs for the selected process\n' + printf -- '- Check heap sizing and thread count trend\n' + printf -- '- Capture jstack only if approved by operational process\n' + exit 1 +fi + +if ! ps -p "$target_pid" >/dev/null 2>&1; then + printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid" + exit 2 +fi + +proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)" +if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then + printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid" + status="WARNING" + exit_code=1 +else + status="OK" + exit_code=0 +fi + +thread_count="unavailable" +if [[ -r "/proc/${target_pid}/status" ]]; then + thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")" +fi + +printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid" + +printf 'Detected JVM process:\n' +printf 'PID USER RSS_MB CPU COMMAND\n' +printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' +printf 'Thread count: %s\n\n' "$thread_count" + +printf 'Heap and JVM evidence:\n' +if command -v jcmd >/dev/null 2>&1; then + printf '\n[jcmd VM.flags]\n' + jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n' + printf '\n[jcmd GC.heap_info]\n' + jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n' + printf '\n[jcmd Thread.print summary]\n' + jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n' +elif command -v jstat >/dev/null 2>&1; then + printf '\n[jstat -gc]\n' + jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n' +else + printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count" +if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then + printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Review GC logs and recent application errors\n' +printf -- '- Check JVM heap sizing against container or host memory limits\n' +printf -- '- Check thread count trend in monitoring before concluding a leak\n' +printf -- '- Capture jstack only if approved by operational process\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_ntp_time_drift.sh b/infra-run/scripts/bash/incident-checks/check_ntp_time_drift.sh new file mode 100755 index 0000000..b12c524 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_ntp_time_drift.sh @@ -0,0 +1,121 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +warning_offset_ms=500 +critical_offset_ms=5000 + +usage() { + cat <<'USAGE' +Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help] + +Check time synchronization status and offset evidence when available. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +while (($# > 0)); do + case "$1" in + --warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;; + --critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +for value in "$warning_offset_ms" "$critical_offset_ms"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((warning_offset_ms >= critical_offset_ms)); then + printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n' + exit 2 +fi + +system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')" +timezone="$(date '+%Z %z')" +sync_status="unknown" +detected_tool="none" +offset_ms="" + +timedate_output="" +if command -v timedatectl >/dev/null 2>&1; then + detected_tool="timedatectl" + timedate_output="$(timedatectl 2>/dev/null || true)" + sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')" + [[ -n "$sync_status" ]] || sync_status="unknown" +fi + +chronyc_output="" +if command -v chronyc >/dev/null 2>&1; then + detected_tool="chronyc" + chronyc_output="$(chronyc tracking 2>/dev/null || true)" + raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')" + if [[ -n "$raw_offset" ]]; then + offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')" + fi +elif command -v ntpq >/dev/null 2>&1; then + detected_tool="ntpq" +fi + +status="OK" +exit_code=0 +if [[ "$sync_status" =~ ^(no|false)$ ]]; then + status="WARNING" + exit_code=1 +fi +if [[ -n "$offset_ms" ]]; then + if ((offset_ms >= critical_offset_ms)); then + status="CRITICAL" + exit_code=3 + elif ((offset_ms >= warning_offset_ms)); then + status="WARNING" + exit_code=1 + fi +elif [[ "$detected_tool" == "none" ]]; then + status="WARNING" + exit_code=1 +fi + +printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}" + +printf 'Time status:\n' +printf 'System time: %s\n' "$system_time" +printf 'Timezone: %s\n' "$timezone" +printf 'Detected tool: %s\n' "$detected_tool" +printf 'NTP synchronized: %s\n' "$sync_status" +printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}" + +printf 'Tool evidence:\n' +if [[ -n "$chronyc_output" ]]; then + printf '%s\n' "$chronyc_output" +elif command -v ntpq >/dev/null 2>&1; then + ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n' +elif [[ -n "$timedate_output" ]]; then + printf '%s\n' "$timedate_output" +else + printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms" +if [[ -z "$offset_ms" ]]; then + printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Verify chrony or ntpd service status and configuration\n' +printf -- '- Check NTP sources and reachability\n' +printf -- '- Check virtualization host time if this is a VM\n' +printf -- '- Avoid restarting time services blindly in production\n' +printf -- '- Attach this output to incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/check_service_restart_loop.sh b/infra-run/scripts/bash/incident-checks/check_service_restart_loop.sh new file mode 100755 index 0000000..4aec4c0 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/check_service_restart_loop.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +service_name="" +since_value="1 hour ago" +warning_count=3 +critical_count=10 + +usage() { + cat <<'USAGE' +Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help] + +Detect restart-loop evidence for a systemd service. Read-only. +USAGE +} + +is_number() { + [[ "$1" =~ ^[0-9]+$ ]] +} + +require_cmd() { + if ! command -v "$1" >/dev/null 2>&1; then + printf 'CRITICAL: required command not found: %s\n' "$1" + exit 2 + fi +} + +while (($# > 0)); do + case "$1" in + --service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;; + --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;; + --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;; + --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;; + --help|-h) usage; exit 0 ;; + *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;; + esac +done + +if [[ -z "$service_name" ]]; then + printf 'CRITICAL: --service is required\n' + usage + exit 2 +fi +for value in "$warning_count" "$critical_count"; do + if ! is_number "$value"; then + printf 'CRITICAL: numeric option expected, got: %s\n' "$value" + exit 2 + fi +done +if ((warning_count >= critical_count)); then + printf 'CRITICAL: --warning must be lower than --critical\n' + exit 2 +fi + +require_cmd systemctl + +active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')" +sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')" +n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')" +restart_count="${n_restarts:-0}" +if ! is_number "$restart_count"; then + restart_count=0 +fi + +status="OK" +exit_code=0 +if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then + status="CRITICAL" + exit_code=3 +elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then + status="WARNING" + exit_code=1 +fi + +printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count" + +printf 'Service state:\n' +systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name" +printf '\n' + +printf 'Systemd properties:\n' +systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true +printf '\n' + +printf 'Recent start/stop/failure log lines since %s:\n' "$since_value" +if command -v journalctl >/dev/null 2>&1; then + journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \ + | grep -Ei 'start|stop|fail|restart|exit|status|main process' \ + | tail -n 40 || printf 'OK: no matching journal lines found\n' +else + printf 'WARNING: journalctl not available; service logs unavailable from this script\n' +fi +printf '\n' + +printf 'Evidence:\n' +printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value" +if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then + printf 'WARNING: running without root; journal visibility may be limited\n' +fi +printf '\n' + +printf 'Recommended next steps:\n' +printf -- '- Inspect the unit file and drop-in overrides\n' +printf -- '- Review application logs around the restart timestamps\n' +printf -- '- Check dependencies such as network, storage, database, or secrets\n' +printf -- '- Verify recent configuration or package changes\n' +printf -- '- Do not restart blindly; attach this output to the incident ticket\n' + +exit "$exit_code" diff --git a/infra-run/scripts/bash/incident-checks/examples/certificate-expiry.sample.txt b/infra-run/scripts/bash/incident-checks/examples/certificate-expiry.sample.txt new file mode 100644 index 0000000..54a3975 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/certificate-expiry.sample.txt @@ -0,0 +1,20 @@ +WARNING: Certificate for app.example.com:443 expires in 18 day(s) + +Certificate details: +Subject: CN = app.example.com +Issuer: C = US, O = Example CA, CN = Example Intermediate CA +notBefore: Apr 11 00:00:00 2026 GMT +notAfter: May 29 23:59:59 2026 GMT +SAN/CN: DNS:app.example.com, DNS:api.example.com + +Evidence: +Target: app.example.com:443 +SNI: app.example.com +Thresholds: warning=30 days critical=7 days + +Recommended next steps: +- Renew certificate before the operational threshold is breached +- Check the full chain and intermediate certificates +- Check the load balancer, ingress, or reverse proxy serving this certificate +- Verify monitoring threshold and alert ownership +- Attach this output to incident or change ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/dns-connectivity.sample.txt b/infra-run/scripts/bash/incident-checks/examples/dns-connectivity.sample.txt new file mode 100644 index 0000000..9fcf504 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/dns-connectivity.sample.txt @@ -0,0 +1,23 @@ +OK: DNS=OK ping=OK tcp_443=OK + +DNS result: +93.184.216.34 example.com + +Ping result: +3 packets transmitted, 3 received, 0% packet loss, time 2002ms + +TCP port result: +OK: TCP connection to example.com:443 succeeded + +Local network hints: +default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15 + +Evidence: +Host: example.com count=3 timeout=3s port=443 + +Recommended next steps: +- Verify the DNS record and resolver path +- Check firewall, routing, security group, or proxy policy +- Compare results from another host or network segment +- Check application endpoint health after network reachability is confirmed +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/failed-ssh-logins.sample.txt b/infra-run/scripts/bash/incident-checks/examples/failed-ssh-logins.sample.txt new file mode 100644 index 0000000..7372365 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/failed-ssh-logins.sample.txt @@ -0,0 +1,26 @@ +CRITICAL: Found 73 failed SSH login attempt(s) for requested window + +Top source IPs: +52 203.0.113.44 +12 198.51.100.20 +9 192.0.2.10 + +Top attempted users: +31 admin +24 oracle +18 root + +Sample recent lines: +May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2 +May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20 + +Evidence: +Thresholds: warning=20 critical=50 since="1 hour ago" +Log source: journalctl + +Recommended next steps: +- Verify source IPs against expected scanners, admins, or automation +- Check firewall, fail2ban, or security tooling state +- Confirm whether the attempts are expected for this host +- Review successful logins too, not only failures +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/filesystem-readonly.sample.txt b/infra-run/scripts/bash/incident-checks/examples/filesystem-readonly.sample.txt new file mode 100644 index 0000000..7fff968 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/filesystem-readonly.sample.txt @@ -0,0 +1,16 @@ +CRITICAL: Found 1 read-only filesystem(s) + +Read-only filesystems: +MOUNT_POINT SOURCE FSTYPE OPTIONS +/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64 + +Evidence: +include_system=0 +Collector: findmnt + +Recommended next steps: +- Check dmesg or journal logs for I/O errors and filesystem remount events +- Check storage path, multipath, SAN, cloud volume, or underlying disk health +- Check filesystem health with the platform-approved procedure +- Do not remount read-write before understanding the cause +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/high-cpu.sample.txt b/infra-run/scripts/bash/incident-checks/examples/high-cpu.sample.txt new file mode 100644 index 0000000..14e5d63 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/high-cpu.sample.txt @@ -0,0 +1,22 @@ +WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count) + +Load average: +1m=7.82 5m=6.91 15m=5.40 + +CPU count: +8 + +Top CPU processes: +PID PPID USER %CPU %MEM COMMAND COMMAND +2314 1 app 245 12.1 java java -jar order-api.jar +991 1 root 38 0.4 backup-agent backup-agent --scan + +Evidence: +WARNING: load is close to online CPU count; runnable task saturation is possible + +Recommended next steps: +- Check process ownership and whether the top process is expected +- Check recent deployments, cron jobs, batch jobs, or maintenance activity +- Review logs for the top CPU-consuming process +- Compare with longer trend data from monitoring before taking action +- Attach this output to the incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/high-memory-oom.sample.txt b/infra-run/scripts/bash/incident-checks/examples/high-memory-oom.sample.txt new file mode 100644 index 0000000..f79a0fd --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/high-memory-oom.sample.txt @@ -0,0 +1,25 @@ +WARNING: Memory usage is 84% and swap usage is 12% + +Memory summary: + total used free shared buff/cache available +Mem: 15934 13386 512 121 2036 2101 +Swap: 4095 512 3583 + +Top memory processes: +PID RSS_MB COMMAND +1234 2048 java +987 812 postgres + +OOM events since 24 hours ago: +2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java) + +Evidence: +Thresholds: warning=80% critical=90% since="24 hours ago" +OOM evidence source: journalctl + +Recommended next steps: +- Check application memory trend +- Review JVM heap settings if process is Java +- Verify swap pressure and paging activity +- Confirm whether OOM events align with application impact +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/inode-usage.sample.txt b/infra-run/scripts/bash/incident-checks/examples/inode-usage.sample.txt new file mode 100644 index 0000000..23e4186 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/inode-usage.sample.txt @@ -0,0 +1,22 @@ +WARNING: Highest inode usage is 87% + +Filesystems above threshold: +/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var + +Inode usage table: +Filesystem Inodes IUsed IFree IUse% Mounted on +/dev/mapper/vg_root-lv_root 524288 91300 432988 18% / +/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var + +Top affected mount points: +87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394 + +Evidence: +Thresholds: warning=80% critical=90% + +Recommended next steps: +- Find directories with many small files under affected mount points +- Check logs, cache, spool, session, and temporary directories +- Avoid deleting blindly; confirm ownership and application impact first +- Confirm whether inode exhaustion is causing write or deploy failures +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/jvm-threads-heap.sample.txt b/infra-run/scripts/bash/incident-checks/examples/jvm-threads-heap.sample.txt new file mode 100644 index 0000000..5696fb2 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/jvm-threads-heap.sample.txt @@ -0,0 +1,30 @@ +OK: JVM diagnostics collected for PID 1234 + +Detected JVM process: +PID USER RSS_MB CPU COMMAND +1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar +Thread count: 188 + +Heap and JVM evidence: + +[jcmd VM.flags] +1234: +-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648 + +[jcmd GC.heap_info] +garbage-first heap total 2097152K, used 1521000K + +[jcmd Thread.print summary] +102 java.lang.Thread.State: WAITING +53 java.lang.Thread.State: RUNNABLE +33 java.lang.Thread.State: TIMED_WAITING + +Evidence: +PID=1234 thread_count=188 top=10 + +Recommended next steps: +- Review GC logs and recent application errors +- Check JVM heap sizing against container or host memory limits +- Check thread count trend in monitoring before concluding a leak +- Capture jstack only if approved by operational process +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/ntp-time-drift.sample.txt b/infra-run/scripts/bash/incident-checks/examples/ntp-time-drift.sample.txt new file mode 100644 index 0000000..e1df960 --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/ntp-time-drift.sample.txt @@ -0,0 +1,23 @@ +WARNING: Time sync status=yes offset_ms=812 + +Time status: +System time: 2026-05-11 10:18:01 UTC +0000 +Timezone: UTC +0000 +Detected tool: chronyc +NTP synchronized: yes +Offset ms: 812 + +Tool evidence: +Reference ID : 203.0.113.10 +System time : 0.812345 seconds fast of NTP time +Last offset : +0.812345 seconds + +Evidence: +Thresholds: warning=500ms critical=5000ms + +Recommended next steps: +- Verify chrony or ntpd service status and configuration +- Check NTP sources and reachability +- Check virtualization host time if this is a VM +- Avoid restarting time services blindly in production +- Attach this output to incident ticket diff --git a/infra-run/scripts/bash/incident-checks/examples/service-restart-loop.sample.txt b/infra-run/scripts/bash/incident-checks/examples/service-restart-loop.sample.txt new file mode 100644 index 0000000..d8a0dcc --- /dev/null +++ b/infra-run/scripts/bash/incident-checks/examples/service-restart-loop.sample.txt @@ -0,0 +1,27 @@ +CRITICAL: Service app.service state=failed substate=failed restarts=12 + +Service state: +app.service - Example application + Loaded: loaded (/etc/systemd/system/app.service; enabled) + Active: failed (Result: exit-code) + +Systemd properties: +Id=app.service +ActiveState=failed +SubState=failed +Result=exit-code +NRestarts=12 + +Recent start/stop/failure log lines since 1 hour ago: +May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE +May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'. + +Evidence: +Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago" + +Recommended next steps: +- Inspect the unit file and drop-in overrides +- Review application logs around the restart timestamps +- Check dependencies such as network, storage, database, or secrets +- Verify recent configuration or package changes +- Do not restart blindly; attach this output to the incident ticket