Files
portfolio/infra-run/scripts/bash/incident-checks/README.md
T
mateusz e03865b453
lint / shell-yaml-ansible (push) Failing after 17s
revert 6475f76787
revert Add L2 incident triage report wrapper
2026-05-14 21:16:57 +02:00

4.4 KiB

Bash Incident Checks

Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.

They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.

Scripts

  • check_high_cpu.sh - load, CPU saturation hint, and top CPU processes.
  • check_high_memory_oom.sh - memory and swap pressure plus recent OOM evidence.
  • check_service_restart_loop.sh - systemd service state, restart count, and recent failure lines.
  • check_failed_ssh_logins.sh - failed SSH login burst review from journal or auth logs.
  • check_certificate_expiry.sh - remote or local TLS certificate expiry check.
  • check_dns_connectivity.sh - DNS resolution, ping, optional TCP check, and local route hints.
  • check_ntp_time_drift.sh - time sync status and offset evidence when available.
  • check_filesystem_readonly.sh - read-only filesystem detection.
  • check_inode_usage.sh - inode pressure and top affected mount points.
  • check_jvm_threads_heap.sh - lightweight JVM process, heap, and thread diagnostics.

Usage Examples

./check_high_cpu.sh
./check_high_cpu.sh --warning 70 --critical 90 --top 15

./check_high_memory_oom.sh
./check_high_memory_oom.sh --since "6 hours ago" --top 5

./check_service_restart_loop.sh --service nginx
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"

./check_failed_ssh_logins.sh
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25

./check_certificate_expiry.sh --host example.com
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt

./check_dns_connectivity.sh --host example.com
./check_dns_connectivity.sh --host db.example.internal --port 5432

./check_ntp_time_drift.sh
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000

./check_filesystem_readonly.sh
./check_filesystem_readonly.sh --include-system

./check_inode_usage.sh
./check_inode_usage.sh --warning 75 --critical 90

./check_jvm_threads_heap.sh
./check_jvm_threads_heap.sh --pid 1234
./check_jvm_threads_heap.sh --match app-name

Exit Codes

  • 0 - OK.
  • 1 - WARNING or operational issue detected.
  • 2 - invalid input or missing required dependency.
  • 3 - CRITICAL issue detected.

Supported Platforms

These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.

Some data sources vary by distribution:

  • RHEL-like systems often use /var/log/secure and /var/log/messages.
  • Debian/Ubuntu systems often use /var/log/auth.log, /var/log/syslog, and /var/log/kern.log.
  • systemd-based checks require systemctl; journal-based evidence uses journalctl when available.

Safety Notes

  • Scripts are read-only.
  • Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
  • Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
  • Treat output as triage evidence, not as complete root-cause analysis.

Dependency Notes

Required dependencies vary by script and are checked at runtime. Common dependencies include bash, awk, sed, grep, sort, head, ps, df, free, systemctl, getent, openssl, date, mount, and findmnt.

Optional dependencies include journalctl, ping, ip, ss, timedatectl, chronyc, ntpq, jcmd, jstat, and readable /proc files.

Copy-To-Server Example

scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'

Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.

Sample Outputs

Sanitized examples are available in examples:

  • high-cpu.sample.txt
  • high-memory-oom.sample.txt
  • service-restart-loop.sample.txt
  • failed-ssh-logins.sample.txt
  • certificate-expiry.sample.txt
  • dns-connectivity.sample.txt
  • ntp-time-drift.sample.txt
  • filesystem-readonly.sample.txt
  • inode-usage.sample.txt
  • jvm-threads-heap.sample.txt