Bash Incident Checks

Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.

They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.

Scripts

check_high_cpu.sh - load, CPU saturation hint, and top CPU processes.
check_high_memory_oom.sh - memory and swap pressure plus recent OOM evidence.
check_service_restart_loop.sh - systemd service state, restart count, and recent failure lines.
check_failed_ssh_logins.sh - failed SSH login burst review from journal or auth logs.
check_certificate_expiry.sh - remote or local TLS certificate expiry check.
check_dns_connectivity.sh - DNS resolution, ping, optional TCP check, and local route hints.
check_ntp_time_drift.sh - time sync status and offset evidence when available.
check_filesystem_readonly.sh - read-only filesystem detection.
check_inode_usage.sh - inode pressure and top affected mount points.
check_jvm_threads_heap.sh - lightweight JVM process, heap, and thread diagnostics.
incident_triage_report.sh - wrapper that runs selected checks and writes a single Markdown L2 handover report.

Usage Examples

./check_high_cpu.sh
./check_high_cpu.sh --warning 70 --critical 90 --top 15

./check_high_memory_oom.sh
./check_high_memory_oom.sh --since "6 hours ago" --top 5

./check_service_restart_loop.sh --service nginx
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"

./check_failed_ssh_logins.sh
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25

./check_certificate_expiry.sh --host example.com
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt

./check_dns_connectivity.sh --host example.com
./check_dns_connectivity.sh --host db.example.internal --port 5432

./check_ntp_time_drift.sh
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000

./check_filesystem_readonly.sh
./check_filesystem_readonly.sh --include-system

./check_inode_usage.sh
./check_inode_usage.sh --warning 75 --critical 90

./check_jvm_threads_heap.sh
./check_jvm_threads_heap.sh --pid 1234
./check_jvm_threads_heap.sh --match app-name

./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md

L2 Triage Report Wrapper

incident_triage_report.sh collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.

Supported report types are cpu, memory, service, network, auth, cert, filesystem, jvm, and all.

The wrapper is read-only apart from writing the requested --output file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as --service or --host.

Exit Codes

0 - OK.
1 - WARNING or operational issue detected.
2 - invalid input or missing required dependency.
3 - CRITICAL issue detected.

Supported Platforms

These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.

Some data sources vary by distribution:

RHEL-like systems often use /var/log/secure and /var/log/messages.
Debian/Ubuntu systems often use /var/log/auth.log, /var/log/syslog, and /var/log/kern.log.
systemd-based checks require systemctl; journal-based evidence uses journalctl when available.

Safety Notes

Scripts are read-only.
Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
Treat output as triage evidence, not as complete root-cause analysis.

Dependency Notes

Required dependencies vary by script and are checked at runtime. Common dependencies include bash, awk, sed, grep, sort, head, ps, df, free, systemctl, getent, openssl, date, mount, and findmnt.

Optional dependencies include journalctl, ping, ip, ss, timedatectl, chronyc, ntpq, jcmd, jstat, and readable /proc files.

Copy-To-Server Example

scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'

Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.

Sample Outputs

Sanitized examples are available in examples:

high-cpu.sample.txt
high-memory-oom.sample.txt
service-restart-loop.sample.txt
failed-ssh-logins.sample.txt
certificate-expiry.sample.txt
dns-connectivity.sample.txt
ntp-time-drift.sample.txt
filesystem-readonly.sample.txt
inode-usage.sample.txt
jvm-threads-heap.sample.txt

A sanitized report sample is available at ../../../examples/incident-triage/l2-incident-triage-report.sample.md.

5.7 KiB Raw Blame History