5.7 KiB
Bash Incident Checks
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
Scripts
check_high_cpu.sh- load, CPU saturation hint, and top CPU processes.check_high_memory_oom.sh- memory and swap pressure plus recent OOM evidence.check_service_restart_loop.sh- systemd service state, restart count, and recent failure lines.check_failed_ssh_logins.sh- failed SSH login burst review from journal or auth logs.check_certificate_expiry.sh- remote or local TLS certificate expiry check.check_dns_connectivity.sh- DNS resolution, ping, optional TCP check, and local route hints.check_ntp_time_drift.sh- time sync status and offset evidence when available.check_filesystem_readonly.sh- read-only filesystem detection.check_inode_usage.sh- inode pressure and top affected mount points.check_jvm_threads_heap.sh- lightweight JVM process, heap, and thread diagnostics.incident_triage_report.sh- wrapper that runs selected checks and writes a single Markdown L2 handover report.
Usage Examples
./check_high_cpu.sh
./check_high_cpu.sh --warning 70 --critical 90 --top 15
./check_high_memory_oom.sh
./check_high_memory_oom.sh --since "6 hours ago" --top 5
./check_service_restart_loop.sh --service nginx
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
./check_failed_ssh_logins.sh
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
./check_certificate_expiry.sh --host example.com
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
./check_dns_connectivity.sh --host example.com
./check_dns_connectivity.sh --host db.example.internal --port 5432
./check_ntp_time_drift.sh
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
./check_filesystem_readonly.sh
./check_filesystem_readonly.sh --include-system
./check_inode_usage.sh
./check_inode_usage.sh --warning 75 --critical 90
./check_jvm_threads_heap.sh
./check_jvm_threads_heap.sh --pid 1234
./check_jvm_threads_heap.sh --match app-name
./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
L2 Triage Report Wrapper
incident_triage_report.sh collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
Supported report types are cpu, memory, service, network, auth, cert, filesystem, jvm, and all.
The wrapper is read-only apart from writing the requested --output file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as --service or --host.
Exit Codes
0- OK.1- WARNING or operational issue detected.2- invalid input or missing required dependency.3- CRITICAL issue detected.
Supported Platforms
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
Some data sources vary by distribution:
- RHEL-like systems often use
/var/log/secureand/var/log/messages. - Debian/Ubuntu systems often use
/var/log/auth.log,/var/log/syslog, and/var/log/kern.log. - systemd-based checks require
systemctl; journal-based evidence usesjournalctlwhen available.
Safety Notes
- Scripts are read-only.
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
- Treat output as triage evidence, not as complete root-cause analysis.
Dependency Notes
Required dependencies vary by script and are checked at runtime. Common dependencies include bash, awk, sed, grep, sort, head, ps, df, free, systemctl, getent, openssl, date, mount, and findmnt.
Optional dependencies include journalctl, ping, ip, ss, timedatectl, chronyc, ntpq, jcmd, jstat, and readable /proc files.
Copy-To-Server Example
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
Sample Outputs
Sanitized examples are available in examples:
high-cpu.sample.txthigh-memory-oom.sample.txtservice-restart-loop.sample.txtfailed-ssh-logins.sample.txtcertificate-expiry.sample.txtdns-connectivity.sample.txtntp-time-drift.sample.txtfilesystem-readonly.sample.txtinode-usage.sample.txtjvm-threads-heap.sample.txt
A sanitized report sample is available at ../../../examples/incident-triage/l2-incident-triage-report.sample.md.