This commit is contained in:
@@ -0,0 +1,108 @@
|
||||
# Bash Incident Checks
|
||||
|
||||
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
|
||||
|
||||
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
|
||||
|
||||
## Scripts
|
||||
|
||||
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
|
||||
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
|
||||
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
|
||||
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
|
||||
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
|
||||
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
|
||||
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
|
||||
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```bash
|
||||
./check_high_cpu.sh
|
||||
./check_high_cpu.sh --warning 70 --critical 90 --top 15
|
||||
|
||||
./check_high_memory_oom.sh
|
||||
./check_high_memory_oom.sh --since "6 hours ago" --top 5
|
||||
|
||||
./check_service_restart_loop.sh --service nginx
|
||||
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
|
||||
|
||||
./check_failed_ssh_logins.sh
|
||||
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
|
||||
|
||||
./check_certificate_expiry.sh --host example.com
|
||||
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
|
||||
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
|
||||
|
||||
./check_dns_connectivity.sh --host example.com
|
||||
./check_dns_connectivity.sh --host db.example.internal --port 5432
|
||||
|
||||
./check_ntp_time_drift.sh
|
||||
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
|
||||
|
||||
./check_filesystem_readonly.sh
|
||||
./check_filesystem_readonly.sh --include-system
|
||||
|
||||
./check_inode_usage.sh
|
||||
./check_inode_usage.sh --warning 75 --critical 90
|
||||
|
||||
./check_jvm_threads_heap.sh
|
||||
./check_jvm_threads_heap.sh --pid 1234
|
||||
./check_jvm_threads_heap.sh --match app-name
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK.
|
||||
- `1` - WARNING or operational issue detected.
|
||||
- `2` - invalid input or missing required dependency.
|
||||
- `3` - CRITICAL issue detected.
|
||||
|
||||
## Supported Platforms
|
||||
|
||||
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
|
||||
|
||||
Some data sources vary by distribution:
|
||||
|
||||
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
|
||||
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
|
||||
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- Scripts are read-only.
|
||||
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
|
||||
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
|
||||
- Treat output as triage evidence, not as complete root-cause analysis.
|
||||
|
||||
## Dependency Notes
|
||||
|
||||
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
|
||||
|
||||
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
|
||||
|
||||
## Copy-To-Server Example
|
||||
|
||||
```bash
|
||||
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
|
||||
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
|
||||
```
|
||||
|
||||
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
|
||||
|
||||
## Sample Outputs
|
||||
|
||||
Sanitized examples are available in [examples](./examples/):
|
||||
|
||||
- `high-cpu.sample.txt`
|
||||
- `high-memory-oom.sample.txt`
|
||||
- `service-restart-loop.sample.txt`
|
||||
- `failed-ssh-logins.sample.txt`
|
||||
- `certificate-expiry.sample.txt`
|
||||
- `dns-connectivity.sample.txt`
|
||||
- `ntp-time-drift.sample.txt`
|
||||
- `filesystem-readonly.sample.txt`
|
||||
- `inode-usage.sample.txt`
|
||||
- `jvm-threads-heap.sample.txt`
|
||||
Reference in New Issue
Block a user