Add standalone Bash incident check scripts

2026-05-11 18:49:00 +00:00
parent 8a7b7c5abc
commit e851568c8c
27 changed files with 1623 additions and 6 deletions
@@ -0,0 +1,20 @@
+WARNING: Certificate for app.example.com:443 expires in 18 day(s)
+
+Certificate details:
+Subject: CN = app.example.com
+Issuer: C = US, O = Example CA, CN = Example Intermediate CA
+notBefore: Apr 11 00:00:00 2026 GMT
+notAfter: May 29 23:59:59 2026 GMT
+SAN/CN: DNS:app.example.com, DNS:api.example.com
+
+Evidence:
+Target: app.example.com:443
+SNI: app.example.com
+Thresholds: warning=30 days critical=7 days
+
+Recommended next steps:
+- Renew certificate before the operational threshold is breached
+- Check the full chain and intermediate certificates
+- Check the load balancer, ingress, or reverse proxy serving this certificate
+- Verify monitoring threshold and alert ownership
+- Attach this output to incident or change ticket
@@ -0,0 +1,23 @@
+OK: DNS=OK ping=OK tcp_443=OK
+
+DNS result:
+93.184.216.34 example.com
+
+Ping result:
+3 packets transmitted, 3 received, 0% packet loss, time 2002ms
+
+TCP port result:
+OK: TCP connection to example.com:443 succeeded
+
+Local network hints:
+default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
+
+Evidence:
+Host: example.com count=3 timeout=3s port=443
+
+Recommended next steps:
+- Verify the DNS record and resolver path
+- Check firewall, routing, security group, or proxy policy
+- Compare results from another host or network segment
+- Check application endpoint health after network reachability is confirmed
+- Attach this output to incident ticket
@@ -0,0 +1,26 @@
+CRITICAL: Found 73 failed SSH login attempt(s) for requested window
+
+Top source IPs:
+52 203.0.113.44
+12 198.51.100.20
+9 192.0.2.10
+
+Top attempted users:
+31 admin
+24 oracle
+18 root
+
+Sample recent lines:
+May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
+May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
+
+Evidence:
+Thresholds: warning=20 critical=50 since="1 hour ago"
+Log source: journalctl
+
+Recommended next steps:
+- Verify source IPs against expected scanners, admins, or automation
+- Check firewall, fail2ban, or security tooling state
+- Confirm whether the attempts are expected for this host
+- Review successful logins too, not only failures
+- Attach this output to incident ticket
@@ -0,0 +1,16 @@
+CRITICAL: Found 1 read-only filesystem(s)
+
+Read-only filesystems:
+MOUNT_POINT	SOURCE	FSTYPE	OPTIONS
+/data	/dev/mapper/vg_data-lv_data	xfs	ro,relatime,seclabel,attr2,inode64
+
+Evidence:
+include_system=0
+Collector: findmnt
+
+Recommended next steps:
+- Check dmesg or journal logs for I/O errors and filesystem remount events
+- Check storage path, multipath, SAN, cloud volume, or underlying disk health
+- Check filesystem health with the platform-approved procedure
+- Do not remount read-write before understanding the cause
+- Attach this output to incident ticket
@@ -0,0 +1,22 @@
+WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
+
+Load average:
+1m=7.82 5m=6.91 15m=5.40
+
+CPU count:
+8
+
+Top CPU processes:
+PID   PPID  USER      %CPU %MEM COMMAND          COMMAND
+2314  1     app       245  12.1 java             java -jar order-api.jar
+991   1     root      38   0.4  backup-agent     backup-agent --scan
+
+Evidence:
+WARNING: load is close to online CPU count; runnable task saturation is possible
+
+Recommended next steps:
+- Check process ownership and whether the top process is expected
+- Check recent deployments, cron jobs, batch jobs, or maintenance activity
+- Review logs for the top CPU-consuming process
+- Compare with longer trend data from monitoring before taking action
+- Attach this output to the incident ticket
@@ -0,0 +1,25 @@
+WARNING: Memory usage is 84% and swap usage is 12%
+
+Memory summary:
+              total        used        free      shared  buff/cache   available
+Mem:          15934       13386         512         121        2036        2101
+Swap:          4095         512        3583
+
+Top memory processes:
+PID     RSS_MB   COMMAND
+1234    2048     java
+987     812      postgres
+
+OOM events since 24 hours ago:
+2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
+
+Evidence:
+Thresholds: warning=80% critical=90% since="24 hours ago"
+OOM evidence source: journalctl
+
+Recommended next steps:
+- Check application memory trend
+- Review JVM heap settings if process is Java
+- Verify swap pressure and paging activity
+- Confirm whether OOM events align with application impact
+- Attach this output to incident ticket
@@ -0,0 +1,22 @@
+WARNING: Highest inode usage is 87%
+
+Filesystems above threshold:
+/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
+
+Inode usage table:
+Filesystem                 Inodes   IUsed  IFree IUse% Mounted on
+/dev/mapper/vg_root-lv_root 524288   91300 432988   18% /
+/dev/mapper/vg_var-lv_var  1310720 1140326 170394   87% /var
+
+Top affected mount points:
+87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
+
+Evidence:
+Thresholds: warning=80% critical=90%
+
+Recommended next steps:
+- Find directories with many small files under affected mount points
+- Check logs, cache, spool, session, and temporary directories
+- Avoid deleting blindly; confirm ownership and application impact first
+- Confirm whether inode exhaustion is causing write or deploy failures
+- Attach this output to incident ticket
@@ -0,0 +1,30 @@
+OK: JVM diagnostics collected for PID 1234
+
+Detected JVM process:
+PID USER RSS_MB CPU COMMAND
+1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
+Thread count: 188
+
+Heap and JVM evidence:
+
+[jcmd VM.flags]
+1234:
+-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
+
+[jcmd GC.heap_info]
+garbage-first heap total 2097152K, used 1521000K
+
+[jcmd Thread.print summary]
+102 java.lang.Thread.State: WAITING
+53 java.lang.Thread.State: RUNNABLE
+33 java.lang.Thread.State: TIMED_WAITING
+
+Evidence:
+PID=1234 thread_count=188 top=10
+
+Recommended next steps:
+- Review GC logs and recent application errors
+- Check JVM heap sizing against container or host memory limits
+- Check thread count trend in monitoring before concluding a leak
+- Capture jstack only if approved by operational process
+- Attach this output to incident ticket
@@ -0,0 +1,23 @@
+WARNING: Time sync status=yes offset_ms=812
+
+Time status:
+System time: 2026-05-11 10:18:01 UTC +0000
+Timezone: UTC +0000
+Detected tool: chronyc
+NTP synchronized: yes
+Offset ms: 812
+
+Tool evidence:
+Reference ID    : 203.0.113.10
+System time     : 0.812345 seconds fast of NTP time
+Last offset     : +0.812345 seconds
+
+Evidence:
+Thresholds: warning=500ms critical=5000ms
+
+Recommended next steps:
+- Verify chrony or ntpd service status and configuration
+- Check NTP sources and reachability
+- Check virtualization host time if this is a VM
+- Avoid restarting time services blindly in production
+- Attach this output to incident ticket
@@ -0,0 +1,27 @@
+CRITICAL: Service app.service state=failed substate=failed restarts=12
+
+Service state:
+app.service - Example application
+   Loaded: loaded (/etc/systemd/system/app.service; enabled)
+   Active: failed (Result: exit-code)
+
+Systemd properties:
+Id=app.service
+ActiveState=failed
+SubState=failed
+Result=exit-code
+NRestarts=12
+
+Recent start/stop/failure log lines since 1 hour ago:
+May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
+May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
+
+Evidence:
+Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
+
+Recommended next steps:
+- Inspect the unit file and drop-in overrides
+- Review application logs around the restart timestamps
+- Check dependencies such as network, storage, database, or secrets
+- Verify recent configuration or package changes
+- Do not restart blindly; attach this output to the incident ticket