Add standalone Bash incident check scripts
lint / shell-yaml-ansible (push) Failing after 16s

This commit is contained in:
Mateusz Suski
2026-05-11 18:49:00 +00:00
parent 8a7b7c5abc
commit e851568c8c
27 changed files with 1623 additions and 6 deletions
@@ -0,0 +1,20 @@
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
Certificate details:
Subject: CN = app.example.com
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
notBefore: Apr 11 00:00:00 2026 GMT
notAfter: May 29 23:59:59 2026 GMT
SAN/CN: DNS:app.example.com, DNS:api.example.com
Evidence:
Target: app.example.com:443
SNI: app.example.com
Thresholds: warning=30 days critical=7 days
Recommended next steps:
- Renew certificate before the operational threshold is breached
- Check the full chain and intermediate certificates
- Check the load balancer, ingress, or reverse proxy serving this certificate
- Verify monitoring threshold and alert ownership
- Attach this output to incident or change ticket
@@ -0,0 +1,23 @@
OK: DNS=OK ping=OK tcp_443=OK
DNS result:
93.184.216.34 example.com
Ping result:
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
TCP port result:
OK: TCP connection to example.com:443 succeeded
Local network hints:
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
Evidence:
Host: example.com count=3 timeout=3s port=443
Recommended next steps:
- Verify the DNS record and resolver path
- Check firewall, routing, security group, or proxy policy
- Compare results from another host or network segment
- Check application endpoint health after network reachability is confirmed
- Attach this output to incident ticket
@@ -0,0 +1,26 @@
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
Top source IPs:
52 203.0.113.44
12 198.51.100.20
9 192.0.2.10
Top attempted users:
31 admin
24 oracle
18 root
Sample recent lines:
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
Evidence:
Thresholds: warning=20 critical=50 since="1 hour ago"
Log source: journalctl
Recommended next steps:
- Verify source IPs against expected scanners, admins, or automation
- Check firewall, fail2ban, or security tooling state
- Confirm whether the attempts are expected for this host
- Review successful logins too, not only failures
- Attach this output to incident ticket
@@ -0,0 +1,16 @@
CRITICAL: Found 1 read-only filesystem(s)
Read-only filesystems:
MOUNT_POINT SOURCE FSTYPE OPTIONS
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
Evidence:
include_system=0
Collector: findmnt
Recommended next steps:
- Check dmesg or journal logs for I/O errors and filesystem remount events
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
- Check filesystem health with the platform-approved procedure
- Do not remount read-write before understanding the cause
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
Load average:
1m=7.82 5m=6.91 15m=5.40
CPU count:
8
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND COMMAND
2314 1 app 245 12.1 java java -jar order-api.jar
991 1 root 38 0.4 backup-agent backup-agent --scan
Evidence:
WARNING: load is close to online CPU count; runnable task saturation is possible
Recommended next steps:
- Check process ownership and whether the top process is expected
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
- Review logs for the top CPU-consuming process
- Compare with longer trend data from monitoring before taking action
- Attach this output to the incident ticket
@@ -0,0 +1,25 @@
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
total used free shared buff/cache available
Mem: 15934 13386 512 121 2036 2101
Swap: 4095 512 3583
Top memory processes:
PID RSS_MB COMMAND
1234 2048 java
987 812 postgres
OOM events since 24 hours ago:
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
Evidence:
Thresholds: warning=80% critical=90% since="24 hours ago"
OOM evidence source: journalctl
Recommended next steps:
- Check application memory trend
- Review JVM heap settings if process is Java
- Verify swap pressure and paging activity
- Confirm whether OOM events align with application impact
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: Highest inode usage is 87%
Filesystems above threshold:
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Inode usage table:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Top affected mount points:
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
Evidence:
Thresholds: warning=80% critical=90%
Recommended next steps:
- Find directories with many small files under affected mount points
- Check logs, cache, spool, session, and temporary directories
- Avoid deleting blindly; confirm ownership and application impact first
- Confirm whether inode exhaustion is causing write or deploy failures
- Attach this output to incident ticket
@@ -0,0 +1,30 @@
OK: JVM diagnostics collected for PID 1234
Detected JVM process:
PID USER RSS_MB CPU COMMAND
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
Thread count: 188
Heap and JVM evidence:
[jcmd VM.flags]
1234:
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
[jcmd GC.heap_info]
garbage-first heap total 2097152K, used 1521000K
[jcmd Thread.print summary]
102 java.lang.Thread.State: WAITING
53 java.lang.Thread.State: RUNNABLE
33 java.lang.Thread.State: TIMED_WAITING
Evidence:
PID=1234 thread_count=188 top=10
Recommended next steps:
- Review GC logs and recent application errors
- Check JVM heap sizing against container or host memory limits
- Check thread count trend in monitoring before concluding a leak
- Capture jstack only if approved by operational process
- Attach this output to incident ticket
@@ -0,0 +1,23 @@
WARNING: Time sync status=yes offset_ms=812
Time status:
System time: 2026-05-11 10:18:01 UTC +0000
Timezone: UTC +0000
Detected tool: chronyc
NTP synchronized: yes
Offset ms: 812
Tool evidence:
Reference ID : 203.0.113.10
System time : 0.812345 seconds fast of NTP time
Last offset : +0.812345 seconds
Evidence:
Thresholds: warning=500ms critical=5000ms
Recommended next steps:
- Verify chrony or ntpd service status and configuration
- Check NTP sources and reachability
- Check virtualization host time if this is a VM
- Avoid restarting time services blindly in production
- Attach this output to incident ticket
@@ -0,0 +1,27 @@
CRITICAL: Service app.service state=failed substate=failed restarts=12
Service state:
app.service - Example application
Loaded: loaded (/etc/systemd/system/app.service; enabled)
Active: failed (Result: exit-code)
Systemd properties:
Id=app.service
ActiveState=failed
SubState=failed
Result=exit-code
NRestarts=12
Recent start/stop/failure log lines since 1 hour ago:
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
Evidence:
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
Recommended next steps:
- Inspect the unit file and drop-in overrides
- Review application logs around the restart timestamps
- Check dependencies such as network, storage, database, or secrets
- Verify recent configuration or package changes
- Do not restart blindly; attach this output to the incident ticket