# L2 Incident Triage Report - Generated: 2026-05-12T19:30:00Z - Local hostname: app01.example.internal - Current user: triage - Incident type: all - Service: nginx - Host: app.example.com - Port: 443 - PID: not provided - Process match: not provided - Since: 30 minutes ago ## Executed Checks | Check | Script | Status | Exit | Command | | --- | --- | --- | --- | --- | | CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` | | Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` | | Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` | | DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` | | Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` | | Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` | | Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` | | Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` | | JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` | ## Summary - CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count) - Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12% - Service restart loop: OK: Service nginx state=active substate=running restarts=0 - DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK - Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window - Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s) - Read-only filesystems: OK: Found 0 read-only filesystem(s) - Inode usage: OK: Highest inode usage is 42% - JVM threads and heap: WARNING: No Java processes detected ## Raw Evidence ### CPU saturation Script: `check_high_cpu.sh` Command: `./check_high_cpu.sh` Status: OK, exit: 0 ```text OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count) Load average: 1m=0.42 5m=0.38 15m=0.31 Top CPU processes: PID PPID USER %CPU %MEM COMMAND ARGS 1450 1 app 7.2 2.1 nginx nginx: worker process Recommended next steps: - Check process ownership and whether the top process is expected - Review logs for the top CPU-consuming process ``` ### Memory and OOM Script: `check_high_memory_oom.sh` Command: `./check_high_memory_oom.sh --since "30 minutes ago"` Status: WARNING, exit: 1 ```text WARNING: Memory usage is 84% and swap usage is 12% Memory summary: Mem: 15800 13272 1110 210 1418 1840 Swap: 4095 512 3583 OOM events since 30 minutes ago: OK: no OOM evidence found in available sources ``` ### Service restart loop Script: `check_service_restart_loop.sh` Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` Status: OK, exit: 0 ```text OK: Service nginx state=active substate=running restarts=0 Systemd properties: Id=nginx.service ActiveState=active SubState=running NRestarts=0 ``` ### Skipped or limited checks ```text JVM threads and heap returned WARNING because no Java process was detected. No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted. ``` ## L2 Handover Checklist - [ ] Business impact confirmed - [ ] Affected host/service identified - [ ] Monitoring alert attached - [ ] Recent changes checked - [ ] Logs attached - [ ] Service owner identified - [ ] Escalation target identified ## Escalation Notes - Escalate when impact is active, spreading, customer-facing, or outside L2 access. - Include the alert, timeline, commands run, and the raw evidence above. - Call out skipped checks and missing inputs so the next responder does not repeat the same gap. - Do not restart, kill, remount, or rotate anything unless the incident owner approves the action. ## Recommended Next Steps - Confirm the symptom against monitoring and user reports. - Compare this point-in-time evidence with recent deploys, config changes, and host events. - Attach this report to the incident ticket before handoff. - If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.