4.3 KiB
4.3 KiB
L2 Incident Triage Report
- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago
Executed Checks
| Check | Script | Status | Exit | Command |
|---|---|---|---|---|
| CPU saturation | check_high_cpu.sh |
OK | 0 | ./check_high_cpu.sh |
| Memory and OOM | check_high_memory_oom.sh |
WARNING | 1 | ./check_high_memory_oom.sh --since "30 minutes ago" |
| Service restart loop | check_service_restart_loop.sh |
OK | 0 | ./check_service_restart_loop.sh --service nginx --since "30 minutes ago" |
| DNS and connectivity | check_dns_connectivity.sh |
OK | 0 | ./check_dns_connectivity.sh --host app.example.com --port 443 |
| Failed SSH logins | check_failed_ssh_logins.sh |
OK | 0 | ./check_failed_ssh_logins.sh --since "30 minutes ago" |
| Certificate expiry | check_certificate_expiry.sh |
OK | 0 | ./check_certificate_expiry.sh --host app.example.com --port 443 |
| Read-only filesystems | check_filesystem_readonly.sh |
OK | 0 | ./check_filesystem_readonly.sh |
| Inode usage | check_inode_usage.sh |
OK | 0 | ./check_inode_usage.sh |
| JVM threads and heap | check_jvm_threads_heap.sh |
WARNING | 1 | ./check_jvm_threads_heap.sh |
Summary
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected
Raw Evidence
CPU saturation
Script: check_high_cpu.sh
Command: ./check_high_cpu.sh
Status: OK, exit: 0
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
Load average:
1m=0.42 5m=0.38 15m=0.31
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process
Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
Memory and OOM
Script: check_high_memory_oom.sh
Command: ./check_high_memory_oom.sh --since "30 minutes ago"
Status: WARNING, exit: 1
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583
OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
Service restart loop
Script: check_service_restart_loop.sh
Command: ./check_service_restart_loop.sh --service nginx --since "30 minutes ago"
Status: OK, exit: 0
OK: Service nginx state=active substate=running restarts=0
Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
Skipped or limited checks
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
L2 Handover Checklist
- Business impact confirmed
- Affected host/service identified
- Monitoring alert attached
- Recent changes checked
- Logs attached
- Service owner identified
- Escalation target identified
Escalation Notes
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
Recommended Next Steps
- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.