Files

132 lines
4.3 KiB
Markdown
Raw Permalink Normal View History

2026-05-12 20:00:42 +00:00
# L2 Incident Triage Report
- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago
## Executed Checks
| Check | Script | Status | Exit | Command |
| --- | --- | --- | --- | --- |
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
## Summary
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected
## Raw Evidence
### CPU saturation
Script: `check_high_cpu.sh`
Command: `./check_high_cpu.sh`
Status: OK, exit: 0
```text
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
Load average:
1m=0.42 5m=0.38 15m=0.31
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process
Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
```
### Memory and OOM
Script: `check_high_memory_oom.sh`
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
Status: WARNING, exit: 1
```text
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583
OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
```
### Service restart loop
Script: `check_service_restart_loop.sh`
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
Status: OK, exit: 0
```text
OK: Service nginx state=active substate=running restarts=0
Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
```
### Skipped or limited checks
```text
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
```
## L2 Handover Checklist
- [ ] Business impact confirmed
- [ ] Affected host/service identified
- [ ] Monitoring alert attached
- [ ] Recent changes checked
- [ ] Logs attached
- [ ] Service owner identified
- [ ] Escalation target identified
## Escalation Notes
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
## Recommended Next Steps
- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.