This commit is contained in:
@@ -0,0 +1,131 @@
|
||||
# L2 Incident Triage Report
|
||||
|
||||
- Generated: 2026-05-12T19:30:00Z
|
||||
- Local hostname: app01.example.internal
|
||||
- Current user: triage
|
||||
- Incident type: all
|
||||
- Service: nginx
|
||||
- Host: app.example.com
|
||||
- Port: 443
|
||||
- PID: not provided
|
||||
- Process match: not provided
|
||||
- Since: 30 minutes ago
|
||||
|
||||
## Executed Checks
|
||||
|
||||
| Check | Script | Status | Exit | Command |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
|
||||
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
|
||||
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
|
||||
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
|
||||
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
|
||||
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
|
||||
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
|
||||
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
|
||||
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
|
||||
|
||||
## Summary
|
||||
|
||||
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
|
||||
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
|
||||
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
|
||||
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
|
||||
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
|
||||
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
|
||||
- Inode usage: OK: Highest inode usage is 42%
|
||||
- JVM threads and heap: WARNING: No Java processes detected
|
||||
|
||||
## Raw Evidence
|
||||
|
||||
### CPU saturation
|
||||
|
||||
Script: `check_high_cpu.sh`
|
||||
|
||||
Command: `./check_high_cpu.sh`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
|
||||
Load average:
|
||||
1m=0.42 5m=0.38 15m=0.31
|
||||
|
||||
Top CPU processes:
|
||||
PID PPID USER %CPU %MEM COMMAND ARGS
|
||||
1450 1 app 7.2 2.1 nginx nginx: worker process
|
||||
|
||||
Recommended next steps:
|
||||
- Check process ownership and whether the top process is expected
|
||||
- Review logs for the top CPU-consuming process
|
||||
```
|
||||
|
||||
### Memory and OOM
|
||||
|
||||
Script: `check_high_memory_oom.sh`
|
||||
|
||||
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
|
||||
|
||||
Status: WARNING, exit: 1
|
||||
|
||||
```text
|
||||
WARNING: Memory usage is 84% and swap usage is 12%
|
||||
|
||||
Memory summary:
|
||||
Mem: 15800 13272 1110 210 1418 1840
|
||||
Swap: 4095 512 3583
|
||||
|
||||
OOM events since 30 minutes ago:
|
||||
OK: no OOM evidence found in available sources
|
||||
```
|
||||
|
||||
### Service restart loop
|
||||
|
||||
Script: `check_service_restart_loop.sh`
|
||||
|
||||
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: Service nginx state=active substate=running restarts=0
|
||||
|
||||
Systemd properties:
|
||||
Id=nginx.service
|
||||
ActiveState=active
|
||||
SubState=running
|
||||
NRestarts=0
|
||||
```
|
||||
|
||||
### Skipped or limited checks
|
||||
|
||||
```text
|
||||
JVM threads and heap returned WARNING because no Java process was detected.
|
||||
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
|
||||
```
|
||||
|
||||
## L2 Handover Checklist
|
||||
|
||||
- [ ] Business impact confirmed
|
||||
- [ ] Affected host/service identified
|
||||
- [ ] Monitoring alert attached
|
||||
- [ ] Recent changes checked
|
||||
- [ ] Logs attached
|
||||
- [ ] Service owner identified
|
||||
- [ ] Escalation target identified
|
||||
|
||||
## Escalation Notes
|
||||
|
||||
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
|
||||
- Include the alert, timeline, commands run, and the raw evidence above.
|
||||
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
|
||||
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
- Confirm the symptom against monitoring and user reports.
|
||||
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
|
||||
- Attach this report to the incident ticket before handoff.
|
||||
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
|
||||
Reference in New Issue
Block a user