infra-run/examples/incident-triage/l2-incident-triage-report.sample.md

# L2 Incident Triage Report

- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago

## Executed Checks

| Check | Script | Status | Exit | Command |
| --- | --- | --- | --- | --- |
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |

## Summary

- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected

## Raw Evidence

### CPU saturation

Script: `check_high_cpu.sh`

Command: `./check_high_cpu.sh`

Status: OK, exit: 0

```text
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)

Load average:
1m=0.42 5m=0.38 15m=0.31

Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process

Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
```

### Memory and OOM

Script: `check_high_memory_oom.sh`

Command: `./check_high_memory_oom.sh --since "30 minutes ago"`

Status: WARNING, exit: 1

```text
WARNING: Memory usage is 84% and swap usage is 12%

Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583

OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
```

### Service restart loop

Script: `check_service_restart_loop.sh`

Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`

Status: OK, exit: 0

```text
OK: Service nginx state=active substate=running restarts=0

Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
```

### Skipped or limited checks

```text
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
```

## L2 Handover Checklist

- [ ] Business impact confirmed
- [ ] Affected host/service identified
- [ ] Monitoring alert attached
- [ ] Recent changes checked
- [ ] Logs attached
- [ ] Service owner identified
- [ ] Escalation target identified

## Escalation Notes

- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.

## Recommended Next Steps

- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
Add L2 incident triage report wrapper 2026-05-12 20:00:42 +00:00			`# L2 Incident Triage Report`

			`- Generated: 2026-05-12T19:30:00Z`
			`- Local hostname: app01.example.internal`
			`- Current user: triage`
			`- Incident type: all`
			`- Service: nginx`
			`- Host: app.example.com`
			`- Port: 443`
			`- PID: not provided`
			`- Process match: not provided`
			`- Since: 30 minutes ago`

			`## Executed Checks`

			`\| Check \| Script \| Status \| Exit \| Command \|`
			`\| --- \| --- \| --- \| --- \| --- \|`
			\| CPU saturation \| `check_high_cpu.sh` \| OK \| 0 \| `./check_high_cpu.sh` \|
			\| Memory and OOM \| `check_high_memory_oom.sh` \| WARNING \| 1 \| `./check_high_memory_oom.sh --since "30 minutes ago"` \|
			\| Service restart loop \| `check_service_restart_loop.sh` \| OK \| 0 \| `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` \|
			\| DNS and connectivity \| `check_dns_connectivity.sh` \| OK \| 0 \| `./check_dns_connectivity.sh --host app.example.com --port 443` \|
			\| Failed SSH logins \| `check_failed_ssh_logins.sh` \| OK \| 0 \| `./check_failed_ssh_logins.sh --since "30 minutes ago"` \|
			\| Certificate expiry \| `check_certificate_expiry.sh` \| OK \| 0 \| `./check_certificate_expiry.sh --host app.example.com --port 443` \|
			\| Read-only filesystems \| `check_filesystem_readonly.sh` \| OK \| 0 \| `./check_filesystem_readonly.sh` \|
			\| Inode usage \| `check_inode_usage.sh` \| OK \| 0 \| `./check_inode_usage.sh` \|
			\| JVM threads and heap \| `check_jvm_threads_heap.sh` \| WARNING \| 1 \| `./check_jvm_threads_heap.sh` \|

			`## Summary`

			`- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)`
			`- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%`
			`- Service restart loop: OK: Service nginx state=active substate=running restarts=0`
			`- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK`
			`- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window`
			`- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)`
			`- Read-only filesystems: OK: Found 0 read-only filesystem(s)`
			`- Inode usage: OK: Highest inode usage is 42%`
			`- JVM threads and heap: WARNING: No Java processes detected`

			`## Raw Evidence`

			`### CPU saturation`

			Script: `check_high_cpu.sh`

			Command: `./check_high_cpu.sh`

			`Status: OK, exit: 0`

			```text
			`OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)`

			`Load average:`
			`1m=0.42 5m=0.38 15m=0.31`

			`Top CPU processes:`
			`PID PPID USER %CPU %MEM COMMAND ARGS`
			`1450 1 app 7.2 2.1 nginx nginx: worker process`

			`Recommended next steps:`
			`- Check process ownership and whether the top process is expected`
			`- Review logs for the top CPU-consuming process`
			```

			`### Memory and OOM`

			Script: `check_high_memory_oom.sh`

			Command: `./check_high_memory_oom.sh --since "30 minutes ago"`

			`Status: WARNING, exit: 1`

			```text
			`WARNING: Memory usage is 84% and swap usage is 12%`

			`Memory summary:`
			`Mem: 15800 13272 1110 210 1418 1840`
			`Swap: 4095 512 3583`

			`OOM events since 30 minutes ago:`
			`OK: no OOM evidence found in available sources`
			```

			`### Service restart loop`

			Script: `check_service_restart_loop.sh`

			Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`

			`Status: OK, exit: 0`

			```text
			`OK: Service nginx state=active substate=running restarts=0`

			`Systemd properties:`
			`Id=nginx.service`
			`ActiveState=active`
			`SubState=running`
			`NRestarts=0`
			```

			`### Skipped or limited checks`

			```text
			`JVM threads and heap returned WARNING because no Java process was detected.`
			`No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.`
			```

			`## L2 Handover Checklist`

			`- [ ] Business impact confirmed`
			`- [ ] Affected host/service identified`
			`- [ ] Monitoring alert attached`
			`- [ ] Recent changes checked`
			`- [ ] Logs attached`
			`- [ ] Service owner identified`
			`- [ ] Escalation target identified`

			`## Escalation Notes`

			`- Escalate when impact is active, spreading, customer-facing, or outside L2 access.`
			`- Include the alert, timeline, commands run, and the raw evidence above.`
			`- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.`
			`- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.`

			`## Recommended Next Steps`

			`- Confirm the symptom against monitoring and user reports.`
			`- Compare this point-in-time evidence with recent deploys, config changes, and host events.`
			`- Attach this report to the incident ticket before handoff.`
			`- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.`