revert 6475f76787
lint / shell-yaml-ansible (push) Failing after 17s
lint / shell-yaml-ansible (push) Failing after 17s
revert Add L2 incident triage report wrapper
This commit is contained in:
@@ -12,7 +12,6 @@
|
||||
- `journal-analyzer` for exported `journalctl` log review.
|
||||
- `known-error-matcher` with JSON-based known error patterns.
|
||||
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
|
||||
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
|
||||
- Repository-level Codex guidance:
|
||||
- `AGENTS.md`
|
||||
- `docs/codex/README.md`
|
||||
|
||||
@@ -30,7 +30,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
||||
|
||||
- [infra-run](./infra-run/) - the main implemented project in this repository.
|
||||
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
|
||||
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
|
||||
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, designed for copy-to-server triage and ticket evidence.
|
||||
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
|
||||
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
|
||||
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
|
||||
|
||||
+1
-1
@@ -9,7 +9,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
|
||||
### Bash Operational Scripts
|
||||
|
||||
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
|
||||
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
|
||||
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, and JVM diagnostics.
|
||||
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
|
||||
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
|
||||
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
|
||||
|
||||
@@ -7,6 +7,5 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
|
||||
## Included
|
||||
|
||||
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
|
||||
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
|
||||
- `veritas/` - sample VxVM disk and VCS service group output.
|
||||
- `gpfs/` - sample GPFS cluster and NSD output.
|
||||
|
||||
@@ -1,131 +0,0 @@
|
||||
# L2 Incident Triage Report
|
||||
|
||||
- Generated: 2026-05-12T19:30:00Z
|
||||
- Local hostname: app01.example.internal
|
||||
- Current user: triage
|
||||
- Incident type: all
|
||||
- Service: nginx
|
||||
- Host: app.example.com
|
||||
- Port: 443
|
||||
- PID: not provided
|
||||
- Process match: not provided
|
||||
- Since: 30 minutes ago
|
||||
|
||||
## Executed Checks
|
||||
|
||||
| Check | Script | Status | Exit | Command |
|
||||
| --- | --- | --- | --- | --- |
|
||||
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
|
||||
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
|
||||
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
|
||||
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
|
||||
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
|
||||
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
|
||||
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
|
||||
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
|
||||
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
|
||||
|
||||
## Summary
|
||||
|
||||
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
|
||||
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
|
||||
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
|
||||
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
|
||||
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
|
||||
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
|
||||
- Inode usage: OK: Highest inode usage is 42%
|
||||
- JVM threads and heap: WARNING: No Java processes detected
|
||||
|
||||
## Raw Evidence
|
||||
|
||||
### CPU saturation
|
||||
|
||||
Script: `check_high_cpu.sh`
|
||||
|
||||
Command: `./check_high_cpu.sh`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||
|
||||
Load average:
|
||||
1m=0.42 5m=0.38 15m=0.31
|
||||
|
||||
Top CPU processes:
|
||||
PID PPID USER %CPU %MEM COMMAND ARGS
|
||||
1450 1 app 7.2 2.1 nginx nginx: worker process
|
||||
|
||||
Recommended next steps:
|
||||
- Check process ownership and whether the top process is expected
|
||||
- Review logs for the top CPU-consuming process
|
||||
```
|
||||
|
||||
### Memory and OOM
|
||||
|
||||
Script: `check_high_memory_oom.sh`
|
||||
|
||||
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
|
||||
|
||||
Status: WARNING, exit: 1
|
||||
|
||||
```text
|
||||
WARNING: Memory usage is 84% and swap usage is 12%
|
||||
|
||||
Memory summary:
|
||||
Mem: 15800 13272 1110 210 1418 1840
|
||||
Swap: 4095 512 3583
|
||||
|
||||
OOM events since 30 minutes ago:
|
||||
OK: no OOM evidence found in available sources
|
||||
```
|
||||
|
||||
### Service restart loop
|
||||
|
||||
Script: `check_service_restart_loop.sh`
|
||||
|
||||
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
|
||||
|
||||
Status: OK, exit: 0
|
||||
|
||||
```text
|
||||
OK: Service nginx state=active substate=running restarts=0
|
||||
|
||||
Systemd properties:
|
||||
Id=nginx.service
|
||||
ActiveState=active
|
||||
SubState=running
|
||||
NRestarts=0
|
||||
```
|
||||
|
||||
### Skipped or limited checks
|
||||
|
||||
```text
|
||||
JVM threads and heap returned WARNING because no Java process was detected.
|
||||
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
|
||||
```
|
||||
|
||||
## L2 Handover Checklist
|
||||
|
||||
- [ ] Business impact confirmed
|
||||
- [ ] Affected host/service identified
|
||||
- [ ] Monitoring alert attached
|
||||
- [ ] Recent changes checked
|
||||
- [ ] Logs attached
|
||||
- [ ] Service owner identified
|
||||
- [ ] Escalation target identified
|
||||
|
||||
## Escalation Notes
|
||||
|
||||
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
|
||||
- Include the alert, timeline, commands run, and the raw evidence above.
|
||||
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
|
||||
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
|
||||
|
||||
## Recommended Next Steps
|
||||
|
||||
- Confirm the symptom against monitoring and user reports.
|
||||
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
|
||||
- Attach this report to the incident ticket before handoff.
|
||||
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
|
||||
@@ -16,7 +16,6 @@ They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional
|
||||
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
@@ -52,21 +51,8 @@ They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional
|
||||
./check_jvm_threads_heap.sh
|
||||
./check_jvm_threads_heap.sh --pid 1234
|
||||
./check_jvm_threads_heap.sh --match app-name
|
||||
|
||||
./incident_triage_report.sh --type cpu
|
||||
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||
```
|
||||
|
||||
## L2 Triage Report Wrapper
|
||||
|
||||
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
|
||||
|
||||
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
|
||||
|
||||
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK.
|
||||
@@ -120,5 +106,3 @@ Sanitized examples are available in [examples](./examples/):
|
||||
- `filesystem-readonly.sample.txt`
|
||||
- `inode-usage.sample.txt`
|
||||
- `jvm-threads-heap.sample.txt`
|
||||
|
||||
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
|
||||
|
||||
@@ -92,7 +92,7 @@ load_15m="unavailable"
|
||||
load_per_cpu_pct=0
|
||||
if [[ -r /proc/loadavg ]]; then
|
||||
read -r load_1m load_5m load_15m _ < /proc/loadavg
|
||||
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
|
||||
load_per_cpu_pct="$(awk -v load="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load / cpus) * 100 }')"
|
||||
elif command -v uptime >/dev/null 2>&1; then
|
||||
load_line="$(uptime 2>/dev/null || true)"
|
||||
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
|
||||
|
||||
@@ -1,385 +0,0 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
incident_type=""
|
||||
service_name=""
|
||||
host_name=""
|
||||
port=""
|
||||
target_pid=""
|
||||
match_string=""
|
||||
output_file=""
|
||||
since_value="1 hour ago"
|
||||
|
||||
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: incident_triage_report.sh --type TYPE [options]
|
||||
|
||||
Run selected read-only incident checks and produce a Markdown triage report.
|
||||
|
||||
Incident types:
|
||||
cpu
|
||||
memory
|
||||
service
|
||||
network
|
||||
auth
|
||||
cert
|
||||
filesystem
|
||||
jvm
|
||||
all
|
||||
|
||||
Options:
|
||||
--type TYPE Incident type to collect
|
||||
--service SERVICE_NAME systemd service name for service checks
|
||||
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
|
||||
--port PORT TCP or TLS port for host checks
|
||||
--pid PID JVM process ID
|
||||
--match PROCESS_MATCH JVM process match string
|
||||
--output FILE write Markdown report to FILE
|
||||
--since VALUE time window for log-based checks
|
||||
--help show this help
|
||||
|
||||
Examples:
|
||||
./incident_triage_report.sh --type cpu
|
||||
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
valid_type() {
|
||||
case "$1" in
|
||||
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
|
||||
*) return 1 ;;
|
||||
esac
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--type)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
|
||||
incident_type="$2"
|
||||
shift 2
|
||||
;;
|
||||
--service)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
|
||||
service_name="$2"
|
||||
shift 2
|
||||
;;
|
||||
--host)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
|
||||
host_name="$2"
|
||||
shift 2
|
||||
;;
|
||||
--port)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
|
||||
port="$2"
|
||||
shift 2
|
||||
;;
|
||||
--pid)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
|
||||
target_pid="$2"
|
||||
shift 2
|
||||
;;
|
||||
--match)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
|
||||
match_string="$2"
|
||||
shift 2
|
||||
;;
|
||||
--output)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
|
||||
output_file="$2"
|
||||
shift 2
|
||||
;;
|
||||
--since)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
|
||||
since_value="$2"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$incident_type" ]]; then
|
||||
printf 'CRITICAL: --type is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if ! valid_type "$incident_type"; then
|
||||
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||
printf 'CRITICAL: --port must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||
printf 'CRITICAL: --pid must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_dir="$(mktemp -d)"
|
||||
trap 'rm -rf "$tmp_dir"' EXIT
|
||||
|
||||
report_file="$tmp_dir/report.md"
|
||||
|
||||
check_labels=()
|
||||
check_names=()
|
||||
check_commands=()
|
||||
check_statuses=()
|
||||
check_exit_codes=()
|
||||
check_summaries=()
|
||||
check_outputs=()
|
||||
|
||||
status_from_exit() {
|
||||
case "$1" in
|
||||
0) printf 'OK' ;;
|
||||
1) printf 'WARNING' ;;
|
||||
2) printf 'INVALID' ;;
|
||||
3) printf 'CRITICAL' ;;
|
||||
*) printf 'ERROR' ;;
|
||||
esac
|
||||
}
|
||||
|
||||
render_command() {
|
||||
local item
|
||||
for item in "$@"; do
|
||||
printf '%q ' "$item"
|
||||
done | sed 's/[[:space:]]*$//'
|
||||
}
|
||||
|
||||
append_skipped_check() {
|
||||
local label="$1"
|
||||
local name="$2"
|
||||
local reason="$3"
|
||||
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||
|
||||
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
|
||||
|
||||
check_labels+=("$label")
|
||||
check_names+=("$name")
|
||||
check_commands+=("not run")
|
||||
check_statuses+=("SKIPPED")
|
||||
check_exit_codes+=("-")
|
||||
check_summaries+=("$reason")
|
||||
check_outputs+=("$output_path")
|
||||
}
|
||||
|
||||
run_check() {
|
||||
local label="$1"
|
||||
local script_name="$2"
|
||||
shift 2
|
||||
|
||||
local script_path="${script_dir}/${script_name}"
|
||||
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||
local command_text
|
||||
local exit_code
|
||||
local status
|
||||
local summary
|
||||
|
||||
command_text="$(render_command "$script_path" "$@")"
|
||||
|
||||
if [[ ! -e "$script_path" ]]; then
|
||||
append_skipped_check "$label" "$script_name" "missing script: $script_name"
|
||||
return
|
||||
fi
|
||||
if [[ ! -x "$script_path" ]]; then
|
||||
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
|
||||
return
|
||||
fi
|
||||
|
||||
set +e
|
||||
"$script_path" "$@" > "$output_path" 2>&1
|
||||
exit_code=$?
|
||||
set -e
|
||||
|
||||
status="$(status_from_exit "$exit_code")"
|
||||
summary="$(sed -n '1p' "$output_path")"
|
||||
if [[ -z "$summary" ]]; then
|
||||
summary="no output captured"
|
||||
fi
|
||||
|
||||
check_labels+=("$label")
|
||||
check_names+=("$script_name")
|
||||
check_commands+=("$command_text")
|
||||
check_statuses+=("$status")
|
||||
check_exit_codes+=("$exit_code")
|
||||
check_summaries+=("$summary")
|
||||
check_outputs+=("$output_path")
|
||||
}
|
||||
|
||||
run_cpu_checks() {
|
||||
run_check "CPU saturation" "check_high_cpu.sh"
|
||||
}
|
||||
|
||||
run_memory_checks() {
|
||||
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
|
||||
}
|
||||
|
||||
run_service_checks() {
|
||||
if [[ -z "$service_name" ]]; then
|
||||
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
|
||||
return
|
||||
fi
|
||||
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
|
||||
}
|
||||
|
||||
run_network_checks() {
|
||||
local args=(--host "$host_name")
|
||||
if [[ -z "$host_name" ]]; then
|
||||
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||
return
|
||||
fi
|
||||
if [[ -n "$port" ]]; then
|
||||
args+=(--port "$port")
|
||||
fi
|
||||
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
run_auth_checks() {
|
||||
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
|
||||
}
|
||||
|
||||
run_cert_checks() {
|
||||
local args=(--host "$host_name")
|
||||
if [[ -z "$host_name" ]]; then
|
||||
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||
return
|
||||
fi
|
||||
if [[ -n "$port" ]]; then
|
||||
args+=(--port "$port")
|
||||
fi
|
||||
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
run_filesystem_checks() {
|
||||
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
|
||||
run_check "Inode usage" "check_inode_usage.sh"
|
||||
}
|
||||
|
||||
run_jvm_checks() {
|
||||
local args=()
|
||||
if [[ -n "$target_pid" ]]; then
|
||||
args+=(--pid "$target_pid")
|
||||
elif [[ -n "$match_string" ]]; then
|
||||
args+=(--match "$match_string")
|
||||
fi
|
||||
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
|
||||
}
|
||||
|
||||
case "$incident_type" in
|
||||
cpu) run_cpu_checks ;;
|
||||
memory) run_memory_checks ;;
|
||||
service) run_service_checks ;;
|
||||
network) run_network_checks ;;
|
||||
auth) run_auth_checks ;;
|
||||
cert) run_cert_checks ;;
|
||||
filesystem) run_filesystem_checks ;;
|
||||
jvm) run_jvm_checks ;;
|
||||
all)
|
||||
run_cpu_checks
|
||||
run_memory_checks
|
||||
run_service_checks
|
||||
run_network_checks
|
||||
run_auth_checks
|
||||
run_cert_checks
|
||||
run_filesystem_checks
|
||||
run_jvm_checks
|
||||
;;
|
||||
esac
|
||||
|
||||
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
|
||||
current_user="$(id -un 2>/dev/null || printf 'unknown')"
|
||||
|
||||
{
|
||||
printf '# L2 Incident Triage Report\n\n'
|
||||
printf -- '- Generated: %s\n' "$generated_at"
|
||||
printf -- '- Local hostname: %s\n' "$local_hostname"
|
||||
printf -- '- Current user: %s\n' "$current_user"
|
||||
printf -- '- Incident type: %s\n' "$incident_type"
|
||||
printf -- '- Service: %s\n' "${service_name:-not provided}"
|
||||
printf -- '- Host: %s\n' "${host_name:-not provided}"
|
||||
printf -- '- Port: %s\n' "${port:-not provided}"
|
||||
printf -- '- PID: %s\n' "${target_pid:-not provided}"
|
||||
printf -- '- Process match: %s\n' "${match_string:-not provided}"
|
||||
printf -- '- Since: %s\n\n' "$since_value"
|
||||
|
||||
printf '## Executed Checks\n\n'
|
||||
printf '| Check | Script | Status | Exit | Command |\n'
|
||||
printf '| --- | --- | --- | --- | --- |\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
|
||||
"${check_labels[$index]}" \
|
||||
"${check_names[$index]}" \
|
||||
"${check_statuses[$index]}" \
|
||||
"${check_exit_codes[$index]}" \
|
||||
"${check_commands[$index]}"
|
||||
done
|
||||
printf '\n'
|
||||
|
||||
printf '## Summary\n\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
|
||||
done
|
||||
printf '\n'
|
||||
|
||||
printf '## Raw Evidence\n\n'
|
||||
for index in "${!check_labels[@]}"; do
|
||||
printf '### %s\n\n' "${check_labels[$index]}"
|
||||
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
|
||||
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
|
||||
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
|
||||
printf '```text\n'
|
||||
cat "${check_outputs[$index]}"
|
||||
printf '\n```\n\n'
|
||||
done
|
||||
|
||||
printf '## L2 Handover Checklist\n\n'
|
||||
printf -- '- [ ] Business impact confirmed\n'
|
||||
printf -- '- [ ] Affected host/service identified\n'
|
||||
printf -- '- [ ] Monitoring alert attached\n'
|
||||
printf -- '- [ ] Recent changes checked\n'
|
||||
printf -- '- [ ] Logs attached\n'
|
||||
printf -- '- [ ] Service owner identified\n'
|
||||
printf -- '- [ ] Escalation target identified\n\n'
|
||||
|
||||
printf '## Escalation Notes\n\n'
|
||||
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
|
||||
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
|
||||
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
|
||||
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
|
||||
|
||||
printf '## Recommended Next Steps\n\n'
|
||||
printf -- '- Confirm the symptom against monitoring and user reports.\n'
|
||||
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
|
||||
printf -- '- Attach this report to the incident ticket before handoff.\n'
|
||||
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
|
||||
} > "$report_file"
|
||||
|
||||
if [[ -n "$output_file" ]]; then
|
||||
cp "$report_file" "$output_file"
|
||||
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
|
||||
else
|
||||
cat "$report_file"
|
||||
fi
|
||||
Reference in New Issue
Block a user