This commit is contained in:
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
|
||||
```mermaid
|
||||
flowchart TD
|
||||
A["bash"] --> B["os-healthcheck"]
|
||||
A --> C["disk-full"]
|
||||
A --> D["veritas"]
|
||||
A --> E["gpfs"]
|
||||
A --> C["incident-checks"]
|
||||
A --> D["disk-full"]
|
||||
A --> E["veritas"]
|
||||
A --> F["gpfs"]
|
||||
B --> B1["Host diagnostics"]
|
||||
C --> C1["Incident workflow"]
|
||||
D --> D1["VxVM and VCS change flow"]
|
||||
E --> E1["Spectrum Scale expansion flow"]
|
||||
C --> C1["Standalone triage checks"]
|
||||
D --> D1["Incident workflow"]
|
||||
E --> E1["VxVM and VCS change flow"]
|
||||
F --> F1["Spectrum Scale expansion flow"]
|
||||
```
|
||||
|
||||
## Scripts
|
||||
@@ -23,6 +25,7 @@ flowchart TD
|
||||
- `os-healthcheck/service_check.sh` - critical service status check.
|
||||
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
||||
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
||||
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
|
||||
|
||||
## Usage
|
||||
|
||||
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
|
||||
./system_report.sh
|
||||
./network_troubleshoot.sh
|
||||
./network_troubleshoot.sh google.com
|
||||
|
||||
cd ../incident-checks
|
||||
./check_high_cpu.sh
|
||||
./check_high_memory_oom.sh --since "24 hours ago"
|
||||
./check_service_restart_loop.sh --service sshd
|
||||
./check_certificate_expiry.sh --host example.com
|
||||
```
|
||||
|
||||
## Standards
|
||||
|
||||
@@ -0,0 +1,108 @@
|
||||
# Bash Incident Checks
|
||||
|
||||
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
|
||||
|
||||
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
|
||||
|
||||
## Scripts
|
||||
|
||||
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
|
||||
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
|
||||
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
|
||||
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
|
||||
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
|
||||
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
|
||||
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
|
||||
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||
|
||||
## Usage Examples
|
||||
|
||||
```bash
|
||||
./check_high_cpu.sh
|
||||
./check_high_cpu.sh --warning 70 --critical 90 --top 15
|
||||
|
||||
./check_high_memory_oom.sh
|
||||
./check_high_memory_oom.sh --since "6 hours ago" --top 5
|
||||
|
||||
./check_service_restart_loop.sh --service nginx
|
||||
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
|
||||
|
||||
./check_failed_ssh_logins.sh
|
||||
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
|
||||
|
||||
./check_certificate_expiry.sh --host example.com
|
||||
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
|
||||
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
|
||||
|
||||
./check_dns_connectivity.sh --host example.com
|
||||
./check_dns_connectivity.sh --host db.example.internal --port 5432
|
||||
|
||||
./check_ntp_time_drift.sh
|
||||
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
|
||||
|
||||
./check_filesystem_readonly.sh
|
||||
./check_filesystem_readonly.sh --include-system
|
||||
|
||||
./check_inode_usage.sh
|
||||
./check_inode_usage.sh --warning 75 --critical 90
|
||||
|
||||
./check_jvm_threads_heap.sh
|
||||
./check_jvm_threads_heap.sh --pid 1234
|
||||
./check_jvm_threads_heap.sh --match app-name
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK.
|
||||
- `1` - WARNING or operational issue detected.
|
||||
- `2` - invalid input or missing required dependency.
|
||||
- `3` - CRITICAL issue detected.
|
||||
|
||||
## Supported Platforms
|
||||
|
||||
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
|
||||
|
||||
Some data sources vary by distribution:
|
||||
|
||||
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
|
||||
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
|
||||
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- Scripts are read-only.
|
||||
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
|
||||
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
|
||||
- Treat output as triage evidence, not as complete root-cause analysis.
|
||||
|
||||
## Dependency Notes
|
||||
|
||||
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
|
||||
|
||||
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
|
||||
|
||||
## Copy-To-Server Example
|
||||
|
||||
```bash
|
||||
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
|
||||
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
|
||||
```
|
||||
|
||||
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
|
||||
|
||||
## Sample Outputs
|
||||
|
||||
Sanitized examples are available in [examples](./examples/):
|
||||
|
||||
- `high-cpu.sample.txt`
|
||||
- `high-memory-oom.sample.txt`
|
||||
- `service-restart-loop.sample.txt`
|
||||
- `failed-ssh-logins.sample.txt`
|
||||
- `certificate-expiry.sample.txt`
|
||||
- `dns-connectivity.sample.txt`
|
||||
- `ntp-time-drift.sample.txt`
|
||||
- `filesystem-readonly.sample.txt`
|
||||
- `inode-usage.sample.txt`
|
||||
- `jvm-threads-heap.sample.txt`
|
||||
@@ -0,0 +1,134 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
host_name=""
|
||||
port=443
|
||||
cert_file=""
|
||||
warning_days=30
|
||||
critical_days=7
|
||||
servername=""
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
|
||||
|
||||
Check TLS certificate expiry for a remote endpoint or local certificate file.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
|
||||
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
|
||||
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
|
||||
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if ! command -v openssl >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: openssl\n'
|
||||
exit 2
|
||||
fi
|
||||
for value in "$port" "$warning_days" "$critical_days"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((critical_days >= warning_days)); then
|
||||
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$host_name" && -n "$cert_file" ]]; then
|
||||
printf 'CRITICAL: use either --host or --file, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -z "$host_name" && -z "$cert_file" ]]; then
|
||||
printf 'CRITICAL: either --host or --file is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
|
||||
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
|
||||
exit 2
|
||||
fi
|
||||
if [[ -z "$servername" ]]; then
|
||||
servername="$host_name"
|
||||
fi
|
||||
|
||||
tmp_cert="$(mktemp)"
|
||||
trap 'rm -f "$tmp_cert"' EXIT
|
||||
|
||||
if [[ -n "$host_name" ]]; then
|
||||
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
|
||||
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
|
||||
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
|
||||
exit 2
|
||||
fi
|
||||
else
|
||||
cp "$cert_file" "$tmp_cert"
|
||||
fi
|
||||
|
||||
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
|
||||
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
|
||||
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
|
||||
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
|
||||
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
|
||||
|
||||
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
|
||||
now_epoch="$(date +%s)"
|
||||
if [[ -z "$expiry_epoch" ]]; then
|
||||
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
|
||||
exit 2
|
||||
fi
|
||||
seconds_left=$((expiry_epoch - now_epoch))
|
||||
days_left=$((seconds_left / 86400))
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((days_left < critical_days)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((days_left < warning_days)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
target="$cert_file"
|
||||
if [[ -n "$host_name" ]]; then
|
||||
target="${host_name}:${port}"
|
||||
fi
|
||||
|
||||
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
|
||||
|
||||
printf 'Certificate details:\n'
|
||||
printf 'Subject: %s\n' "$subject"
|
||||
printf 'Issuer: %s\n' "$issuer"
|
||||
printf 'notBefore: %s\n' "$not_before"
|
||||
printf 'notAfter: %s\n' "$not_after"
|
||||
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Target: %s\n' "$target"
|
||||
printf 'SNI: %s\n' "${servername:-not used}"
|
||||
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Renew certificate before the operational threshold is breached\n'
|
||||
printf -- '- Check the full chain and intermediate certificates\n'
|
||||
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
|
||||
printf -- '- Verify monitoring threshold and alert ownership\n'
|
||||
printf -- '- Attach this output to incident or change ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,161 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
host_name=""
|
||||
port=""
|
||||
count=3
|
||||
timeout_seconds=3
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
|
||||
|
||||
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
|
||||
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$host_name" ]]; then
|
||||
printf 'CRITICAL: --host is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
for value in "$count" "$timeout_seconds"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||
printf 'CRITICAL: --port must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v getent >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: getent\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
dns_ok=0
|
||||
ping_ok=0
|
||||
tcp_ok=0
|
||||
tcp_checked=0
|
||||
tcp_note=""
|
||||
ping_output="$(mktemp)"
|
||||
trap 'rm -f "$ping_output"' EXIT
|
||||
|
||||
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
|
||||
if [[ -n "$dns_output" ]]; then
|
||||
dns_ok=1
|
||||
fi
|
||||
|
||||
if command -v ping >/dev/null 2>&1; then
|
||||
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
|
||||
ping_ok=1
|
||||
fi
|
||||
else
|
||||
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
|
||||
fi
|
||||
|
||||
if [[ -n "$port" ]]; then
|
||||
tcp_checked=1
|
||||
if command -v timeout >/dev/null 2>&1; then
|
||||
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||
tcp_ok=1
|
||||
fi
|
||||
else
|
||||
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
|
||||
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||
tcp_ok=1
|
||||
fi
|
||||
fi
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((dns_ok == 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((tcp_checked == 1 && tcp_ok == 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
|
||||
if ((tcp_checked == 1)); then
|
||||
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
|
||||
fi
|
||||
printf '\n\n'
|
||||
|
||||
printf 'DNS result:\n'
|
||||
if [[ -n "$dns_output" ]]; then
|
||||
printf '%s\n' "$dns_output"
|
||||
else
|
||||
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Ping result:\n'
|
||||
if [[ -s "$ping_output" ]]; then
|
||||
cat "$ping_output"
|
||||
else
|
||||
printf 'WARNING: ping result unavailable or ping command missing\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
if ((tcp_checked == 1)); then
|
||||
printf 'TCP port result:\n'
|
||||
if ((tcp_ok == 1)); then
|
||||
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
|
||||
else
|
||||
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
|
||||
fi
|
||||
if [[ -n "$tcp_note" ]]; then
|
||||
printf '%s\n' "$tcp_note"
|
||||
fi
|
||||
printf '\n'
|
||||
fi
|
||||
|
||||
printf 'Local network hints:\n'
|
||||
if command -v ip >/dev/null 2>&1; then
|
||||
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
|
||||
elif command -v ss >/dev/null 2>&1; then
|
||||
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
|
||||
else
|
||||
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
|
||||
if [[ -n "$tcp_note" ]]; then
|
||||
printf '%s\n' "$tcp_note"
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify the DNS record and resolver path\n'
|
||||
printf -- '- Check firewall, routing, security group, or proxy policy\n'
|
||||
printf -- '- Compare results from another host or network segment\n'
|
||||
printf -- '- Check application endpoint health after network reachability is confirmed\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
since_value="1 hour ago"
|
||||
warning_count=20
|
||||
critical_count=50
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
|
||||
|
||||
Detect failed SSH login bursts from journal or readable authentication logs.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_count" "$critical_count" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_count >= critical_count)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_log="$(mktemp)"
|
||||
trap 'rm -f "$tmp_log"' EXIT
|
||||
log_source="journalctl"
|
||||
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl --since "$since_value" --no-pager 2>/dev/null \
|
||||
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
|
||||
else
|
||||
log_source="log file fallback"
|
||||
fi
|
||||
|
||||
if [[ ! -s "$tmp_log" ]]; then
|
||||
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
|
||||
if [[ -r "$log_file" ]]; then
|
||||
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
|
||||
log_source="$log_file"
|
||||
fi
|
||||
done
|
||||
fi
|
||||
|
||||
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((attempts >= critical_count)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((attempts >= warning_count)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
|
||||
|
||||
printf 'Top source IPs:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
|
||||
| sed -E 's/^(from|rhost=) //' \
|
||||
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||
else
|
||||
printf 'OK: no failed SSH attempts found in available sources\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Top attempted users:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
|
||||
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||
else
|
||||
printf 'OK: no attempted users extracted\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Sample recent lines:\n'
|
||||
if [[ -s "$tmp_log" ]]; then
|
||||
tail -n "$top_count" "$tmp_log"
|
||||
else
|
||||
printf 'OK: no sample lines available\n'
|
||||
fi
|
||||
printf '\n\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||
printf 'Log source: %s\n' "$log_source"
|
||||
if [[ "$log_source" != "journalctl" ]]; then
|
||||
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; authentication log visibility may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
|
||||
printf -- '- Check firewall, fail2ban, or security tooling state\n'
|
||||
printf -- '- Confirm whether the attempts are expected for this host\n'
|
||||
printf -- '- Review successful logins too, not only failures\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
include_system=0
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_filesystem_readonly.sh [--include-system] [--help]
|
||||
|
||||
Detect filesystems mounted read-only. Read-only.
|
||||
USAGE
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--include-system) include_system=1; shift ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
tmp_mounts="$(mktemp)"
|
||||
trap 'rm -f "$tmp_mounts"' EXIT
|
||||
|
||||
if command -v findmnt >/dev/null 2>&1; then
|
||||
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
|
||||
elif command -v mount >/dev/null 2>&1; then
|
||||
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
|
||||
else
|
||||
printf 'CRITICAL: findmnt or mount is required\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_ro="$(mktemp)"
|
||||
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
|
||||
|
||||
awk -v include_system="$include_system" '
|
||||
function system_fs(type, target) {
|
||||
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
|
||||
}
|
||||
{
|
||||
target=$1; source=$2; type=$3; opts=$4
|
||||
if (opts ~ /(^|,)ro(,|$)/) {
|
||||
if (include_system == 1 || ! system_fs(type, target)) {
|
||||
print target "\t" source "\t" type "\t" opts
|
||||
}
|
||||
}
|
||||
}
|
||||
' "$tmp_mounts" > "$tmp_ro"
|
||||
|
||||
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((readonly_count > 0)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
fi
|
||||
|
||||
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
|
||||
|
||||
printf 'Read-only filesystems:\n'
|
||||
if [[ -s "$tmp_ro" ]]; then
|
||||
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
|
||||
cat "$tmp_ro"
|
||||
else
|
||||
printf 'OK: no read-only filesystems found with current filters\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'include_system=%s\n' "$include_system"
|
||||
printf 'Collector: '
|
||||
if command -v findmnt >/dev/null 2>&1; then
|
||||
printf 'findmnt\n'
|
||||
else
|
||||
printf 'mount fallback\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
|
||||
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
|
||||
printf -- '- Check filesystem health with the platform-approved procedure\n'
|
||||
printf -- '- Do not remount read-write before understanding the cause\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
+146
@@ -0,0 +1,146 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=75
|
||||
critical_threshold=90
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||
|
||||
Detect high CPU load and show top CPU-consuming processes.
|
||||
|
||||
Exit codes:
|
||||
0 OK
|
||||
1 WARNING / operational issue detected
|
||||
2 invalid input / missing required dependency
|
||||
3 CRITICAL issue detected
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
|
||||
warning_threshold="$2"
|
||||
shift 2
|
||||
;;
|
||||
--critical)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
|
||||
critical_threshold="$2"
|
||||
shift 2
|
||||
;;
|
||||
--top)
|
||||
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
|
||||
top_count="$2"
|
||||
shift 2
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd ps
|
||||
require_cmd awk
|
||||
require_cmd head
|
||||
|
||||
cpu_count=1
|
||||
if command -v getconf >/dev/null 2>&1; then
|
||||
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
|
||||
elif [[ -r /proc/cpuinfo ]]; then
|
||||
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
|
||||
fi
|
||||
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
|
||||
((cpu_count > 0)) || cpu_count=1
|
||||
|
||||
load_1m="unavailable"
|
||||
load_5m="unavailable"
|
||||
load_15m="unavailable"
|
||||
load_per_cpu_pct=0
|
||||
if [[ -r /proc/loadavg ]]; then
|
||||
read -r load_1m load_5m load_15m _ < /proc/loadavg
|
||||
load_per_cpu_pct="$(awk -v load="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load / cpus) * 100 }')"
|
||||
elif command -v uptime >/dev/null 2>&1; then
|
||||
load_line="$(uptime 2>/dev/null || true)"
|
||||
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((load_per_cpu_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((load_per_cpu_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
|
||||
|
||||
printf 'Load average:\n'
|
||||
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
|
||||
|
||||
printf 'CPU count:\n'
|
||||
printf '%s\n\n' "$cpu_count"
|
||||
|
||||
printf 'Top CPU processes:\n'
|
||||
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
if command -v uptime >/dev/null 2>&1; then
|
||||
uptime || true
|
||||
else
|
||||
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
|
||||
fi
|
||||
if ((load_per_cpu_pct >= 100)); then
|
||||
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
|
||||
else
|
||||
printf 'OK: load is not above online CPU count at collection time\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check process ownership and whether the top process is expected\n'
|
||||
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
|
||||
printf -- '- Review logs for the top CPU-consuming process\n'
|
||||
printf -- '- Compare with longer trend data from monitoring before taking action\n'
|
||||
printf -- '- Attach this output to the incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,138 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=80
|
||||
critical_threshold=90
|
||||
since_value="24 hours ago"
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
|
||||
|
||||
Detect high memory or swap usage and show recent OOM killer evidence.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd free
|
||||
require_cmd ps
|
||||
require_cmd awk
|
||||
require_cmd head
|
||||
|
||||
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
|
||||
/^Mem:/ { mt=$2; mu=$3 }
|
||||
/^Swap:/ { st=$2; su=$3 }
|
||||
END { printf "%d %d %d %d\n", mt, mu, st, su }
|
||||
')
|
||||
|
||||
mem_pct=0
|
||||
swap_pct=0
|
||||
if ((mem_total > 0)); then
|
||||
mem_pct=$((mem_used * 100 / mem_total))
|
||||
fi
|
||||
if ((swap_total > 0)); then
|
||||
swap_pct=$((swap_used * 100 / swap_total))
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
|
||||
|
||||
printf 'Memory summary:\n'
|
||||
free -m
|
||||
printf '\n'
|
||||
|
||||
printf 'Top memory processes:\n'
|
||||
printf 'PID RSS_MB COMMAND\n'
|
||||
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
|
||||
printf '\n'
|
||||
|
||||
printf 'OOM events since %s:\n' "$since_value"
|
||||
oom_found=0
|
||||
oom_source="journalctl"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
|
||||
oom_found=1
|
||||
fi
|
||||
else
|
||||
printf 'WARNING: journalctl not available; checking readable log files\n'
|
||||
oom_source="log file fallback"
|
||||
fi
|
||||
if ((oom_found == 0)); then
|
||||
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
|
||||
if [[ -r "$log_file" ]]; then
|
||||
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
|
||||
oom_found=1
|
||||
oom_source="$log_file"
|
||||
break
|
||||
fi
|
||||
fi
|
||||
done
|
||||
fi
|
||||
if ((oom_found == 0)); then
|
||||
printf 'OK: no OOM evidence found in available sources\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
|
||||
printf 'OOM evidence source: %s\n' "$oom_source"
|
||||
if [[ "$oom_source" != "journalctl" ]]; then
|
||||
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||
fi
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Check application memory trend\n'
|
||||
printf -- '- Review JVM heap settings if process is Java\n'
|
||||
printf -- '- Verify swap pressure and paging activity\n'
|
||||
printf -- '- Confirm whether OOM events align with application impact\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
+103
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_threshold=80
|
||||
critical_threshold=90
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||
|
||||
Detect inode exhaustion using df -i.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_threshold >= critical_threshold)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v df >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: df\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_df="$(mktemp)"
|
||||
tmp_alerts="$(mktemp)"
|
||||
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
|
||||
|
||||
df -Pi > "$tmp_df"
|
||||
awk -v warn="$warning_threshold" '
|
||||
NR > 1 {
|
||||
pct=$5
|
||||
gsub(/%/, "", pct)
|
||||
if (pct >= warn) {
|
||||
print $0
|
||||
}
|
||||
}
|
||||
' "$tmp_df" > "$tmp_alerts"
|
||||
|
||||
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if ((max_pct >= critical_threshold)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((max_pct >= warning_threshold)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
|
||||
|
||||
printf 'Filesystems above threshold:\n'
|
||||
if [[ -s "$tmp_alerts" ]]; then
|
||||
cat "$tmp_alerts"
|
||||
else
|
||||
printf 'OK: no filesystems above warning threshold\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Inode usage table:\n'
|
||||
cat "$tmp_df"
|
||||
printf '\n'
|
||||
|
||||
printf 'Top affected mount points:\n'
|
||||
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
|
||||
| sort -rn | head -n "$top_count" \
|
||||
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Find directories with many small files under affected mount points\n'
|
||||
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
|
||||
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
|
||||
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,134 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
target_pid=""
|
||||
match_string=""
|
||||
top_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
|
||||
|
||||
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
|
||||
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
|
||||
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||
printf 'CRITICAL: use either --pid or --match, not both\n'
|
||||
exit 2
|
||||
fi
|
||||
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||
printf 'CRITICAL: --pid must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! is_number "$top_count"; then
|
||||
printf 'CRITICAL: --top must be numeric\n'
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v ps >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: ps\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
tmp_java="$(mktemp)"
|
||||
trap 'rm -f "$tmp_java"' EXIT
|
||||
|
||||
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
|
||||
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
|
||||
|
||||
if [[ -z "$target_pid" && -n "$match_string" ]]; then
|
||||
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
|
||||
fi
|
||||
|
||||
if [[ -z "$target_pid" ]]; then
|
||||
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
|
||||
if ((detected_count == 0)); then
|
||||
printf 'WARNING: No Java processes detected\n\n'
|
||||
else
|
||||
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
|
||||
fi
|
||||
printf 'Detected JVM processes:\n'
|
||||
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
|
||||
printf '\nRecommended next steps:\n'
|
||||
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
|
||||
printf -- '- Review GC logs and application logs for the selected process\n'
|
||||
printf -- '- Check heap sizing and thread count trend\n'
|
||||
printf -- '- Capture jstack only if approved by operational process\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if ! ps -p "$target_pid" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
|
||||
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
|
||||
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
else
|
||||
status="OK"
|
||||
exit_code=0
|
||||
fi
|
||||
|
||||
thread_count="unavailable"
|
||||
if [[ -r "/proc/${target_pid}/status" ]]; then
|
||||
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
|
||||
fi
|
||||
|
||||
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
|
||||
|
||||
printf 'Detected JVM process:\n'
|
||||
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
|
||||
printf 'Thread count: %s\n\n' "$thread_count"
|
||||
|
||||
printf 'Heap and JVM evidence:\n'
|
||||
if command -v jcmd >/dev/null 2>&1; then
|
||||
printf '\n[jcmd VM.flags]\n'
|
||||
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
|
||||
printf '\n[jcmd GC.heap_info]\n'
|
||||
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
|
||||
printf '\n[jcmd Thread.print summary]\n'
|
||||
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
|
||||
elif command -v jstat >/dev/null 2>&1; then
|
||||
printf '\n[jstat -gc]\n'
|
||||
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
|
||||
else
|
||||
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Review GC logs and recent application errors\n'
|
||||
printf -- '- Check JVM heap sizing against container or host memory limits\n'
|
||||
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
|
||||
printf -- '- Capture jstack only if approved by operational process\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,121 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
warning_offset_ms=500
|
||||
critical_offset_ms=5000
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
|
||||
|
||||
Check time synchronization status and offset evidence when available.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
|
||||
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
for value in "$warning_offset_ms" "$critical_offset_ms"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_offset_ms >= critical_offset_ms)); then
|
||||
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
|
||||
timezone="$(date '+%Z %z')"
|
||||
sync_status="unknown"
|
||||
detected_tool="none"
|
||||
offset_ms=""
|
||||
|
||||
timedate_output=""
|
||||
if command -v timedatectl >/dev/null 2>&1; then
|
||||
detected_tool="timedatectl"
|
||||
timedate_output="$(timedatectl 2>/dev/null || true)"
|
||||
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
|
||||
[[ -n "$sync_status" ]] || sync_status="unknown"
|
||||
fi
|
||||
|
||||
chronyc_output=""
|
||||
if command -v chronyc >/dev/null 2>&1; then
|
||||
detected_tool="chronyc"
|
||||
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
|
||||
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
|
||||
if [[ -n "$raw_offset" ]]; then
|
||||
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
|
||||
fi
|
||||
elif command -v ntpq >/dev/null 2>&1; then
|
||||
detected_tool="ntpq"
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
if [[ -n "$offset_ms" ]]; then
|
||||
if ((offset_ms >= critical_offset_ms)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((offset_ms >= warning_offset_ms)); then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
elif [[ "$detected_tool" == "none" ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
|
||||
|
||||
printf 'Time status:\n'
|
||||
printf 'System time: %s\n' "$system_time"
|
||||
printf 'Timezone: %s\n' "$timezone"
|
||||
printf 'Detected tool: %s\n' "$detected_tool"
|
||||
printf 'NTP synchronized: %s\n' "$sync_status"
|
||||
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
|
||||
|
||||
printf 'Tool evidence:\n'
|
||||
if [[ -n "$chronyc_output" ]]; then
|
||||
printf '%s\n' "$chronyc_output"
|
||||
elif command -v ntpq >/dev/null 2>&1; then
|
||||
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
|
||||
elif [[ -n "$timedate_output" ]]; then
|
||||
printf '%s\n' "$timedate_output"
|
||||
else
|
||||
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
|
||||
if [[ -z "$offset_ms" ]]; then
|
||||
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Verify chrony or ntpd service status and configuration\n'
|
||||
printf -- '- Check NTP sources and reachability\n'
|
||||
printf -- '- Check virtualization host time if this is a VM\n'
|
||||
printf -- '- Avoid restarting time services blindly in production\n'
|
||||
printf -- '- Attach this output to incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
service_name=""
|
||||
since_value="1 hour ago"
|
||||
warning_count=3
|
||||
critical_count=10
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
|
||||
|
||||
Detect restart-loop evidence for a systemd service. Read-only.
|
||||
USAGE
|
||||
}
|
||||
|
||||
is_number() {
|
||||
[[ "$1" =~ ^[0-9]+$ ]]
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
if ! command -v "$1" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
|
||||
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||
--help|-h) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$service_name" ]]; then
|
||||
printf 'CRITICAL: --service is required\n'
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
for value in "$warning_count" "$critical_count"; do
|
||||
if ! is_number "$value"; then
|
||||
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
if ((warning_count >= critical_count)); then
|
||||
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||
exit 2
|
||||
fi
|
||||
|
||||
require_cmd systemctl
|
||||
|
||||
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
|
||||
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
|
||||
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
|
||||
restart_count="${n_restarts:-0}"
|
||||
if ! is_number "$restart_count"; then
|
||||
restart_count=0
|
||||
fi
|
||||
|
||||
status="OK"
|
||||
exit_code=0
|
||||
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
|
||||
status="CRITICAL"
|
||||
exit_code=3
|
||||
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
|
||||
status="WARNING"
|
||||
exit_code=1
|
||||
fi
|
||||
|
||||
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
|
||||
|
||||
printf 'Service state:\n'
|
||||
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
|
||||
printf '\n'
|
||||
|
||||
printf 'Systemd properties:\n'
|
||||
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
|
||||
printf '\n'
|
||||
|
||||
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
|
||||
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
|
||||
| tail -n 40 || printf 'OK: no matching journal lines found\n'
|
||||
else
|
||||
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Evidence:\n'
|
||||
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||
printf 'WARNING: running without root; journal visibility may be limited\n'
|
||||
fi
|
||||
printf '\n'
|
||||
|
||||
printf 'Recommended next steps:\n'
|
||||
printf -- '- Inspect the unit file and drop-in overrides\n'
|
||||
printf -- '- Review application logs around the restart timestamps\n'
|
||||
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
|
||||
printf -- '- Verify recent configuration or package changes\n'
|
||||
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,20 @@
|
||||
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
|
||||
|
||||
Certificate details:
|
||||
Subject: CN = app.example.com
|
||||
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
|
||||
notBefore: Apr 11 00:00:00 2026 GMT
|
||||
notAfter: May 29 23:59:59 2026 GMT
|
||||
SAN/CN: DNS:app.example.com, DNS:api.example.com
|
||||
|
||||
Evidence:
|
||||
Target: app.example.com:443
|
||||
SNI: app.example.com
|
||||
Thresholds: warning=30 days critical=7 days
|
||||
|
||||
Recommended next steps:
|
||||
- Renew certificate before the operational threshold is breached
|
||||
- Check the full chain and intermediate certificates
|
||||
- Check the load balancer, ingress, or reverse proxy serving this certificate
|
||||
- Verify monitoring threshold and alert ownership
|
||||
- Attach this output to incident or change ticket
|
||||
@@ -0,0 +1,23 @@
|
||||
OK: DNS=OK ping=OK tcp_443=OK
|
||||
|
||||
DNS result:
|
||||
93.184.216.34 example.com
|
||||
|
||||
Ping result:
|
||||
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
|
||||
|
||||
TCP port result:
|
||||
OK: TCP connection to example.com:443 succeeded
|
||||
|
||||
Local network hints:
|
||||
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
|
||||
|
||||
Evidence:
|
||||
Host: example.com count=3 timeout=3s port=443
|
||||
|
||||
Recommended next steps:
|
||||
- Verify the DNS record and resolver path
|
||||
- Check firewall, routing, security group, or proxy policy
|
||||
- Compare results from another host or network segment
|
||||
- Check application endpoint health after network reachability is confirmed
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,26 @@
|
||||
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
|
||||
|
||||
Top source IPs:
|
||||
52 203.0.113.44
|
||||
12 198.51.100.20
|
||||
9 192.0.2.10
|
||||
|
||||
Top attempted users:
|
||||
31 admin
|
||||
24 oracle
|
||||
18 root
|
||||
|
||||
Sample recent lines:
|
||||
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
|
||||
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=20 critical=50 since="1 hour ago"
|
||||
Log source: journalctl
|
||||
|
||||
Recommended next steps:
|
||||
- Verify source IPs against expected scanners, admins, or automation
|
||||
- Check firewall, fail2ban, or security tooling state
|
||||
- Confirm whether the attempts are expected for this host
|
||||
- Review successful logins too, not only failures
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,16 @@
|
||||
CRITICAL: Found 1 read-only filesystem(s)
|
||||
|
||||
Read-only filesystems:
|
||||
MOUNT_POINT SOURCE FSTYPE OPTIONS
|
||||
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
|
||||
|
||||
Evidence:
|
||||
include_system=0
|
||||
Collector: findmnt
|
||||
|
||||
Recommended next steps:
|
||||
- Check dmesg or journal logs for I/O errors and filesystem remount events
|
||||
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
|
||||
- Check filesystem health with the platform-approved procedure
|
||||
- Do not remount read-write before understanding the cause
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,22 @@
|
||||
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
|
||||
|
||||
Load average:
|
||||
1m=7.82 5m=6.91 15m=5.40
|
||||
|
||||
CPU count:
|
||||
8
|
||||
|
||||
Top CPU processes:
|
||||
PID PPID USER %CPU %MEM COMMAND COMMAND
|
||||
2314 1 app 245 12.1 java java -jar order-api.jar
|
||||
991 1 root 38 0.4 backup-agent backup-agent --scan
|
||||
|
||||
Evidence:
|
||||
WARNING: load is close to online CPU count; runnable task saturation is possible
|
||||
|
||||
Recommended next steps:
|
||||
- Check process ownership and whether the top process is expected
|
||||
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
|
||||
- Review logs for the top CPU-consuming process
|
||||
- Compare with longer trend data from monitoring before taking action
|
||||
- Attach this output to the incident ticket
|
||||
@@ -0,0 +1,25 @@
|
||||
WARNING: Memory usage is 84% and swap usage is 12%
|
||||
|
||||
Memory summary:
|
||||
total used free shared buff/cache available
|
||||
Mem: 15934 13386 512 121 2036 2101
|
||||
Swap: 4095 512 3583
|
||||
|
||||
Top memory processes:
|
||||
PID RSS_MB COMMAND
|
||||
1234 2048 java
|
||||
987 812 postgres
|
||||
|
||||
OOM events since 24 hours ago:
|
||||
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=80% critical=90% since="24 hours ago"
|
||||
OOM evidence source: journalctl
|
||||
|
||||
Recommended next steps:
|
||||
- Check application memory trend
|
||||
- Review JVM heap settings if process is Java
|
||||
- Verify swap pressure and paging activity
|
||||
- Confirm whether OOM events align with application impact
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,22 @@
|
||||
WARNING: Highest inode usage is 87%
|
||||
|
||||
Filesystems above threshold:
|
||||
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||
|
||||
Inode usage table:
|
||||
Filesystem Inodes IUsed IFree IUse% Mounted on
|
||||
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
|
||||
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||
|
||||
Top affected mount points:
|
||||
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=80% critical=90%
|
||||
|
||||
Recommended next steps:
|
||||
- Find directories with many small files under affected mount points
|
||||
- Check logs, cache, spool, session, and temporary directories
|
||||
- Avoid deleting blindly; confirm ownership and application impact first
|
||||
- Confirm whether inode exhaustion is causing write or deploy failures
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,30 @@
|
||||
OK: JVM diagnostics collected for PID 1234
|
||||
|
||||
Detected JVM process:
|
||||
PID USER RSS_MB CPU COMMAND
|
||||
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
|
||||
Thread count: 188
|
||||
|
||||
Heap and JVM evidence:
|
||||
|
||||
[jcmd VM.flags]
|
||||
1234:
|
||||
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
|
||||
|
||||
[jcmd GC.heap_info]
|
||||
garbage-first heap total 2097152K, used 1521000K
|
||||
|
||||
[jcmd Thread.print summary]
|
||||
102 java.lang.Thread.State: WAITING
|
||||
53 java.lang.Thread.State: RUNNABLE
|
||||
33 java.lang.Thread.State: TIMED_WAITING
|
||||
|
||||
Evidence:
|
||||
PID=1234 thread_count=188 top=10
|
||||
|
||||
Recommended next steps:
|
||||
- Review GC logs and recent application errors
|
||||
- Check JVM heap sizing against container or host memory limits
|
||||
- Check thread count trend in monitoring before concluding a leak
|
||||
- Capture jstack only if approved by operational process
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,23 @@
|
||||
WARNING: Time sync status=yes offset_ms=812
|
||||
|
||||
Time status:
|
||||
System time: 2026-05-11 10:18:01 UTC +0000
|
||||
Timezone: UTC +0000
|
||||
Detected tool: chronyc
|
||||
NTP synchronized: yes
|
||||
Offset ms: 812
|
||||
|
||||
Tool evidence:
|
||||
Reference ID : 203.0.113.10
|
||||
System time : 0.812345 seconds fast of NTP time
|
||||
Last offset : +0.812345 seconds
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=500ms critical=5000ms
|
||||
|
||||
Recommended next steps:
|
||||
- Verify chrony or ntpd service status and configuration
|
||||
- Check NTP sources and reachability
|
||||
- Check virtualization host time if this is a VM
|
||||
- Avoid restarting time services blindly in production
|
||||
- Attach this output to incident ticket
|
||||
@@ -0,0 +1,27 @@
|
||||
CRITICAL: Service app.service state=failed substate=failed restarts=12
|
||||
|
||||
Service state:
|
||||
app.service - Example application
|
||||
Loaded: loaded (/etc/systemd/system/app.service; enabled)
|
||||
Active: failed (Result: exit-code)
|
||||
|
||||
Systemd properties:
|
||||
Id=app.service
|
||||
ActiveState=failed
|
||||
SubState=failed
|
||||
Result=exit-code
|
||||
NRestarts=12
|
||||
|
||||
Recent start/stop/failure log lines since 1 hour ago:
|
||||
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
|
||||
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
|
||||
|
||||
Evidence:
|
||||
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
|
||||
|
||||
Recommended next steps:
|
||||
- Inspect the unit file and drop-in overrides
|
||||
- Review application logs around the restart timestamps
|
||||
- Check dependencies such as network, storage, database, or secrets
|
||||
- Verify recent configuration or package changes
|
||||
- Do not restart blindly; attach this output to the incident ticket
|
||||
Reference in New Issue
Block a user