Add Linux fresh setup toolkit

Add AI lab maintenance toolkit
Document Slurm AI/HPC cluster project
2026-06-06 00:23:11 +00:00 · 2026-06-06 00:10:44 +00:00 · 2026-06-05 15:39:24 +00:00 · 2026-06-05 15:38:56 +00:00 · 2026-05-14 21:23:49 +02:00 · 2026-05-12 20:00:42 +00:00
122 changed files with 9759 additions and 8 deletions
@@ -4,6 +4,8 @@
 ### Added
 - Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
 - Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
 - Python tooling validation for operational scripts.
 - `incident-log-summary` for general incident log summarization.
 - `log-diff-checker` for pre-change and post-change log comparison.
@@ -11,6 +13,8 @@
 - `jvm-log-analyzer` for JVM application log summaries.
 - `journal-analyzer` for exported `journalctl` log review.
 - `known-error-matcher` with JSON-based known error patterns.
 - Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
 - `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
 - Repository-level Codex guidance:
  - `AGENTS.md`
  - `docs/codex/README.md`
@@ -34,6 +38,7 @@
  - IBM AIX 7 role and playbook.
 - Shared sanitized Ansible inventory defaults for Linux and AIX examples.
 - Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
 - Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
 ### Changed
@@ -30,6 +30,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
 - [infra-run](./infra-run/) - the main implemented project in this repository.
 - [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
 - [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
 - [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
 - [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
 - [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
@@ -41,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
 - [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
 - [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
 - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
 - [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
 ## Planned Areas
@@ -105,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
 - Veritas VxVM/VCS operational awareness.
 - GPFS / IBM Spectrum Scale operational awareness.
 - Ansible role organization for selected hardening controls.
 - Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
 - Clear documentation of what was tested and what still needs a real system.
@@ -20,6 +20,7 @@ This file keeps future portfolio ideas in one place so empty folders do not look
 ## Implemented Portfolio Additions
 - Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
 - Python operational log analysis suite under `infra-run/scripts/python/`:
  - `incident-log-summary`
  - `log-diff-checker`
@@ -9,6 +9,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
 ### Bash Operational Scripts
 - [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
 - [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
 - [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
 - [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
 - [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
@@ -16,6 +16,7 @@ This file tracks planned `infra-run` additions without presenting them as comple
 ## Implemented Additions
 - `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
 - `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
 - `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
 - `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
@@ -7,5 +7,6 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
 ## Included
 - `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
 - `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
 - `veritas/` - sample VxVM disk and VCS service group output.
 - `gpfs/` - sample GPFS cluster and NSD output.
@@ -0,0 +1,131 @@
 # L2 Incident Triage Report
 - Generated: 2026-05-12T19:30:00Z
 - Local hostname: app01.example.internal
 - Current user: triage
 - Incident type: all
 - Service: nginx
 - Host: app.example.com
 - Port: 443
 - PID: not provided
 - Process match: not provided
 - Since: 30 minutes ago
 ## Executed Checks
 | Check | Script | Status | Exit | Command |
 | --- | --- | --- | --- | --- |
 | CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
 | Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
 | Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
 | DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
 | Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
 | Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
 | Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
 | Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
 | JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
 ## Summary
 - CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
 - Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
 - Service restart loop: OK: Service nginx state=active substate=running restarts=0
 - DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
 - Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
 - Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
 - Read-only filesystems: OK: Found 0 read-only filesystem(s)
 - Inode usage: OK: Highest inode usage is 42%
 - JVM threads and heap: WARNING: No Java processes detected
 ## Raw Evidence
 ### CPU saturation
 Script: `check_high_cpu.sh`
 Command: `./check_high_cpu.sh`
 Status: OK, exit: 0
 ```text
 OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
 Load average:
 1m=0.42 5m=0.38 15m=0.31
 Top CPU processes:
 PID PPID USER %CPU %MEM COMMAND ARGS
 1450 1 app 7.2 2.1 nginx nginx: worker process
 Recommended next steps:
 - Check process ownership and whether the top process is expected
 - Review logs for the top CPU-consuming process
 ```
 ### Memory and OOM
 Script: `check_high_memory_oom.sh`
 Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
 Status: WARNING, exit: 1
 ```text
 WARNING: Memory usage is 84% and swap usage is 12%
 Memory summary:
 Mem: 15800 13272 1110 210 1418 1840
 Swap: 4095 512 3583
 OOM events since 30 minutes ago:
 OK: no OOM evidence found in available sources
 ```
 ### Service restart loop
 Script: `check_service_restart_loop.sh`
 Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
 Status: OK, exit: 0
 ```text
 OK: Service nginx state=active substate=running restarts=0
 Systemd properties:
 Id=nginx.service
 ActiveState=active
 SubState=running
 NRestarts=0
 ```
 ### Skipped or limited checks
 ```text
 JVM threads and heap returned WARNING because no Java process was detected.
 No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
 ```
 ## L2 Handover Checklist
 - [ ] Business impact confirmed
 - [ ] Affected host/service identified
 - [ ] Monitoring alert attached
 - [ ] Recent changes checked
 - [ ] Logs attached
 - [ ] Service owner identified
 - [ ] Escalation target identified
 ## Escalation Notes
 - Escalate when impact is active, spreading, customer-facing, or outside L2 access.
 - Include the alert, timeline, commands run, and the raw evidence above.
 - Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
 - Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
 ## Recommended Next Steps
 - Confirm the symptom against monitoring and user reports.
 - Compare this point-in-time evidence with recent deploys, config changes, and host events.
 - Attach this report to the incident ticket before handoff.
 - If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
 ```mermaid
 flowchart TD
  A["bash"] --> B["os-healthcheck"]
-  A --> C["disk-full"]
+  A --> C["incident-checks"]
-  A --> D["veritas"]
+  A --> D["disk-full"]
-  A --> E["gpfs"]
+  A --> E["veritas"]
  A --> F["gpfs"]
  B --> B1["Host diagnostics"]
-  C --> C1["Incident workflow"]
+  C --> C1["Standalone triage checks"]
-  D --> D1["VxVM and VCS change flow"]
+  D --> D1["Incident workflow"]
-  E --> E1["Spectrum Scale expansion flow"]
+  E --> E1["VxVM and VCS change flow"]
  F --> F1["Spectrum Scale expansion flow"]
 ```
 ## Scripts
@@ -23,6 +25,7 @@ flowchart TD
 - `os-healthcheck/service_check.sh` - critical service status check.
 - `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
 - `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
 - `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
 ## Usage
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
 ./system_report.sh
 ./network_troubleshoot.sh
 ./network_troubleshoot.sh google.com
 cd ../incident-checks
 ./check_high_cpu.sh
 ./check_high_memory_oom.sh --since "24 hours ago"
 ./check_service_restart_loop.sh --service sshd
 ./check_certificate_expiry.sh --host example.com
 ```
 ## Standards
@@ -0,0 +1,124 @@
 # Bash Incident Checks
 Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
 They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
 ## Scripts
 - `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
 - `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
 - `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
 - `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
 - `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
 - `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
 - `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
 - `check_filesystem_readonly.sh` - read-only filesystem detection.
 - `check_inode_usage.sh` - inode pressure and top affected mount points.
 - `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
 - `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
 ## Usage Examples
 ```bash
 ./check_high_cpu.sh
 ./check_high_cpu.sh --warning 70 --critical 90 --top 15
 ./check_high_memory_oom.sh
 ./check_high_memory_oom.sh --since "6 hours ago" --top 5
 ./check_service_restart_loop.sh --service nginx
 ./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
 ./check_failed_ssh_logins.sh
 ./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
 ./check_certificate_expiry.sh --host example.com
 ./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
 ./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
 ./check_dns_connectivity.sh --host example.com
 ./check_dns_connectivity.sh --host db.example.internal --port 5432
 ./check_ntp_time_drift.sh
 ./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
 ./check_filesystem_readonly.sh
 ./check_filesystem_readonly.sh --include-system
 ./check_inode_usage.sh
 ./check_inode_usage.sh --warning 75 --critical 90
 ./check_jvm_threads_heap.sh
 ./check_jvm_threads_heap.sh --pid 1234
 ./check_jvm_threads_heap.sh --match app-name
 ./incident_triage_report.sh --type cpu
 ./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
 ./incident_triage_report.sh --type network --host app.example.com --port 443
 ./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
 ```
 ## L2 Triage Report Wrapper
 `incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
 Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
 The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
 ## Exit Codes
 - `0` - OK.
 - `1` - WARNING or operational issue detected.
 - `2` - invalid input or missing required dependency.
 - `3` - CRITICAL issue detected.
 ## Supported Platforms
 These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
 Some data sources vary by distribution:
 - RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
 - Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
 - systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
 ## Safety Notes
 - Scripts are read-only.
 - Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
 - Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
 - Treat output as triage evidence, not as complete root-cause analysis.
 ## Dependency Notes
 Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
 Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
 ## Copy-To-Server Example
 ```bash
 scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
 ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
 ```
 Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
 ## Sample Outputs
 Sanitized examples are available in [examples](./examples/):
 - `high-cpu.sample.txt`
 - `high-memory-oom.sample.txt`
 - `service-restart-loop.sample.txt`
 - `failed-ssh-logins.sample.txt`
 - `certificate-expiry.sample.txt`
 - `dns-connectivity.sample.txt`
 - `ntp-time-drift.sample.txt`
 - `filesystem-readonly.sample.txt`
 - `inode-usage.sample.txt`
 - `jvm-threads-heap.sample.txt`
 A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
@@ -0,0 +1,134 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 host_name=""
 port=443
 cert_file=""
 warning_days=30
 critical_days=7
 servername=""
 usage() {
  cat <<'USAGE'
 Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
 Check TLS certificate expiry for a remote endpoint or local certificate file.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
    --port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
    --file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
    --servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
    --warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
    --critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 if ! command -v openssl >/dev/null 2>&1; then
  printf 'CRITICAL: required command not found: openssl\n'
  exit 2
 fi
 for value in "$port" "$warning_days" "$critical_days"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((critical_days >= warning_days)); then
  printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
  exit 2
 fi
 if [[ -n "$host_name" && -n "$cert_file" ]]; then
  printf 'CRITICAL: use either --host or --file, not both\n'
  exit 2
 fi
 if [[ -z "$host_name" && -z "$cert_file" ]]; then
  printf 'CRITICAL: either --host or --file is required\n'
  usage
  exit 2
 fi
 if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
  printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
  exit 2
 fi
 if [[ -z "$servername" ]]; then
  servername="$host_name"
 fi
 tmp_cert="$(mktemp)"
 trap 'rm -f "$tmp_cert"' EXIT
 if [[ -n "$host_name" ]]; then
  if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
      | openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
    printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
    exit 2
  fi
 else
  cp "$cert_file" "$tmp_cert"
 fi
 subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
 issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
 not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
 not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
 san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
 expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
 now_epoch="$(date +%s)"
 if [[ -z "$expiry_epoch" ]]; then
  printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
  exit 2
 fi
 seconds_left=$((expiry_epoch - now_epoch))
 days_left=$((seconds_left / 86400))
 status="OK"
 exit_code=0
 if ((days_left < critical_days)); then
  status="CRITICAL"
  exit_code=3
 elif ((days_left < warning_days)); then
  status="WARNING"
  exit_code=1
 fi
 target="$cert_file"
 if [[ -n "$host_name" ]]; then
  target="${host_name}:${port}"
 fi
 printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
 printf 'Certificate details:\n'
 printf 'Subject: %s\n' "$subject"
 printf 'Issuer: %s\n' "$issuer"
 printf 'notBefore: %s\n' "$not_before"
 printf 'notAfter: %s\n' "$not_after"
 printf 'SAN/CN: %s\n' "${san_text:-$subject}"
 printf '\n'
 printf 'Evidence:\n'
 printf 'Target: %s\n' "$target"
 printf 'SNI: %s\n' "${servername:-not used}"
 printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
 printf 'Recommended next steps:\n'
 printf -- '- Renew certificate before the operational threshold is breached\n'
 printf -- '- Check the full chain and intermediate certificates\n'
 printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
 printf -- '- Verify monitoring threshold and alert ownership\n'
 printf -- '- Attach this output to incident or change ticket\n'
 exit "$exit_code"
@@ -0,0 +1,161 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 host_name=""
 port=""
 count=3
 timeout_seconds=3
 usage() {
  cat <<'USAGE'
 Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
 Check DNS resolution, ping, optional TCP connectivity, and local route hints.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
    --port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
    --count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
    --timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 if [[ -z "$host_name" ]]; then
  printf 'CRITICAL: --host is required\n'
  usage
  exit 2
 fi
 for value in "$count" "$timeout_seconds"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if [[ -n "$port" ]] && ! is_number "$port"; then
  printf 'CRITICAL: --port must be numeric\n'
  exit 2
 fi
 if ! command -v getent >/dev/null 2>&1; then
  printf 'CRITICAL: required command not found: getent\n'
  exit 2
 fi
 dns_ok=0
 ping_ok=0
 tcp_ok=0
 tcp_checked=0
 tcp_note=""
 ping_output="$(mktemp)"
 trap 'rm -f "$ping_output"' EXIT
 dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
 if [[ -n "$dns_output" ]]; then
  dns_ok=1
 fi
 if command -v ping >/dev/null 2>&1; then
  if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
    ping_ok=1
  fi
 else
  printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
 fi
 if [[ -n "$port" ]]; then
  tcp_checked=1
  if command -v timeout >/dev/null 2>&1; then
    if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
      tcp_ok=1
    fi
  else
    tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
    if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
      tcp_ok=1
    fi
  fi
 fi
 status="OK"
 exit_code=0
 if ((dns_ok == 0)); then
  status="CRITICAL"
  exit_code=3
 elif ((tcp_checked == 1 && tcp_ok == 0)); then
  status="CRITICAL"
  exit_code=3
 elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
 if ((tcp_checked == 1)); then
  printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
 fi
 printf '\n\n'
 printf 'DNS result:\n'
 if [[ -n "$dns_output" ]]; then
  printf '%s\n' "$dns_output"
 else
  printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
 fi
 printf '\n'
 printf 'Ping result:\n'
 if [[ -s "$ping_output" ]]; then
  cat "$ping_output"
 else
  printf 'WARNING: ping result unavailable or ping command missing\n'
 fi
 printf '\n'
 if ((tcp_checked == 1)); then
  printf 'TCP port result:\n'
  if ((tcp_ok == 1)); then
    printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
  else
    printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
  fi
  if [[ -n "$tcp_note" ]]; then
    printf '%s\n' "$tcp_note"
  fi
  printf '\n'
 fi
 printf 'Local network hints:\n'
 if command -v ip >/dev/null 2>&1; then
  ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
 elif command -v ss >/dev/null 2>&1; then
  ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
 else
  printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
 if [[ -n "$tcp_note" ]]; then
  printf '%s\n' "$tcp_note"
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Verify the DNS record and resolver path\n'
 printf -- '- Check firewall, routing, security group, or proxy policy\n'
 printf -- '- Compare results from another host or network segment\n'
 printf -- '- Check application endpoint health after network reachability is confirmed\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,124 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 since_value="1 hour ago"
 warning_count=20
 critical_count=50
 top_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
 Detect failed SSH login bursts from journal or readable authentication logs.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
    --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
    --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
    --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 for value in "$warning_count" "$critical_count" "$top_count"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_count >= critical_count)); then
  printf 'CRITICAL: --warning must be lower than --critical\n'
  exit 2
 fi
 tmp_log="$(mktemp)"
 trap 'rm -f "$tmp_log"' EXIT
 log_source="journalctl"
 if command -v journalctl >/dev/null 2>&1; then
  journalctl --since "$since_value" --no-pager 2>/dev/null \
    | grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
 else
  log_source="log file fallback"
 fi
 if [[ ! -s "$tmp_log" ]]; then
  for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
    if [[ -r "$log_file" ]]; then
      grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
      log_source="$log_file"
    fi
  done
 fi
 attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
 status="OK"
 exit_code=0
 if ((attempts >= critical_count)); then
  status="CRITICAL"
  exit_code=3
 elif ((attempts >= warning_count)); then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
 printf 'Top source IPs:\n'
 if [[ -s "$tmp_log" ]]; then
  grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
    | sed -E 's/^(from|rhost=) //' \
    | sort | uniq -c | sort -rn | head -n "$top_count" || true
 else
  printf 'OK: no failed SSH attempts found in available sources\n'
 fi
 printf '\n'
 printf 'Top attempted users:\n'
 if [[ -s "$tmp_log" ]]; then
  sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
    | sort | uniq -c | sort -rn | head -n "$top_count" || true
 else
  printf 'OK: no attempted users extracted\n'
 fi
 printf '\n'
 printf 'Sample recent lines:\n'
 if [[ -s "$tmp_log" ]]; then
  tail -n "$top_count" "$tmp_log"
 else
  printf 'OK: no sample lines available\n'
 fi
 printf '\n\n'
 printf 'Evidence:\n'
 printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
 printf 'Log source: %s\n' "$log_source"
 if [[ "$log_source" != "journalctl" ]]; then
  printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
 fi
 if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
  printf 'WARNING: running without root; authentication log visibility may be limited\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
 printf -- '- Check firewall, fail2ban, or security tooling state\n'
 printf -- '- Confirm whether the attempts are expected for this host\n'
 printf -- '- Review successful logins too, not only failures\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,89 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 include_system=0
 usage() {
  cat <<'USAGE'
 Usage: check_filesystem_readonly.sh [--include-system] [--help]
 Detect filesystems mounted read-only. Read-only.
 USAGE
 }
 while (($# > 0)); do
  case "$1" in
    --include-system) include_system=1; shift ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 tmp_mounts="$(mktemp)"
 trap 'rm -f "$tmp_mounts"' EXIT
 if command -v findmnt >/dev/null 2>&1; then
  findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
 elif command -v mount >/dev/null 2>&1; then
  mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
 else
  printf 'CRITICAL: findmnt or mount is required\n'
  exit 2
 fi
 tmp_ro="$(mktemp)"
 trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
 awk -v include_system="$include_system" '
  function system_fs(type, target) {
    return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
  }
  {
    target=$1; source=$2; type=$3; opts=$4
    if (opts ~ /(^|,)ro(,|$)/) {
      if (include_system == 1 || ! system_fs(type, target)) {
        print target "\t" source "\t" type "\t" opts
      }
    }
  }
 ' "$tmp_mounts" > "$tmp_ro"
 readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
 status="OK"
 exit_code=0
 if ((readonly_count > 0)); then
  status="CRITICAL"
  exit_code=3
 fi
 printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
 printf 'Read-only filesystems:\n'
 if [[ -s "$tmp_ro" ]]; then
  printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
  cat "$tmp_ro"
 else
  printf 'OK: no read-only filesystems found with current filters\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'include_system=%s\n' "$include_system"
 printf 'Collector: '
 if command -v findmnt >/dev/null 2>&1; then
  printf 'findmnt\n'
 else
  printf 'mount fallback\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
 printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
 printf -- '- Check filesystem health with the platform-approved procedure\n'
 printf -- '- Do not remount read-write before understanding the cause\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,146 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 warning_threshold=75
 critical_threshold=90
 top_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
 Detect high CPU load and show top CPU-consuming processes.
 Exit codes:
  0 OK
  1 WARNING / operational issue detected
  2 invalid input / missing required dependency
  3 CRITICAL issue detected
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 require_cmd() {
  if ! command -v "$1" >/dev/null 2>&1; then
    printf 'CRITICAL: required command not found: %s\n' "$1"
    exit 2
  fi
 }
 while (($# > 0)); do
  case "$1" in
    --warning)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
      warning_threshold="$2"
      shift 2
      ;;
    --critical)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
      critical_threshold="$2"
      shift 2
      ;;
    --top)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
      top_count="$2"
      shift 2
      ;;
    --help|-h)
      usage
      exit 0
      ;;
    *)
      printf 'CRITICAL: unknown option: %s\n' "$1"
      usage
      exit 2
      ;;
  esac
 done
 for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_threshold >= critical_threshold)); then
  printf 'CRITICAL: --warning must be lower than --critical\n'
  exit 2
 fi
 require_cmd ps
 require_cmd awk
 require_cmd head
 cpu_count=1
 if command -v getconf >/dev/null 2>&1; then
  cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
 elif [[ -r /proc/cpuinfo ]]; then
  cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
 fi
 [[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
 ((cpu_count > 0)) || cpu_count=1
 load_1m="unavailable"
 load_5m="unavailable"
 load_15m="unavailable"
 load_per_cpu_pct=0
 if [[ -r /proc/loadavg ]]; then
  read -r load_1m load_5m load_15m _ < /proc/loadavg
  load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
 elif command -v uptime >/dev/null 2>&1; then
  load_line="$(uptime 2>/dev/null || true)"
  load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
 fi
 status="OK"
 exit_code=0
 if ((load_per_cpu_pct >= critical_threshold)); then
  status="CRITICAL"
  exit_code=3
 elif ((load_per_cpu_pct >= warning_threshold)); then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
 printf 'Load average:\n'
 printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
 printf 'CPU count:\n'
 printf '%s\n\n' "$cpu_count"
 printf 'Top CPU processes:\n'
 ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
 printf '\n'
 printf 'Evidence:\n'
 if command -v uptime >/dev/null 2>&1; then
  uptime || true
 else
  printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
 fi
 if ((load_per_cpu_pct >= 100)); then
  printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
 else
  printf 'OK: load is not above online CPU count at collection time\n'
 fi
 if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
  printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Check process ownership and whether the top process is expected\n'
 printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
 printf -- '- Review logs for the top CPU-consuming process\n'
 printf -- '- Compare with longer trend data from monitoring before taking action\n'
 printf -- '- Attach this output to the incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,138 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 warning_threshold=80
 critical_threshold=90
 since_value="24 hours ago"
 top_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
 Detect high memory or swap usage and show recent OOM killer evidence.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 require_cmd() {
  if ! command -v "$1" >/dev/null 2>&1; then
    printf 'CRITICAL: required command not found: %s\n' "$1"
    exit 2
  fi
 }
 while (($# > 0)); do
  case "$1" in
    --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
    --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
    --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
    --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_threshold >= critical_threshold)); then
  printf 'CRITICAL: --warning must be lower than --critical\n'
  exit 2
 fi
 require_cmd free
 require_cmd ps
 require_cmd awk
 require_cmd head
 read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
  /^Mem:/ { mt=$2; mu=$3 }
  /^Swap:/ { st=$2; su=$3 }
  END { printf "%d %d %d %d\n", mt, mu, st, su }
 ')
 mem_pct=0
 swap_pct=0
 if ((mem_total > 0)); then
  mem_pct=$((mem_used * 100 / mem_total))
 fi
 if ((swap_total > 0)); then
  swap_pct=$((swap_used * 100 / swap_total))
 fi
 status="OK"
 exit_code=0
 if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
  status="CRITICAL"
  exit_code=3
 elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
 printf 'Memory summary:\n'
 free -m
 printf '\n'
 printf 'Top memory processes:\n'
 printf 'PID     RSS_MB   COMMAND\n'
 ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
 printf '\n'
 printf 'OOM events since %s:\n' "$since_value"
 oom_found=0
 oom_source="journalctl"
 if command -v journalctl >/dev/null 2>&1; then
  if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
    oom_found=1
  fi
 else
  printf 'WARNING: journalctl not available; checking readable log files\n'
  oom_source="log file fallback"
 fi
 if ((oom_found == 0)); then
  for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
    if [[ -r "$log_file" ]]; then
      if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
        oom_found=1
        oom_source="$log_file"
        break
      fi
    fi
  done
 fi
 if ((oom_found == 0)); then
  printf 'OK: no OOM evidence found in available sources\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
 printf 'OOM evidence source: %s\n' "$oom_source"
 if [[ "$oom_source" != "journalctl" ]]; then
  printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
 fi
 if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
  printf 'WARNING: running without root; kernel logs or process details may be limited\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Check application memory trend\n'
 printf -- '- Review JVM heap settings if process is Java\n'
 printf -- '- Verify swap pressure and paging activity\n'
 printf -- '- Confirm whether OOM events align with application impact\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,103 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 warning_threshold=80
 critical_threshold=90
 top_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
 Detect inode exhaustion using df -i.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
    --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
    --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_threshold >= critical_threshold)); then
  printf 'CRITICAL: --warning must be lower than --critical\n'
  exit 2
 fi
 if ! command -v df >/dev/null 2>&1; then
  printf 'CRITICAL: required command not found: df\n'
  exit 2
 fi
 tmp_df="$(mktemp)"
 tmp_alerts="$(mktemp)"
 trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
 df -Pi > "$tmp_df"
 awk -v warn="$warning_threshold" '
  NR > 1 {
    pct=$5
    gsub(/%/, "", pct)
    if (pct >= warn) {
      print $0
    }
  }
 ' "$tmp_df" > "$tmp_alerts"
 max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
 status="OK"
 exit_code=0
 if ((max_pct >= critical_threshold)); then
  status="CRITICAL"
  exit_code=3
 elif ((max_pct >= warning_threshold)); then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
 printf 'Filesystems above threshold:\n'
 if [[ -s "$tmp_alerts" ]]; then
  cat "$tmp_alerts"
 else
  printf 'OK: no filesystems above warning threshold\n'
 fi
 printf '\n'
 printf 'Inode usage table:\n'
 cat "$tmp_df"
 printf '\n'
 printf 'Top affected mount points:\n'
 awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
  | sort -rn | head -n "$top_count" \
  | awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
 printf '\n'
 printf 'Evidence:\n'
 printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
 printf 'Recommended next steps:\n'
 printf -- '- Find directories with many small files under affected mount points\n'
 printf -- '- Check logs, cache, spool, session, and temporary directories\n'
 printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
 printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,134 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 target_pid=""
 match_string=""
 top_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
 Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
    --match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
    --top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 if [[ -n "$target_pid" && -n "$match_string" ]]; then
  printf 'CRITICAL: use either --pid or --match, not both\n'
  exit 2
 fi
 if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
  printf 'CRITICAL: --pid must be numeric\n'
  exit 2
 fi
 if ! is_number "$top_count"; then
  printf 'CRITICAL: --top must be numeric\n'
  exit 2
 fi
 if ! command -v ps >/dev/null 2>&1; then
  printf 'CRITICAL: required command not found: ps\n'
  exit 2
 fi
 tmp_java="$(mktemp)"
 trap 'rm -f "$tmp_java"' EXIT
 ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
  | awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
 if [[ -z "$target_pid" && -n "$match_string" ]]; then
  target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
 fi
 if [[ -z "$target_pid" ]]; then
  detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
  if ((detected_count == 0)); then
    printf 'WARNING: No Java processes detected\n\n'
  else
    printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
  fi
  printf 'Detected JVM processes:\n'
  printf 'PID USER RSS_MB CPU COMMAND\n'
  awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
  printf '\nRecommended next steps:\n'
  printf -- '- Select a JVM process with --pid for focused diagnostics\n'
  printf -- '- Review GC logs and application logs for the selected process\n'
  printf -- '- Check heap sizing and thread count trend\n'
  printf -- '- Capture jstack only if approved by operational process\n'
  exit 1
 fi
 if ! ps -p "$target_pid" >/dev/null 2>&1; then
  printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
  exit 2
 fi
 proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
 if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
  printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
  status="WARNING"
  exit_code=1
 else
  status="OK"
  exit_code=0
 fi
 thread_count="unavailable"
 if [[ -r "/proc/${target_pid}/status" ]]; then
  thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
 fi
 printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
 printf 'Detected JVM process:\n'
 printf 'PID USER RSS_MB CPU COMMAND\n'
 printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
 printf 'Thread count: %s\n\n' "$thread_count"
 printf 'Heap and JVM evidence:\n'
 if command -v jcmd >/dev/null 2>&1; then
  printf '\n[jcmd VM.flags]\n'
  jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
  printf '\n[jcmd GC.heap_info]\n'
  jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
  printf '\n[jcmd Thread.print summary]\n'
  jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
 elif command -v jstat >/dev/null 2>&1; then
  printf '\n[jstat -gc]\n'
  jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
 else
  printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
 if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
  printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Review GC logs and recent application errors\n'
 printf -- '- Check JVM heap sizing against container or host memory limits\n'
 printf -- '- Check thread count trend in monitoring before concluding a leak\n'
 printf -- '- Capture jstack only if approved by operational process\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,121 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 warning_offset_ms=500
 critical_offset_ms=5000
 usage() {
  cat <<'USAGE'
 Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
 Check time synchronization status and offset evidence when available.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 while (($# > 0)); do
  case "$1" in
    --warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
    --critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 for value in "$warning_offset_ms" "$critical_offset_ms"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_offset_ms >= critical_offset_ms)); then
  printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
  exit 2
 fi
 system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
 timezone="$(date '+%Z %z')"
 sync_status="unknown"
 detected_tool="none"
 offset_ms=""
 timedate_output=""
 if command -v timedatectl >/dev/null 2>&1; then
  detected_tool="timedatectl"
  timedate_output="$(timedatectl 2>/dev/null || true)"
  sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
  [[ -n "$sync_status" ]] || sync_status="unknown"
 fi
 chronyc_output=""
 if command -v chronyc >/dev/null 2>&1; then
  detected_tool="chronyc"
  chronyc_output="$(chronyc tracking 2>/dev/null || true)"
  raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
  if [[ -n "$raw_offset" ]]; then
    offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
  fi
 elif command -v ntpq >/dev/null 2>&1; then
  detected_tool="ntpq"
 fi
 status="OK"
 exit_code=0
 if [[ "$sync_status" =~ ^(no|false)$ ]]; then
  status="WARNING"
  exit_code=1
 fi
 if [[ -n "$offset_ms" ]]; then
  if ((offset_ms >= critical_offset_ms)); then
    status="CRITICAL"
    exit_code=3
  elif ((offset_ms >= warning_offset_ms)); then
    status="WARNING"
    exit_code=1
  fi
 elif [[ "$detected_tool" == "none" ]]; then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
 printf 'Time status:\n'
 printf 'System time: %s\n' "$system_time"
 printf 'Timezone: %s\n' "$timezone"
 printf 'Detected tool: %s\n' "$detected_tool"
 printf 'NTP synchronized: %s\n' "$sync_status"
 printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
 printf 'Tool evidence:\n'
 if [[ -n "$chronyc_output" ]]; then
  printf '%s\n' "$chronyc_output"
 elif command -v ntpq >/dev/null 2>&1; then
  ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
 elif [[ -n "$timedate_output" ]]; then
  printf '%s\n' "$timedate_output"
 else
  printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
 if [[ -z "$offset_ms" ]]; then
  printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Verify chrony or ntpd service status and configuration\n'
 printf -- '- Check NTP sources and reachability\n'
 printf -- '- Check virtualization host time if this is a VM\n'
 printf -- '- Avoid restarting time services blindly in production\n'
 printf -- '- Attach this output to incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,111 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 service_name=""
 since_value="1 hour ago"
 warning_count=3
 critical_count=10
 usage() {
  cat <<'USAGE'
 Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
 Detect restart-loop evidence for a systemd service. Read-only.
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 require_cmd() {
  if ! command -v "$1" >/dev/null 2>&1; then
    printf 'CRITICAL: required command not found: %s\n' "$1"
    exit 2
  fi
 }
 while (($# > 0)); do
  case "$1" in
    --service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
    --since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
    --warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
    --critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
    --help|-h) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
  esac
 done
 if [[ -z "$service_name" ]]; then
  printf 'CRITICAL: --service is required\n'
  usage
  exit 2
 fi
 for value in "$warning_count" "$critical_count"; do
  if ! is_number "$value"; then
    printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
    exit 2
  fi
 done
 if ((warning_count >= critical_count)); then
  printf 'CRITICAL: --warning must be lower than --critical\n'
  exit 2
 fi
 require_cmd systemctl
 active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
 sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
 n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
 restart_count="${n_restarts:-0}"
 if ! is_number "$restart_count"; then
  restart_count=0
 fi
 status="OK"
 exit_code=0
 if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
  status="CRITICAL"
  exit_code=3
 elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
  status="WARNING"
  exit_code=1
 fi
 printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
 printf 'Service state:\n'
 systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
 printf '\n'
 printf 'Systemd properties:\n'
 systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
 printf '\n'
 printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
 if command -v journalctl >/dev/null 2>&1; then
  journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
    | grep -Ei 'start|stop|fail|restart|exit|status|main process' \
    | tail -n 40 || printf 'OK: no matching journal lines found\n'
 else
  printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
 fi
 printf '\n'
 printf 'Evidence:\n'
 printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
 if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
  printf 'WARNING: running without root; journal visibility may be limited\n'
 fi
 printf '\n'
 printf 'Recommended next steps:\n'
 printf -- '- Inspect the unit file and drop-in overrides\n'
 printf -- '- Review application logs around the restart timestamps\n'
 printf -- '- Check dependencies such as network, storage, database, or secrets\n'
 printf -- '- Verify recent configuration or package changes\n'
 printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
 exit "$exit_code"
@@ -0,0 +1,20 @@
 WARNING: Certificate for app.example.com:443 expires in 18 day(s)
 Certificate details:
 Subject: CN = app.example.com
 Issuer: C = US, O = Example CA, CN = Example Intermediate CA
 notBefore: Apr 11 00:00:00 2026 GMT
 notAfter: May 29 23:59:59 2026 GMT
 SAN/CN: DNS:app.example.com, DNS:api.example.com
 Evidence:
 Target: app.example.com:443
 SNI: app.example.com
 Thresholds: warning=30 days critical=7 days
 Recommended next steps:
 - Renew certificate before the operational threshold is breached
 - Check the full chain and intermediate certificates
 - Check the load balancer, ingress, or reverse proxy serving this certificate
 - Verify monitoring threshold and alert ownership
 - Attach this output to incident or change ticket
@@ -0,0 +1,23 @@
 OK: DNS=OK ping=OK tcp_443=OK
 DNS result:
 93.184.216.34 example.com
 Ping result:
 3 packets transmitted, 3 received, 0% packet loss, time 2002ms
 TCP port result:
 OK: TCP connection to example.com:443 succeeded
 Local network hints:
 default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
 Evidence:
 Host: example.com count=3 timeout=3s port=443
 Recommended next steps:
 - Verify the DNS record and resolver path
 - Check firewall, routing, security group, or proxy policy
 - Compare results from another host or network segment
 - Check application endpoint health after network reachability is confirmed
 - Attach this output to incident ticket
@@ -0,0 +1,26 @@
 CRITICAL: Found 73 failed SSH login attempt(s) for requested window
 Top source IPs:
 52 203.0.113.44
 12 198.51.100.20
 9 192.0.2.10
 Top attempted users:
 31 admin
 24 oracle
 18 root
 Sample recent lines:
 May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
 May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
 Evidence:
 Thresholds: warning=20 critical=50 since="1 hour ago"
 Log source: journalctl
 Recommended next steps:
 - Verify source IPs against expected scanners, admins, or automation
 - Check firewall, fail2ban, or security tooling state
 - Confirm whether the attempts are expected for this host
 - Review successful logins too, not only failures
 - Attach this output to incident ticket
@@ -0,0 +1,16 @@
 CRITICAL: Found 1 read-only filesystem(s)
 Read-only filesystems:
 MOUNT_POINT	SOURCE	FSTYPE	OPTIONS
 /data	/dev/mapper/vg_data-lv_data	xfs	ro,relatime,seclabel,attr2,inode64
 Evidence:
 include_system=0
 Collector: findmnt
 Recommended next steps:
 - Check dmesg or journal logs for I/O errors and filesystem remount events
 - Check storage path, multipath, SAN, cloud volume, or underlying disk health
 - Check filesystem health with the platform-approved procedure
 - Do not remount read-write before understanding the cause
 - Attach this output to incident ticket
@@ -0,0 +1,22 @@
 WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
 Load average:
 1m=7.82 5m=6.91 15m=5.40
 CPU count:
 8
 Top CPU processes:
 PID   PPID  USER      %CPU %MEM COMMAND          COMMAND
 2314  1     app       245  12.1 java             java -jar order-api.jar
 991   1     root      38   0.4  backup-agent     backup-agent --scan
 Evidence:
 WARNING: load is close to online CPU count; runnable task saturation is possible
 Recommended next steps:
 - Check process ownership and whether the top process is expected
 - Check recent deployments, cron jobs, batch jobs, or maintenance activity
 - Review logs for the top CPU-consuming process
 - Compare with longer trend data from monitoring before taking action
 - Attach this output to the incident ticket
@@ -0,0 +1,25 @@
 WARNING: Memory usage is 84% and swap usage is 12%
 Memory summary:
              total        used        free      shared  buff/cache   available
 Mem:          15934       13386         512         121        2036        2101
 Swap:          4095         512        3583
 Top memory processes:
 PID     RSS_MB   COMMAND
 1234    2048     java
 987     812      postgres
 OOM events since 24 hours ago:
 2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
 Evidence:
 Thresholds: warning=80% critical=90% since="24 hours ago"
 OOM evidence source: journalctl
 Recommended next steps:
 - Check application memory trend
 - Review JVM heap settings if process is Java
 - Verify swap pressure and paging activity
 - Confirm whether OOM events align with application impact
 - Attach this output to incident ticket
@@ -0,0 +1,22 @@
 WARNING: Highest inode usage is 87%
 Filesystems above threshold:
 /dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
 Inode usage table:
 Filesystem                 Inodes   IUsed  IFree IUse% Mounted on
 /dev/mapper/vg_root-lv_root 524288   91300 432988   18% /
 /dev/mapper/vg_var-lv_var  1310720 1140326 170394   87% /var
 Top affected mount points:
 87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
 Evidence:
 Thresholds: warning=80% critical=90%
 Recommended next steps:
 - Find directories with many small files under affected mount points
 - Check logs, cache, spool, session, and temporary directories
 - Avoid deleting blindly; confirm ownership and application impact first
 - Confirm whether inode exhaustion is causing write or deploy failures
 - Attach this output to incident ticket
@@ -0,0 +1,30 @@
 OK: JVM diagnostics collected for PID 1234
 Detected JVM process:
 PID USER RSS_MB CPU COMMAND
 1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
 Thread count: 188
 Heap and JVM evidence:
 [jcmd VM.flags]
 1234:
 -XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
 [jcmd GC.heap_info]
 garbage-first heap total 2097152K, used 1521000K
 [jcmd Thread.print summary]
 102 java.lang.Thread.State: WAITING
 53 java.lang.Thread.State: RUNNABLE
 33 java.lang.Thread.State: TIMED_WAITING
 Evidence:
 PID=1234 thread_count=188 top=10
 Recommended next steps:
 - Review GC logs and recent application errors
 - Check JVM heap sizing against container or host memory limits
 - Check thread count trend in monitoring before concluding a leak
 - Capture jstack only if approved by operational process
 - Attach this output to incident ticket
@@ -0,0 +1,23 @@
 WARNING: Time sync status=yes offset_ms=812
 Time status:
 System time: 2026-05-11 10:18:01 UTC +0000
 Timezone: UTC +0000
 Detected tool: chronyc
 NTP synchronized: yes
 Offset ms: 812
 Tool evidence:
 Reference ID    : 203.0.113.10
 System time     : 0.812345 seconds fast of NTP time
 Last offset     : +0.812345 seconds
 Evidence:
 Thresholds: warning=500ms critical=5000ms
 Recommended next steps:
 - Verify chrony or ntpd service status and configuration
 - Check NTP sources and reachability
 - Check virtualization host time if this is a VM
 - Avoid restarting time services blindly in production
 - Attach this output to incident ticket
@@ -0,0 +1,27 @@
 CRITICAL: Service app.service state=failed substate=failed restarts=12
 Service state:
 app.service - Example application
   Loaded: loaded (/etc/systemd/system/app.service; enabled)
   Active: failed (Result: exit-code)
 Systemd properties:
 Id=app.service
 ActiveState=failed
 SubState=failed
 Result=exit-code
 NRestarts=12
 Recent start/stop/failure log lines since 1 hour ago:
 May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
 May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
 Evidence:
 Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
 Recommended next steps:
 - Inspect the unit file and drop-in overrides
 - Review application logs around the restart timestamps
 - Check dependencies such as network, storage, database, or secrets
 - Verify recent configuration or package changes
 - Do not restart blindly; attach this output to the incident ticket
@@ -0,0 +1,385 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 incident_type=""
 service_name=""
 host_name=""
 port=""
 target_pid=""
 match_string=""
 output_file=""
 since_value="1 hour ago"
 script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 usage() {
  cat <<'USAGE'
 Usage: incident_triage_report.sh --type TYPE [options]
 Run selected read-only incident checks and produce a Markdown triage report.
 Incident types:
  cpu
  memory
  service
  network
  auth
  cert
  filesystem
  jvm
  all
 Options:
  --type TYPE                 Incident type to collect
  --service SERVICE_NAME      systemd service name for service checks
  --host HOSTNAME_OR_FQDN     host for DNS, network, or certificate checks
  --port PORT                 TCP or TLS port for host checks
  --pid PID                   JVM process ID
  --match PROCESS_MATCH       JVM process match string
  --output FILE               write Markdown report to FILE
  --since VALUE               time window for log-based checks
  --help                      show this help
 Examples:
  ./incident_triage_report.sh --type cpu
  ./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
  ./incident_triage_report.sh --type network --host app.example.com --port 443
  ./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
 USAGE
 }
 is_number() {
  [[ "$1" =~ ^[0-9]+$ ]]
 }
 valid_type() {
  case "$1" in
    cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
    *) return 1 ;;
  esac
 }
 while (($# > 0)); do
  case "$1" in
    --type)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
      incident_type="$2"
      shift 2
      ;;
    --service)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
      service_name="$2"
      shift 2
      ;;
    --host)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
      host_name="$2"
      shift 2
      ;;
    --port)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
      port="$2"
      shift 2
      ;;
    --pid)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
      target_pid="$2"
      shift 2
      ;;
    --match)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
      match_string="$2"
      shift 2
      ;;
    --output)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
      output_file="$2"
      shift 2
      ;;
    --since)
      [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
      since_value="$2"
      shift 2
      ;;
    --help|-h)
      usage
      exit 0
      ;;
    *)
      printf 'CRITICAL: unknown option: %s\n' "$1"
      usage
      exit 2
      ;;
  esac
 done
 if [[ -z "$incident_type" ]]; then
  printf 'CRITICAL: --type is required\n'
  usage
  exit 2
 fi
 if ! valid_type "$incident_type"; then
  printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
  usage
  exit 2
 fi
 if [[ -n "$port" ]] && ! is_number "$port"; then
  printf 'CRITICAL: --port must be numeric\n'
  exit 2
 fi
 if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
  printf 'CRITICAL: --pid must be numeric\n'
  exit 2
 fi
 if [[ -n "$target_pid" && -n "$match_string" ]]; then
  printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
  exit 2
 fi
 tmp_dir="$(mktemp -d)"
 trap 'rm -rf "$tmp_dir"' EXIT
 report_file="$tmp_dir/report.md"
 check_labels=()
 check_names=()
 check_commands=()
 check_statuses=()
 check_exit_codes=()
 check_summaries=()
 check_outputs=()
 status_from_exit() {
  case "$1" in
    0) printf 'OK' ;;
    1) printf 'WARNING' ;;
    2) printf 'INVALID' ;;
    3) printf 'CRITICAL' ;;
    *) printf 'ERROR' ;;
  esac
 }
 render_command() {
  local item
  for item in "$@"; do
    printf '%q ' "$item"
  done | sed 's/[[:space:]]*$//'
 }
 append_skipped_check() {
  local label="$1"
  local name="$2"
  local reason="$3"
  local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
  printf 'SKIPPED: %s\n' "$reason" > "$output_path"
  check_labels+=("$label")
  check_names+=("$name")
  check_commands+=("not run")
  check_statuses+=("SKIPPED")
  check_exit_codes+=("-")
  check_summaries+=("$reason")
  check_outputs+=("$output_path")
 }
 run_check() {
  local label="$1"
  local script_name="$2"
  shift 2
  local script_path="${script_dir}/${script_name}"
  local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
  local command_text
  local exit_code
  local status
  local summary
  command_text="$(render_command "$script_path" "$@")"
  if [[ ! -e "$script_path" ]]; then
    append_skipped_check "$label" "$script_name" "missing script: $script_name"
    return
  fi
  if [[ ! -x "$script_path" ]]; then
    append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
    return
  fi
  set +e
  "$script_path" "$@" > "$output_path" 2>&1
  exit_code=$?
  set -e
  status="$(status_from_exit "$exit_code")"
  summary="$(sed -n '1p' "$output_path")"
  if [[ -z "$summary" ]]; then
    summary="no output captured"
  fi
  check_labels+=("$label")
  check_names+=("$script_name")
  check_commands+=("$command_text")
  check_statuses+=("$status")
  check_exit_codes+=("$exit_code")
  check_summaries+=("$summary")
  check_outputs+=("$output_path")
 }
 run_cpu_checks() {
  run_check "CPU saturation" "check_high_cpu.sh"
 }
 run_memory_checks() {
  run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
 }
 run_service_checks() {
  if [[ -z "$service_name" ]]; then
    append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
    return
  fi
  run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
 }
 run_network_checks() {
  local args=(--host "$host_name")
  if [[ -z "$host_name" ]]; then
    append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
    return
  fi
  if [[ -n "$port" ]]; then
    args+=(--port "$port")
  fi
  run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
 }
 run_auth_checks() {
  run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
 }
 run_cert_checks() {
  local args=(--host "$host_name")
  if [[ -z "$host_name" ]]; then
    append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
    return
  fi
  if [[ -n "$port" ]]; then
    args+=(--port "$port")
  fi
  run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
 }
 run_filesystem_checks() {
  run_check "Read-only filesystems" "check_filesystem_readonly.sh"
  run_check "Inode usage" "check_inode_usage.sh"
 }
 run_jvm_checks() {
  local args=()
  if [[ -n "$target_pid" ]]; then
    args+=(--pid "$target_pid")
  elif [[ -n "$match_string" ]]; then
    args+=(--match "$match_string")
  fi
  run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
 }
 case "$incident_type" in
  cpu) run_cpu_checks ;;
  memory) run_memory_checks ;;
  service) run_service_checks ;;
  network) run_network_checks ;;
  auth) run_auth_checks ;;
  cert) run_cert_checks ;;
  filesystem) run_filesystem_checks ;;
  jvm) run_jvm_checks ;;
  all)
    run_cpu_checks
    run_memory_checks
    run_service_checks
    run_network_checks
    run_auth_checks
    run_cert_checks
    run_filesystem_checks
    run_jvm_checks
    ;;
 esac
 generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
 local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
 current_user="$(id -un 2>/dev/null || printf 'unknown')"
 {
  printf '# L2 Incident Triage Report\n\n'
  printf -- '- Generated: %s\n' "$generated_at"
  printf -- '- Local hostname: %s\n' "$local_hostname"
  printf -- '- Current user: %s\n' "$current_user"
  printf -- '- Incident type: %s\n' "$incident_type"
  printf -- '- Service: %s\n' "${service_name:-not provided}"
  printf -- '- Host: %s\n' "${host_name:-not provided}"
  printf -- '- Port: %s\n' "${port:-not provided}"
  printf -- '- PID: %s\n' "${target_pid:-not provided}"
  printf -- '- Process match: %s\n' "${match_string:-not provided}"
  printf -- '- Since: %s\n\n' "$since_value"
  printf '## Executed Checks\n\n'
  printf '| Check | Script | Status | Exit | Command |\n'
  printf '| --- | --- | --- | --- | --- |\n'
  for index in "${!check_labels[@]}"; do
    printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
      "${check_labels[$index]}" \
      "${check_names[$index]}" \
      "${check_statuses[$index]}" \
      "${check_exit_codes[$index]}" \
      "${check_commands[$index]}"
  done
  printf '\n'
  printf '## Summary\n\n'
  for index in "${!check_labels[@]}"; do
    printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
  done
  printf '\n'
  printf '## Raw Evidence\n\n'
  for index in "${!check_labels[@]}"; do
    printf '### %s\n\n' "${check_labels[$index]}"
    printf "Script: \`%s\`\n\n" "${check_names[$index]}"
    printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
    printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
    printf '```text\n'
    cat "${check_outputs[$index]}"
    printf '\n```\n\n'
  done
  printf '## L2 Handover Checklist\n\n'
  printf -- '- [ ] Business impact confirmed\n'
  printf -- '- [ ] Affected host/service identified\n'
  printf -- '- [ ] Monitoring alert attached\n'
  printf -- '- [ ] Recent changes checked\n'
  printf -- '- [ ] Logs attached\n'
  printf -- '- [ ] Service owner identified\n'
  printf -- '- [ ] Escalation target identified\n\n'
  printf '## Escalation Notes\n\n'
  printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
  printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
  printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
  printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
  printf '## Recommended Next Steps\n\n'
  printf -- '- Confirm the symptom against monitoring and user reports.\n'
  printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
  printf -- '- Attach this report to the incident ticket before handoff.\n'
  printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
 } > "$report_file"
 if [[ -n "$output_file" ]]; then
  cp "$report_file" "$output_file"
  printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
 else
  cat "$report_file"
 fi
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
 - `ci-cd`
 - `docker`
 ## Linux operations labs
 - [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
 - [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
 Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
 Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
@@ -0,0 +1,308 @@
 # AI Lab Maintenance Toolkit
 ## Executive summary
 The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
 Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
 reporting, disk monitoring, conservative package cleanup, Docker hygiene,
 configuration backup, and non-destructive VM inventory into a small toolkit
 that is readable enough for review and guarded enough for homelab use.
 This is a portfolio and lab implementation, not evidence of production
 certification. Review package policy, backup coverage, maintenance windows, and
 application impact before deploying it to another host.
 ## Problem solved
 AI lab hosts accumulate operating system packages, kernel packages, container
 images, build cache, journals, and configuration changes while also carrying
 stateful workloads. Manual maintenance is easy to defer and risky to perform
 without evidence. This project provides scheduled, logged tasks with explicit
 safety boundaries and separate read-only audit commands.
 ## What this demonstrates
 - Bash strict mode, input validation, dependency checks, and operational exit
  codes.
 - Dry-run-first maintenance with explicit authorization for changes.
 - systemd oneshot services and persistent calendar timers.
 - APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
 - Docker cleanup that preserves volumes.
 - Configuration-focused backups with bounded retention.
 - Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
 - Idempotent installation and guarded JSON configuration updates.
 ## Architecture and directory layout
 ```text
 ailab-maintenance/
 ├── README.md
 ├── install.sh
 ├── scripts/
 │   ├── ailab-healthcheck.sh
 │   ├── ailab-disk-watch.sh
 │   ├── ailab-apt-cleanup.sh
 │   ├── ailab-kernel-cleanup.sh
 │   ├── ailab-docker-cleanup.sh
 │   ├── ailab-config-backup.sh
 │   └── ailab-vm-audit.sh
 └── systemd/
    ├── ailab-apt-cleanup.service
    ├── ailab-apt-cleanup.timer
    ├── ailab-kernel-cleanup.service
    ├── ailab-kernel-cleanup.timer
    ├── ailab-docker-cleanup.service
    ├── ailab-docker-cleanup.timer
    ├── ailab-config-backup.service
    ├── ailab-config-backup.timer
    ├── ailab-disk-watch.service
    └── ailab-disk-watch.timer
 ```
 The installer deploys scripts to `/usr/local/sbin` and units to
 `/etc/systemd/system`. Scripts run directly as root from systemd rather than
 through an additional framework.
 ## Maintenance tasks
 | Command | Purpose | Change behavior |
 | --- | --- | --- |
 | `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
 | `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
 | `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
 | `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
 | `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
 | `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
 | `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
 ## Safety model
 Change-capable scripts default to dry-run behavior. Manual execution requires
 `--execute` and an interactive `EXECUTE` confirmation. The systemd services
 use `--execute --non-interactive`; installing and enabling those reviewed unit
 files is the explicit authorization for scheduled maintenance.
 Exit codes follow the repository convention:
 - `0`: completed successfully or an optional component was absent.
 - `1`: an operational check or maintenance action failed.
 - `2`: invalid input, missing required dependency, or insufficient privilege.
 The scripts do not bypass APT or Docker locks, delete VM resources, manually
 select kernel names for removal, or hide command failures.
 ## Installation
 Review every script and unit first. Installation changes package state,
 journald settings, Docker daemon settings when Docker exists, and enabled timer
 state.
 ```bash
 cd labs/linux/ailab-maintenance
 sudo ./install.sh
 ```
 The installer:
 1. Installs the documented Ubuntu utilities.
 2. Deploys scripts and systemd units with fixed permissions.
 3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
 4. Restarts `systemd-journald`.
 5. Validates and backs up an existing Docker `daemon.json`, merges log limits
   with `jq`, and attempts a Docker restart.
 6. Enables all five timers.
 7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
 The installer is intended for Ubuntu 26.04. It is not run automatically by
 repository validation.
 ## Manual commands
 Read-only reports:
 ```bash
 sudo /usr/local/sbin/ailab-healthcheck.sh
 sudo /usr/local/sbin/ailab-disk-watch.sh
 sudo /usr/local/sbin/ailab-vm-audit.sh
 ```
 Preview maintenance:
 ```bash
 sudo /usr/local/sbin/ailab-apt-cleanup.sh
 sudo /usr/local/sbin/ailab-kernel-cleanup.sh
 sudo /usr/local/sbin/ailab-docker-cleanup.sh
 sudo /usr/local/sbin/ailab-config-backup.sh
 ```
 Apply reviewed maintenance interactively:
 ```bash
 sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
 sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
 sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
 sudo /usr/local/sbin/ailab-config-backup.sh --execute
 ```
 `--non-interactive` is reserved for reviewed automation and is rejected unless
 `--execute` is also present.
 ## Systemd timers
 | Timer | Schedule |
 | --- | --- |
 | `ailab-config-backup.timer` | Daily at 03:30 |
 | `ailab-disk-watch.timer` | Hourly |
 | `ailab-apt-cleanup.timer` | Sunday at 04:00 |
 | `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
 | `ailab-docker-cleanup.timer` | Sunday at 04:40 |
 All timers use `Persistent=true`, so a missed event runs after the host becomes
 available. Inspect timer and service evidence with:
 ```bash
 systemctl list-timers --all | grep ailab-
 systemctl status ailab-config-backup.timer
 journalctl -u ailab-kernel-cleanup.service
 ```
 ## Logs
 Scheduled and manual maintenance writes to:
 ```text
 /var/log/ailab-apt-cleanup.log
 /var/log/ailab-kernel-cleanup.log
 /var/log/ailab-docker-cleanup.log
 /var/log/ailab-config-backup.log
 /var/log/ailab-disk-watch.log
 ```
 systemd also records service output in the journal. Logrotate is installed as a
 dependency, but this lab does not create a custom rotation policy for these
 small maintenance logs.
 ## Docker policy
 Docker cleanup runs `docker system prune -af` and removes build cache older
 than seven days. It never passes `--volumes`. Named and anonymous volumes
 remain outside this automated policy and require application-aware review.
 The installer configures the `json-file` driver with a maximum size of `50m`
 and five files. Existing valid JSON is backed up and merged. Invalid JSON
 causes installation to stop rather than overwrite operator configuration.
 ## Kernel policy
 Kernel removal is delegated to `apt autoremove --purge`; package names are not
 constructed or purged with regular expressions. Before execution, the script
 logs the APT simulation and refuses cleanup unless at least two installed
 versioned kernel image packages remain after simulated removals.
 This protects a fallback kernel while preserving Ubuntu dependency policy.
 Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
 Secure Boot state, and the simulated removal set before manual execution.
 ## Backup policy
 Backups are written to `/srv/backups/ailab-config` as
 `ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
 deleted only after a new archive is created.
 The backup covers `/etc`, selected root shell configuration,
 `/opt/ailab-maintenance` when present, and libvirt configuration under
 `/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
 Ollama models, VM disk images, or other large application datasets. Because
 `/etc` is included, explicitly listed configuration subdirectories are already
 covered even when optional-path reporting mentions them separately.
 This is a local configuration backup, not a disaster-recovery design. A real
 deployment should copy archives to independently protected storage and test
 restoration.
 ## Journald policy
 The installer applies:
 ```ini
 [Journal]
 SystemMaxUse=1G
 SystemKeepFree=2G
 MaxRetentionSec=14day
 Compress=yes
 ```
 These settings bound journal growth while retaining useful troubleshooting
 evidence. Capacity and retention should be adjusted to the host's disk size
 and incident-response requirements.
 ## Disk watch policy
 The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
 `1` when any checked filesystem meets or exceeds the threshold. Override the
 threshold for a manual or unit invocation with:
 ```bash
 sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
 ```
 The script reports every filesystem as `OK` or `WARNING`; it does not delete
 data or attempt remediation.
 ## Example operational workflows
 ### Weekly maintenance review
 ```bash
 sudo /usr/local/sbin/ailab-healthcheck.sh
 sudo /usr/local/sbin/ailab-kernel-cleanup.sh
 sudo /usr/local/sbin/ailab-docker-cleanup.sh
 systemctl list-timers --all | grep ailab-
 ```
 Review the kernel simulation, Docker usage, failed units, backup freshness, and
 disk warnings before approving manual changes.
 ### Disk pressure investigation
 ```bash
 sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
 sudo docker system df
 sudo journalctl --disk-usage
 sudo /usr/local/sbin/ailab-vm-audit.sh
 ```
 Use the evidence to identify ownership. Do not treat Docker pruning or file
 deletion as a substitute for application-specific retention policy.
 ### Post-maintenance evidence
 ```bash
 sudo /usr/local/sbin/ailab-healthcheck.sh \
  | sudo tee /root/ailab-healthcheck-after-maintenance.txt
 journalctl --since today -u 'ailab-*.service'
 ```
 ## Interview talking points
 - Why timer units explicitly carry the non-interactive execution boundary.
 - Why APT dependency policy is safer than regex-based kernel deletion.
 - How Docker volume preservation separates platform hygiene from application
  data lifecycle decisions.
 - How optional dependency handling keeps one health command useful across
  container, GPU, and virtualization host variants.
 - Why configuration backup and application-data backup are separate concerns.
 - How exit codes, persistent timers, logs, and post-checks support operations.
 ## Future improvements
 - Add a dedicated logrotate policy after measuring log growth.
 - Export disk-watch status to a monitoring system instead of relying only on
  timer failure state.
 - Add automated archive integrity checks and off-host replication.
 - Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
  commands.
 - Add package-lock detection with bounded retry policy if recurring contention
  is observed.
 - Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
  dedicated read-only audit.
@@ -0,0 +1,103 @@
 #!/usr/bin/env bash
 set -euo pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
 DOCKER_CONFIG="/etc/docker/daemon.json"
 packages=(
  logrotate
  needrestart
  smartmontools
  nvme-cli
  sysstat
  iotop
  ncdu
  duf
  jq
  lsof
  psmisc
  tar
  gzip
 )
 timers=(
  ailab-apt-cleanup.timer
  ailab-kernel-cleanup.timer
  ailab-docker-cleanup.timer
  ailab-config-backup.timer
  ailab-disk-watch.timer
 )
 if ((EUID != 0)); then
  printf 'CRITICAL: install.sh must run as root\n' >&2
  exit 2
 fi
 for command_name in apt-get install systemctl; do
  if ! command -v "$command_name" >/dev/null 2>&1; then
    printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
    exit 2
  fi
 done
 printf 'Installing maintenance dependencies...\n'
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
 printf 'Installing scripts and systemd units...\n'
 for script in "$SCRIPT_DIR"/scripts/*.sh; do
  install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
 done
 for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
  install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
 done
 install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
 tmp_journald="$(mktemp)"
 trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
 cat >"$tmp_journald" <<'EOF'
 [Journal]
 SystemMaxUse=1G
 SystemKeepFree=2G
 MaxRetentionSec=14day
 Compress=yes
 EOF
 install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
 systemctl restart systemd-journald
 if command -v docker >/dev/null 2>&1; then
  printf 'Configuring Docker log rotation limits...\n'
  install -d -m 0755 /etc/docker
  tmp_docker="$(mktemp)"
  if [[ -f "$DOCKER_CONFIG" ]]; then
    if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
      printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
      exit 1
    fi
    backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
    install -m 0644 "$DOCKER_CONFIG" "$backup"
    jq '. + {
      "log-driver": "json-file",
      "log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
    }' "$DOCKER_CONFIG" >"$tmp_docker"
  else
    jq -n '{
      "log-driver": "json-file",
      "log-opts": {"max-size": "50m", "max-file": "5"}
    }' >"$tmp_docker"
  fi
  jq empty "$tmp_docker"
  install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
  systemctl restart docker || true
 else
  printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
 fi
 systemctl daemon-reload
 systemctl enable --now "${timers[@]}"
 printf '\nEnabled AI Lab timers:\n'
 systemctl list-timers --all --no-pager | grep 'ailab-' || true
 /usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
 printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
@@ -0,0 +1,66 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 LOG_FILE="/var/log/ailab-apt-cleanup.log"
 execute=false
 non_interactive=false
 usage() {
  printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
 }
 while (($# > 0)); do
  case "$1" in
    --execute) execute=true ;;
    --non-interactive) non_interactive=true ;;
    -h|--help) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
  esac
  shift
 done
 if [[ "$non_interactive" == true && "$execute" != true ]]; then
  printf 'CRITICAL: --non-interactive requires --execute\n' >&2
  exit 2
 fi
 if ((EUID != 0)); then
  printf 'CRITICAL: this script must run as root\n' >&2
  exit 2
 fi
 if ! command -v apt >/dev/null 2>&1; then
  printf 'CRITICAL: apt is required\n' >&2
  exit 2
 fi
 exec > >(tee -a "$LOG_FILE") 2>&1
 printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
 if [[ "$execute" != true ]]; then
  printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
  printf 'INFO: simulated autoremove follows\n'
  LC_ALL=C apt -s autoremove --purge
  printf 'INFO: rerun with --execute and confirm to apply changes\n'
  exit 0
 fi
 if [[ "$non_interactive" != true ]]; then
  printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
  printf 'Type EXECUTE to continue: '
  read -r confirmation
  if [[ "$confirmation" != "EXECUTE" ]]; then
    printf 'CRITICAL: confirmation failed; no changes made\n'
    exit 2
  fi
 fi
 apt update
 apt autoremove --purge -y
 apt autoclean -y
 if command -v needrestart >/dev/null 2>&1; then
  needrestart -b || true
 else
  printf 'WARNING: needrestart is not installed\n'
 fi
 printf 'OK: APT cleanup completed\n'
@@ -0,0 +1,90 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 LOG_FILE="/var/log/ailab-config-backup.log"
 BACKUP_DIR="/srv/backups/ailab-config"
 RETENTION_DAYS=30
 execute=false
 non_interactive=false
 usage() {
  printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
 }
 while (($# > 0)); do
  case "$1" in
    --execute) execute=true ;;
    --non-interactive) non_interactive=true ;;
    -h|--help) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
  esac
  shift
 done
 if [[ "$non_interactive" == true && "$execute" != true ]]; then
  printf 'CRITICAL: --non-interactive requires --execute\n' >&2
  exit 2
 fi
 if ((EUID != 0)); then
  printf 'CRITICAL: this script must run as root\n' >&2
  exit 2
 fi
 for command_name in tar gzip find; do
  if ! command -v "$command_name" >/dev/null 2>&1; then
    printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
    exit 2
  fi
 done
 exec > >(tee -a "$LOG_FILE") 2>&1
 timestamp="$(date '+%Y%m%d-%H%M%S')"
 archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
 candidate_paths=(
  /etc
  /root/.bashrc
  /root/.bashrc.d
  /opt/ailab-maintenance
  /var/lib/libvirt/qemu
 )
 source_paths=()
 printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
 for path in "${candidate_paths[@]}"; do
  if [[ -e "$path" ]]; then
    source_paths+=("${path#/}")
    printf 'OK: include %s\n' "$path"
  else
    printf 'INFO: optional path is absent: %s\n' "$path"
  fi
 done
 if ((${#source_paths[@]} == 0)); then
  printf 'CRITICAL: no backup source paths are present\n'
  exit 1
 fi
 printf 'Backup destination: %s\n' "$archive"
 printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
 printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
 printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
 if [[ "$execute" != true ]]; then
  printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
  exit 0
 fi
 if [[ "$non_interactive" != true ]]; then
  printf 'Type EXECUTE to create the archive and apply retention: '
  read -r confirmation
  if [[ "$confirmation" != "EXECUTE" ]]; then
    printf 'CRITICAL: confirmation failed; no changes made\n'
    exit 2
  fi
 fi
 install -d -m 0750 "$BACKUP_DIR"
 tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
 find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
 printf 'OK: configuration backup created: %s\n' "$archive"
@@ -0,0 +1,38 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 LOG_FILE="/var/log/ailab-disk-watch.log"
 threshold="${AILAB_DISK_THRESHOLD:-85}"
 if ((EUID != 0)); then
  printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
  exit 2
 fi
 if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
  printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
  exit 2
 fi
 exec > >(tee -a "$LOG_FILE") 2>&1
 printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
 status=0
 while read -r filesystem _blocks _used available use_percent mountpoint; do
  usage="${use_percent%\%}"
  if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
    printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
    status=1
  elif ((usage >= threshold)); then
    printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
      "$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
    status=1
  else
    printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
  fi
 done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
 exit "$status"
@@ -0,0 +1,70 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 LOG_FILE="/var/log/ailab-docker-cleanup.log"
 execute=false
 non_interactive=false
 usage() {
  printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
 }
 while (($# > 0)); do
  case "$1" in
    --execute) execute=true ;;
    --non-interactive) non_interactive=true ;;
    -h|--help) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
  esac
  shift
 done
 if [[ "$non_interactive" == true && "$execute" != true ]]; then
  printf 'CRITICAL: --non-interactive requires --execute\n' >&2
  exit 2
 fi
 if ((EUID != 0)); then
  printf 'CRITICAL: this script must run as root\n' >&2
  exit 2
 fi
 exec > >(tee -a "$LOG_FILE") 2>&1
 printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
 if ! command -v docker >/dev/null 2>&1; then
  printf 'INFO: Docker is not installed; nothing to do\n'
  exit 0
 fi
 if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
  printf 'INFO: docker.service is inactive; nothing to do\n'
  exit 0
 fi
 printf '\nDocker disk usage before cleanup:\n'
 docker system df
 if [[ "$execute" != true ]]; then
  printf 'INFO: dry-run mode; would run docker system prune -af\n'
  printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
  printf 'INFO: Docker volumes are never included in this cleanup\n'
  exit 0
 fi
 if [[ "$non_interactive" != true ]]; then
  printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
  printf 'Type EXECUTE to continue: '
  read -r confirmation
  if [[ "$confirmation" != "EXECUTE" ]]; then
    printf 'CRITICAL: confirmation failed; no changes made\n'
    exit 2
  fi
 fi
 docker system prune -af
 docker builder prune -af --filter "until=168h"
 printf '\nDocker disk usage after cleanup:\n'
 docker system df
 printf 'OK: Docker cleanup completed; volumes were not pruned\n'
@@ -0,0 +1,111 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 section() {
  printf '\n== %s ==\n' "$1"
 }
 run_optional() {
  local description="$1"
  shift
  if "$@"; then
    return 0
  fi
  printf 'WARNING: %s failed\n' "$description"
  return 0
 }
 section "Host identity"
 if command -v hostnamectl >/dev/null 2>&1; then
  run_optional "hostnamectl" hostnamectl
 else
  run_optional "hostname" hostname
 fi
 run_optional "kernel information" uname -a
 run_optional "uptime" uptime
 section "Memory"
 if command -v free >/dev/null 2>&1; then
  run_optional "memory report" free -h
 else
  printf 'WARNING: free is not available\n'
 fi
 section "Filesystems"
 if command -v df >/dev/null 2>&1; then
  run_optional "filesystem report" df -hT
  printf '\nKey mountpoints present:\n'
  for mountpoint in / /boot /var /srv /opt /home; do
    if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
      run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
    fi
  done
 else
  printf 'WARNING: df is not available\n'
 fi
 section "Journal usage"
 if command -v journalctl >/dev/null 2>&1; then
  run_optional "journal disk usage" journalctl --disk-usage
 else
  printf 'WARNING: journalctl is not available\n'
 fi
 section "Docker"
 if command -v docker >/dev/null 2>&1; then
  if command -v systemctl >/dev/null 2>&1; then
    run_optional "Docker service state" systemctl is-active docker
  fi
  run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
  run_optional "Docker disk usage" docker system df
 else
  printf 'INFO: Docker is not installed\n'
 fi
 section "Libvirt"
 if command -v virsh >/dev/null 2>&1; then
  if command -v systemctl >/dev/null 2>&1; then
    run_optional "libvirtd service state" systemctl is-active libvirtd
  fi
  run_optional "libvirt guest list" virsh list --all
 else
  printf 'INFO: virsh is not installed\n'
 fi
 section "NVIDIA"
 if command -v nvidia-smi >/dev/null 2>&1; then
  run_optional "NVIDIA status" nvidia-smi
 else
  printf 'INFO: nvidia-smi is not installed\n'
 fi
 section "Failed systemd units"
 if command -v systemctl >/dev/null 2>&1; then
  run_optional "failed systemd unit report" systemctl --failed --no-pager
 else
  printf 'WARNING: systemctl is not available\n'
 fi
 section "SMART quick health"
 if command -v smartctl >/dev/null 2>&1; then
  shopt -s nullglob
  devices=(/dev/sd? /dev/nvme?n?)
  shopt -u nullglob
  if ((${#devices[@]} == 0)); then
    printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
  else
    for device in "${devices[@]}"; do
      printf '\n-- %s --\n' "$device"
      run_optional "SMART health check for $device" smartctl -H "$device"
    done
  fi
 else
  printf 'INFO: smartctl is not installed\n'
 fi
 exit 0
@@ -0,0 +1,117 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 # APT autoremove respects package dependencies and kernel protection rules. That
 # is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
 LOG_FILE="/var/log/ailab-kernel-cleanup.log"
 execute=false
 non_interactive=false
 usage() {
  printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
 }
 kernel_packages() {
  dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
    'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
    | awk '$1 ~ /^ii/ {print $2}' \
    | sort -u || true
 }
 versioned_kernel_images() {
  dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
    | awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
    | sort -u || true
 }
 while (($# > 0)); do
  case "$1" in
    --execute) execute=true ;;
    --non-interactive) non_interactive=true ;;
    -h|--help) usage; exit 0 ;;
    *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
  esac
  shift
 done
 if [[ "$non_interactive" == true && "$execute" != true ]]; then
  printf 'CRITICAL: --non-interactive requires --execute\n' >&2
  exit 2
 fi
 if ((EUID != 0)); then
  printf 'CRITICAL: this script must run as root\n' >&2
  exit 2
 fi
 for command_name in apt dpkg-query uname; do
  if ! command -v "$command_name" >/dev/null 2>&1; then
    printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
    exit 2
  fi
 done
 exec > >(tee -a "$LOG_FILE") 2>&1
 printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
 printf 'Running kernel: %s\n' "$(uname -r)"
 printf '\nInstalled kernel-related packages before cleanup:\n'
 kernel_packages
 simulation="$(LC_ALL=C apt -s autoremove --purge)"
 printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
 mapfile -t installed_images < <(versioned_kernel_images)
 mapfile -t removed_images < <(
  awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
 )
 remaining_images=0
 for image in "${installed_images[@]}"; do
  remove_image=false
  for removed in "${removed_images[@]}"; do
    if [[ "$image" == "$removed" ]]; then
      remove_image=true
      break
    fi
  done
  if [[ "$remove_image" != true ]]; then
    remaining_images=$((remaining_images + 1))
  fi
 done
 printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
  "${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
 if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
  printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
  exit 1
 fi
 if [[ "$execute" != true ]]; then
  printf 'INFO: dry-run mode; no packages were removed\n'
  printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
  exit 0
 fi
 if [[ "$non_interactive" != true ]]; then
  printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
  printf 'Type EXECUTE to continue: '
  read -r confirmation
  if [[ "$confirmation" != "EXECUTE" ]]; then
    printf 'CRITICAL: confirmation failed; no changes made\n'
    exit 2
  fi
 fi
 apt autoremove --purge -y
 apt autoclean -y
 if command -v update-grub >/dev/null 2>&1; then
  update-grub || true
 else
  printf 'WARNING: update-grub is not installed\n'
 fi
 printf '\nInstalled kernel-related packages after cleanup:\n'
 kernel_packages
 printf 'OK: kernel cleanup completed with APT-managed package selection\n'
@@ -0,0 +1,42 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 section() {
  printf '\n== %s ==\n' "$1"
 }
 if ! command -v virsh >/dev/null 2>&1; then
  printf 'INFO: virsh is not installed; VM audit skipped\n'
  exit 0
 fi
 section "Virtual machines"
 virsh list --all || printf 'WARNING: unable to list virtual machines\n'
 section "Storage pools"
 virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
 mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
 for pool in "${pools[@]}"; do
  section "Volumes in pool: $pool"
  virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
 done
 section "Possible VM disk and installation images"
 search_roots=()
 for path in /var/lib/libvirt /srv /opt; do
  [[ -d "$path" ]] && search_roots+=("$path")
 done
 if ((${#search_roots[@]} == 0)); then
  printf 'INFO: no configured search roots are present\n'
 else
  find "${search_roots[@]}" -xdev -type f \
    \( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
    -printf '%12s bytes  %p\n' 2>/dev/null \
    | sort -nr || true
 fi
 printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
@@ -0,0 +1,8 @@
 [Unit]
 Description=AI Lab safe APT cleanup
 After=network-online.target
 Wants=network-online.target
 [Service]
 Type=oneshot
 ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
 [Unit]
 Description=Run AI Lab APT cleanup weekly
 [Timer]
 OnCalendar=Sun *-*-* 04:00:00
 Persistent=true
 [Install]
 WantedBy=timers.target
@@ -0,0 +1,6 @@
 [Unit]
 Description=AI Lab configuration backup
 [Service]
 Type=oneshot
 ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
 [Unit]
 Description=Run AI Lab configuration backup daily
 [Timer]
 OnCalendar=*-*-* 03:30:00
 Persistent=true
 [Install]
 WantedBy=timers.target
@@ -0,0 +1,6 @@
 [Unit]
 Description=AI Lab disk usage check
 [Service]
 Type=oneshot
 ExecStart=/usr/local/sbin/ailab-disk-watch.sh
@@ -0,0 +1,9 @@
 [Unit]
 Description=Run AI Lab disk usage check hourly
 [Timer]
 OnCalendar=hourly
 Persistent=true
 [Install]
 WantedBy=timers.target
@@ -0,0 +1,8 @@
 [Unit]
 Description=AI Lab safe Docker cleanup
 Requires=docker.service
 After=docker.service
 [Service]
 Type=oneshot
 ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
 [Unit]
 Description=Run AI Lab Docker cleanup weekly
 [Timer]
 OnCalendar=Sun *-*-* 04:40:00
 Persistent=true
 [Install]
 WantedBy=timers.target
@@ -0,0 +1,8 @@
 [Unit]
 Description=AI Lab safe kernel cleanup
 After=network-online.target ailab-apt-cleanup.service
 Wants=network-online.target
 [Service]
 Type=oneshot
 ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
 [Unit]
 Description=Run AI Lab kernel cleanup weekly
 [Timer]
 OnCalendar=Sun *-*-* 04:20:00
 Persistent=true
 [Install]
 WantedBy=timers.target
@@ -0,0 +1,276 @@
 # Linux Fresh Setup Toolkit
 ## Executive summary
 The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
 lab server or workstation. It prepares a host for routine administration,
 Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
 diagnostics, bounded logging, practical kernel tuning, and a conservative
 security baseline.
 The scripts are modular and safe to rerun. Optional components remain optional,
 UFW is not enabled without a specific flag, and an NVIDIA driver is never
 installed without an explicit version. This is a portfolio and homelab
 implementation, not a production-certified build standard.
 ## Scope and non-goals
 The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
 with APT package management. It is suitable for a host such as `ailab` that may
 run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
 It does not:
 - Deploy applications, containers, or virtual machines.
 - Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
 - Select an NVIDIA driver automatically.
 - Define a complete firewall policy or compliance baseline.
 - Replace backup, monitoring, patching, or ongoing maintenance processes.
 - Claim live validation against every future Ubuntu release.
 ## Why this is separate from ailab-maintenance
 This project establishes a fresh host. The sibling
 [AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
 checks, scheduled cleanup, configuration backup, disk monitoring, and VM
 inventory after a host is operating.
 Keeping bootstrap and maintenance separate makes the change boundary clear:
 this toolkit installs platform capabilities and baseline configuration, while
 the maintenance toolkit manages recurring operational tasks.
 ## Directory layout
 ```text
 setup/
 ├── README.md
 ├── install.sh
 ├── scripts/
 │   ├── 00-preflight.sh
 │   ├── 00-platform-guard.inc
 │   ├── 01-base-packages.sh
 │   ├── 02-shell-profile.sh
 │   ├── 03-cockpit.sh
 │   ├── 04-docker.sh
 │   ├── 05-libvirt.sh
 │   ├── 06-nvidia-tools.sh
 │   ├── 07-tuning.sh
 │   ├── 08-security-baseline.sh
 │   └── 99-postcheck.sh
 ├── files/
 │   ├── bashrc.d/ailab.sh
 │   ├── docker/daemon.json
 │   ├── sysctl/99-ailab.conf
 │   └── systemd/journald-ailab-limits.conf
 └── docs/
    ├── fresh-install-checklist.md
    ├── cockpit.md
    ├── docker.md
    ├── libvirt.md
    ├── nvidia.md
    └── bash-shell.md
 ```
 `00-platform-guard.inc` is an internal sourced helper used by mutating
 component scripts; it is not an executable profile.
 ## Supported profiles and flags
 | Flag | Result |
 | --- | --- |
 | `--base` | Install operational CLI, diagnostic, storage, and network packages |
 | `--shell` | Install the root AI lab Bash profile |
 | `--cockpit` | Install and enable Cockpit |
 | `--docker` | Install Docker and bounded JSON-file logging |
 | `--libvirt` | Install and enable libvirt/KVM |
 | `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
 | `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
 | `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
 | `--security` | Install and enable fail2ban; install but do not enable UFW |
 | `--enable-ufw` | Run security setup and explicitly enable UFW |
 | `--all` | Run every standard profile without UFW enablement or driver installation |
 `--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
 `--security`. With no flags, the installer prints help and makes no changes.
 ## Installation examples
 Review the scripts and current host access path before execution:
 ```bash
 cd labs/linux/setup
 ./install.sh
 sudo ./install.sh --base --shell
 sudo ./install.sh --cockpit --docker --libvirt
 sudo ./install.sh --all
 ```
 Explicit high-impact options can be combined with `--all`:
 ```bash
 sudo ./install.sh --all --enable-ufw
 sudo ./install.sh --all --install-nvidia-driver 550
 ```
 The installer runs the read-only preflight once before selected profiles and a
 postcheck after all successful profile steps.
 ## Fresh host workflow
 1. Patch the base Ubuntu installation and confirm console or out-of-band access.
 2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
 3. Run `sudo ./install.sh --base --shell`.
 4. Add only the platform profiles needed by the host.
 5. Review service state, listening ports, storage, networking, and warnings in
   the postcheck.
 6. Reboot if a driver or kernel-related package requires it.
 7. Capture host-specific configuration and backup requirements separately.
 ## AI lab workflow
 A general AI lab host can start with:
 ```bash
 sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
 ```
 This installs GPU diagnostics but leaves driver choice to the operator. Add
 libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
 Cockpit, application, bridge, and VM networking requirements.
 ## Safety model
 - Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
  than 24.04.
 - Component profiles install their own direct prerequisites.
 - Existing managed configuration is changed only when content differs.
 - Changed root shell, Docker, journald, and sysctl files receive timestamped
  backups.
 - Existing valid Docker JSON is merged so unrelated settings survive.
 - Invalid Docker JSON stops configuration rather than being overwritten.
 - UFW and NVIDIA driver installation require explicit flags.
 - Package and service failures are not hidden.
 - Postcheck warnings report optional or inactive components without masking a
  successfully completed diagnostic script.
 APT installation and service restarts are real system changes. Test first on a
 disposable host and maintain a console path when changing remote access policy.
 ## Bash shell profile
 The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
 source line is maintained in `/root/.bashrc`. It adds concise helpers for
 systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
 See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
 ## Cockpit setup
 Cockpit provides browser-based host, storage, network, package, VM, metrics,
 and support-report views. The installer enables `cockpit.socket` and reports
 `https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
 available in every enabled Ubuntu repository.
 See [Cockpit setup](docs/cockpit.md).
 ## Docker setup
 The Ubuntu `docker.io` package path is preferred. The Docker official
 repository is configured only when `docker.io` is unavailable. The daemon uses
 the `json-file` log driver with five 50 MB files per container.
 The toolkit configures log retention only. It does not prune data, deploy
 Compose applications, or configure an NVIDIA container runtime.
 See [Docker setup](docs/docker.md).
 ## libvirt/KVM setup
 The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
 virt-manager, bridge utilities, and libvirt clients and services. It enables
 `libvirtd` and prints existing guests and networks.
 See [libvirt/KVM setup](docs/libvirt.md).
 ## NVIDIA tooling
 The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
 It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
 commands exist.
 Driver installation requires a numeric version that maps to an available
 Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
 driver suitability, CUDA, container runtime support, and passthrough remain
 operator decisions.
 See [NVIDIA tooling](docs/nvidia.md).
 ## Tuning
 The tuning profile bounds persistent journal use, raises inotify limits for
 development and container workloads, reduces swappiness, enables sysstat, and
 runs automatic sensor detection when available.
 Review these values against available memory, storage, monitoring retention,
 and workload behavior before deployment beyond a lab.
 ## Security baseline
 The security profile installs UFW and fail2ban and enables fail2ban. It leaves
 UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
 OpenSSH and TCP port 9090 before activation.
 This is a minimal access-preservation baseline, not a complete host firewall or
 hardening standard. Application and VM networking may require additional
 reviewed rules.
 ## Postcheck
 The final script reports:
 - Failed systemd units.
 - Cockpit, Docker, libvirt, and fail2ban status when installed.
 - Running Docker containers and defined virtual machines.
 - NVIDIA runtime state.
 - Filesystem usage and listening ports.
 Warnings require operator review but optional component absence does not cause
 the postcheck itself to fail.
 ## Troubleshooting
 Run individual read-only checks after correcting a failed profile:
 ```bash
 sudo ./scripts/00-preflight.sh
 sudo ./scripts/99-postcheck.sh
 systemctl --failed
 journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
 ```
 Common failure areas are unavailable APT repositories, unsupported package
 names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
 module signing, disabled CPU virtualization, and remote firewall assumptions.
 To roll back a managed configuration, compare the current file with its
 timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
 owning service. Package removal is intentionally not automated because it may
 affect workloads and dependencies.
 ## Interview talking points
 - Why day-0 bootstrap and day-2 maintenance have separate ownership.
 - How explicit flags protect firewall and GPU driver decisions.
 - Why Docker JSON is validated, backed up, and merged.
 - How idempotent content checks prevent backup and restart churn.
 - Why preflight and postcheck evidence surround mutating profiles.
 - Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
 ## Future improvements
 - Add automated tests using disposable Ubuntu VMs.
 - Add a documented NVIDIA Container Toolkit profile.
 - Add optional non-root administrative user and group membership management.
 - Add bridge and VFIO planning checks without applying passthrough changes.
 - Add package compatibility matrices after validating future Ubuntu releases.
 - Export postcheck results in a structured format for evidence collection.
@@ -0,0 +1,53 @@
 # Bash Shell Profile
 ## Installation
 The shell profile is installed for root:
 ```text
 /root/.bashrc.d/ailab.sh
 ```
 The installer maintains one exact source line in `/root/.bashrc` and backs up
 changed files. Start a new Bash session or run:
 ```bash
 source /root/.bashrc
 ```
 ## Aliases
 | Alias | Purpose |
 | --- | --- |
 | `ll`, `la` | Detailed and hidden-file directory listings |
 | `ports` | Listening TCP/UDP sockets and processes |
 | `dus`, `dufh` | Directory and filesystem usage |
 | `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
 | `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
 | `vms` | All libvirt guests |
 | `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
 | `now` | Current timestamp and timezone |
 `dcu` runs `docker compose up -d` in the current directory and therefore may
 create or start resources. Review the Compose project before using it.
 ## Functions
 - `svc_status SERVICE`
 - `svc_logs SERVICE [LINES]`
 - `docker_logs CONTAINER [LINES]`
 - `docker_restart CONTAINER`
 - `vm_autostart VM`
 - `vm_no_autostart VM`
 - `path_backup PATH`
 - `extract ARCHIVE`
 Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
 report missing commands clearly. `path_backup` creates a timestamped adjacent
 copy and can consume substantial space for large paths.
 ## Rollback
 Review timestamped backups under `/root`, restore the desired `.bashrc` or
 profile copy, and start a new shell. Avoid restoring a backup without checking
 for unrelated shell changes made after bootstrap.
@@ -0,0 +1,41 @@
 # Cockpit
 ## Purpose
 The Cockpit profile installs browser-based host administration modules for
 system state, storage, networking, packages, virtual machines, metrics, and
 support reports. It enables the socket-activated service.
 ## Installation and validation
 ```bash
 sudo ./install.sh --cockpit
 systemctl status cockpit.socket
 ss -ltnp | grep ':9090'
 ```
 Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
 default host certificate is not trusted.
 `cockpit-files` is installed when available and skipped with a warning
 otherwise.
 ## Access and firewall
 The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
 TCP 9090, but upstream firewalls and network ACLs remain external concerns.
 Use normal Linux accounts and review which users may administer the host.
 ## Troubleshooting and rollback
 ```bash
 journalctl -u cockpit.socket -u cockpit.service
 systemctl restart cockpit.socket
 apt-cache policy cockpit cockpit-machines cockpit-files
 ```
 To disable remote access without removing packages:
 ```bash
 sudo systemctl disable --now cockpit.socket
 ```
@@ -0,0 +1,56 @@
 # Docker
 ## Package policy
 The profile prefers Ubuntu's `docker.io` package. If that package is
 unavailable after an APT refresh, it configures Docker's official Ubuntu
 repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
 This fallback requires network access to `download.docker.com`.
 ## Daemon configuration
 The managed settings are:
 ```json
 {
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "5"
  }
 }
 ```
 Existing valid `/etc/docker/daemon.json` content is preserved and merged with
 these log settings. A changed file is backed up with a timestamp. Invalid JSON
 causes the profile to stop rather than overwrite operator configuration.
 Log limits apply to newly created containers. Existing containers may retain
 their original logging configuration until recreated.
 ## Validation
 ```bash
 docker version
 docker compose version
 docker info
 docker ps
 docker system df
 jq . /etc/docker/daemon.json
 ```
 ## Troubleshooting and rollback
 ```bash
 systemctl status docker
 journalctl -u docker
 jq empty /etc/docker/daemon.json
 ```
 To restore a previous daemon configuration, review a timestamped backup,
 replace the current file, validate it with `jq empty`, and restart Docker.
 Do not restore blindly when workloads depend on newer daemon settings.
 The profile does not configure Docker data roots, prune objects, deploy
 applications, or install the NVIDIA Container Toolkit.
@@ -0,0 +1,47 @@
 # Fresh Install Checklist
 ## Before bootstrap
 - Confirm Ubuntu 24.04 or newer and record the release and kernel.
 - Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
 - Confirm console or out-of-band access before firewall work.
 - Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
 - Patch the base system and reboot if required.
 - Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
 - Review application ports and VM networking before enabling UFW.
 - Confirm backups exist for any pre-existing host configuration.
 ## Bootstrap
 Start with the least capability required:
 ```bash
 sudo ./install.sh --base --shell
 ```
 Add reviewed platform profiles:
 ```bash
 sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
 ```
 Do not select `--enable-ufw` until remote access and application rules are
 understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
 and workload compatibility are known.
 ## Post-bootstrap evidence
 - Review all installer warnings.
 - Run `systemctl --failed`.
 - Confirm expected services with `systemctl status`.
 - Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
 - Confirm Docker with `docker version` and `docker compose version`.
 - Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
 - Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
 - Reboot after driver installation and repeat the postcheck.
 ## Handover
 Document host-specific storage, network, firewall, backup, application, GPU,
 and VM decisions. Install the separate `ailab-maintenance` toolkit only after
 reviewing its scheduled day-2 behavior.
@@ -0,0 +1,54 @@
 # libvirt and KVM
 ## Purpose
 The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
 TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
 supports later Linux or Windows 11 VM work without defining guests.
 ## Firmware pre-checks
 Confirm CPU virtualization is enabled:
 ```bash
 lscpu | grep -E 'Virtualization|Hypervisor'
 grep -Eom1 '(vmx|svm)' /proc/cpuinfo
 ```
 IOMMU and GPU passthrough require separate firmware, kernel command-line,
 device isolation, driver binding, and recovery planning. This toolkit reports
 hints but does not apply those changes.
 ## Validation
 ```bash
 systemctl status libvirtd
 virsh list --all
 virsh net-list --all
 virsh pool-list --all
 ```
 Use `virt-host-validate` when available for a broader host capability report.
 Desktop use of `virt-manager` requires a graphical environment or remote
 display strategy.
 ## Networking and Windows 11
 The default libvirt NAT network is distinct from host bridge networking. Review
 DHCP, DNS, forwarding, and firewall behavior before changing it.
 Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
 packages provide those building blocks, but guest creation and licensing remain
 manual.
 ## Troubleshooting
 ```bash
 journalctl -u libvirtd
 virsh net-info default
 virsh pool-list --all
 lsmod | grep kvm
 ```
 Disabling `libvirtd` does not remove VM disks or definitions. Package removal
 and VM data deletion are intentionally outside this toolkit.
@@ -0,0 +1,52 @@
 # NVIDIA Tooling
 ## Diagnostic-only default
 The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
 does not install or select a driver:
 ```bash
 sudo ./install.sh --nvidia-tools
 ```
 Review hardware and current module state:
 ```bash
 lspci -nn | grep -i nvidia
 nvidia-smi
 dkms status
 mokutil --sb-state
 ```
 ## Explicit driver installation
 Install only a reviewed Ubuntu driver package version:
 ```bash
 sudo ./install.sh --install-nvidia-driver 550
 ```
 The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
 an unavailable package. Reboot after installation, then validate `nvidia-smi`,
 kernel logs, DKMS state, and application behavior.
 ## Selection considerations
 - GPU generation and supported driver branch.
 - Ubuntu release, kernel, and HWE stack.
 - Secure Boot module enrollment.
 - CUDA or application compatibility.
 - Docker NVIDIA Container Toolkit requirements.
 - Whether the device will be bound to VFIO instead of the host driver.
 ## Troubleshooting
 ```bash
 journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
 lsmod | grep -E 'nvidia|nouveau'
 dkms status
 apt-cache policy 'nvidia-driver-*'
 ```
 Driver rollback is environment-specific and is not automated. Preserve console
 access and a known-good kernel before changing GPU or Secure Boot configuration.
@@ -0,0 +1,133 @@
 #!/usr/bin/env bash
 # AI lab operational shell helpers. This file is intended to be sourced.
 alias ll='ls -alF'
 alias la='ls -A'
 alias ports='ss -tulpn'
 alias dus='du -xhd1 2>/dev/null | sort -h'
 alias dufh='df -hT'
 alias failed='systemctl --failed --no-pager'
 alias jerr='journalctl -p err -b --no-pager'
 alias timers='systemctl list-timers --all --no-pager'
 alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
 alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
 alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
 alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
 alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
 alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
 alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
 svc_status() {
  if (($# != 1)); then
    printf 'Usage: svc_status SERVICE\n' >&2
    return 2
  fi
  systemctl status "$1" --no-pager
 }
 svc_logs() {
  if (($# < 1 || $# > 2)); then
    printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
    return 2
  fi
  local lines="${2:-100}"
  [[ "$lines" =~ ^[0-9]+$ ]] || {
    printf 'LINES must be numeric\n' >&2
    return 2
  }
  journalctl -u "$1" -n "$lines" --no-pager
 }
 docker_logs() {
  if (($# < 1 || $# > 2)); then
    printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
    return 2
  fi
  command -v docker >/dev/null 2>&1 || {
    printf 'Docker is not installed\n' >&2
    return 1
  }
  local lines="${2:-100}"
  [[ "$lines" =~ ^[0-9]+$ ]] || {
    printf 'LINES must be numeric\n' >&2
    return 2
  }
  docker logs --tail "$lines" "$1"
 }
 docker_restart() {
  if (($# != 1)); then
    printf 'Usage: docker_restart CONTAINER\n' >&2
    return 2
  fi
  command -v docker >/dev/null 2>&1 || {
    printf 'Docker is not installed\n' >&2
    return 1
  }
  docker restart "$1"
 }
 vm_autostart() {
  if (($# != 1)); then
    printf 'Usage: vm_autostart VM\n' >&2
    return 2
  fi
  command -v virsh >/dev/null 2>&1 || {
    printf 'virsh is not installed\n' >&2
    return 1
  }
  virsh autostart "$1"
 }
 vm_no_autostart() {
  if (($# != 1)); then
    printf 'Usage: vm_no_autostart VM\n' >&2
    return 2
  fi
  command -v virsh >/dev/null 2>&1 || {
    printf 'virsh is not installed\n' >&2
    return 1
  }
  virsh autostart --disable "$1"
 }
 path_backup() {
  if (($# != 1)); then
    printf 'Usage: path_backup PATH\n' >&2
    return 2
  fi
  if [[ ! -e "$1" ]]; then
    printf 'Path does not exist: %s\n' "$1" >&2
    return 1
  fi
  local destination
  destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
  cp -a -- "$1" "$destination"
  printf 'Backup created: %s\n' "$destination"
 }
 extract() {
  if (($# != 1)); then
    printf 'Usage: extract ARCHIVE\n' >&2
    return 2
  fi
  if [[ ! -f "$1" ]]; then
    printf 'Archive does not exist: %s\n' "$1" >&2
    return 1
  fi
  case "$1" in
    *.tar.bz2|*.tbz2) tar xjf "$1" ;;
    *.tar.gz|*.tgz) tar xzf "$1" ;;
    *.tar.xz|*.txz) tar xJf "$1" ;;
    *.tar) tar xf "$1" ;;
    *.bz2) bunzip2 "$1" ;;
    *.gz) gunzip "$1" ;;
    *.zip) unzip "$1" ;;
    *.7z) 7z x "$1" ;;
    *.rar) unrar x "$1" ;;
    *)
      printf 'Unsupported archive type: %s\n' "$1" >&2
      return 2
      ;;
  esac
 }
@@ -0,0 +1,7 @@
 {
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "5"
  }
 }
@@ -0,0 +1,3 @@
 fs.inotify.max_user_watches=1048576
 fs.inotify.max_user_instances=1024
 vm.swappiness=10
@@ -0,0 +1,5 @@
 [Journal]
 SystemMaxUse=1G
 SystemKeepFree=2G
 MaxRetentionSec=14day
 Compress=yes
@@ -0,0 +1,182 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 run_base=0
 run_shell=0
 run_cockpit=0
 run_docker=0
 run_libvirt=0
 run_nvidia=0
 run_tuning=0
 run_security=0
 enable_ufw=0
 nvidia_driver_version=""
 usage() {
  cat <<'EOF'
 Usage: sudo ./install.sh [OPTIONS]
 Day-0 bootstrap automation for Ubuntu 24.04 or newer.
 Profiles:
  --base                         Install baseline operational packages
  --shell                        Install the root shell profile
  --cockpit                      Install and enable Cockpit
  --docker                       Install and configure Docker
  --libvirt                      Install and enable libvirt/KVM
  --nvidia-tools                 Install NVIDIA diagnostic tools only
  --install-nvidia-driver VERSION
                                 Install diagnostic tools and the explicit driver
  --tuning                       Install journald and sysctl tuning
  --security                     Install fail2ban and UFW; do not enable UFW
  --enable-ufw                   Run security profile and explicitly enable UFW
  --all                          Run every profile without enabling UFW or
                                 installing an NVIDIA driver
  -h, --help                     Show this help
 Examples:
  sudo ./install.sh --base --shell
  sudo ./install.sh --all
  sudo ./install.sh --all --enable-ufw
  sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
 EOF
 }
 require_supported_ubuntu() {
  if [[ ! -r /etc/os-release ]]; then
    printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
    exit 2
  fi
  # shellcheck disable=SC1091
  source /etc/os-release
  if [[ "${ID:-}" != "ubuntu" ]]; then
    printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
    exit 2
  fi
  if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
    printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
      "${VERSION_ID:-unknown}" >&2
    exit 2
  fi
 }
 if (($# == 0)); then
  usage
  exit 0
 fi
 while (($# > 0)); do
  case "$1" in
    --base)
      run_base=1
      ;;
    --shell)
      run_shell=1
      ;;
    --cockpit)
      run_cockpit=1
      ;;
    --docker)
      run_docker=1
      ;;
    --libvirt)
      run_libvirt=1
      ;;
    --nvidia-tools)
      run_nvidia=1
      ;;
    --install-nvidia-driver)
      if (($# < 2)); then
        printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
        exit 2
      fi
      nvidia_driver_version="$2"
      if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
        printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
        exit 2
      fi
      run_nvidia=1
      shift
      ;;
    --tuning)
      run_tuning=1
      ;;
    --security)
      run_security=1
      ;;
    --enable-ufw)
      enable_ufw=1
      run_security=1
      ;;
    --all)
      run_base=1
      run_shell=1
      run_cockpit=1
      run_docker=1
      run_libvirt=1
      run_nvidia=1
      run_tuning=1
      run_security=1
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
      usage >&2
      exit 2
      ;;
  esac
  shift
 done
 if ((EUID != 0)); then
  printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
  exit 2
 fi
 for required_command in bash dpkg; do
  if ! command -v "$required_command" >/dev/null 2>&1; then
    printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
    exit 2
  fi
 done
 require_supported_ubuntu
 printf 'INFO: running read-only preflight\n'
 "$SCRIPT_DIR/scripts/00-preflight.sh"
 ((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
 ((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
 ((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
 ((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
 ((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
 if ((run_nvidia == 1)); then
  if [[ -n "$nvidia_driver_version" ]]; then
    "$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
  else
    "$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
  fi
 fi
 ((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
 if ((run_security == 1)); then
  if ((enable_ufw == 1)); then
    "$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
  else
    "$SCRIPT_DIR/scripts/08-security-baseline.sh"
  fi
 fi
 printf '\nINFO: running post-install checks\n'
 "$SCRIPT_DIR/scripts/99-postcheck.sh"
 printf '\nOK: selected Linux setup profiles completed\n'
@@ -0,0 +1,20 @@
 # shellcheck shell=bash
 require_supported_ubuntu() {
  if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
    printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
    exit 2
  fi
  # shellcheck disable=SC1091
  source /etc/os-release
  if [[ "${ID:-}" != "ubuntu" ]]; then
    printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
    exit 2
  fi
  if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
    printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
      "${VERSION_ID:-unknown}" >&2
    exit 2
  fi
 }
@@ -0,0 +1,124 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 section() {
  printf '\n== %s ==\n' "$1"
 }
 run_optional() {
  local description="$1"
  shift
  if "$@"; then
    return 0
  fi
  printf 'WARNING: %s failed\n' "$description"
  return 0
 }
 section "Operating system"
 if [[ -r /etc/os-release ]]; then
  run_optional "OS release report" cat /etc/os-release
 else
  printf 'WARNING: /etc/os-release is unavailable\n'
 fi
 run_optional "kernel report" uname -a
 section "Host"
 run_optional "hostname report" hostname
 run_optional "uptime report" uptime
 section "CPU and virtualization"
 if command -v lscpu >/dev/null 2>&1; then
  run_optional "CPU report" lscpu
  printf '\nVirtualization flags:\n'
  lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
    printf 'INFO: no virtualization summary reported by lscpu\n'
 else
  printf 'WARNING: lscpu is unavailable\n'
 fi
 if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
  printf 'OK: CPU virtualization flags detected\n'
 else
  printf 'WARNING: CPU virtualization flags were not detected\n'
 fi
 section "Memory"
 if command -v free >/dev/null 2>&1; then
  run_optional "memory report" free -h
 else
  run_optional "memory report" cat /proc/meminfo
 fi
 section "Disks"
 if command -v lsblk >/dev/null 2>&1; then
  run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
 else
  printf 'WARNING: lsblk is unavailable\n'
 fi
 run_optional "filesystem report" df -hT
 section "Network"
 if command -v ip >/dev/null 2>&1; then
  run_optional "network interface report" ip -brief address
  run_optional "route report" ip route
 else
  printf 'WARNING: ip is unavailable\n'
 fi
 section "Firmware and Secure Boot"
 if [[ -d /sys/firmware/efi ]]; then
  printf 'OK: boot mode is UEFI\n'
 else
  printf 'INFO: boot mode appears to be legacy BIOS\n'
 fi
 if command -v mokutil >/dev/null 2>&1; then
  run_optional "Secure Boot report" mokutil --sb-state
 else
  printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
 fi
 section "IOMMU"
 if [[ -r /proc/cmdline ]]; then
  printf 'Kernel command line:\n'
  cat /proc/cmdline
  if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
    printf 'OK: IOMMU-related kernel arguments detected\n'
  else
    printf 'INFO: no explicit IOMMU kernel argument detected\n'
  fi
 fi
 if command -v dmesg >/dev/null 2>&1; then
  dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
    printf 'INFO: no readable IOMMU hints found in dmesg\n'
 fi
 section "NVIDIA hardware"
 if command -v lspci >/dev/null 2>&1; then
  lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
 else
  printf 'INFO: lspci is unavailable\n'
 fi
 section "Existing platform components"
 for command_name in docker virsh cockpit-bridge; do
  if command -v "$command_name" >/dev/null 2>&1; then
    printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
  else
    printf 'INFO: %s is not installed\n' "$command_name"
  fi
 done
 if command -v systemctl >/dev/null 2>&1; then
  for unit in docker.service libvirtd.service cockpit.socket; do
    if systemctl cat "$unit" >/dev/null 2>&1; then
      state="$(systemctl is-active "$unit" 2>/dev/null || true)"
      printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
    else
      printf 'INFO: %s is not installed\n' "$unit"
    fi
  done
 fi
 printf '\nOK: preflight completed without modifying the host\n'
@@ -0,0 +1,41 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 packages=(
  curl wget git vim nano tmux byobu htop btop glances
  jq unzip zip rsync tree ncdu duf
  lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
  smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
  sysstat iotop iftop nload
  ca-certificates gnupg software-properties-common apt-transport-https
  needrestart unattended-upgrades logrotate
 )
 if ((EUID != 0)); then
  printf 'CRITICAL: base package setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if ! command -v apt-get >/dev/null 2>&1; then
  printf 'CRITICAL: apt-get is required\n' >&2
  exit 2
 fi
 printf 'INFO: refreshing APT metadata\n'
 apt-get update
 printf 'INFO: installing baseline operational packages\n'
 DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
 if command -v systemctl >/dev/null 2>&1; then
  systemctl enable --now sysstat
 else
  printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
 fi
 printf 'OK: baseline operational packages are installed\n'
@@ -0,0 +1,60 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
 PROFILE_DIR="/root/.bashrc.d"
 PROFILE_FILE="$PROFILE_DIR/ailab.sh"
 BASHRC="/root/.bashrc"
 SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
 backup_file() {
  local path="$1"
  local backup
  backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
  install -m 0644 "$path" "$backup"
  printf 'INFO: backed up %s to %s\n' "$path" "$backup"
 }
 if ((EUID != 0)); then
  printf 'CRITICAL: shell profile setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if [[ ! -r "$SOURCE_FILE" ]]; then
  printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
  exit 2
 fi
 install -d -m 0755 "$PROFILE_DIR"
 if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
  if [[ -f "$PROFILE_FILE" ]]; then
    backup_file "$PROFILE_FILE"
  fi
  install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
  printf 'OK: installed %s\n' "$PROFILE_FILE"
 else
  printf 'OK: shell profile is already current\n'
 fi
 if [[ ! -f "$BASHRC" ]]; then
  install -m 0644 /dev/null "$BASHRC"
 fi
 source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
 if [[ "$source_count" != "1" ]]; then
  tmp_bashrc="$(mktemp)"
  trap 'rm -f "$tmp_bashrc"' EXIT
  grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
  printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
  backup_file "$BASHRC"
  install -m 0644 "$tmp_bashrc" "$BASHRC"
  printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
 else
  printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
 fi
@@ -0,0 +1,36 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 required_packages=(
  cockpit cockpit-system cockpit-storaged cockpit-networkmanager
  cockpit-packagekit cockpit-machines cockpit-sosreport cockpit-pcp
 )
 if ((EUID != 0)); then
  printf 'CRITICAL: Cockpit setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if ! command -v apt-get >/dev/null 2>&1; then
  printf 'CRITICAL: apt-get is required\n' >&2
  exit 2
 fi
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y "${required_packages[@]}"
 if apt-cache show cockpit-files >/dev/null 2>&1; then
  DEBIAN_FRONTEND=noninteractive apt-get install -y cockpit-files
  printf 'OK: installed optional cockpit-files package\n'
 else
  printf 'WARNING: cockpit-files is unavailable; continuing without it\n'
 fi
 systemctl enable --now cockpit.socket
 printf 'OK: Cockpit is enabled at https://%s:9090\n' "$(hostname)"
@@ -0,0 +1,136 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 SOURCE_CONFIG="$SCRIPT_DIR/../files/docker/daemon.json"
 DOCKER_CONFIG="/etc/docker/daemon.json"
 temporary_files=()
 cleanup() {
  local path
  for path in "${temporary_files[@]}"; do
    rm -f "$path"
  done
 }
 trap cleanup EXIT
 backup_file() {
  local path="$1"
  local backup
  backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
  install -m 0644 "$path" "$backup"
  printf 'INFO: backed up %s to %s\n' "$path" "$backup"
 }
 if ((EUID != 0)); then
  printf 'CRITICAL: Docker setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 for command_name in apt-get apt-cache; do
  if ! command -v "$command_name" >/dev/null 2>&1; then
    printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
    exit 2
  fi
 done
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y ca-certificates curl gnupg jq
 if apt-cache show docker.io >/dev/null 2>&1; then
  packages=(docker.io)
  if apt-cache show docker-compose-v2 >/dev/null 2>&1; then
    packages+=(docker-compose-v2)
  else
    printf 'WARNING: docker-compose-v2 is unavailable from Ubuntu repositories\n'
  fi
 else
  printf 'WARNING: docker.io is unavailable; configuring Docker official repository\n'
  install -d -m 0755 /etc/apt/keyrings
  tmp_key="$(mktemp)"
  temporary_files+=("$tmp_key")
  curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
    | gpg --dearmor --yes -o "$tmp_key"
  if [[ ! -f /etc/apt/keyrings/docker.gpg ]] || \
    ! cmp -s "$tmp_key" /etc/apt/keyrings/docker.gpg; then
    if [[ -f /etc/apt/keyrings/docker.gpg ]]; then
      backup_file /etc/apt/keyrings/docker.gpg
    fi
    install -m 0644 "$tmp_key" /etc/apt/keyrings/docker.gpg
  fi
  # shellcheck disable=SC1091
  source /etc/os-release
  architecture="$(dpkg --print-architecture)"
  tmp_repository="$(mktemp)"
  temporary_files+=("$tmp_repository")
  printf 'deb [arch=%s signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu %s stable\n' \
    "$architecture" "${VERSION_CODENAME:?}" \
    >"$tmp_repository"
  if [[ ! -f /etc/apt/sources.list.d/docker.list ]] || \
    ! cmp -s "$tmp_repository" /etc/apt/sources.list.d/docker.list; then
    if [[ -f /etc/apt/sources.list.d/docker.list ]]; then
      backup_file /etc/apt/sources.list.d/docker.list
    fi
    install -m 0644 "$tmp_repository" /etc/apt/sources.list.d/docker.list
  fi
  apt-get update
  packages=(docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin)
 fi
 DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
 install -d -m 0755 /etc/docker
 if [[ ! -r "$SOURCE_CONFIG" ]]; then
  printf 'CRITICAL: Docker configuration template is missing: %s\n' "$SOURCE_CONFIG" >&2
  exit 2
 fi
 jq empty "$SOURCE_CONFIG"
 tmp_config="$(mktemp)"
 temporary_files+=("$tmp_config")
 if [[ -f "$DOCKER_CONFIG" ]]; then
  if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
    printf 'CRITICAL: %s is invalid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
    exit 1
  fi
  jq '. + {
    "log-driver": "json-file",
    "log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
  }' "$DOCKER_CONFIG" >"$tmp_config"
 else
  install -m 0644 "$SOURCE_CONFIG" "$tmp_config"
 fi
 jq empty "$tmp_config"
 config_changed=0
 if [[ ! -f "$DOCKER_CONFIG" ]] || ! cmp -s "$tmp_config" "$DOCKER_CONFIG"; then
  if [[ -f "$DOCKER_CONFIG" ]]; then
    backup_file "$DOCKER_CONFIG"
  fi
  install -m 0644 "$tmp_config" "$DOCKER_CONFIG"
  config_changed=1
  printf 'OK: installed Docker daemon log limits\n'
 else
  printf 'OK: Docker daemon configuration is already current\n'
 fi
 systemctl enable --now docker
 if ((config_changed == 1)); then
  systemctl restart docker
 fi
 docker version
 if docker compose version >/dev/null 2>&1; then
  docker compose version
 else
  printf 'WARNING: Docker Compose v2 is unavailable\n'
 fi
 printf 'OK: Docker setup completed\n'
@@ -0,0 +1,33 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 packages=(
  qemu-system-x86 qemu-utils libvirt-daemon-system libvirt-clients
  virtinst virt-manager bridge-utils ovmf swtpm swtpm-tools dnsmasq-base
 )
 if ((EUID != 0)); then
  printf 'CRITICAL: libvirt setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if ! command -v apt-get >/dev/null 2>&1; then
  printf 'CRITICAL: apt-get is required\n' >&2
  exit 2
 fi
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
 systemctl enable --now libvirtd
 printf '\n== Virtual machines ==\n'
 virsh list --all || printf 'WARNING: unable to list virtual machines\n'
 printf '\n== Virtual networks ==\n'
 virsh net-list --all || printf 'WARNING: unable to list virtual networks\n'
 printf 'OK: libvirt/KVM setup completed\n'
@@ -0,0 +1,88 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 driver_version=""
 usage() {
  cat <<'EOF'
 Usage: sudo ./06-nvidia-tools.sh [--install-driver VERSION]
 Without --install-driver, only non-driver diagnostic tools are installed.
 EOF
 }
 while (($# > 0)); do
  case "$1" in
    --install-driver)
      if (($# < 2)); then
        printf 'CRITICAL: --install-driver requires a VERSION\n' >&2
        exit 2
      fi
      driver_version="$2"
      if [[ ! "$driver_version" =~ ^[0-9]+$ ]]; then
        printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
        exit 2
      fi
      shift
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      printf 'CRITICAL: unknown option: %s\n' "$1" >&2
      exit 2
      ;;
  esac
  shift
 done
 if ((EUID != 0)); then
  printf 'CRITICAL: NVIDIA tooling setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if ! command -v apt-get >/dev/null 2>&1; then
  printf 'CRITICAL: apt-get is required\n' >&2
  exit 2
 fi
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y nvtop clinfo pciutils
 printf '\n== NVIDIA PCI devices ==\n'
 lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
 printf '\n== NVIDIA runtime ==\n'
 if command -v nvidia-smi >/dev/null 2>&1; then
  nvidia-smi || printf 'WARNING: nvidia-smi returned an error\n'
 else
  printf 'INFO: nvidia-smi is not installed\n'
 fi
 printf '\n== DKMS ==\n'
 if command -v dkms >/dev/null 2>&1; then
  dkms status || printf 'WARNING: dkms status returned an error\n'
 else
  printf 'INFO: dkms is not installed\n'
 fi
 if [[ -n "$driver_version" ]]; then
  driver_package="nvidia-driver-$driver_version"
  if ! apt-cache show "$driver_package" >/dev/null 2>&1; then
    printf 'CRITICAL: requested NVIDIA driver package is unavailable: %s\n' \
      "$driver_package" >&2
    exit 1
  fi
  DEBIAN_FRONTEND=noninteractive apt-get install -y "$driver_package"
  printf 'WARNING: NVIDIA driver %s was installed; reboot before validation\n' \
    "$driver_version"
 else
  printf 'OK: NVIDIA diagnostic tools installed; no driver was installed\n'
 fi
@@ -0,0 +1,67 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 JOURNAL_SOURCE="$SCRIPT_DIR/../files/systemd/journald-ailab-limits.conf"
 JOURNAL_DEST="/etc/systemd/journald.conf.d/ailab-limits.conf"
 SYSCTL_SOURCE="$SCRIPT_DIR/../files/sysctl/99-ailab.conf"
 SYSCTL_DEST="/etc/sysctl.d/99-ailab.conf"
 install_config() {
  local source_path="$1"
  local destination_path="$2"
  local mode="$3"
  local backup
  install -d -m 0755 "$(dirname "$destination_path")"
  if [[ -f "$destination_path" ]] && cmp -s "$source_path" "$destination_path"; then
    printf 'OK: %s is already current\n' "$destination_path"
    return 0
  fi
  if [[ -f "$destination_path" ]]; then
    backup="${destination_path}.$(date '+%Y%m%d-%H%M%S').bak"
    install -m "$mode" "$destination_path" "$backup"
    printf 'INFO: backed up %s to %s\n' "$destination_path" "$backup"
  fi
  install -m "$mode" "$source_path" "$destination_path"
  printf 'OK: installed %s\n' "$destination_path"
 }
 if ((EUID != 0)); then
  printf 'CRITICAL: tuning setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 for source_path in "$JOURNAL_SOURCE" "$SYSCTL_SOURCE"; do
  if [[ ! -r "$source_path" ]]; then
    printf 'CRITICAL: required configuration is missing: %s\n' "$source_path" >&2
    exit 2
  fi
 done
 if ! command -v sysctl >/dev/null 2>&1 || ! command -v systemctl >/dev/null 2>&1; then
  printf 'CRITICAL: sysctl and systemctl are required\n' >&2
  exit 2
 fi
 if ! command -v sensors-detect >/dev/null 2>&1 || \
  ! systemctl cat sysstat.service >/dev/null 2>&1; then
  apt-get update
  DEBIAN_FRONTEND=noninteractive apt-get install -y lm-sensors sysstat
 fi
 install_config "$JOURNAL_SOURCE" "$JOURNAL_DEST" 0644
 install_config "$SYSCTL_SOURCE" "$SYSCTL_DEST" 0644
 sysctl --system
 systemctl restart systemd-journald
 systemctl enable --now sysstat
 if command -v sensors-detect >/dev/null 2>&1; then
  sensors-detect --auto || printf 'WARNING: sensors-detect did not complete successfully\n'
 fi
 printf 'OK: host tuning completed\n'
@@ -0,0 +1,61 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 # shellcheck source=00-platform-guard.inc
 source "$SCRIPT_DIR/00-platform-guard.inc"
 enable_ufw=0
 usage() {
  cat <<'EOF'
 Usage: sudo ./08-security-baseline.sh [--enable-ufw]
 Installs fail2ban and UFW. UFW is enabled only with the explicit flag.
 EOF
 }
 while (($# > 0)); do
  case "$1" in
    --enable-ufw)
      enable_ufw=1
      ;;
    -h|--help)
      usage
      exit 0
      ;;
    *)
      printf 'CRITICAL: unknown option: %s\n' "$1" >&2
      exit 2
      ;;
  esac
  shift
 done
 if ((EUID != 0)); then
  printf 'CRITICAL: security baseline setup must run as root\n' >&2
  exit 2
 fi
 require_supported_ubuntu
 if ! command -v apt-get >/dev/null 2>&1; then
  printf 'CRITICAL: apt-get is required\n' >&2
  exit 2
 fi
 apt-get update
 DEBIAN_FRONTEND=noninteractive apt-get install -y fail2ban ufw
 systemctl enable --now fail2ban
 if ((enable_ufw == 1)); then
  printf 'WARNING: UFW was explicitly requested; SSH and Cockpit rules will be added before enablement\n'
  ufw allow OpenSSH
  ufw allow 9090/tcp comment 'Cockpit'
  ufw --force enable
 else
  printf 'WARNING: UFW is installed but was not enabled; use --enable-ufw after reviewing access requirements\n'
 fi
 ufw status verbose || printf 'WARNING: unable to read UFW status\n'
 printf 'OK: security baseline completed\n'
@@ -0,0 +1,69 @@
 #!/usr/bin/env bash
 set -o errexit
 set -o nounset
 set -o pipefail
 section() {
  printf '\n== %s ==\n' "$1"
 }
 run_optional() {
  local description="$1"
  shift
  if "$@"; then
    return 0
  fi
  printf 'WARNING: %s failed\n' "$description"
  return 0
 }
 section "Failed systemd units"
 if command -v systemctl >/dev/null 2>&1; then
  run_optional "failed systemd unit report" systemctl --failed --no-pager
  section "Selected service status"
  for unit in cockpit.socket docker.service libvirtd.service fail2ban.service; do
    if systemctl cat "$unit" >/dev/null 2>&1; then
      run_optional "$unit status" systemctl status "$unit" --no-pager
    else
      printf 'INFO: %s is not installed\n' "$unit"
    fi
  done
 else
  printf 'WARNING: systemctl is unavailable\n'
 fi
 section "Docker"
 if command -v docker >/dev/null 2>&1; then
  run_optional "Docker container list" docker ps
 else
  printf 'INFO: Docker is not installed\n'
 fi
 section "Libvirt"
 if command -v virsh >/dev/null 2>&1; then
  run_optional "libvirt guest list" virsh list --all
 else
  printf 'INFO: virsh is not installed\n'
 fi
 section "NVIDIA"
 if command -v nvidia-smi >/dev/null 2>&1; then
  run_optional "NVIDIA status" nvidia-smi
 else
  printf 'INFO: nvidia-smi is not installed\n'
 fi
 section "Filesystems"
 run_optional "filesystem report" df -hT
 section "Listening ports"
 if command -v ss >/dev/null 2>&1; then
  run_optional "listening port report" ss -tulpn
 else
  printf 'WARNING: ss is unavailable\n'
 fi
 printf '\nOK: postcheck completed; review warnings above\n'
 exit 0
@@ -1,8 +1,14 @@
 # platform-projects
-This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
+This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
-Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
+## Implemented platform projects
 - [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
 ## Planning areas
 These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
 - `monitoring-zabbix`
 - `elk-log-analysis`
@@ -0,0 +1,233 @@
 # Slurm AI/HPC Cluster Automation Lab
 ## Executive summary
 This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
 The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
 ## What this project demonstrates
 - Slurm controller and worker node management.
 - Munge authentication across the cluster.
 - GPU node integration through Slurm GRES.
 - cgroup CPU, memory, and GPU device enforcement.
 - SlurmDBD with MariaDB-backed accounting.
 - `sacct`, `sreport`, and `sacctmgr` workflows.
 - QOS, fairshare, and multifactor priority configuration.
 - Node provisioning and decommissioning workflows.
 - Rolling OS upgrades with canary validation.
 - Health checks and auto-remediation.
 - Backup and restore-check workflow for the accounting database.
 - Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
 ## Architecture overview
 ```mermaid
 flowchart LR
    operator[Ansible control node]
    munge[Munge authentication]
    controller[Slurm controller<br/>slurmctld]
    db[MariaDB + SlurmDBD<br/>accounting]
    shared[Shared filesystem<br/>site dependency]
    cpu_part[CPU partition]
    gpu_part[GPU partition]
    cpu_nodes[CPU compute nodes<br/>slurmd]
    gpu_node[GPU node<br/>slurmd + GRES]
    jobs[User jobs<br/>sbatch / srun]
    operator -->|bootstrap and configure| controller
    operator -->|configure workers| cpu_nodes
    operator -->|configure GPU worker| gpu_node
    operator -->|deploy key and service| munge
    munge --> controller
    munge --> cpu_nodes
    munge --> gpu_node
    controller -->|accounting RPC| db
    jobs -->|submit to Slurm| controller
    controller -->|schedule CPU jobs| cpu_part
    controller -->|schedule GPU jobs| gpu_part
    cpu_part --> cpu_nodes
    gpu_part --> gpu_node
    cpu_nodes --- shared
    gpu_node --- shared
    controller --- shared
 ```
 The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
 ## Repository layout
 ```text
 inventories/lab/          Sanitized lab inventory and group variables
 playbooks/bootstrap/      Initial SSH, sudo, operator user, and host setup
 playbooks/core/           Munge, Slurm config, and safe restart workflows
 playbooks/accounting/     SlurmDBD, MariaDB, backup, restore-check, and reporting validation
 playbooks/qos/            QOS, fairshare, and priority configuration
 playbooks/lifecycle/      Node provisioning, inspection, and decommissioning
 playbooks/upgrade/        Canary and rolling OS upgrade workflows
 playbooks/health/         Health checks, repair, and auto-remediation
 playbooks/tests/          CPU, GPU, cgroup, accounting, and reporting validation jobs
 playbooks/backup/         Slurm and Munge state backup helpers
 templates/                Slurm, cgroup, GRES, and SlurmDBD templates
 docs/                     Operational runbook
 prompts/                  Documentation prompts used to expand this project
 ```
 ## Main operational workflows
 Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
 ### Bootstrap access
 ```bash
 ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
 ansible-playbook playbooks/bootstrap/slurm-hosts.yml
 ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
 ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
 ```
 ### Deploy Munge
 ```bash
 ansible-playbook playbooks/core/manage-munge.yml
 ```
 ### Deploy Slurm config
 ```bash
 ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
 ansible-playbook playbooks/core/manage-slurm-config.yml --diff
 ansible-playbook playbooks/core/restart-slurm-safe.yml
 ```
 ### Validate CPU jobs
 ```bash
 ansible-playbook playbooks/tests/validate-slurm-operator.yml
 ansible-playbook playbooks/tests/test-cpu-job.yml
 ```
 ### Validate GPU jobs
 ```bash
 ansible-playbook playbooks/tests/test-gpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
 ```
 ### Enable accounting
 ```bash
 ansible-playbook playbooks/accounting/setup-slurmdbd.yml
 ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
 ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
 ansible-playbook playbooks/tests/test-sreport-usage.yml
 ```
 ### Configure QOS and fairshare
 ```bash
 ansible-playbook playbooks/qos/configure-slurm-qos.yml
 ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
 ```
 ### Provision a node
 ```bash
 ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
 ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
 ```
 ### Decommission a node
 ```bash
 ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
  -e target_node=<node> \
  -e "decom_reason=planned maintenance"
 ```
 ### Rolling OS upgrade
 ```bash
 ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
 ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
  -e canary_node=<node> \
  -e skip_canary=true
 ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
 ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
 ```
 ### Health check and auto-remediation
 ```bash
 ansible-playbook playbooks/health/check-slurm-health.yml
 ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
 ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
 ```
 ### Accounting backup and restore-check
 ```bash
 ansible-playbook playbooks/accounting/backup-slurmdbd.yml
 ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
 ```
 ## Operational maturity
 This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
 - Ansible workflows are designed to be repeatable and readable.
 - Configuration deployment supports check and diff review before applying changes.
 - Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
 - SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
 - QOS, fairshare, priority, and association workflows show resource governance.
 - Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
 - Rolling upgrade playbooks include canary validation before broader worker upgrades.
 - Health and repair playbooks document remediation paths for common node states.
 - Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
 ## Tested capabilities
 - [x] CPU job scheduling.
 - [x] GPU job scheduling.
 - [x] GPU denial when no GRES is requested.
 - [x] CPU cgroup enforcement.
 - [x] SlurmDBD accounting setup.
 - [x] `sacct` job history visibility.
 - [x] `sreport` usage reporting.
 - [x] QOS creation and validation.
 - [x] Fairshare and priority visibility.
 - [x] Node decommission and reprovision workflow.
 - [x] Rolling upgrade canary workflow.
 - [x] Node health check and auto-remediation workflow.
 These checks represent sanitized lab validation, not a claim of production certification.
 ## Safety and sanitization
 This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
 Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
 Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
 ## Why this matters for AI/HPC infrastructure roles
 AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
 This project demonstrates practical understanding of:
 - Linux systems operations.
 - Slurm cluster operations.
 - GPU infrastructure and GRES scheduling.
 - Job scheduling and resource isolation.
 - Accounting, reporting, QOS, fairshare, and priority policy.
 - Automation and repeatability with Ansible.
 - Troubleshooting and operational ownership.
 ## Deeper docs
 - [Runbook](docs/runbook.md)
@@ -0,0 +1,14 @@
 [defaults]
 inventory = ./inventories/lab/inventory.yml
 host_key_checking = False
 retry_files_enabled = False
 stdout_callback = default
 result_format = yaml
 interpreter_python = auto_silent
 timeout = 30
 roles_path = ./roles
 collections_path = ./collections
 [ssh_connection]
 pipelining = True
 ssh_args = -o ControlMaster=auto -o ControlPersist=60s
@@ -0,0 +1 @@
 Generated backups and reports can be stored here locally. This directory is ignored by git.
@@ -0,0 +1,75 @@
 # Slurm AI/HPC Lab Runbook
 ## Standard deployment order
 ```bash
 ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
 ansible-playbook playbooks/bootstrap/slurm-hosts.yml
 ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
 ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
 ansible-playbook playbooks/core/manage-munge.yml
 ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
 ansible-playbook playbooks/core/manage-slurm-config.yml --diff
 ansible-playbook playbooks/core/restart-slurm-safe.yml
 ansible-playbook playbooks/tests/validate-slurm-operator.yml
 ansible-playbook playbooks/tests/test-cpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
 ansible-playbook playbooks/accounting/setup-slurmdbd.yml
 ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
 ansible-playbook playbooks/accounting/backup-slurmdbd.yml
 ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
 ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
 ansible-playbook playbooks/qos/configure-slurm-qos.yml
 ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
 ansible-playbook playbooks/health/check-slurm-health.yml
 ```
 ## Node lifecycle
 Provision a node:
 ```bash
 ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
 ```
 Decommission a node:
 ```bash
 ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
 ```
 Repair a node:
 ```bash
 ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
 ```
 Run health remediation for nodes that can be recovered by the automated workflow:
 ```bash
 ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
 ```
 Back up Slurm and Munge state before planned lifecycle work:
 ```bash
 ansible-playbook playbooks/backup/backup-slurm-state.yml
 ansible-playbook playbooks/backup/fetch-slurm-backups.yml
 ```
 ## Rolling OS upgrade
 ```bash
 ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
 ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
 ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
 ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
 ```
 If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
@@ -0,0 +1,128 @@
 ---
 # Example lab inventory variables. Replace addresses, users and node topology for your environment.
 slurm_cluster_name: labcluster
 slurm_control_machine: slurm-ctl01
 slurm_control_addr: 10.10.10.11
 slurm_config_dir: /etc/slurm
 slurm_user: slurm
 slurm_operator_user: slurmuser
 slurmctld_port: 6817
 slurmd_port: 6818
 slurm_job_comp_type: jobcomp/none
 slurm_select_type: select/cons_tres
 slurm_select_type_parameters: CR_Core_Memory
 slurm_return_to_service: 2
 slurm_default_mpi_type: none
 slurm_gres_types: gpu
 slurm_nodes:
  - name: slurm-c01
    managed_state: present
    addr: 10.10.10.12
    cpus: 2
    real_memory: 1800
    features: ""
    gres: ""
    topology: ""
  - name: slurm-c02
    managed_state: present
    addr: 10.10.10.13
    cpus: 2
    real_memory: 1800
    features: ""
    gres: ""
    topology: ""
  - name: gpu01
    managed_state: present
    addr: 10.10.10.14
    cpus: 12
    real_memory: 60000
    features: "gpu"
    gres: "gpu:1"
    gres_file: /dev/nvidia0
    topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
 slurm_partitions:
  - name: debug
    managed_state: present
    nodes: "slurm-c[01-02]"
    default: "YES"
    max_time: "INFINITE"
    state: "UP"
  - name: gpu
    managed_state: present
    nodes: "gpu01"
    default: "NO"
    max_time: "INFINITE"
    state: "UP"
  - name: all
    managed_state: present
    nodes: "slurm-c[01-02],gpu01"
    default: "NO"
    max_time: "INFINITE"
    state: "UP"
 # Cgroup enforcement
 slurm_enable_cgroup: true
 slurm_task_plugin: task/cgroup,task/affinity
 slurm_proctrack_type: proctrack/cgroup
 slurm_job_acct_gather_type: jobacct_gather/cgroup
 # Slurm accounting / SlurmDBD
 slurm_accounting_storage_type: accounting_storage/slurmdbd
 slurm_accounting_storage_host: slurm-ctl01
 slurm_accounting_storage_port: 6819
 slurm_accounting_storage_enforce: associations,limits,qos
 slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
 slurmdbd_host: slurm-ctl01
 slurmdbd_port: 6819
 slurmdbd_storage_type: accounting_storage/mysql
 slurmdbd_storage_host: localhost
 slurmdbd_storage_port: 3306
 slurmdbd_storage_loc: slurm_acct_db
 slurmdbd_storage_user: slurm
 # Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
 slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
 slurm_account_name: lab
 slurm_account_description: "AI/HPC Slurm lab account"
 slurm_account_organization: "labcluster"
 # SlurmDBD purge / retention policy for lab
 slurmdbd_commit_delay: 1
 slurmdbd_purge_event_after: 12months
 slurmdbd_purge_job_after: 12months
 slurmdbd_purge_resv_after: 12months
 slurmdbd_purge_step_after: 3months
 slurmdbd_purge_suspend_after: 3months
 slurmdbd_purge_txn_after: 12months
 slurmdbd_purge_usage_after: 24months
 # Archive is disabled for the lab; backup playbooks handle database dumps.
 slurmdbd_archive_events: no
 slurmdbd_archive_jobs: no
 slurmdbd_archive_steps: no
 slurmdbd_archive_suspend: no
 slurmdbd_archive_txn: no
 slurmdbd_archive_usage: no
 # Slurm priority / fairshare
 slurm_priority_type: priority/multifactor
 slurm_priority_decay_half_life: 7-0
 slurm_priority_calc_period: 5
 slurm_priority_favor_small: "NO"
 slurm_priority_weight_age: 1000
 slurm_priority_weight_fairshare: 10000
 slurm_priority_weight_job_size: 1000
 slurm_priority_weight_partition: 1000
 slurm_priority_weight_qos: 10000
 slurm_priority_max_age: 1-0
@@ -0,0 +1,5 @@
 ---
 # Copy this file to vault.yml and encrypt it with ansible-vault.
 # ansible-vault encrypt inventories/lab/group_vars/vault.yml
 vault_slurmdbd_storage_pass: CHANGE_ME
@@ -0,0 +1,24 @@
 all:
  vars:
    ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
  children:
    slurm_cluster:
      children:
        slurm_controller:
          hosts:
            slurm-ctl01:
              ansible_host: 10.10.10.11
              ansible_user: ansible
        slurm_compute:
          hosts:
            slurm-c01:
              ansible_host: 10.10.10.12
              ansible_user: ansible
            slurm-c02:
              ansible_host: 10.10.10.13
              ansible_user: ansible
        slurm_gpu:
          hosts:
            gpu01:
              ansible_host: 10.10.10.14
              ansible_user: ansible
@@ -0,0 +1,90 @@
 ---
 - name: Backup SlurmDBD MariaDB database
  hosts: slurm_controller
  become: true
  gather_facts: true
  vars:
    slurmdbd_backup_dir: /var/backups/slurmdbd
    local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
  tasks:
    - name: Create remote backup directory
      ansible.builtin.file:
        path: "{{ slurmdbd_backup_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
    - name: Create local fetch directory on Ansible controller
      ansible.builtin.file:
        path: "{{ local_fetch_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
      delegate_to: localhost
      become: false
    - name: Validate MariaDB is running
      ansible.builtin.command:
        cmd: systemctl is-active mariadb
      changed_when: false
    - name: Validate SlurmDBD is running
      ansible.builtin.command:
        cmd: systemctl is-active slurmdbd
      changed_when: false
    - name: Validate Slurm accounting database exists
      ansible.builtin.shell: |
        set -euo pipefail
        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
      args:
        executable: /bin/bash
      changed_when: false
    - name: Dump Slurm accounting database
      ansible.builtin.shell: |
        set -euo pipefail
        ts="$(date +%F-%H%M%S)"
        out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
        mysqldump \
          --single-transaction \
          --routines \
          --events \
          --triggers \
          {{ slurmdbd_storage_loc }} | gzip -9 > "$out"
        chmod 0600 "$out"
        echo "$out"
      args:
        executable: /bin/bash
      register: db_dump
      changed_when: true
    - name: Validate backup file is non-empty
      ansible.builtin.stat:
        path: "{{ db_dump.stdout }}"
      register: backup_file
    - name: Fail if backup file is empty
      ansible.builtin.fail:
        msg: "Backup file is empty: {{ db_dump.stdout }}"
      when: backup_file.stat.size | int < 1024
    - name: Fetch DB backup to Ansible controller
      ansible.builtin.fetch:
        src: "{{ db_dump.stdout }}"
        dest: "{{ local_fetch_dir }}/"
        flat: true
    - name: Show DB backup result
      ansible.builtin.debug:
        msg:
          - "Remote backup: {{ db_dump.stdout }}"
          - "Backup size bytes: {{ backup_file.stat.size }}"
          - "Fetched to: {{ local_fetch_dir }}/"
@@ -0,0 +1,126 @@
 ---
 - name: Initialize Slurm accounting entities
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Wait for sacctmgr connectivity
      ansible.builtin.command:
        cmd: sacctmgr -n list cluster
      register: sacctmgr_cluster_list
      retries: 20
      delay: 2
      until: sacctmgr_cluster_list.rc == 0
      changed_when: false
    - name: Show current accounting state before changes
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: accounting_state_before
      changed_when: false
    - name: Print current accounting state before changes
      ansible.builtin.debug:
        var: accounting_state_before.stdout_lines
    - name: Ensure Slurm cluster exists in accounting DB
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
          echo "Cluster {{ slurm_cluster_name }} already exists"
        else
          sacctmgr -i add cluster {{ slurm_cluster_name }}
        fi
      args:
        executable: /bin/bash
      register: ensure_cluster
      changed_when: "'Adding Cluster' in ensure_cluster.stdout"
    - name: Ensure default lab account exists for cluster
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
          echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
        else
          sacctmgr -i add account {{ slurm_account_name }} \
            Cluster={{ slurm_cluster_name }} \
            Description="{{ slurm_account_description }}" \
            Organization="{{ slurm_account_organization }}"
        fi
      args:
        executable: /bin/bash
      register: ensure_account
      changed_when: "'Adding Account' in ensure_account.stdout"
    - name: Ensure slurmuser exists with lab account association
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
          echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
        else
          sacctmgr -i add user slurmuser \
            Cluster={{ slurm_cluster_name }} \
            Account={{ slurm_account_name }} \
            DefaultAccount={{ slurm_account_name }}
        fi
      args:
        executable: /bin/bash
      register: ensure_user_assoc
      changed_when: "'Adding User' in ensure_user_assoc.stdout"
    - name: Ensure slurmuser has default account set
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
      args:
        executable: /bin/bash
      register: set_default_account
      changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
    - name: Show final accounting state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: accounting_state_after
      changed_when: false
    - name: Print final accounting state
      ansible.builtin.debug:
        var: accounting_state_after.stdout_lines
@@ -0,0 +1,98 @@
 ---
 - name: Restore-check latest SlurmDBD backup into test database
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
    slurmdbd_backup_dir: /var/backups/slurmdbd
  tasks:
    - name: Validate MariaDB is running
      ansible.builtin.command:
        cmd: systemctl is-active mariadb
      changed_when: false
    - name: Find latest SlurmDBD backup
      ansible.builtin.shell: |
        set -euo pipefail
        ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
      args:
        executable: /bin/bash
      register: latest_backup
      changed_when: false
    - name: Validate latest backup exists
      ansible.builtin.stat:
        path: "{{ latest_backup.stdout }}"
      register: latest_backup_stat
    - name: Fail if latest backup is missing or empty
      ansible.builtin.fail:
        msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
      when:
        - not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
    - name: Recreate restore-check database
      ansible.builtin.shell: |
        set -euo pipefail
        mysql <<SQL
        DROP DATABASE IF EXISTS {{ restore_check_db }};
        CREATE DATABASE {{ restore_check_db }};
        SQL
      args:
        executable: /bin/bash
      changed_when: true
    - name: Import backup into restore-check database
      ansible.builtin.shell: |
        set -euo pipefail
        zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
      args:
        executable: /bin/bash
      changed_when: true
    - name: Validate restored table count
      ansible.builtin.shell: |
        set -euo pipefail
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
      args:
        executable: /bin/bash
      register: restored_tables
      changed_when: false
      failed_when: restored_tables.stdout | int < 1
    - name: Validate restored row count sample
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### restored database"
        echo "{{ restore_check_db }}"
        echo
        echo "### table count"
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
        echo
        echo "### largest tables"
        mysql -N -B -e "
          SELECT table_name, table_rows
          FROM information_schema.tables
          WHERE table_schema='{{ restore_check_db }}'
          ORDER BY table_rows DESC
          LIMIT 10;
        "
      args:
        executable: /bin/bash
      register: restore_check_summary
      changed_when: false
    - name: Show restore-check result
      ansible.builtin.debug:
        msg:
          - "Imported backup: {{ latest_backup.stdout }}"
          - "Restore-check DB: {{ restore_check_db }}"
          - "Restored tables: {{ restored_tables.stdout }}"
          - "Summary:"
          - "{{ restore_check_summary.stdout_lines }}"
@@ -0,0 +1,105 @@
 ---
 - name: Install and configure MariaDB for SlurmDBD
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Install MariaDB and SlurmDBD packages
      ansible.builtin.apt:
        name:
          - mariadb-server
          - mariadb-client
          - slurmdbd
          - slurm-wlm-mysql-plugin
        state: present
        update_cache: true
    - name: Ensure MariaDB is enabled and running
      ansible.builtin.systemd:
        name: mariadb
        enabled: true
        state: started
    - name: Ensure Slurm log directory exists
      ansible.builtin.file:
        path: /var/log/slurm
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Create Slurm accounting database and DB user
      ansible.builtin.shell: |
        set -euo pipefail
        mysql <<SQL
        CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
        FLUSH PRIVILEGES;
        SQL
      args:
        executable: /bin/bash
      changed_when: true
    - name: Ensure /etc/slurm exists
      ansible.builtin.file:
        path: /etc/slurm
        state: directory
        owner: root
        group: root
        mode: "0755"
    - name: Deploy slurmdbd.conf
      ansible.builtin.template:
        src: ../../templates/slurmdbd.conf.j2
        dest: /etc/slurm/slurmdbd.conf
        owner: slurm
        group: slurm
        mode: "0600"
      notify:
        - Restart slurmdbd
    - name: Ensure slurmdbd is enabled and running
      ansible.builtin.systemd:
        name: slurmdbd
        enabled: true
        state: started
    - name: Flush handlers before validation
      ansible.builtin.meta: flush_handlers
    - name: Validate slurmdbd service is active
      ansible.builtin.command:
        cmd: systemctl is-active slurmdbd
      register: slurmdbd_active
      retries: 10
      delay: 2
      until: slurmdbd_active.stdout == "active"
      changed_when: false
    - name: Validate slurmdbd is listening on port
      ansible.builtin.shell: |
        set -euo pipefail
        ss -lntp | grep ':{{ slurmdbd_port }} '
      args:
        executable: /bin/bash
      register: slurmdbd_port_check
      retries: 10
      delay: 2
      until: slurmdbd_port_check.rc == 0
      changed_when: false
    - name: Show slurmdbd service validation
      ansible.builtin.debug:
        msg:
          - "slurmdbd is active"
          - "{{ slurmdbd_port_check.stdout_lines }}"
  handlers:
    - name: Restart slurmdbd
      ansible.builtin.systemd:
        name: slurmdbd
        state: restarted
@@ -0,0 +1,178 @@
 ---
 - name: Validate Slurm accounting production-like setup
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate accounting services
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### services"
        systemctl is-active mariadb
        systemctl is-active slurmdbd
        systemctl is-active slurmctld
        echo
        echo "### slurmdbd listener"
        ss -lntp | grep ':6819 '
      args:
        executable: /bin/bash
      register: service_check
      changed_when: false
    - name: Validate Slurm accounting runtime config
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### accounting config"
        scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
        echo
        echo "### priority / select / cgroup config"
        scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
      args:
        executable: /bin/bash
      register: config_check
      changed_when: false
    - name: Validate sacctmgr entities
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: entity_check
      changed_when: false
    - name: Submit accounting validation job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=acct-prodlike-test
        #SBATCH --partition=debug
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/acct-prodlike-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/acct-prodlike-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: acct_job
      changed_when: true
    - name: Validate sacct can read recent jobs
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### recent jobs"
        sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
      args:
        executable: /bin/bash
      register: sacct_recent
      changed_when: false
    - name: Validate sreport commands
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### cluster utilization"
        sreport cluster utilization start=today || true
        echo
        echo "### account utilization by user"
        sreport cluster AccountUtilizationByUser start=today || true
        echo
        echo "### user top"
        sreport user top start=today || true
      args:
        executable: /bin/bash
      register: sreport_check
      changed_when: false
    - name: Validate MariaDB table health summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### database exists"
        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
        echo
        echo "### table count"
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
        echo
        echo "### largest tables"
        mysql -N -B -e "
          SELECT table_name, table_rows
          FROM information_schema.tables
          WHERE table_schema='{{ slurmdbd_storage_loc }}'
          ORDER BY table_rows DESC
          LIMIT 10;
        "
      args:
        executable: /bin/bash
      register: db_health
      changed_when: false
    - name: Print accounting validation
      ansible.builtin.debug:
        msg:
          - "### services"
          - "{{ service_check.stdout_lines }}"
          - "### runtime config"
          - "{{ config_check.stdout_lines }}"
          - "### accounting entities"
          - "{{ entity_check.stdout_lines }}"
          - "### accounting validation job"
          - "{{ acct_job.stdout_lines }}"
          - "### recent sacct data"
          - "{{ sacct_recent.stdout_lines }}"
          - "### sreport"
          - "{{ sreport_check.stdout_lines }}"
          - "### database health"
          - "{{ db_health.stdout_lines }}"
@@ -0,0 +1,83 @@
 ---
 - name: Backup Slurm and Munge state on all cluster nodes
  hosts: slurm_cluster
  become: true
  gather_facts: true
  vars:
    backup_base_dir: /var/backups/slurm
  tasks:
    - name: Create backup base directory
      ansible.builtin.file:
        path: "{{ backup_base_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
    - name: Create timestamped backup directory
      ansible.builtin.shell: |
        set -euo pipefail
        ts="$(date +%F-%H%M%S)"
        dir="{{ backup_base_dir }}/$ts"
        mkdir -p "$dir"
        echo "$dir"
      args:
        executable: /bin/bash
      register: backup_dir_result
      changed_when: true
    - name: Store backup directory fact
      ansible.builtin.set_fact:
        node_backup_dir: "{{ backup_dir_result.stdout }}"
    - name: Backup Slurm and Munge config/state if present
      ansible.builtin.shell: |
        set -euo pipefail
        backup_dir="{{ node_backup_dir }}"
        for p in \
          /etc/slurm \
          /etc/slurm-llnl \
          /etc/munge \
          /var/spool/slurmctld \
          /var/spool/slurmd \
          /var/log/slurm \
          /var/log/slurm-llnl
        do
          if [ -e "$p" ]; then
            cp -a "$p" "$backup_dir/"
          fi
        done
        systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
        systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
        systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
        journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
        journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
        journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
        if command -v sinfo >/dev/null 2>&1; then
          sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
        fi
        if command -v scontrol >/dev/null 2>&1; then
          scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
          scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
          scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
        fi
        find "$backup_dir" -maxdepth 2 -type f -o -type d
      args:
        executable: /bin/bash
      register: backup_content
      changed_when: true
    - name: Show backup location on node
      ansible.builtin.debug:
        msg:
          - "Host: {{ inventory_hostname }}"
          - "Backup directory: {{ node_backup_dir }}"
@@ -0,0 +1,46 @@
 ---
 - name: Fetch latest Slurm backups from nodes to pvef
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    remote_backup_base: /var/backups/slurm
    local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
  tasks:
    - name: Find latest remote backup directory
      ansible.builtin.shell: |
        set -euo pipefail
        ls -1dt {{ remote_backup_base }}/* | head -n 1
      args:
        executable: /bin/bash
      register: latest_backup_dir
      changed_when: false
    - name: Create local backup directory on pvef
      ansible.builtin.file:
        path: "{{ local_backup_base }}/{{ inventory_hostname }}"
        state: directory
        mode: "0700"
      delegate_to: localhost
      become: false
    - name: Archive latest backup directory on remote node
      ansible.builtin.archive:
        path: "{{ latest_backup_dir.stdout }}"
        dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        format: gz
        force_archive: true
      changed_when: true
    - name: Fetch archive to pvef
      ansible.builtin.fetch:
        src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
        flat: true
    - name: Remove temporary remote archive
      ansible.builtin.file:
        path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        state: absent
@@ -0,0 +1,58 @@
 ---
 - name: Bootstrap Ansible SSH access from pvef to Slurm nodes
  hosts: slurm_cluster
  gather_facts: false
  become: true
  vars:
    ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
  pre_tasks:
    - name: Wait for SSH
      ansible.builtin.wait_for_connection:
        timeout: 30
    - name: Install Python if missing - Debian/Ubuntu
      ansible.builtin.raw: |
        test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
      changed_when: false
  tasks:
    - name: Ensure sudo is installed
      ansible.builtin.apt:
        name:
          - sudo
          - openssh-server
        state: present
        update_cache: true
    - name: Ensure SSH server is enabled and running
      ansible.builtin.service:
        name: ssh
        state: started
        enabled: true
    - name: Ensure .ssh directory exists for login user
      ansible.builtin.file:
        path: "/home/{{ ansible_user }}/.ssh"
        state: directory
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: "0700"
    - name: Add pvef root public key to login user's authorized_keys
      ansible.builtin.authorized_key:
        user: "{{ ansible_user }}"
        key: "{{ ansible_controller_pubkey }}"
        state: present
        manage_dir: true
    - name: Allow bootstrap login user passwordless sudo
      ansible.builtin.copy:
        dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
        owner: root
        group: root
        mode: "0440"
        content: |
          {{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
        validate: "visudo -cf %s"
@@ -0,0 +1,16 @@
 ---
 - name: Configure /etc/hosts for Slurm cluster
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Add Slurm cluster hosts to /etc/hosts
      ansible.builtin.blockinfile:
        path: /etc/hosts
        marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
        block: |
          {{ slurm_control_addr }} {{ slurm_control_machine }}
          {% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
          {{ node.addr }} {{ node.name }}
          {% endfor %}
@@ -0,0 +1,218 @@
 ---
 - name: Create slurmuser and generate SSH keys on every Slurm node
  hosts: slurm_cluster
  become: true
  gather_facts: true
  vars:
    slurm_operator_user: slurmuser
    slurm_operator_shell: /bin/bash
  tasks:
    - name: Ensure useful packages are installed
      ansible.builtin.apt:
        name:
          - sudo
          - openssh-client
          - openssh-server
          - acl
        state: present
        update_cache: true
    - name: Ensure slurmuser exists
      ansible.builtin.user:
        name: "{{ slurm_operator_user }}"
        shell: "{{ slurm_operator_shell }}"
        create_home: true
        state: present
    - name: Ensure .ssh directory exists for slurmuser
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh"
        state: directory
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0700"
    - name: Generate SSH key for slurmuser if missing
      ansible.builtin.openssh_keypair:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
        type: ed25519
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0600"
        comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
        force: false
    - name: Read public key from each node
      ansible.builtin.slurp:
        src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
      register: slurmuser_pubkey_raw
    - name: Store decoded public key as host fact
      ansible.builtin.set_fact:
        slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
 - name: Exchange slurmuser SSH keys across all Slurm nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Install all slurmuser public keys into authorized_keys on every node
      ansible.builtin.authorized_key:
        user: "{{ slurm_operator_user }}"
        key: "{{ hostvars[item].slurmuser_pubkey }}"
        state: present
        manage_dir: true
      loop: "{{ groups['slurm_cluster'] }}"
    - name: Build SSH known_hosts entries for all cluster nodes
      ansible.builtin.shell: |
        set -e
        mkdir -p /home/{{ slurm_operator_user }}/.ssh
        touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
        {% for host in groups['slurm_cluster'] %}
        ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
        {% endfor %}
        sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
        chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
        chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
      args:
        executable: /bin/bash
      changed_when: true
    - name: Ensure SSH permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh"
        state: directory
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0700"
    - name: Ensure private key permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0600"
    - name: Ensure public key permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0644"
 - name: Configure sudo permissions for slurmuser
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Configure sudoers for slurmuser on Slurm controller
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          # Operator access for Slurm controller node.
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
            /bin/systemctl status slurmctld, \
            /bin/systemctl restart slurmctld, \
            /bin/systemctl reload slurmctld, \
            /bin/systemctl stop slurmctld, \
            /bin/systemctl start slurmctld, \
            /bin/systemctl status slurmd, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl stop slurmd, \
            /bin/systemctl start slurmd, \
            /bin/journalctl -u slurmctld, \
            /bin/journalctl -u slurmd, \
            /usr/bin/scontrol, \
            /usr/bin/sinfo, \
            /usr/bin/squeue, \
            /usr/bin/scancel, \
            /usr/bin/sacct, \
            /usr/bin/sacctmgr, \
            /usr/bin/sbatch, \
            /usr/bin/srun, \
            /usr/bin/salloc
        validate: "visudo -cf %s"
      when: inventory_hostname in groups['slurm_controller']
    - name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          # Operator access for Slurm worker/GPU nodes.
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
            /bin/systemctl status slurmd, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl stop slurmd, \
            /bin/systemctl start slurmd, \
            /bin/journalctl -u slurmd, \
            /usr/bin/scontrol, \
            /usr/bin/sinfo, \
            /usr/bin/squeue, \
            /usr/bin/scancel, \
            /usr/bin/sacct, \
            /usr/bin/sbatch, \
            /usr/bin/srun, \
            /usr/bin/salloc
        validate: "visudo -cf %s"
      when: inventory_hostname not in groups['slurm_controller']
 - name: Validate slurmuser SSH mesh and Slurm access
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Test local Slurm commands as slurmuser
      ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
      register: sinfo_test
      changed_when: false
      failed_when: sinfo_test.rc != 0
    - name: Show sinfo result
      ansible.builtin.debug:
        var: sinfo_test.stdout_lines
    - name: Test SSH from each node to every other node as slurmuser
      ansible.builtin.shell: |
        set -e
        {% for host in groups['slurm_cluster'] %}
        ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
        {% endfor %}
      args:
        executable: /bin/bash
      become_user: "{{ slurm_operator_user }}"
      register: ssh_mesh_test
      changed_when: false
    - name: Show SSH mesh test result
      ansible.builtin.debug:
        var: ssh_mesh_test.stdout_lines
@@ -0,0 +1,112 @@
 ---
 - name: Fix sudo permissions for slurmuser Slurm operations
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Configure sudoers for slurmuser on controller
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
            /bin/systemctl status slurmctld, \
            /bin/systemctl status slurmctld *, \
            /bin/systemctl restart slurmctld, \
            /bin/systemctl reload slurmctld, \
            /bin/systemctl start slurmctld, \
            /bin/systemctl stop slurmctld, \
            /bin/systemctl status slurmd, \
            /bin/systemctl status slurmd *, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl start slurmd, \
            /bin/systemctl stop slurmd, \
            /usr/bin/systemctl status slurmctld, \
            /usr/bin/systemctl status slurmctld *, \
            /usr/bin/systemctl restart slurmctld, \
            /usr/bin/systemctl reload slurmctld, \
            /usr/bin/systemctl start slurmctld, \
            /usr/bin/systemctl stop slurmctld, \
            /usr/bin/systemctl status slurmd, \
            /usr/bin/systemctl status slurmd *, \
            /usr/bin/systemctl restart slurmd, \
            /usr/bin/systemctl reload slurmd, \
            /usr/bin/systemctl start slurmd, \
            /usr/bin/systemctl stop slurmd
          Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
            /bin/journalctl -u slurmctld, \
            /bin/journalctl -u slurmctld *, \
            /bin/journalctl -u slurmd, \
            /bin/journalctl -u slurmd *, \
            /usr/bin/journalctl -u slurmctld, \
            /usr/bin/journalctl -u slurmctld *, \
            /usr/bin/journalctl -u slurmd, \
            /usr/bin/journalctl -u slurmd *
          Cmnd_Alias SLURM_COMMANDS = \
            /usr/bin/scontrol, /usr/bin/scontrol *, \
            /usr/bin/sinfo, /usr/bin/sinfo *, \
            /usr/bin/squeue, /usr/bin/squeue *, \
            /usr/bin/scancel, /usr/bin/scancel *, \
            /usr/bin/sacct, /usr/bin/sacct *, \
            /usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
            /usr/bin/sbatch, /usr/bin/sbatch *, \
            /usr/bin/srun, /usr/bin/srun *, \
            /usr/bin/salloc, /usr/bin/salloc *
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
        validate: "visudo -cf %s"
      when: inventory_hostname in groups['slurm_controller']
    - name: Configure sudoers for slurmuser on compute and GPU nodes
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
            /bin/systemctl status slurmd, \
            /bin/systemctl status slurmd *, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl start slurmd, \
            /bin/systemctl stop slurmd, \
            /usr/bin/systemctl status slurmd, \
            /usr/bin/systemctl status slurmd *, \
            /usr/bin/systemctl restart slurmd, \
            /usr/bin/systemctl reload slurmd, \
            /usr/bin/systemctl start slurmd, \
            /usr/bin/systemctl stop slurmd
          Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
            /bin/journalctl -u slurmd, \
            /bin/journalctl -u slurmd *, \
            /usr/bin/journalctl -u slurmd, \
            /usr/bin/journalctl -u slurmd *
          Cmnd_Alias SLURM_COMMANDS = \
            /usr/bin/scontrol, /usr/bin/scontrol *, \
            /usr/bin/sinfo, /usr/bin/sinfo *, \
            /usr/bin/squeue, /usr/bin/squeue *, \
            /usr/bin/scancel, /usr/bin/scancel *, \
            /usr/bin/sacct, /usr/bin/sacct *, \
            /usr/bin/sbatch, /usr/bin/sbatch *, \
            /usr/bin/srun, /usr/bin/srun *, \
            /usr/bin/salloc, /usr/bin/salloc *
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
        validate: "visudo -cf %s"
      when: inventory_hostname not in groups['slurm_controller']
@@ -0,0 +1,133 @@
 ---
 - name: Read Munge key from Slurm controller
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Check controller munge.key exists
      ansible.builtin.stat:
        path: /etc/munge/munge.key
      register: controller_munge_key
    - name: Fail if controller munge.key is missing
      ansible.builtin.fail:
        msg: "/etc/munge/munge.key is missing on controller. Do not continue."
      when: not controller_munge_key.stat.exists
    - name: Read controller munge.key
      ansible.builtin.slurp:
        src: /etc/munge/munge.key
      register: controller_munge_key_raw
    - name: Store controller Munge key as fact
      ansible.builtin.set_fact:
        cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
 - name: Deploy controller Munge key to all Slurm nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    controller_host: "{{ groups['slurm_controller'][0] }}"
  tasks:
    - name: Ensure munge package is installed
      ansible.builtin.apt:
        name:
          - munge
          - libmunge2
        state: present
        update_cache: true
    - name: Ensure munge group exists
      ansible.builtin.group:
        name: munge
        system: true
        state: present
    - name: Ensure munge user exists
      ansible.builtin.user:
        name: munge
        group: munge
        system: true
        shell: /usr/sbin/nologin
        home: /nonexistent
        create_home: false
        state: present
    - name: Ensure /etc/munge exists
      ansible.builtin.file:
        path: /etc/munge
        state: directory
        owner: munge
        group: munge
        mode: "0700"
    - name: Deploy shared munge.key from controller
      ansible.builtin.copy:
        dest: /etc/munge/munge.key
        content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
        owner: munge
        group: munge
        mode: "0400"
      notify:
        - Restart munge
    - name: Ensure /var/log/munge exists
      ansible.builtin.file:
        path: /var/log/munge
        state: directory
        owner: munge
        group: munge
        mode: "0755"
    - name: Ensure /var/lib/munge exists
      ansible.builtin.file:
        path: /var/lib/munge
        state: directory
        owner: munge
        group: munge
        mode: "0711"
    - name: Ensure /run/munge exists
      ansible.builtin.file:
        path: /run/munge
        state: directory
        owner: munge
        group: munge
        mode: "0755"
    - name: Ensure munge is enabled and running
      ansible.builtin.systemd:
        name: munge
        enabled: true
        state: started
  handlers:
    - name: Restart munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
 - name: Validate Munge locally on all nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Test local munge encode/decode
      ansible.builtin.shell: |
        set -euo pipefail
        munge -n | unmunge
      args:
        executable: /bin/bash
      register: munge_local_test
      changed_when: false
    - name: Show local Munge validation
      ansible.builtin.debug:
        var: munge_local_test.stdout_lines
@@ -0,0 +1,132 @@
 ---
 - name: Prepare Slurm config directories and logs
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Ensure Slurm config directory exists
      ansible.builtin.file:
        path: "{{ slurm_config_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0755"
    - name: Ensure Slurm log directory exists
      ansible.builtin.file:
        path: /var/log/slurm
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Ensure slurmctld spool directory exists on controller
      ansible.builtin.file:
        path: /var/spool/slurmctld
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
      when: inventory_hostname in groups['slurm_controller']
    - name: Ensure slurmd spool directory exists on workers
      ansible.builtin.file:
        path: /var/spool/slurmd
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
 - name: Deploy Slurm config files
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Backup current slurm.conf before managed deployment
      ansible.builtin.copy:
        src: "{{ slurm_config_dir }}/slurm.conf"
        dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
        remote_src: true
        owner: root
        group: root
        mode: "0644"
        force: false
    - name: Deploy managed slurm.conf
      ansible.builtin.template:
        src: ../../templates/slurm.conf.j2
        dest: "{{ slurm_config_dir }}/slurm.conf"
        owner: root
        group: root
        mode: "0644"
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
    - name: Deploy managed cgroup.conf
      ansible.builtin.template:
        src: ../../templates/cgroup.conf.j2
        dest: "{{ slurm_config_dir }}/cgroup.conf"
        owner: root
        group: root
        mode: "0644"
      when: slurm_enable_cgroup | default(false) | bool
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
    - name: Deploy managed gres.conf only on GPU nodes
      ansible.builtin.template:
        src: ../../templates/gres.conf.j2
        dest: "{{ slurm_config_dir }}/gres.conf"
        owner: root
        group: root
        mode: "0644"
      when: inventory_hostname in groups['slurm_gpu']
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
  handlers:
    - name: Reconfigure slurmctld
      ansible.builtin.command:
        cmd: scontrol reconfigure
      when: inventory_hostname in groups['slurm_controller']
      changed_when: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
 - name: Validate Slurm after config deployment
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Reconfigure controller
      ansible.builtin.command:
        cmd: scontrol reconfigure
      changed_when: true
    - name: Validate cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol ping
        sinfo
        scontrol show nodes
      args:
        executable: /bin/bash
      register: slurm_config_validation
      changed_when: false
    - name: Show validation output
      ansible.builtin.debug:
        var: slurm_config_validation.stdout_lines
@@ -0,0 +1,103 @@
 ---
 - name: Restart Slurm controller safely
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Restart munge on controller
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmctld on controller
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
        enabled: true
    - name: Wait for slurmctld to answer
      ansible.builtin.command:
        cmd: scontrol ping
      register: scontrol_ping
      retries: 15
      delay: 2
      until: scontrol_ping.rc == 0
      changed_when: false
    - name: Show controller ping
      ansible.builtin.debug:
        var: scontrol_ping.stdout_lines
 - name: Restart Slurm workers safely one by one
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: false
  serial: 1
  tasks:
    - name: Restart munge on worker
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd on worker
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Wait for slurmd to be active
      ansible.builtin.command:
        cmd: systemctl is-active slurmd
      register: slurmd_active
      retries: 15
      delay: 2
      until: slurmd_active.stdout == "active"
      changed_when: false
    - name: Wait until this node is visible in Slurm
      ansible.builtin.command:
        cmd: scontrol show node {{ inventory_hostname }}
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: node_visible
      retries: 15
      delay: 2
      until: node_visible.rc == 0
      changed_when: false
 - name: Validate Slurm after restart
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate Slurm cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### scontrol ping"
        scontrol ping
        echo
        echo "### sinfo"
        sinfo
        echo
        echo "### nodes"
        scontrol show nodes
        echo
        echo "### partitions"
        scontrol show partitions
      args:
        executable: /bin/bash
      register: slurm_validation
      changed_when: false
    - name: Show Slurm validation
      ansible.builtin.debug:
        var: slurm_validation.stdout_lines
@@ -0,0 +1,40 @@
 ---
 - name: Discover node resources for Slurm config
  hosts: slurm_cluster
  become: true
  gather_facts: true
  tasks:
    - name: Discover CPU and memory
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST={{ inventory_hostname }}"
        echo "CPUS=$(nproc)"
        echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
        echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
        echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
        echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
      args:
        executable: /bin/bash
      register: cpu_mem
      changed_when: false
    - name: Discover NVIDIA GPU if present
      ansible.builtin.shell: |
        set -euo pipefail
        if command -v nvidia-smi >/dev/null 2>&1; then
          nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
        else
          echo "NO_NVIDIA_SMI"
        fi
      args:
        executable: /bin/bash
      register: gpu_info
      changed_when: false
    - name: Show discovered resources
      ansible.builtin.debug:
        msg:
          - "{{ cpu_mem.stdout_lines }}"
          - "GPU:"
          - "{{ gpu_info.stdout_lines }}"
@@ -0,0 +1,89 @@
 ---
 - name: Inspect current Slurm and Munge state
  hosts: slurm_cluster
  become: true
  gather_facts: true
  tasks:
    - name: Basic host info
      ansible.builtin.shell: |
        set -e
        echo "HOST=$(hostname -f 2>/dev/null || hostname)"
        echo "SHORT_HOST=$(hostname -s)"
        echo "IP_ADDRESSES=$(hostname -I)"
        echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
        echo "KERNEL=$(uname -r)"
      args:
        executable: /bin/bash
      register: host_info
      changed_when: false
    - name: Slurm package info
      ansible.builtin.shell: |
        dpkg -l | grep -Ei 'slurm|munge' || true
      args:
        executable: /bin/bash
      register: package_info
      changed_when: false
    - name: Slurm config paths
      ansible.builtin.shell: |
        set -e
        for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
          echo "### $p"
          if [ -e "$p" ]; then
            find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
          else
            echo "MISSING"
          fi
        done
      args:
        executable: /bin/bash
      register: config_paths
      changed_when: false
    - name: Service state
      ansible.builtin.shell: |
        for s in munge slurmctld slurmd; do
          echo "### $s"
          systemctl is-enabled "$s" 2>/dev/null || true
          systemctl is-active "$s" 2>/dev/null || true
        done
      args:
        executable: /bin/bash
      register: service_state
      changed_when: false
    - name: Slurm commands
      ansible.builtin.shell: |
        echo "### which"
        command -v sinfo || true
        command -v scontrol || true
        command -v sbatch || true
        command -v srun || true
        command -v munge || true
        command -v unmunge || true
        echo "### sinfo"
        sinfo 2>&1 || true
        echo "### scontrol ping"
        scontrol ping 2>&1 || true
      args:
        executable: /bin/bash
      register: slurm_commands
      changed_when: false
    - name: Show inspection report
      ansible.builtin.debug:
        msg:
          - "===== {{ inventory_hostname }} :: host_info ====="
          - "{{ host_info.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: packages ====="
          - "{{ package_info.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: config_paths ====="
          - "{{ config_paths.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: services ====="
          - "{{ service_state.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: slurm_commands ====="
          - "{{ slurm_commands.stdout_lines }}"
@@ -0,0 +1,216 @@
 ---
 - name: Detect problematic Slurm nodes
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Detect nodes needing remediation
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -h -o "%N %T" | awk '
          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
        ' | sort -u
      args:
        executable: /bin/bash
      register: bad_nodes_raw
      changed_when: false
    - name: Store bad node list
      ansible.builtin.set_fact:
        bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
    - name: Show detected problematic nodes
      ansible.builtin.debug:
        var: bad_nodes
 - name: Attempt auto-remediation on problematic nodes
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: false
  serial: 1
  vars:
    bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
  tasks:
    - name: Skip healthy nodes
      ansible.builtin.meta: end_host
      when: inventory_hostname not in bad_nodes_from_controller
    - name: Restart Munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Validate local services after remediation attempt
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST=$(hostname)"
        echo
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller ping"
        scontrol ping
        echo
        echo "### slurmd listener"
        ss -lntp | grep ':6818 ' || true
        echo
        echo "### recent slurmd logs"
        journalctl -u slurmd -n 30 --no-pager || true
      args:
        executable: /bin/bash
      register: local_repair_check
      changed_when: false
    - name: Print local remediation result
      ansible.builtin.debug:
        var: local_repair_check.stdout_lines
 - name: Refresh controller and validate remediated nodes
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Restart slurmctld to refresh node states
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
    - name: Wait for controller
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping
      retries: 15
      delay: 2
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Clear maintenance state on previously bad nodes
      ansible.builtin.shell: |
        set -euo pipefail
        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
        if [ -z "$bad_nodes" ]; then
          echo "No bad nodes detected. Nothing to clear."
          sinfo -N
          exit 0
        fi
        for node in $bad_nodes; do
          echo "### clearing state on $node"
          scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
          scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
          scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
        done
        sleep 5
        sinfo -N
      args:
        executable: /bin/bash
      register: clear_result
      changed_when: true
    - name: Print clear-state result
      ansible.builtin.debug:
        var: clear_result.stdout_lines
    - name: Detect nodes still unhealthy after remediation
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -h -o "%N %T" | awk '
          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
        ' | sort -u
      args:
        executable: /bin/bash
      register: still_bad_nodes_raw
      changed_when: false
    - name: Store still bad nodes
      ansible.builtin.set_fact:
        still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
    - name: Drain nodes that remain unhealthy
      ansible.builtin.shell: |
        set -euo pipefail
        unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
        if [ -z "$unresolved_nodes" ]; then
          echo "No unresolved unhealthy nodes."
          sinfo -N
          exit 0
        fi
        for node in $unresolved_nodes; do
          echo "### draining unresolved node $node"
          scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
        done
        sinfo -N
      args:
        executable: /bin/bash
      register: drain_unresolved
      changed_when: still_bad_nodes | length > 0
    - name: Show remediation summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### initial bad nodes"
        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
        if [ -z "$bad_nodes" ]; then
          echo "none"
        else
          printf '%s\n' $bad_nodes
        fi
        echo
        echo "### still bad nodes"
        still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
        if [ -z "$still_bad_nodes" ]; then
          echo "none"
        else
          printf '%s\n' $still_bad_nodes
        fi
        echo
        echo "### final sinfo"
        sinfo -N
        echo
        echo "### queue"
        squeue
      args:
        executable: /bin/bash
      register: remediation_summary
      changed_when: false
    - name: Print remediation summary
      ansible.builtin.debug:
        var: remediation_summary.stdout_lines
@@ -0,0 +1,149 @@
 ---
 - name: Check Slurm controller health
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Check controller services and cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### controller services"
        systemctl is-active munge
        systemctl is-active slurmctld
        systemctl is-active slurmdbd || true
        systemctl is-active mariadb || true
        echo
        echo "### slurm ping"
        scontrol ping
        echo
        echo "### nodes"
        sinfo -N
        echo
        echo "### partitions"
        sinfo
        echo
        echo "### queue"
        squeue
        echo
        echo "### problematic nodes"
        sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
        echo
        echo "### accounting"
        sacctmgr -n list cluster || true
        echo
        echo "### recent failed jobs"
        sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
          --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
      args:
        executable: /bin/bash
      register: controller_health
      changed_when: false
    - name: Print controller health
      ansible.builtin.debug:
        var: controller_health.stdout_lines
 - name: Check Slurm worker health
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: true
  tasks:
    - name: Check worker services, config and connectivity
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST=$(hostname)"
        echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
        echo "KERNEL=$(uname -r)"
        echo "UPTIME=$(uptime -p)"
        echo
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge local test"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller connectivity"
        getent hosts slurm-ctl01 || true
        scontrol ping
        echo
        echo "### slurmd listener"
        ss -lntp | grep ':6818 ' || true
        echo
        echo "### config checksums"
        sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
        echo
        echo "### shared filesystem"
        test -d /shared
        touch /shared/.slurm-health-$(hostname)
        ls -l /shared/.slurm-health-$(hostname)
        rm -f /shared/.slurm-health-$(hostname)
        echo
        echo "### cgroup"
        mount | grep cgroup || true
        echo
        echo "### gpu check"
        if command -v nvidia-smi >/dev/null 2>&1; then
          nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
        else
          echo "NO_NVIDIA_SMI"
        fi
      args:
        executable: /bin/bash
      register: worker_health
      changed_when: false
    - name: Print worker health
      ansible.builtin.debug:
        var: worker_health.stdout_lines
 - name: Check Slurm-reported node state consistency
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Build Slurm node health summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### node summary"
        sinfo -N -o "%N %P %T %C %m %G %E"
        echo
        echo "### full problematic node details"
        for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
          echo
          echo "### $node"
          scontrol show node "$node"
        done
      args:
        executable: /bin/bash
      register: slurm_node_summary
      changed_when: false
    - name: Print Slurm node summary
      ansible.builtin.debug:
        var: slurm_node_summary.stdout_lines
--- a/Show More
+++ b/Show More
Author	SHA1	Message	Date
Mateusz Suski	4e739c5c99	Add Linux fresh setup toolkit lint / shell-yaml-ansible (push) Failing after 16s Details	2026-06-06 00:23:11 +00:00
Mateusz Suski	8cb92de06f	Add AI lab maintenance toolkit lint / shell-yaml-ansible (push) Failing after 17s Details	2026-06-06 00:10:44 +00:00
Mateusz Suski	1843796e92	Document Slurm AI/HPC cluster project lint / shell-yaml-ansible (push) Failing after 17s Details	2026-06-05 15:39:24 +00:00
Mateusz Suski	cd6830334b	Add Slurm AI/HPC cluster platform project	2026-06-05 15:38:56 +00:00
mateusz	e2624a7533	PDF CV file upload lint / shell-yaml-ansible (push) Failing after 16s Details	2026-05-14 21:23:49 +02:00
Mateusz Suski	6475f76787	Add L2 incident triage report wrapper lint / shell-yaml-ansible (push) Failing after 17s Details	2026-05-12 20:00:42 +00:00
Mateusz Suski	e851568c8c	Add standalone Bash incident check scripts lint / shell-yaml-ansible (push) Failing after 16s Details	2026-05-11 18:49:00 +00:00
		`@@ -0,0 +1 @@`
							`Generated backups and reports can be stored here locally. This directory is ignored by git.`