Compare commits

..

7 Commits

Author SHA1 Message Date
Mateusz Suski 4e739c5c99 Add Linux fresh setup toolkit
lint / shell-yaml-ansible (push) Failing after 16s
2026-06-06 00:23:11 +00:00
Mateusz Suski 8cb92de06f Add AI lab maintenance toolkit
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-06 00:10:44 +00:00
Mateusz Suski 1843796e92 Document Slurm AI/HPC cluster project
lint / shell-yaml-ansible (push) Failing after 17s
2026-06-05 15:39:24 +00:00
Mateusz Suski cd6830334b Add Slurm AI/HPC cluster platform project 2026-06-05 15:38:56 +00:00
mateusz e2624a7533 PDF CV file upload
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-14 21:23:49 +02:00
Mateusz Suski 6475f76787 Add L2 incident triage report wrapper
lint / shell-yaml-ansible (push) Failing after 17s
2026-05-12 20:00:42 +00:00
Mateusz Suski e851568c8c Add standalone Bash incident check scripts
lint / shell-yaml-ansible (push) Failing after 16s
2026-05-11 18:49:00 +00:00
122 changed files with 9759 additions and 8 deletions
+5
View File
@@ -4,6 +4,8 @@
### Added ### Added
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
- Python tooling validation for operational scripts. - Python tooling validation for operational scripts.
- `incident-log-summary` for general incident log summarization. - `incident-log-summary` for general incident log summarization.
- `log-diff-checker` for pre-change and post-change log comparison. - `log-diff-checker` for pre-change and post-change log comparison.
@@ -11,6 +13,8 @@
- `jvm-log-analyzer` for JVM application log summaries. - `jvm-log-analyzer` for JVM application log summaries.
- `journal-analyzer` for exported `journalctl` log review. - `journal-analyzer` for exported `journalctl` log review.
- `known-error-matcher` with JSON-based known error patterns. - `known-error-matcher` with JSON-based known error patterns.
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
- Repository-level Codex guidance: - Repository-level Codex guidance:
- `AGENTS.md` - `AGENTS.md`
- `docs/codex/README.md` - `docs/codex/README.md`
@@ -34,6 +38,7 @@
- IBM AIX 7 role and playbook. - IBM AIX 7 role and playbook.
- Shared sanitized Ansible inventory defaults for Linux and AIX examples. - Shared sanitized Ansible inventory defaults for Linux and AIX examples.
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation. - Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
### Changed ### Changed
Binary file not shown.
+3
View File
@@ -30,6 +30,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
- [infra-run](./infra-run/) - the main implemented project in this repository. - [infra-run](./infra-run/) - the main implemented project in this repository.
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers. - [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks. - [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples. - [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples. - [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
@@ -41,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references. - [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports. - [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles. - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
## Planned Areas ## Planned Areas
@@ -105,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
- Veritas VxVM/VCS operational awareness. - Veritas VxVM/VCS operational awareness.
- GPFS / IBM Spectrum Scale operational awareness. - GPFS / IBM Spectrum Scale operational awareness.
- Ansible role organization for selected hardening controls. - Ansible role organization for selected hardening controls.
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
- Clear documentation of what was tested and what still needs a real system. - Clear documentation of what was tested and what still needs a real system.
+1
View File
@@ -20,6 +20,7 @@ This file keeps future portfolio ideas in one place so empty folders do not look
## Implemented Portfolio Additions ## Implemented Portfolio Additions
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
- Python operational log analysis suite under `infra-run/scripts/python/`: - Python operational log analysis suite under `infra-run/scripts/python/`:
- `incident-log-summary` - `incident-log-summary`
- `log-diff-checker` - `log-diff-checker`
+1
View File
@@ -9,6 +9,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
### Bash Operational Scripts ### Bash Operational Scripts
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts. - [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow. - [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples. - [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples. - [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
+1
View File
@@ -16,6 +16,7 @@ This file tracks planned `infra-run` additions without presenting them as comple
## Implemented Additions ## Implemented Additions
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files. - `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review. - `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review. - `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
+1
View File
@@ -7,5 +7,6 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
## Included ## Included
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report. - `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
- `veritas/` - sample VxVM disk and VCS service group output. - `veritas/` - sample VxVM disk and VCS service group output.
- `gpfs/` - sample GPFS cluster and NSD output. - `gpfs/` - sample GPFS cluster and NSD output.
@@ -0,0 +1,131 @@
# L2 Incident Triage Report
- Generated: 2026-05-12T19:30:00Z
- Local hostname: app01.example.internal
- Current user: triage
- Incident type: all
- Service: nginx
- Host: app.example.com
- Port: 443
- PID: not provided
- Process match: not provided
- Since: 30 minutes ago
## Executed Checks
| Check | Script | Status | Exit | Command |
| --- | --- | --- | --- | --- |
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
## Summary
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
- Inode usage: OK: Highest inode usage is 42%
- JVM threads and heap: WARNING: No Java processes detected
## Raw Evidence
### CPU saturation
Script: `check_high_cpu.sh`
Command: `./check_high_cpu.sh`
Status: OK, exit: 0
```text
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
Load average:
1m=0.42 5m=0.38 15m=0.31
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND ARGS
1450 1 app 7.2 2.1 nginx nginx: worker process
Recommended next steps:
- Check process ownership and whether the top process is expected
- Review logs for the top CPU-consuming process
```
### Memory and OOM
Script: `check_high_memory_oom.sh`
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
Status: WARNING, exit: 1
```text
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
Mem: 15800 13272 1110 210 1418 1840
Swap: 4095 512 3583
OOM events since 30 minutes ago:
OK: no OOM evidence found in available sources
```
### Service restart loop
Script: `check_service_restart_loop.sh`
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
Status: OK, exit: 0
```text
OK: Service nginx state=active substate=running restarts=0
Systemd properties:
Id=nginx.service
ActiveState=active
SubState=running
NRestarts=0
```
### Skipped or limited checks
```text
JVM threads and heap returned WARNING because no Java process was detected.
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
```
## L2 Handover Checklist
- [ ] Business impact confirmed
- [ ] Affected host/service identified
- [ ] Monitoring alert attached
- [ ] Recent changes checked
- [ ] Logs attached
- [ ] Service owner identified
- [ ] Escalation target identified
## Escalation Notes
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
- Include the alert, timeline, commands run, and the raw evidence above.
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
## Recommended Next Steps
- Confirm the symptom against monitoring and user reports.
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
- Attach this report to the incident ticket before handoff.
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
+15 -6
View File
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
```mermaid ```mermaid
flowchart TD flowchart TD
A["bash"] --> B["os-healthcheck"] A["bash"] --> B["os-healthcheck"]
A --> C["disk-full"] A --> C["incident-checks"]
A --> D["veritas"] A --> D["disk-full"]
A --> E["gpfs"] A --> E["veritas"]
A --> F["gpfs"]
B --> B1["Host diagnostics"] B --> B1["Host diagnostics"]
C --> C1["Incident workflow"] C --> C1["Standalone triage checks"]
D --> D1["VxVM and VCS change flow"] D --> D1["Incident workflow"]
E --> E1["Spectrum Scale expansion flow"] E --> E1["VxVM and VCS change flow"]
F --> F1["Spectrum Scale expansion flow"]
``` ```
## Scripts ## Scripts
@@ -23,6 +25,7 @@ flowchart TD
- `os-healthcheck/service_check.sh` - critical service status check. - `os-healthcheck/service_check.sh` - critical service status check.
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`. - `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics. - `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
## Usage ## Usage
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
./system_report.sh ./system_report.sh
./network_troubleshoot.sh ./network_troubleshoot.sh
./network_troubleshoot.sh google.com ./network_troubleshoot.sh google.com
cd ../incident-checks
./check_high_cpu.sh
./check_high_memory_oom.sh --since "24 hours ago"
./check_service_restart_loop.sh --service sshd
./check_certificate_expiry.sh --host example.com
``` ```
## Standards ## Standards
@@ -0,0 +1,124 @@
# Bash Incident Checks
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
## Scripts
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
- `check_filesystem_readonly.sh` - read-only filesystem detection.
- `check_inode_usage.sh` - inode pressure and top affected mount points.
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
## Usage Examples
```bash
./check_high_cpu.sh
./check_high_cpu.sh --warning 70 --critical 90 --top 15
./check_high_memory_oom.sh
./check_high_memory_oom.sh --since "6 hours ago" --top 5
./check_service_restart_loop.sh --service nginx
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
./check_failed_ssh_logins.sh
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
./check_certificate_expiry.sh --host example.com
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
./check_dns_connectivity.sh --host example.com
./check_dns_connectivity.sh --host db.example.internal --port 5432
./check_ntp_time_drift.sh
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
./check_filesystem_readonly.sh
./check_filesystem_readonly.sh --include-system
./check_inode_usage.sh
./check_inode_usage.sh --warning 75 --critical 90
./check_jvm_threads_heap.sh
./check_jvm_threads_heap.sh --pid 1234
./check_jvm_threads_heap.sh --match app-name
./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
```
## L2 Triage Report Wrapper
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
## Exit Codes
- `0` - OK.
- `1` - WARNING or operational issue detected.
- `2` - invalid input or missing required dependency.
- `3` - CRITICAL issue detected.
## Supported Platforms
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
Some data sources vary by distribution:
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
## Safety Notes
- Scripts are read-only.
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
- Treat output as triage evidence, not as complete root-cause analysis.
## Dependency Notes
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
## Copy-To-Server Example
```bash
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
```
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
## Sample Outputs
Sanitized examples are available in [examples](./examples/):
- `high-cpu.sample.txt`
- `high-memory-oom.sample.txt`
- `service-restart-loop.sample.txt`
- `failed-ssh-logins.sample.txt`
- `certificate-expiry.sample.txt`
- `dns-connectivity.sample.txt`
- `ntp-time-drift.sample.txt`
- `filesystem-readonly.sample.txt`
- `inode-usage.sample.txt`
- `jvm-threads-heap.sample.txt`
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
@@ -0,0 +1,134 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
host_name=""
port=443
cert_file=""
warning_days=30
critical_days=7
servername=""
usage() {
cat <<'USAGE'
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
Check TLS certificate expiry for a remote endpoint or local certificate file.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if ! command -v openssl >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: openssl\n'
exit 2
fi
for value in "$port" "$warning_days" "$critical_days"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((critical_days >= warning_days)); then
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
exit 2
fi
if [[ -n "$host_name" && -n "$cert_file" ]]; then
printf 'CRITICAL: use either --host or --file, not both\n'
exit 2
fi
if [[ -z "$host_name" && -z "$cert_file" ]]; then
printf 'CRITICAL: either --host or --file is required\n'
usage
exit 2
fi
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
exit 2
fi
if [[ -z "$servername" ]]; then
servername="$host_name"
fi
tmp_cert="$(mktemp)"
trap 'rm -f "$tmp_cert"' EXIT
if [[ -n "$host_name" ]]; then
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
exit 2
fi
else
cp "$cert_file" "$tmp_cert"
fi
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
now_epoch="$(date +%s)"
if [[ -z "$expiry_epoch" ]]; then
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
exit 2
fi
seconds_left=$((expiry_epoch - now_epoch))
days_left=$((seconds_left / 86400))
status="OK"
exit_code=0
if ((days_left < critical_days)); then
status="CRITICAL"
exit_code=3
elif ((days_left < warning_days)); then
status="WARNING"
exit_code=1
fi
target="$cert_file"
if [[ -n "$host_name" ]]; then
target="${host_name}:${port}"
fi
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
printf 'Certificate details:\n'
printf 'Subject: %s\n' "$subject"
printf 'Issuer: %s\n' "$issuer"
printf 'notBefore: %s\n' "$not_before"
printf 'notAfter: %s\n' "$not_after"
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
printf '\n'
printf 'Evidence:\n'
printf 'Target: %s\n' "$target"
printf 'SNI: %s\n' "${servername:-not used}"
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
printf 'Recommended next steps:\n'
printf -- '- Renew certificate before the operational threshold is breached\n'
printf -- '- Check the full chain and intermediate certificates\n'
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
printf -- '- Verify monitoring threshold and alert ownership\n'
printf -- '- Attach this output to incident or change ticket\n'
exit "$exit_code"
@@ -0,0 +1,161 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
host_name=""
port=""
count=3
timeout_seconds=3
usage() {
cat <<'USAGE'
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -z "$host_name" ]]; then
printf 'CRITICAL: --host is required\n'
usage
exit 2
fi
for value in "$count" "$timeout_seconds"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if [[ -n "$port" ]] && ! is_number "$port"; then
printf 'CRITICAL: --port must be numeric\n'
exit 2
fi
if ! command -v getent >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: getent\n'
exit 2
fi
dns_ok=0
ping_ok=0
tcp_ok=0
tcp_checked=0
tcp_note=""
ping_output="$(mktemp)"
trap 'rm -f "$ping_output"' EXIT
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
if [[ -n "$dns_output" ]]; then
dns_ok=1
fi
if command -v ping >/dev/null 2>&1; then
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
ping_ok=1
fi
else
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
fi
if [[ -n "$port" ]]; then
tcp_checked=1
if command -v timeout >/dev/null 2>&1; then
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
tcp_ok=1
fi
else
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
tcp_ok=1
fi
fi
fi
status="OK"
exit_code=0
if ((dns_ok == 0)); then
status="CRITICAL"
exit_code=3
elif ((tcp_checked == 1 && tcp_ok == 0)); then
status="CRITICAL"
exit_code=3
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
status="WARNING"
exit_code=1
fi
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
if ((tcp_checked == 1)); then
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
fi
printf '\n\n'
printf 'DNS result:\n'
if [[ -n "$dns_output" ]]; then
printf '%s\n' "$dns_output"
else
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
fi
printf '\n'
printf 'Ping result:\n'
if [[ -s "$ping_output" ]]; then
cat "$ping_output"
else
printf 'WARNING: ping result unavailable or ping command missing\n'
fi
printf '\n'
if ((tcp_checked == 1)); then
printf 'TCP port result:\n'
if ((tcp_ok == 1)); then
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
else
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
fi
if [[ -n "$tcp_note" ]]; then
printf '%s\n' "$tcp_note"
fi
printf '\n'
fi
printf 'Local network hints:\n'
if command -v ip >/dev/null 2>&1; then
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
elif command -v ss >/dev/null 2>&1; then
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
else
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
if [[ -n "$tcp_note" ]]; then
printf '%s\n' "$tcp_note"
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify the DNS record and resolver path\n'
printf -- '- Check firewall, routing, security group, or proxy policy\n'
printf -- '- Compare results from another host or network segment\n'
printf -- '- Check application endpoint health after network reachability is confirmed\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,124 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
since_value="1 hour ago"
warning_count=20
critical_count=50
top_count=10
usage() {
cat <<'USAGE'
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
Detect failed SSH login bursts from journal or readable authentication logs.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_count" "$critical_count" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_count >= critical_count)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
tmp_log="$(mktemp)"
trap 'rm -f "$tmp_log"' EXIT
log_source="journalctl"
if command -v journalctl >/dev/null 2>&1; then
journalctl --since "$since_value" --no-pager 2>/dev/null \
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
else
log_source="log file fallback"
fi
if [[ ! -s "$tmp_log" ]]; then
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
if [[ -r "$log_file" ]]; then
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
log_source="$log_file"
fi
done
fi
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
status="OK"
exit_code=0
if ((attempts >= critical_count)); then
status="CRITICAL"
exit_code=3
elif ((attempts >= warning_count)); then
status="WARNING"
exit_code=1
fi
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
printf 'Top source IPs:\n'
if [[ -s "$tmp_log" ]]; then
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
| sed -E 's/^(from|rhost=) //' \
| sort | uniq -c | sort -rn | head -n "$top_count" || true
else
printf 'OK: no failed SSH attempts found in available sources\n'
fi
printf '\n'
printf 'Top attempted users:\n'
if [[ -s "$tmp_log" ]]; then
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
| sort | uniq -c | sort -rn | head -n "$top_count" || true
else
printf 'OK: no attempted users extracted\n'
fi
printf '\n'
printf 'Sample recent lines:\n'
if [[ -s "$tmp_log" ]]; then
tail -n "$top_count" "$tmp_log"
else
printf 'OK: no sample lines available\n'
fi
printf '\n\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
printf 'Log source: %s\n' "$log_source"
if [[ "$log_source" != "journalctl" ]]; then
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; authentication log visibility may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
printf -- '- Check firewall, fail2ban, or security tooling state\n'
printf -- '- Confirm whether the attempts are expected for this host\n'
printf -- '- Review successful logins too, not only failures\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,89 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
include_system=0
usage() {
cat <<'USAGE'
Usage: check_filesystem_readonly.sh [--include-system] [--help]
Detect filesystems mounted read-only. Read-only.
USAGE
}
while (($# > 0)); do
case "$1" in
--include-system) include_system=1; shift ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
tmp_mounts="$(mktemp)"
trap 'rm -f "$tmp_mounts"' EXIT
if command -v findmnt >/dev/null 2>&1; then
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
elif command -v mount >/dev/null 2>&1; then
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
else
printf 'CRITICAL: findmnt or mount is required\n'
exit 2
fi
tmp_ro="$(mktemp)"
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
awk -v include_system="$include_system" '
function system_fs(type, target) {
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
}
{
target=$1; source=$2; type=$3; opts=$4
if (opts ~ /(^|,)ro(,|$)/) {
if (include_system == 1 || ! system_fs(type, target)) {
print target "\t" source "\t" type "\t" opts
}
}
}
' "$tmp_mounts" > "$tmp_ro"
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
status="OK"
exit_code=0
if ((readonly_count > 0)); then
status="CRITICAL"
exit_code=3
fi
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
printf 'Read-only filesystems:\n'
if [[ -s "$tmp_ro" ]]; then
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
cat "$tmp_ro"
else
printf 'OK: no read-only filesystems found with current filters\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'include_system=%s\n' "$include_system"
printf 'Collector: '
if command -v findmnt >/dev/null 2>&1; then
printf 'findmnt\n'
else
printf 'mount fallback\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
printf -- '- Check filesystem health with the platform-approved procedure\n'
printf -- '- Do not remount read-write before understanding the cause\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
+146
View File
@@ -0,0 +1,146 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=75
critical_threshold=90
top_count=10
usage() {
cat <<'USAGE'
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
Detect high CPU load and show top CPU-consuming processes.
Exit codes:
0 OK
1 WARNING / operational issue detected
2 invalid input / missing required dependency
3 CRITICAL issue detected
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--warning)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
warning_threshold="$2"
shift 2
;;
--critical)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
critical_threshold="$2"
shift 2
;;
--top)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
top_count="$2"
shift 2
;;
--help|-h)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1"
usage
exit 2
;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd ps
require_cmd awk
require_cmd head
cpu_count=1
if command -v getconf >/dev/null 2>&1; then
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
elif [[ -r /proc/cpuinfo ]]; then
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
fi
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
((cpu_count > 0)) || cpu_count=1
load_1m="unavailable"
load_5m="unavailable"
load_15m="unavailable"
load_per_cpu_pct=0
if [[ -r /proc/loadavg ]]; then
read -r load_1m load_5m load_15m _ < /proc/loadavg
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
elif command -v uptime >/dev/null 2>&1; then
load_line="$(uptime 2>/dev/null || true)"
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
fi
status="OK"
exit_code=0
if ((load_per_cpu_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((load_per_cpu_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
printf 'Load average:\n'
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
printf 'CPU count:\n'
printf '%s\n\n' "$cpu_count"
printf 'Top CPU processes:\n'
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
printf '\n'
printf 'Evidence:\n'
if command -v uptime >/dev/null 2>&1; then
uptime || true
else
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
fi
if ((load_per_cpu_pct >= 100)); then
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
else
printf 'OK: load is not above online CPU count at collection time\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check process ownership and whether the top process is expected\n'
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
printf -- '- Review logs for the top CPU-consuming process\n'
printf -- '- Compare with longer trend data from monitoring before taking action\n'
printf -- '- Attach this output to the incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,138 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=80
critical_threshold=90
since_value="24 hours ago"
top_count=10
usage() {
cat <<'USAGE'
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
Detect high memory or swap usage and show recent OOM killer evidence.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd free
require_cmd ps
require_cmd awk
require_cmd head
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
/^Mem:/ { mt=$2; mu=$3 }
/^Swap:/ { st=$2; su=$3 }
END { printf "%d %d %d %d\n", mt, mu, st, su }
')
mem_pct=0
swap_pct=0
if ((mem_total > 0)); then
mem_pct=$((mem_used * 100 / mem_total))
fi
if ((swap_total > 0)); then
swap_pct=$((swap_used * 100 / swap_total))
fi
status="OK"
exit_code=0
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
printf 'Memory summary:\n'
free -m
printf '\n'
printf 'Top memory processes:\n'
printf 'PID RSS_MB COMMAND\n'
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
printf '\n'
printf 'OOM events since %s:\n' "$since_value"
oom_found=0
oom_source="journalctl"
if command -v journalctl >/dev/null 2>&1; then
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
oom_found=1
fi
else
printf 'WARNING: journalctl not available; checking readable log files\n'
oom_source="log file fallback"
fi
if ((oom_found == 0)); then
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
if [[ -r "$log_file" ]]; then
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
oom_found=1
oom_source="$log_file"
break
fi
fi
done
fi
if ((oom_found == 0)); then
printf 'OK: no OOM evidence found in available sources\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
printf 'OOM evidence source: %s\n' "$oom_source"
if [[ "$oom_source" != "journalctl" ]]; then
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
fi
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Check application memory trend\n'
printf -- '- Review JVM heap settings if process is Java\n'
printf -- '- Verify swap pressure and paging activity\n'
printf -- '- Confirm whether OOM events align with application impact\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_threshold=80
critical_threshold=90
top_count=10
usage() {
cat <<'USAGE'
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
Detect inode exhaustion using df -i.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_threshold >= critical_threshold)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
if ! command -v df >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: df\n'
exit 2
fi
tmp_df="$(mktemp)"
tmp_alerts="$(mktemp)"
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
df -Pi > "$tmp_df"
awk -v warn="$warning_threshold" '
NR > 1 {
pct=$5
gsub(/%/, "", pct)
if (pct >= warn) {
print $0
}
}
' "$tmp_df" > "$tmp_alerts"
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
status="OK"
exit_code=0
if ((max_pct >= critical_threshold)); then
status="CRITICAL"
exit_code=3
elif ((max_pct >= warning_threshold)); then
status="WARNING"
exit_code=1
fi
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
printf 'Filesystems above threshold:\n'
if [[ -s "$tmp_alerts" ]]; then
cat "$tmp_alerts"
else
printf 'OK: no filesystems above warning threshold\n'
fi
printf '\n'
printf 'Inode usage table:\n'
cat "$tmp_df"
printf '\n'
printf 'Top affected mount points:\n'
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
| sort -rn | head -n "$top_count" \
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
printf 'Recommended next steps:\n'
printf -- '- Find directories with many small files under affected mount points\n'
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,134 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
target_pid=""
match_string=""
top_count=10
usage() {
cat <<'USAGE'
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -n "$target_pid" && -n "$match_string" ]]; then
printf 'CRITICAL: use either --pid or --match, not both\n'
exit 2
fi
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
printf 'CRITICAL: --pid must be numeric\n'
exit 2
fi
if ! is_number "$top_count"; then
printf 'CRITICAL: --top must be numeric\n'
exit 2
fi
if ! command -v ps >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: ps\n'
exit 2
fi
tmp_java="$(mktemp)"
trap 'rm -f "$tmp_java"' EXIT
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
if [[ -z "$target_pid" && -n "$match_string" ]]; then
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
fi
if [[ -z "$target_pid" ]]; then
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
if ((detected_count == 0)); then
printf 'WARNING: No Java processes detected\n\n'
else
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
fi
printf 'Detected JVM processes:\n'
printf 'PID USER RSS_MB CPU COMMAND\n'
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
printf '\nRecommended next steps:\n'
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
printf -- '- Review GC logs and application logs for the selected process\n'
printf -- '- Check heap sizing and thread count trend\n'
printf -- '- Capture jstack only if approved by operational process\n'
exit 1
fi
if ! ps -p "$target_pid" >/dev/null 2>&1; then
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
exit 2
fi
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
status="WARNING"
exit_code=1
else
status="OK"
exit_code=0
fi
thread_count="unavailable"
if [[ -r "/proc/${target_pid}/status" ]]; then
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
fi
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
printf 'Detected JVM process:\n'
printf 'PID USER RSS_MB CPU COMMAND\n'
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
printf 'Thread count: %s\n\n' "$thread_count"
printf 'Heap and JVM evidence:\n'
if command -v jcmd >/dev/null 2>&1; then
printf '\n[jcmd VM.flags]\n'
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
printf '\n[jcmd GC.heap_info]\n'
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
printf '\n[jcmd Thread.print summary]\n'
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
elif command -v jstat >/dev/null 2>&1; then
printf '\n[jstat -gc]\n'
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
else
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Review GC logs and recent application errors\n'
printf -- '- Check JVM heap sizing against container or host memory limits\n'
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
printf -- '- Capture jstack only if approved by operational process\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,121 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
warning_offset_ms=500
critical_offset_ms=5000
usage() {
cat <<'USAGE'
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
Check time synchronization status and offset evidence when available.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
while (($# > 0)); do
case "$1" in
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
for value in "$warning_offset_ms" "$critical_offset_ms"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_offset_ms >= critical_offset_ms)); then
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
exit 2
fi
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
timezone="$(date '+%Z %z')"
sync_status="unknown"
detected_tool="none"
offset_ms=""
timedate_output=""
if command -v timedatectl >/dev/null 2>&1; then
detected_tool="timedatectl"
timedate_output="$(timedatectl 2>/dev/null || true)"
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
[[ -n "$sync_status" ]] || sync_status="unknown"
fi
chronyc_output=""
if command -v chronyc >/dev/null 2>&1; then
detected_tool="chronyc"
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
if [[ -n "$raw_offset" ]]; then
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
fi
elif command -v ntpq >/dev/null 2>&1; then
detected_tool="ntpq"
fi
status="OK"
exit_code=0
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
status="WARNING"
exit_code=1
fi
if [[ -n "$offset_ms" ]]; then
if ((offset_ms >= critical_offset_ms)); then
status="CRITICAL"
exit_code=3
elif ((offset_ms >= warning_offset_ms)); then
status="WARNING"
exit_code=1
fi
elif [[ "$detected_tool" == "none" ]]; then
status="WARNING"
exit_code=1
fi
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
printf 'Time status:\n'
printf 'System time: %s\n' "$system_time"
printf 'Timezone: %s\n' "$timezone"
printf 'Detected tool: %s\n' "$detected_tool"
printf 'NTP synchronized: %s\n' "$sync_status"
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
printf 'Tool evidence:\n'
if [[ -n "$chronyc_output" ]]; then
printf '%s\n' "$chronyc_output"
elif command -v ntpq >/dev/null 2>&1; then
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
elif [[ -n "$timedate_output" ]]; then
printf '%s\n' "$timedate_output"
else
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
if [[ -z "$offset_ms" ]]; then
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Verify chrony or ntpd service status and configuration\n'
printf -- '- Check NTP sources and reachability\n'
printf -- '- Check virtualization host time if this is a VM\n'
printf -- '- Avoid restarting time services blindly in production\n'
printf -- '- Attach this output to incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,111 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
service_name=""
since_value="1 hour ago"
warning_count=3
critical_count=10
usage() {
cat <<'USAGE'
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
Detect restart-loop evidence for a systemd service. Read-only.
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
require_cmd() {
if ! command -v "$1" >/dev/null 2>&1; then
printf 'CRITICAL: required command not found: %s\n' "$1"
exit 2
fi
}
while (($# > 0)); do
case "$1" in
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
--help|-h) usage; exit 0 ;;
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
esac
done
if [[ -z "$service_name" ]]; then
printf 'CRITICAL: --service is required\n'
usage
exit 2
fi
for value in "$warning_count" "$critical_count"; do
if ! is_number "$value"; then
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
exit 2
fi
done
if ((warning_count >= critical_count)); then
printf 'CRITICAL: --warning must be lower than --critical\n'
exit 2
fi
require_cmd systemctl
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
restart_count="${n_restarts:-0}"
if ! is_number "$restart_count"; then
restart_count=0
fi
status="OK"
exit_code=0
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
status="CRITICAL"
exit_code=3
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
status="WARNING"
exit_code=1
fi
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
printf 'Service state:\n'
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
printf '\n'
printf 'Systemd properties:\n'
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
printf '\n'
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
if command -v journalctl >/dev/null 2>&1; then
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
| tail -n 40 || printf 'OK: no matching journal lines found\n'
else
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
fi
printf '\n'
printf 'Evidence:\n'
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
printf 'WARNING: running without root; journal visibility may be limited\n'
fi
printf '\n'
printf 'Recommended next steps:\n'
printf -- '- Inspect the unit file and drop-in overrides\n'
printf -- '- Review application logs around the restart timestamps\n'
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
printf -- '- Verify recent configuration or package changes\n'
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
exit "$exit_code"
@@ -0,0 +1,20 @@
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
Certificate details:
Subject: CN = app.example.com
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
notBefore: Apr 11 00:00:00 2026 GMT
notAfter: May 29 23:59:59 2026 GMT
SAN/CN: DNS:app.example.com, DNS:api.example.com
Evidence:
Target: app.example.com:443
SNI: app.example.com
Thresholds: warning=30 days critical=7 days
Recommended next steps:
- Renew certificate before the operational threshold is breached
- Check the full chain and intermediate certificates
- Check the load balancer, ingress, or reverse proxy serving this certificate
- Verify monitoring threshold and alert ownership
- Attach this output to incident or change ticket
@@ -0,0 +1,23 @@
OK: DNS=OK ping=OK tcp_443=OK
DNS result:
93.184.216.34 example.com
Ping result:
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
TCP port result:
OK: TCP connection to example.com:443 succeeded
Local network hints:
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
Evidence:
Host: example.com count=3 timeout=3s port=443
Recommended next steps:
- Verify the DNS record and resolver path
- Check firewall, routing, security group, or proxy policy
- Compare results from another host or network segment
- Check application endpoint health after network reachability is confirmed
- Attach this output to incident ticket
@@ -0,0 +1,26 @@
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
Top source IPs:
52 203.0.113.44
12 198.51.100.20
9 192.0.2.10
Top attempted users:
31 admin
24 oracle
18 root
Sample recent lines:
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
Evidence:
Thresholds: warning=20 critical=50 since="1 hour ago"
Log source: journalctl
Recommended next steps:
- Verify source IPs against expected scanners, admins, or automation
- Check firewall, fail2ban, or security tooling state
- Confirm whether the attempts are expected for this host
- Review successful logins too, not only failures
- Attach this output to incident ticket
@@ -0,0 +1,16 @@
CRITICAL: Found 1 read-only filesystem(s)
Read-only filesystems:
MOUNT_POINT SOURCE FSTYPE OPTIONS
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
Evidence:
include_system=0
Collector: findmnt
Recommended next steps:
- Check dmesg or journal logs for I/O errors and filesystem remount events
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
- Check filesystem health with the platform-approved procedure
- Do not remount read-write before understanding the cause
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
Load average:
1m=7.82 5m=6.91 15m=5.40
CPU count:
8
Top CPU processes:
PID PPID USER %CPU %MEM COMMAND COMMAND
2314 1 app 245 12.1 java java -jar order-api.jar
991 1 root 38 0.4 backup-agent backup-agent --scan
Evidence:
WARNING: load is close to online CPU count; runnable task saturation is possible
Recommended next steps:
- Check process ownership and whether the top process is expected
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
- Review logs for the top CPU-consuming process
- Compare with longer trend data from monitoring before taking action
- Attach this output to the incident ticket
@@ -0,0 +1,25 @@
WARNING: Memory usage is 84% and swap usage is 12%
Memory summary:
total used free shared buff/cache available
Mem: 15934 13386 512 121 2036 2101
Swap: 4095 512 3583
Top memory processes:
PID RSS_MB COMMAND
1234 2048 java
987 812 postgres
OOM events since 24 hours ago:
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
Evidence:
Thresholds: warning=80% critical=90% since="24 hours ago"
OOM evidence source: journalctl
Recommended next steps:
- Check application memory trend
- Review JVM heap settings if process is Java
- Verify swap pressure and paging activity
- Confirm whether OOM events align with application impact
- Attach this output to incident ticket
@@ -0,0 +1,22 @@
WARNING: Highest inode usage is 87%
Filesystems above threshold:
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Inode usage table:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
Top affected mount points:
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
Evidence:
Thresholds: warning=80% critical=90%
Recommended next steps:
- Find directories with many small files under affected mount points
- Check logs, cache, spool, session, and temporary directories
- Avoid deleting blindly; confirm ownership and application impact first
- Confirm whether inode exhaustion is causing write or deploy failures
- Attach this output to incident ticket
@@ -0,0 +1,30 @@
OK: JVM diagnostics collected for PID 1234
Detected JVM process:
PID USER RSS_MB CPU COMMAND
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
Thread count: 188
Heap and JVM evidence:
[jcmd VM.flags]
1234:
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
[jcmd GC.heap_info]
garbage-first heap total 2097152K, used 1521000K
[jcmd Thread.print summary]
102 java.lang.Thread.State: WAITING
53 java.lang.Thread.State: RUNNABLE
33 java.lang.Thread.State: TIMED_WAITING
Evidence:
PID=1234 thread_count=188 top=10
Recommended next steps:
- Review GC logs and recent application errors
- Check JVM heap sizing against container or host memory limits
- Check thread count trend in monitoring before concluding a leak
- Capture jstack only if approved by operational process
- Attach this output to incident ticket
@@ -0,0 +1,23 @@
WARNING: Time sync status=yes offset_ms=812
Time status:
System time: 2026-05-11 10:18:01 UTC +0000
Timezone: UTC +0000
Detected tool: chronyc
NTP synchronized: yes
Offset ms: 812
Tool evidence:
Reference ID : 203.0.113.10
System time : 0.812345 seconds fast of NTP time
Last offset : +0.812345 seconds
Evidence:
Thresholds: warning=500ms critical=5000ms
Recommended next steps:
- Verify chrony or ntpd service status and configuration
- Check NTP sources and reachability
- Check virtualization host time if this is a VM
- Avoid restarting time services blindly in production
- Attach this output to incident ticket
@@ -0,0 +1,27 @@
CRITICAL: Service app.service state=failed substate=failed restarts=12
Service state:
app.service - Example application
Loaded: loaded (/etc/systemd/system/app.service; enabled)
Active: failed (Result: exit-code)
Systemd properties:
Id=app.service
ActiveState=failed
SubState=failed
Result=exit-code
NRestarts=12
Recent start/stop/failure log lines since 1 hour ago:
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
Evidence:
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
Recommended next steps:
- Inspect the unit file and drop-in overrides
- Review application logs around the restart timestamps
- Check dependencies such as network, storage, database, or secrets
- Verify recent configuration or package changes
- Do not restart blindly; attach this output to the incident ticket
@@ -0,0 +1,385 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
incident_type=""
service_name=""
host_name=""
port=""
target_pid=""
match_string=""
output_file=""
since_value="1 hour ago"
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
usage() {
cat <<'USAGE'
Usage: incident_triage_report.sh --type TYPE [options]
Run selected read-only incident checks and produce a Markdown triage report.
Incident types:
cpu
memory
service
network
auth
cert
filesystem
jvm
all
Options:
--type TYPE Incident type to collect
--service SERVICE_NAME systemd service name for service checks
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
--port PORT TCP or TLS port for host checks
--pid PID JVM process ID
--match PROCESS_MATCH JVM process match string
--output FILE write Markdown report to FILE
--since VALUE time window for log-based checks
--help show this help
Examples:
./incident_triage_report.sh --type cpu
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
./incident_triage_report.sh --type network --host app.example.com --port 443
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
USAGE
}
is_number() {
[[ "$1" =~ ^[0-9]+$ ]]
}
valid_type() {
case "$1" in
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
*) return 1 ;;
esac
}
while (($# > 0)); do
case "$1" in
--type)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
incident_type="$2"
shift 2
;;
--service)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
service_name="$2"
shift 2
;;
--host)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
host_name="$2"
shift 2
;;
--port)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
port="$2"
shift 2
;;
--pid)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
target_pid="$2"
shift 2
;;
--match)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
match_string="$2"
shift 2
;;
--output)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
output_file="$2"
shift 2
;;
--since)
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
since_value="$2"
shift 2
;;
--help|-h)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1"
usage
exit 2
;;
esac
done
if [[ -z "$incident_type" ]]; then
printf 'CRITICAL: --type is required\n'
usage
exit 2
fi
if ! valid_type "$incident_type"; then
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
usage
exit 2
fi
if [[ -n "$port" ]] && ! is_number "$port"; then
printf 'CRITICAL: --port must be numeric\n'
exit 2
fi
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
printf 'CRITICAL: --pid must be numeric\n'
exit 2
fi
if [[ -n "$target_pid" && -n "$match_string" ]]; then
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
exit 2
fi
tmp_dir="$(mktemp -d)"
trap 'rm -rf "$tmp_dir"' EXIT
report_file="$tmp_dir/report.md"
check_labels=()
check_names=()
check_commands=()
check_statuses=()
check_exit_codes=()
check_summaries=()
check_outputs=()
status_from_exit() {
case "$1" in
0) printf 'OK' ;;
1) printf 'WARNING' ;;
2) printf 'INVALID' ;;
3) printf 'CRITICAL' ;;
*) printf 'ERROR' ;;
esac
}
render_command() {
local item
for item in "$@"; do
printf '%q ' "$item"
done | sed 's/[[:space:]]*$//'
}
append_skipped_check() {
local label="$1"
local name="$2"
local reason="$3"
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
check_labels+=("$label")
check_names+=("$name")
check_commands+=("not run")
check_statuses+=("SKIPPED")
check_exit_codes+=("-")
check_summaries+=("$reason")
check_outputs+=("$output_path")
}
run_check() {
local label="$1"
local script_name="$2"
shift 2
local script_path="${script_dir}/${script_name}"
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
local command_text
local exit_code
local status
local summary
command_text="$(render_command "$script_path" "$@")"
if [[ ! -e "$script_path" ]]; then
append_skipped_check "$label" "$script_name" "missing script: $script_name"
return
fi
if [[ ! -x "$script_path" ]]; then
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
return
fi
set +e
"$script_path" "$@" > "$output_path" 2>&1
exit_code=$?
set -e
status="$(status_from_exit "$exit_code")"
summary="$(sed -n '1p' "$output_path")"
if [[ -z "$summary" ]]; then
summary="no output captured"
fi
check_labels+=("$label")
check_names+=("$script_name")
check_commands+=("$command_text")
check_statuses+=("$status")
check_exit_codes+=("$exit_code")
check_summaries+=("$summary")
check_outputs+=("$output_path")
}
run_cpu_checks() {
run_check "CPU saturation" "check_high_cpu.sh"
}
run_memory_checks() {
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
}
run_service_checks() {
if [[ -z "$service_name" ]]; then
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
return
fi
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
}
run_network_checks() {
local args=(--host "$host_name")
if [[ -z "$host_name" ]]; then
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
return
fi
if [[ -n "$port" ]]; then
args+=(--port "$port")
fi
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
}
run_auth_checks() {
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
}
run_cert_checks() {
local args=(--host "$host_name")
if [[ -z "$host_name" ]]; then
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
return
fi
if [[ -n "$port" ]]; then
args+=(--port "$port")
fi
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
}
run_filesystem_checks() {
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
run_check "Inode usage" "check_inode_usage.sh"
}
run_jvm_checks() {
local args=()
if [[ -n "$target_pid" ]]; then
args+=(--pid "$target_pid")
elif [[ -n "$match_string" ]]; then
args+=(--match "$match_string")
fi
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
}
case "$incident_type" in
cpu) run_cpu_checks ;;
memory) run_memory_checks ;;
service) run_service_checks ;;
network) run_network_checks ;;
auth) run_auth_checks ;;
cert) run_cert_checks ;;
filesystem) run_filesystem_checks ;;
jvm) run_jvm_checks ;;
all)
run_cpu_checks
run_memory_checks
run_service_checks
run_network_checks
run_auth_checks
run_cert_checks
run_filesystem_checks
run_jvm_checks
;;
esac
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
current_user="$(id -un 2>/dev/null || printf 'unknown')"
{
printf '# L2 Incident Triage Report\n\n'
printf -- '- Generated: %s\n' "$generated_at"
printf -- '- Local hostname: %s\n' "$local_hostname"
printf -- '- Current user: %s\n' "$current_user"
printf -- '- Incident type: %s\n' "$incident_type"
printf -- '- Service: %s\n' "${service_name:-not provided}"
printf -- '- Host: %s\n' "${host_name:-not provided}"
printf -- '- Port: %s\n' "${port:-not provided}"
printf -- '- PID: %s\n' "${target_pid:-not provided}"
printf -- '- Process match: %s\n' "${match_string:-not provided}"
printf -- '- Since: %s\n\n' "$since_value"
printf '## Executed Checks\n\n'
printf '| Check | Script | Status | Exit | Command |\n'
printf '| --- | --- | --- | --- | --- |\n'
for index in "${!check_labels[@]}"; do
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
"${check_labels[$index]}" \
"${check_names[$index]}" \
"${check_statuses[$index]}" \
"${check_exit_codes[$index]}" \
"${check_commands[$index]}"
done
printf '\n'
printf '## Summary\n\n'
for index in "${!check_labels[@]}"; do
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
done
printf '\n'
printf '## Raw Evidence\n\n'
for index in "${!check_labels[@]}"; do
printf '### %s\n\n' "${check_labels[$index]}"
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
printf '```text\n'
cat "${check_outputs[$index]}"
printf '\n```\n\n'
done
printf '## L2 Handover Checklist\n\n'
printf -- '- [ ] Business impact confirmed\n'
printf -- '- [ ] Affected host/service identified\n'
printf -- '- [ ] Monitoring alert attached\n'
printf -- '- [ ] Recent changes checked\n'
printf -- '- [ ] Logs attached\n'
printf -- '- [ ] Service owner identified\n'
printf -- '- [ ] Escalation target identified\n\n'
printf '## Escalation Notes\n\n'
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
printf '## Recommended Next Steps\n\n'
printf -- '- Confirm the symptom against monitoring and user reports.\n'
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
printf -- '- Attach this report to the incident ticket before handoff.\n'
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
} > "$report_file"
if [[ -n "$output_file" ]]; then
cp "$report_file" "$output_file"
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
else
cat "$report_file"
fi
+5
View File
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
- `ci-cd` - `ci-cd`
- `docker` - `docker`
## Linux operations labs
- [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready. Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/). Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
+308
View File
@@ -0,0 +1,308 @@
# AI Lab Maintenance Toolkit
## Executive summary
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
configuration backup, and non-destructive VM inventory into a small toolkit
that is readable enough for review and guarded enough for homelab use.
This is a portfolio and lab implementation, not evidence of production
certification. Review package policy, backup coverage, maintenance windows, and
application impact before deploying it to another host.
## Problem solved
AI lab hosts accumulate operating system packages, kernel packages, container
images, build cache, journals, and configuration changes while also carrying
stateful workloads. Manual maintenance is easy to defer and risky to perform
without evidence. This project provides scheduled, logged tasks with explicit
safety boundaries and separate read-only audit commands.
## What this demonstrates
- Bash strict mode, input validation, dependency checks, and operational exit
codes.
- Dry-run-first maintenance with explicit authorization for changes.
- systemd oneshot services and persistent calendar timers.
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
- Docker cleanup that preserves volumes.
- Configuration-focused backups with bounded retention.
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
- Idempotent installation and guarded JSON configuration updates.
## Architecture and directory layout
```text
ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│ ├── ailab-healthcheck.sh
│ ├── ailab-disk-watch.sh
│ ├── ailab-apt-cleanup.sh
│ ├── ailab-kernel-cleanup.sh
│ ├── ailab-docker-cleanup.sh
│ ├── ailab-config-backup.sh
│ └── ailab-vm-audit.sh
└── systemd/
├── ailab-apt-cleanup.service
├── ailab-apt-cleanup.timer
├── ailab-kernel-cleanup.service
├── ailab-kernel-cleanup.timer
├── ailab-docker-cleanup.service
├── ailab-docker-cleanup.timer
├── ailab-config-backup.service
├── ailab-config-backup.timer
├── ailab-disk-watch.service
└── ailab-disk-watch.timer
```
The installer deploys scripts to `/usr/local/sbin` and units to
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
through an additional framework.
## Maintenance tasks
| Command | Purpose | Change behavior |
| --- | --- | --- |
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
## Safety model
Change-capable scripts default to dry-run behavior. Manual execution requires
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
use `--execute --non-interactive`; installing and enabling those reviewed unit
files is the explicit authorization for scheduled maintenance.
Exit codes follow the repository convention:
- `0`: completed successfully or an optional component was absent.
- `1`: an operational check or maintenance action failed.
- `2`: invalid input, missing required dependency, or insufficient privilege.
The scripts do not bypass APT or Docker locks, delete VM resources, manually
select kernel names for removal, or hide command failures.
## Installation
Review every script and unit first. Installation changes package state,
journald settings, Docker daemon settings when Docker exists, and enabled timer
state.
```bash
cd labs/linux/ailab-maintenance
sudo ./install.sh
```
The installer:
1. Installs the documented Ubuntu utilities.
2. Deploys scripts and systemd units with fixed permissions.
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
4. Restarts `systemd-journald`.
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
with `jq`, and attempts a Docker restart.
6. Enables all five timers.
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
The installer is intended for Ubuntu 26.04. It is not run automatically by
repository validation.
## Manual commands
Read-only reports:
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Preview maintenance:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh
```
Apply reviewed maintenance interactively:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute
```
`--non-interactive` is reserved for reviewed automation and is rejected unless
`--execute` is also present.
## Systemd timers
| Timer | Schedule |
| --- | --- |
| `ailab-config-backup.timer` | Daily at 03:30 |
| `ailab-disk-watch.timer` | Hourly |
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
All timers use `Persistent=true`, so a missed event runs after the host becomes
available. Inspect timer and service evidence with:
```bash
systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service
```
## Logs
Scheduled and manual maintenance writes to:
```text
/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log
```
systemd also records service output in the journal. Logrotate is installed as a
dependency, but this lab does not create a custom rotation policy for these
small maintenance logs.
## Docker policy
Docker cleanup runs `docker system prune -af` and removes build cache older
than seven days. It never passes `--volumes`. Named and anonymous volumes
remain outside this automated policy and require application-aware review.
The installer configures the `json-file` driver with a maximum size of `50m`
and five files. Existing valid JSON is backed up and merged. Invalid JSON
causes installation to stop rather than overwrite operator configuration.
## Kernel policy
Kernel removal is delegated to `apt autoremove --purge`; package names are not
constructed or purged with regular expressions. Before execution, the script
logs the APT simulation and refuses cleanup unless at least two installed
versioned kernel image packages remain after simulated removals.
This protects a fallback kernel while preserving Ubuntu dependency policy.
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
Secure Boot state, and the simulated removal set before manual execution.
## Backup policy
Backups are written to `/srv/backups/ailab-config` as
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
deleted only after a new archive is created.
The backup covers `/etc`, selected root shell configuration,
`/opt/ailab-maintenance` when present, and libvirt configuration under
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
Ollama models, VM disk images, or other large application datasets. Because
`/etc` is included, explicitly listed configuration subdirectories are already
covered even when optional-path reporting mentions them separately.
This is a local configuration backup, not a disaster-recovery design. A real
deployment should copy archives to independently protected storage and test
restoration.
## Journald policy
The installer applies:
```ini
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
```
These settings bound journal growth while retaining useful troubleshooting
evidence. Capacity and retention should be adjusted to the host's disk size
and incident-response requirements.
## Disk watch policy
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
`1` when any checked filesystem meets or exceeds the threshold. Override the
threshold for a manual or unit invocation with:
```bash
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
```
The script reports every filesystem as `OK` or `WARNING`; it does not delete
data or attempt remediation.
## Example operational workflows
### Weekly maintenance review
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-
```
Review the kernel simulation, Docker usage, failed units, backup freshness, and
disk warnings before approving manual changes.
### Disk pressure investigation
```bash
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Use the evidence to identify ownership. Do not treat Docker pruning or file
deletion as a substitute for application-specific retention policy.
### Post-maintenance evidence
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh \
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'
```
## Interview talking points
- Why timer units explicitly carry the non-interactive execution boundary.
- Why APT dependency policy is safer than regex-based kernel deletion.
- How Docker volume preservation separates platform hygiene from application
data lifecycle decisions.
- How optional dependency handling keeps one health command useful across
container, GPU, and virtualization host variants.
- Why configuration backup and application-data backup are separate concerns.
- How exit codes, persistent timers, logs, and post-checks support operations.
## Future improvements
- Add a dedicated logrotate policy after measuring log growth.
- Export disk-watch status to a monitoring system instead of relying only on
timer failure state.
- Add automated archive integrity checks and off-host replication.
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
commands.
- Add package-lock detection with bounded retry policy if recurring contention
is observed.
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
dedicated read-only audit.
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
DOCKER_CONFIG="/etc/docker/daemon.json"
packages=(
logrotate
needrestart
smartmontools
nvme-cli
sysstat
iotop
ncdu
duf
jq
lsof
psmisc
tar
gzip
)
timers=(
ailab-apt-cleanup.timer
ailab-kernel-cleanup.timer
ailab-docker-cleanup.timer
ailab-config-backup.timer
ailab-disk-watch.timer
)
if ((EUID != 0)); then
printf 'CRITICAL: install.sh must run as root\n' >&2
exit 2
fi
for command_name in apt-get install systemctl; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
printf 'Installing maintenance dependencies...\n'
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
printf 'Installing scripts and systemd units...\n'
for script in "$SCRIPT_DIR"/scripts/*.sh; do
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
done
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
done
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
tmp_journald="$(mktemp)"
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
cat >"$tmp_journald" <<'EOF'
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
EOF
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
systemctl restart systemd-journald
if command -v docker >/dev/null 2>&1; then
printf 'Configuring Docker log rotation limits...\n'
install -d -m 0755 /etc/docker
tmp_docker="$(mktemp)"
if [[ -f "$DOCKER_CONFIG" ]]; then
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
exit 1
fi
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$DOCKER_CONFIG" "$backup"
jq '. + {
"log-driver": "json-file",
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
}' "$DOCKER_CONFIG" >"$tmp_docker"
else
jq -n '{
"log-driver": "json-file",
"log-opts": {"max-size": "50m", "max-file": "5"}
}' >"$tmp_docker"
fi
jq empty "$tmp_docker"
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
systemctl restart docker || true
else
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
fi
systemctl daemon-reload
systemctl enable --now "${timers[@]}"
printf '\nEnabled AI Lab timers:\n'
systemctl list-timers --all --no-pager | grep 'ailab-' || true
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
+66
View File
@@ -0,0 +1,66 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-apt-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
if ! command -v apt >/dev/null 2>&1; then
printf 'CRITICAL: apt is required\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
printf 'INFO: simulated autoremove follows\n'
LC_ALL=C apt -s autoremove --purge
printf 'INFO: rerun with --execute and confirm to apply changes\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt update
apt autoremove --purge -y
apt autoclean -y
if command -v needrestart >/dev/null 2>&1; then
needrestart -b || true
else
printf 'WARNING: needrestart is not installed\n'
fi
printf 'OK: APT cleanup completed\n'
@@ -0,0 +1,90 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-config-backup.log"
BACKUP_DIR="/srv/backups/ailab-config"
RETENTION_DAYS=30
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in tar gzip find; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
timestamp="$(date '+%Y%m%d-%H%M%S')"
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
candidate_paths=(
/etc
/root/.bashrc
/root/.bashrc.d
/opt/ailab-maintenance
/var/lib/libvirt/qemu
)
source_paths=()
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
for path in "${candidate_paths[@]}"; do
if [[ -e "$path" ]]; then
source_paths+=("${path#/}")
printf 'OK: include %s\n' "$path"
else
printf 'INFO: optional path is absent: %s\n' "$path"
fi
done
if ((${#source_paths[@]} == 0)); then
printf 'CRITICAL: no backup source paths are present\n'
exit 1
fi
printf 'Backup destination: %s\n' "$archive"
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'Type EXECUTE to create the archive and apply retention: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
install -d -m 0750 "$BACKUP_DIR"
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
printf 'OK: configuration backup created: %s\n' "$archive"
+38
View File
@@ -0,0 +1,38 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-disk-watch.log"
threshold="${AILAB_DISK_THRESHOLD:-85}"
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
exit 2
fi
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
status=0
while read -r filesystem _blocks _used available use_percent mountpoint; do
usage="${use_percent%\%}"
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
status=1
elif ((usage >= threshold)); then
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
status=1
else
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
fi
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
exit "$status"
@@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-docker-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
if ! command -v docker >/dev/null 2>&1; then
printf 'INFO: Docker is not installed; nothing to do\n'
exit 0
fi
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
printf 'INFO: docker.service is inactive; nothing to do\n'
exit 0
fi
printf '\nDocker disk usage before cleanup:\n'
docker system df
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; would run docker system prune -af\n'
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
printf 'INFO: Docker volumes are never included in this cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
docker system prune -af
docker builder prune -af --filter "until=168h"
printf '\nDocker disk usage after cleanup:\n'
docker system df
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
+111
View File
@@ -0,0 +1,111 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Host identity"
if command -v hostnamectl >/dev/null 2>&1; then
run_optional "hostnamectl" hostnamectl
else
run_optional "hostname" hostname
fi
run_optional "kernel information" uname -a
run_optional "uptime" uptime
section "Memory"
if command -v free >/dev/null 2>&1; then
run_optional "memory report" free -h
else
printf 'WARNING: free is not available\n'
fi
section "Filesystems"
if command -v df >/dev/null 2>&1; then
run_optional "filesystem report" df -hT
printf '\nKey mountpoints present:\n'
for mountpoint in / /boot /var /srv /opt /home; do
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
fi
done
else
printf 'WARNING: df is not available\n'
fi
section "Journal usage"
if command -v journalctl >/dev/null 2>&1; then
run_optional "journal disk usage" journalctl --disk-usage
else
printf 'WARNING: journalctl is not available\n'
fi
section "Docker"
if command -v docker >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "Docker service state" systemctl is-active docker
fi
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
run_optional "Docker disk usage" docker system df
else
printf 'INFO: Docker is not installed\n'
fi
section "Libvirt"
if command -v virsh >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "libvirtd service state" systemctl is-active libvirtd
fi
run_optional "libvirt guest list" virsh list --all
else
printf 'INFO: virsh is not installed\n'
fi
section "NVIDIA"
if command -v nvidia-smi >/dev/null 2>&1; then
run_optional "NVIDIA status" nvidia-smi
else
printf 'INFO: nvidia-smi is not installed\n'
fi
section "Failed systemd units"
if command -v systemctl >/dev/null 2>&1; then
run_optional "failed systemd unit report" systemctl --failed --no-pager
else
printf 'WARNING: systemctl is not available\n'
fi
section "SMART quick health"
if command -v smartctl >/dev/null 2>&1; then
shopt -s nullglob
devices=(/dev/sd? /dev/nvme?n?)
shopt -u nullglob
if ((${#devices[@]} == 0)); then
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
else
for device in "${devices[@]}"; do
printf '\n-- %s --\n' "$device"
run_optional "SMART health check for $device" smartctl -H "$device"
done
fi
else
printf 'INFO: smartctl is not installed\n'
fi
exit 0
@@ -0,0 +1,117 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
# APT autoremove respects package dependencies and kernel protection rules. That
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
kernel_packages() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
| awk '$1 ~ /^ii/ {print $2}' \
| sort -u || true
}
versioned_kernel_images() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
| sort -u || true
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in apt dpkg-query uname; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
printf 'Running kernel: %s\n' "$(uname -r)"
printf '\nInstalled kernel-related packages before cleanup:\n'
kernel_packages
simulation="$(LC_ALL=C apt -s autoremove --purge)"
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
mapfile -t installed_images < <(versioned_kernel_images)
mapfile -t removed_images < <(
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
)
remaining_images=0
for image in "${installed_images[@]}"; do
remove_image=false
for removed in "${removed_images[@]}"; do
if [[ "$image" == "$removed" ]]; then
remove_image=true
break
fi
done
if [[ "$remove_image" != true ]]; then
remaining_images=$((remaining_images + 1))
fi
done
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
exit 1
fi
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no packages were removed\n'
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt autoremove --purge -y
apt autoclean -y
if command -v update-grub >/dev/null 2>&1; then
update-grub || true
else
printf 'WARNING: update-grub is not installed\n'
fi
printf '\nInstalled kernel-related packages after cleanup:\n'
kernel_packages
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
+42
View File
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
if ! command -v virsh >/dev/null 2>&1; then
printf 'INFO: virsh is not installed; VM audit skipped\n'
exit 0
fi
section "Virtual machines"
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
section "Storage pools"
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
for pool in "${pools[@]}"; do
section "Volumes in pool: $pool"
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
done
section "Possible VM disk and installation images"
search_roots=()
for path in /var/lib/libvirt /srv /opt; do
[[ -d "$path" ]] && search_roots+=("$path")
done
if ((${#search_roots[@]} == 0)); then
printf 'INFO: no configured search roots are present\n'
else
find "${search_roots[@]}" -xdev -type f \
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
-printf '%12s bytes %p\n' 2>/dev/null \
| sort -nr || true
fi
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe APT cleanup
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab APT cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:00:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab configuration backup
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab configuration backup daily
[Timer]
OnCalendar=*-*-* 03:30:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab disk usage check
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab disk usage check hourly
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe Docker cleanup
Requires=docker.service
After=docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab Docker cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:40:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe kernel cleanup
After=network-online.target ailab-apt-cleanup.service
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab kernel cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:20:00
Persistent=true
[Install]
WantedBy=timers.target
+276
View File
@@ -0,0 +1,276 @@
# Linux Fresh Setup Toolkit
## Executive summary
The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
lab server or workstation. It prepares a host for routine administration,
Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
diagnostics, bounded logging, practical kernel tuning, and a conservative
security baseline.
The scripts are modular and safe to rerun. Optional components remain optional,
UFW is not enabled without a specific flag, and an NVIDIA driver is never
installed without an explicit version. This is a portfolio and homelab
implementation, not a production-certified build standard.
## Scope and non-goals
The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
with APT package management. It is suitable for a host such as `ailab` that may
run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
It does not:
- Deploy applications, containers, or virtual machines.
- Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
- Select an NVIDIA driver automatically.
- Define a complete firewall policy or compliance baseline.
- Replace backup, monitoring, patching, or ongoing maintenance processes.
- Claim live validation against every future Ubuntu release.
## Why this is separate from ailab-maintenance
This project establishes a fresh host. The sibling
[AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
checks, scheduled cleanup, configuration backup, disk monitoring, and VM
inventory after a host is operating.
Keeping bootstrap and maintenance separate makes the change boundary clear:
this toolkit installs platform capabilities and baseline configuration, while
the maintenance toolkit manages recurring operational tasks.
## Directory layout
```text
setup/
├── README.md
├── install.sh
├── scripts/
│ ├── 00-preflight.sh
│ ├── 00-platform-guard.inc
│ ├── 01-base-packages.sh
│ ├── 02-shell-profile.sh
│ ├── 03-cockpit.sh
│ ├── 04-docker.sh
│ ├── 05-libvirt.sh
│ ├── 06-nvidia-tools.sh
│ ├── 07-tuning.sh
│ ├── 08-security-baseline.sh
│ └── 99-postcheck.sh
├── files/
│ ├── bashrc.d/ailab.sh
│ ├── docker/daemon.json
│ ├── sysctl/99-ailab.conf
│ └── systemd/journald-ailab-limits.conf
└── docs/
├── fresh-install-checklist.md
├── cockpit.md
├── docker.md
├── libvirt.md
├── nvidia.md
└── bash-shell.md
```
`00-platform-guard.inc` is an internal sourced helper used by mutating
component scripts; it is not an executable profile.
## Supported profiles and flags
| Flag | Result |
| --- | --- |
| `--base` | Install operational CLI, diagnostic, storage, and network packages |
| `--shell` | Install the root AI lab Bash profile |
| `--cockpit` | Install and enable Cockpit |
| `--docker` | Install Docker and bounded JSON-file logging |
| `--libvirt` | Install and enable libvirt/KVM |
| `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
| `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
| `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
| `--security` | Install and enable fail2ban; install but do not enable UFW |
| `--enable-ufw` | Run security setup and explicitly enable UFW |
| `--all` | Run every standard profile without UFW enablement or driver installation |
`--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
`--security`. With no flags, the installer prints help and makes no changes.
## Installation examples
Review the scripts and current host access path before execution:
```bash
cd labs/linux/setup
./install.sh
sudo ./install.sh --base --shell
sudo ./install.sh --cockpit --docker --libvirt
sudo ./install.sh --all
```
Explicit high-impact options can be combined with `--all`:
```bash
sudo ./install.sh --all --enable-ufw
sudo ./install.sh --all --install-nvidia-driver 550
```
The installer runs the read-only preflight once before selected profiles and a
postcheck after all successful profile steps.
## Fresh host workflow
1. Patch the base Ubuntu installation and confirm console or out-of-band access.
2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
3. Run `sudo ./install.sh --base --shell`.
4. Add only the platform profiles needed by the host.
5. Review service state, listening ports, storage, networking, and warnings in
the postcheck.
6. Reboot if a driver or kernel-related package requires it.
7. Capture host-specific configuration and backup requirements separately.
## AI lab workflow
A general AI lab host can start with:
```bash
sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
```
This installs GPU diagnostics but leaves driver choice to the operator. Add
libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
Cockpit, application, bridge, and VM networking requirements.
## Safety model
- Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
than 24.04.
- Component profiles install their own direct prerequisites.
- Existing managed configuration is changed only when content differs.
- Changed root shell, Docker, journald, and sysctl files receive timestamped
backups.
- Existing valid Docker JSON is merged so unrelated settings survive.
- Invalid Docker JSON stops configuration rather than being overwritten.
- UFW and NVIDIA driver installation require explicit flags.
- Package and service failures are not hidden.
- Postcheck warnings report optional or inactive components without masking a
successfully completed diagnostic script.
APT installation and service restarts are real system changes. Test first on a
disposable host and maintain a console path when changing remote access policy.
## Bash shell profile
The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
source line is maintained in `/root/.bashrc`. It adds concise helpers for
systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
## Cockpit setup
Cockpit provides browser-based host, storage, network, package, VM, metrics,
and support-report views. The installer enables `cockpit.socket` and reports
`https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
available in every enabled Ubuntu repository.
See [Cockpit setup](docs/cockpit.md).
## Docker setup
The Ubuntu `docker.io` package path is preferred. The Docker official
repository is configured only when `docker.io` is unavailable. The daemon uses
the `json-file` log driver with five 50 MB files per container.
The toolkit configures log retention only. It does not prune data, deploy
Compose applications, or configure an NVIDIA container runtime.
See [Docker setup](docs/docker.md).
## libvirt/KVM setup
The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
virt-manager, bridge utilities, and libvirt clients and services. It enables
`libvirtd` and prints existing guests and networks.
See [libvirt/KVM setup](docs/libvirt.md).
## NVIDIA tooling
The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
commands exist.
Driver installation requires a numeric version that maps to an available
Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
driver suitability, CUDA, container runtime support, and passthrough remain
operator decisions.
See [NVIDIA tooling](docs/nvidia.md).
## Tuning
The tuning profile bounds persistent journal use, raises inotify limits for
development and container workloads, reduces swappiness, enables sysstat, and
runs automatic sensor detection when available.
Review these values against available memory, storage, monitoring retention,
and workload behavior before deployment beyond a lab.
## Security baseline
The security profile installs UFW and fail2ban and enables fail2ban. It leaves
UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
OpenSSH and TCP port 9090 before activation.
This is a minimal access-preservation baseline, not a complete host firewall or
hardening standard. Application and VM networking may require additional
reviewed rules.
## Postcheck
The final script reports:
- Failed systemd units.
- Cockpit, Docker, libvirt, and fail2ban status when installed.
- Running Docker containers and defined virtual machines.
- NVIDIA runtime state.
- Filesystem usage and listening ports.
Warnings require operator review but optional component absence does not cause
the postcheck itself to fail.
## Troubleshooting
Run individual read-only checks after correcting a failed profile:
```bash
sudo ./scripts/00-preflight.sh
sudo ./scripts/99-postcheck.sh
systemctl --failed
journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
```
Common failure areas are unavailable APT repositories, unsupported package
names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
module signing, disabled CPU virtualization, and remote firewall assumptions.
To roll back a managed configuration, compare the current file with its
timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
owning service. Package removal is intentionally not automated because it may
affect workloads and dependencies.
## Interview talking points
- Why day-0 bootstrap and day-2 maintenance have separate ownership.
- How explicit flags protect firewall and GPU driver decisions.
- Why Docker JSON is validated, backed up, and merged.
- How idempotent content checks prevent backup and restart churn.
- Why preflight and postcheck evidence surround mutating profiles.
- Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
## Future improvements
- Add automated tests using disposable Ubuntu VMs.
- Add a documented NVIDIA Container Toolkit profile.
- Add optional non-root administrative user and group membership management.
- Add bridge and VFIO planning checks without applying passthrough changes.
- Add package compatibility matrices after validating future Ubuntu releases.
- Export postcheck results in a structured format for evidence collection.
+53
View File
@@ -0,0 +1,53 @@
# Bash Shell Profile
## Installation
The shell profile is installed for root:
```text
/root/.bashrc.d/ailab.sh
```
The installer maintains one exact source line in `/root/.bashrc` and backs up
changed files. Start a new Bash session or run:
```bash
source /root/.bashrc
```
## Aliases
| Alias | Purpose |
| --- | --- |
| `ll`, `la` | Detailed and hidden-file directory listings |
| `ports` | Listening TCP/UDP sockets and processes |
| `dus`, `dufh` | Directory and filesystem usage |
| `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
| `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
| `vms` | All libvirt guests |
| `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
| `now` | Current timestamp and timezone |
`dcu` runs `docker compose up -d` in the current directory and therefore may
create or start resources. Review the Compose project before using it.
## Functions
- `svc_status SERVICE`
- `svc_logs SERVICE [LINES]`
- `docker_logs CONTAINER [LINES]`
- `docker_restart CONTAINER`
- `vm_autostart VM`
- `vm_no_autostart VM`
- `path_backup PATH`
- `extract ARCHIVE`
Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
report missing commands clearly. `path_backup` creates a timestamped adjacent
copy and can consume substantial space for large paths.
## Rollback
Review timestamped backups under `/root`, restore the desired `.bashrc` or
profile copy, and start a new shell. Avoid restoring a backup without checking
for unrelated shell changes made after bootstrap.
+41
View File
@@ -0,0 +1,41 @@
# Cockpit
## Purpose
The Cockpit profile installs browser-based host administration modules for
system state, storage, networking, packages, virtual machines, metrics, and
support reports. It enables the socket-activated service.
## Installation and validation
```bash
sudo ./install.sh --cockpit
systemctl status cockpit.socket
ss -ltnp | grep ':9090'
```
Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
default host certificate is not trusted.
`cockpit-files` is installed when available and skipped with a warning
otherwise.
## Access and firewall
The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
TCP 9090, but upstream firewalls and network ACLs remain external concerns.
Use normal Linux accounts and review which users may administer the host.
## Troubleshooting and rollback
```bash
journalctl -u cockpit.socket -u cockpit.service
systemctl restart cockpit.socket
apt-cache policy cockpit cockpit-machines cockpit-files
```
To disable remote access without removing packages:
```bash
sudo systemctl disable --now cockpit.socket
```
+56
View File
@@ -0,0 +1,56 @@
# Docker
## Package policy
The profile prefers Ubuntu's `docker.io` package. If that package is
unavailable after an APT refresh, it configures Docker's official Ubuntu
repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
This fallback requires network access to `download.docker.com`.
## Daemon configuration
The managed settings are:
```json
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
```
Existing valid `/etc/docker/daemon.json` content is preserved and merged with
these log settings. A changed file is backed up with a timestamp. Invalid JSON
causes the profile to stop rather than overwrite operator configuration.
Log limits apply to newly created containers. Existing containers may retain
their original logging configuration until recreated.
## Validation
```bash
docker version
docker compose version
docker info
docker ps
docker system df
jq . /etc/docker/daemon.json
```
## Troubleshooting and rollback
```bash
systemctl status docker
journalctl -u docker
jq empty /etc/docker/daemon.json
```
To restore a previous daemon configuration, review a timestamped backup,
replace the current file, validate it with `jq empty`, and restart Docker.
Do not restore blindly when workloads depend on newer daemon settings.
The profile does not configure Docker data roots, prune objects, deploy
applications, or install the NVIDIA Container Toolkit.
@@ -0,0 +1,47 @@
# Fresh Install Checklist
## Before bootstrap
- Confirm Ubuntu 24.04 or newer and record the release and kernel.
- Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
- Confirm console or out-of-band access before firewall work.
- Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
- Patch the base system and reboot if required.
- Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
- Review application ports and VM networking before enabling UFW.
- Confirm backups exist for any pre-existing host configuration.
## Bootstrap
Start with the least capability required:
```bash
sudo ./install.sh --base --shell
```
Add reviewed platform profiles:
```bash
sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
```
Do not select `--enable-ufw` until remote access and application rules are
understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
and workload compatibility are known.
## Post-bootstrap evidence
- Review all installer warnings.
- Run `systemctl --failed`.
- Confirm expected services with `systemctl status`.
- Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
- Confirm Docker with `docker version` and `docker compose version`.
- Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
- Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
- Reboot after driver installation and repeat the postcheck.
## Handover
Document host-specific storage, network, firewall, backup, application, GPU,
and VM decisions. Install the separate `ailab-maintenance` toolkit only after
reviewing its scheduled day-2 behavior.
+54
View File
@@ -0,0 +1,54 @@
# libvirt and KVM
## Purpose
The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
supports later Linux or Windows 11 VM work without defining guests.
## Firmware pre-checks
Confirm CPU virtualization is enabled:
```bash
lscpu | grep -E 'Virtualization|Hypervisor'
grep -Eom1 '(vmx|svm)' /proc/cpuinfo
```
IOMMU and GPU passthrough require separate firmware, kernel command-line,
device isolation, driver binding, and recovery planning. This toolkit reports
hints but does not apply those changes.
## Validation
```bash
systemctl status libvirtd
virsh list --all
virsh net-list --all
virsh pool-list --all
```
Use `virt-host-validate` when available for a broader host capability report.
Desktop use of `virt-manager` requires a graphical environment or remote
display strategy.
## Networking and Windows 11
The default libvirt NAT network is distinct from host bridge networking. Review
DHCP, DNS, forwarding, and firewall behavior before changing it.
Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
packages provide those building blocks, but guest creation and licensing remain
manual.
## Troubleshooting
```bash
journalctl -u libvirtd
virsh net-info default
virsh pool-list --all
lsmod | grep kvm
```
Disabling `libvirtd` does not remove VM disks or definitions. Package removal
and VM data deletion are intentionally outside this toolkit.
+52
View File
@@ -0,0 +1,52 @@
# NVIDIA Tooling
## Diagnostic-only default
The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
does not install or select a driver:
```bash
sudo ./install.sh --nvidia-tools
```
Review hardware and current module state:
```bash
lspci -nn | grep -i nvidia
nvidia-smi
dkms status
mokutil --sb-state
```
## Explicit driver installation
Install only a reviewed Ubuntu driver package version:
```bash
sudo ./install.sh --install-nvidia-driver 550
```
The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
an unavailable package. Reboot after installation, then validate `nvidia-smi`,
kernel logs, DKMS state, and application behavior.
## Selection considerations
- GPU generation and supported driver branch.
- Ubuntu release, kernel, and HWE stack.
- Secure Boot module enrollment.
- CUDA or application compatibility.
- Docker NVIDIA Container Toolkit requirements.
- Whether the device will be bound to VFIO instead of the host driver.
## Troubleshooting
```bash
journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
lsmod | grep -E 'nvidia|nouveau'
dkms status
apt-cache policy 'nvidia-driver-*'
```
Driver rollback is environment-specific and is not automated. Preserve console
access and a known-good kernel before changing GPU or Secure Boot configuration.
+133
View File
@@ -0,0 +1,133 @@
#!/usr/bin/env bash
# AI lab operational shell helpers. This file is intended to be sourced.
alias ll='ls -alF'
alias la='ls -A'
alias ports='ss -tulpn'
alias dus='du -xhd1 2>/dev/null | sort -h'
alias dufh='df -hT'
alias failed='systemctl --failed --no-pager'
alias jerr='journalctl -p err -b --no-pager'
alias timers='systemctl list-timers --all --no-pager'
alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
svc_status() {
if (($# != 1)); then
printf 'Usage: svc_status SERVICE\n' >&2
return 2
fi
systemctl status "$1" --no-pager
}
svc_logs() {
if (($# < 1 || $# > 2)); then
printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
return 2
fi
local lines="${2:-100}"
[[ "$lines" =~ ^[0-9]+$ ]] || {
printf 'LINES must be numeric\n' >&2
return 2
}
journalctl -u "$1" -n "$lines" --no-pager
}
docker_logs() {
if (($# < 1 || $# > 2)); then
printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
return 2
fi
command -v docker >/dev/null 2>&1 || {
printf 'Docker is not installed\n' >&2
return 1
}
local lines="${2:-100}"
[[ "$lines" =~ ^[0-9]+$ ]] || {
printf 'LINES must be numeric\n' >&2
return 2
}
docker logs --tail "$lines" "$1"
}
docker_restart() {
if (($# != 1)); then
printf 'Usage: docker_restart CONTAINER\n' >&2
return 2
fi
command -v docker >/dev/null 2>&1 || {
printf 'Docker is not installed\n' >&2
return 1
}
docker restart "$1"
}
vm_autostart() {
if (($# != 1)); then
printf 'Usage: vm_autostart VM\n' >&2
return 2
fi
command -v virsh >/dev/null 2>&1 || {
printf 'virsh is not installed\n' >&2
return 1
}
virsh autostart "$1"
}
vm_no_autostart() {
if (($# != 1)); then
printf 'Usage: vm_no_autostart VM\n' >&2
return 2
fi
command -v virsh >/dev/null 2>&1 || {
printf 'virsh is not installed\n' >&2
return 1
}
virsh autostart --disable "$1"
}
path_backup() {
if (($# != 1)); then
printf 'Usage: path_backup PATH\n' >&2
return 2
fi
if [[ ! -e "$1" ]]; then
printf 'Path does not exist: %s\n' "$1" >&2
return 1
fi
local destination
destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
cp -a -- "$1" "$destination"
printf 'Backup created: %s\n' "$destination"
}
extract() {
if (($# != 1)); then
printf 'Usage: extract ARCHIVE\n' >&2
return 2
fi
if [[ ! -f "$1" ]]; then
printf 'Archive does not exist: %s\n' "$1" >&2
return 1
fi
case "$1" in
*.tar.bz2|*.tbz2) tar xjf "$1" ;;
*.tar.gz|*.tgz) tar xzf "$1" ;;
*.tar.xz|*.txz) tar xJf "$1" ;;
*.tar) tar xf "$1" ;;
*.bz2) bunzip2 "$1" ;;
*.gz) gunzip "$1" ;;
*.zip) unzip "$1" ;;
*.7z) 7z x "$1" ;;
*.rar) unrar x "$1" ;;
*)
printf 'Unsupported archive type: %s\n' "$1" >&2
return 2
;;
esac
}
@@ -0,0 +1,7 @@
{
"log-driver": "json-file",
"log-opts": {
"max-size": "50m",
"max-file": "5"
}
}
@@ -0,0 +1,3 @@
fs.inotify.max_user_watches=1048576
fs.inotify.max_user_instances=1024
vm.swappiness=10
@@ -0,0 +1,5 @@
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
+182
View File
@@ -0,0 +1,182 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
run_base=0
run_shell=0
run_cockpit=0
run_docker=0
run_libvirt=0
run_nvidia=0
run_tuning=0
run_security=0
enable_ufw=0
nvidia_driver_version=""
usage() {
cat <<'EOF'
Usage: sudo ./install.sh [OPTIONS]
Day-0 bootstrap automation for Ubuntu 24.04 or newer.
Profiles:
--base Install baseline operational packages
--shell Install the root shell profile
--cockpit Install and enable Cockpit
--docker Install and configure Docker
--libvirt Install and enable libvirt/KVM
--nvidia-tools Install NVIDIA diagnostic tools only
--install-nvidia-driver VERSION
Install diagnostic tools and the explicit driver
--tuning Install journald and sysctl tuning
--security Install fail2ban and UFW; do not enable UFW
--enable-ufw Run security profile and explicitly enable UFW
--all Run every profile without enabling UFW or
installing an NVIDIA driver
-h, --help Show this help
Examples:
sudo ./install.sh --base --shell
sudo ./install.sh --all
sudo ./install.sh --all --enable-ufw
sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
EOF
}
require_supported_ubuntu() {
if [[ ! -r /etc/os-release ]]; then
printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
exit 2
fi
# shellcheck disable=SC1091
source /etc/os-release
if [[ "${ID:-}" != "ubuntu" ]]; then
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
exit 2
fi
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
"${VERSION_ID:-unknown}" >&2
exit 2
fi
}
if (($# == 0)); then
usage
exit 0
fi
while (($# > 0)); do
case "$1" in
--base)
run_base=1
;;
--shell)
run_shell=1
;;
--cockpit)
run_cockpit=1
;;
--docker)
run_docker=1
;;
--libvirt)
run_libvirt=1
;;
--nvidia-tools)
run_nvidia=1
;;
--install-nvidia-driver)
if (($# < 2)); then
printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
exit 2
fi
nvidia_driver_version="$2"
if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
exit 2
fi
run_nvidia=1
shift
;;
--tuning)
run_tuning=1
;;
--security)
run_security=1
;;
--enable-ufw)
enable_ufw=1
run_security=1
;;
--all)
run_base=1
run_shell=1
run_cockpit=1
run_docker=1
run_libvirt=1
run_nvidia=1
run_tuning=1
run_security=1
;;
-h|--help)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
usage >&2
exit 2
;;
esac
shift
done
if ((EUID != 0)); then
printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
exit 2
fi
for required_command in bash dpkg; do
if ! command -v "$required_command" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
exit 2
fi
done
require_supported_ubuntu
printf 'INFO: running read-only preflight\n'
"$SCRIPT_DIR/scripts/00-preflight.sh"
((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
if ((run_nvidia == 1)); then
if [[ -n "$nvidia_driver_version" ]]; then
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
else
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
fi
fi
((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
if ((run_security == 1)); then
if ((enable_ufw == 1)); then
"$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
else
"$SCRIPT_DIR/scripts/08-security-baseline.sh"
fi
fi
printf '\nINFO: running post-install checks\n'
"$SCRIPT_DIR/scripts/99-postcheck.sh"
printf '\nOK: selected Linux setup profiles completed\n'
@@ -0,0 +1,20 @@
# shellcheck shell=bash
require_supported_ubuntu() {
if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
exit 2
fi
# shellcheck disable=SC1091
source /etc/os-release
if [[ "${ID:-}" != "ubuntu" ]]; then
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
exit 2
fi
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
"${VERSION_ID:-unknown}" >&2
exit 2
fi
}
+124
View File
@@ -0,0 +1,124 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Operating system"
if [[ -r /etc/os-release ]]; then
run_optional "OS release report" cat /etc/os-release
else
printf 'WARNING: /etc/os-release is unavailable\n'
fi
run_optional "kernel report" uname -a
section "Host"
run_optional "hostname report" hostname
run_optional "uptime report" uptime
section "CPU and virtualization"
if command -v lscpu >/dev/null 2>&1; then
run_optional "CPU report" lscpu
printf '\nVirtualization flags:\n'
lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
printf 'INFO: no virtualization summary reported by lscpu\n'
else
printf 'WARNING: lscpu is unavailable\n'
fi
if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
printf 'OK: CPU virtualization flags detected\n'
else
printf 'WARNING: CPU virtualization flags were not detected\n'
fi
section "Memory"
if command -v free >/dev/null 2>&1; then
run_optional "memory report" free -h
else
run_optional "memory report" cat /proc/meminfo
fi
section "Disks"
if command -v lsblk >/dev/null 2>&1; then
run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
else
printf 'WARNING: lsblk is unavailable\n'
fi
run_optional "filesystem report" df -hT
section "Network"
if command -v ip >/dev/null 2>&1; then
run_optional "network interface report" ip -brief address
run_optional "route report" ip route
else
printf 'WARNING: ip is unavailable\n'
fi
section "Firmware and Secure Boot"
if [[ -d /sys/firmware/efi ]]; then
printf 'OK: boot mode is UEFI\n'
else
printf 'INFO: boot mode appears to be legacy BIOS\n'
fi
if command -v mokutil >/dev/null 2>&1; then
run_optional "Secure Boot report" mokutil --sb-state
else
printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
fi
section "IOMMU"
if [[ -r /proc/cmdline ]]; then
printf 'Kernel command line:\n'
cat /proc/cmdline
if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
printf 'OK: IOMMU-related kernel arguments detected\n'
else
printf 'INFO: no explicit IOMMU kernel argument detected\n'
fi
fi
if command -v dmesg >/dev/null 2>&1; then
dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
printf 'INFO: no readable IOMMU hints found in dmesg\n'
fi
section "NVIDIA hardware"
if command -v lspci >/dev/null 2>&1; then
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
else
printf 'INFO: lspci is unavailable\n'
fi
section "Existing platform components"
for command_name in docker virsh cockpit-bridge; do
if command -v "$command_name" >/dev/null 2>&1; then
printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
else
printf 'INFO: %s is not installed\n' "$command_name"
fi
done
if command -v systemctl >/dev/null 2>&1; then
for unit in docker.service libvirtd.service cockpit.socket; do
if systemctl cat "$unit" >/dev/null 2>&1; then
state="$(systemctl is-active "$unit" 2>/dev/null || true)"
printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
else
printf 'INFO: %s is not installed\n' "$unit"
fi
done
fi
printf '\nOK: preflight completed without modifying the host\n'
+41
View File
@@ -0,0 +1,41 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
packages=(
curl wget git vim nano tmux byobu htop btop glances
jq unzip zip rsync tree ncdu duf
lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
sysstat iotop iftop nload
ca-certificates gnupg software-properties-common apt-transport-https
needrestart unattended-upgrades logrotate
)
if ((EUID != 0)); then
printf 'CRITICAL: base package setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
printf 'INFO: refreshing APT metadata\n'
apt-get update
printf 'INFO: installing baseline operational packages\n'
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
if command -v systemctl >/dev/null 2>&1; then
systemctl enable --now sysstat
else
printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
fi
printf 'OK: baseline operational packages are installed\n'
+60
View File
@@ -0,0 +1,60 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
PROFILE_DIR="/root/.bashrc.d"
PROFILE_FILE="$PROFILE_DIR/ailab.sh"
BASHRC="/root/.bashrc"
SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
backup_file() {
local path="$1"
local backup
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$path" "$backup"
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
}
if ((EUID != 0)); then
printf 'CRITICAL: shell profile setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if [[ ! -r "$SOURCE_FILE" ]]; then
printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
exit 2
fi
install -d -m 0755 "$PROFILE_DIR"
if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
if [[ -f "$PROFILE_FILE" ]]; then
backup_file "$PROFILE_FILE"
fi
install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
printf 'OK: installed %s\n' "$PROFILE_FILE"
else
printf 'OK: shell profile is already current\n'
fi
if [[ ! -f "$BASHRC" ]]; then
install -m 0644 /dev/null "$BASHRC"
fi
source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
if [[ "$source_count" != "1" ]]; then
tmp_bashrc="$(mktemp)"
trap 'rm -f "$tmp_bashrc"' EXIT
grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
backup_file "$BASHRC"
install -m 0644 "$tmp_bashrc" "$BASHRC"
printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
else
printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
fi
+36
View File
@@ -0,0 +1,36 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
required_packages=(
cockpit cockpit-system cockpit-storaged cockpit-networkmanager
cockpit-packagekit cockpit-machines cockpit-sosreport cockpit-pcp
)
if ((EUID != 0)); then
printf 'CRITICAL: Cockpit setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y "${required_packages[@]}"
if apt-cache show cockpit-files >/dev/null 2>&1; then
DEBIAN_FRONTEND=noninteractive apt-get install -y cockpit-files
printf 'OK: installed optional cockpit-files package\n'
else
printf 'WARNING: cockpit-files is unavailable; continuing without it\n'
fi
systemctl enable --now cockpit.socket
printf 'OK: Cockpit is enabled at https://%s:9090\n' "$(hostname)"
+136
View File
@@ -0,0 +1,136 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
SOURCE_CONFIG="$SCRIPT_DIR/../files/docker/daemon.json"
DOCKER_CONFIG="/etc/docker/daemon.json"
temporary_files=()
cleanup() {
local path
for path in "${temporary_files[@]}"; do
rm -f "$path"
done
}
trap cleanup EXIT
backup_file() {
local path="$1"
local backup
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$path" "$backup"
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
}
if ((EUID != 0)); then
printf 'CRITICAL: Docker setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
for command_name in apt-get apt-cache; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y ca-certificates curl gnupg jq
if apt-cache show docker.io >/dev/null 2>&1; then
packages=(docker.io)
if apt-cache show docker-compose-v2 >/dev/null 2>&1; then
packages+=(docker-compose-v2)
else
printf 'WARNING: docker-compose-v2 is unavailable from Ubuntu repositories\n'
fi
else
printf 'WARNING: docker.io is unavailable; configuring Docker official repository\n'
install -d -m 0755 /etc/apt/keyrings
tmp_key="$(mktemp)"
temporary_files+=("$tmp_key")
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
| gpg --dearmor --yes -o "$tmp_key"
if [[ ! -f /etc/apt/keyrings/docker.gpg ]] || \
! cmp -s "$tmp_key" /etc/apt/keyrings/docker.gpg; then
if [[ -f /etc/apt/keyrings/docker.gpg ]]; then
backup_file /etc/apt/keyrings/docker.gpg
fi
install -m 0644 "$tmp_key" /etc/apt/keyrings/docker.gpg
fi
# shellcheck disable=SC1091
source /etc/os-release
architecture="$(dpkg --print-architecture)"
tmp_repository="$(mktemp)"
temporary_files+=("$tmp_repository")
printf 'deb [arch=%s signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu %s stable\n' \
"$architecture" "${VERSION_CODENAME:?}" \
>"$tmp_repository"
if [[ ! -f /etc/apt/sources.list.d/docker.list ]] || \
! cmp -s "$tmp_repository" /etc/apt/sources.list.d/docker.list; then
if [[ -f /etc/apt/sources.list.d/docker.list ]]; then
backup_file /etc/apt/sources.list.d/docker.list
fi
install -m 0644 "$tmp_repository" /etc/apt/sources.list.d/docker.list
fi
apt-get update
packages=(docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin)
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
install -d -m 0755 /etc/docker
if [[ ! -r "$SOURCE_CONFIG" ]]; then
printf 'CRITICAL: Docker configuration template is missing: %s\n' "$SOURCE_CONFIG" >&2
exit 2
fi
jq empty "$SOURCE_CONFIG"
tmp_config="$(mktemp)"
temporary_files+=("$tmp_config")
if [[ -f "$DOCKER_CONFIG" ]]; then
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
printf 'CRITICAL: %s is invalid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
exit 1
fi
jq '. + {
"log-driver": "json-file",
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
}' "$DOCKER_CONFIG" >"$tmp_config"
else
install -m 0644 "$SOURCE_CONFIG" "$tmp_config"
fi
jq empty "$tmp_config"
config_changed=0
if [[ ! -f "$DOCKER_CONFIG" ]] || ! cmp -s "$tmp_config" "$DOCKER_CONFIG"; then
if [[ -f "$DOCKER_CONFIG" ]]; then
backup_file "$DOCKER_CONFIG"
fi
install -m 0644 "$tmp_config" "$DOCKER_CONFIG"
config_changed=1
printf 'OK: installed Docker daemon log limits\n'
else
printf 'OK: Docker daemon configuration is already current\n'
fi
systemctl enable --now docker
if ((config_changed == 1)); then
systemctl restart docker
fi
docker version
if docker compose version >/dev/null 2>&1; then
docker compose version
else
printf 'WARNING: Docker Compose v2 is unavailable\n'
fi
printf 'OK: Docker setup completed\n'
+33
View File
@@ -0,0 +1,33 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
packages=(
qemu-system-x86 qemu-utils libvirt-daemon-system libvirt-clients
virtinst virt-manager bridge-utils ovmf swtpm swtpm-tools dnsmasq-base
)
if ((EUID != 0)); then
printf 'CRITICAL: libvirt setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
systemctl enable --now libvirtd
printf '\n== Virtual machines ==\n'
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
printf '\n== Virtual networks ==\n'
virsh net-list --all || printf 'WARNING: unable to list virtual networks\n'
printf 'OK: libvirt/KVM setup completed\n'
+88
View File
@@ -0,0 +1,88 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
driver_version=""
usage() {
cat <<'EOF'
Usage: sudo ./06-nvidia-tools.sh [--install-driver VERSION]
Without --install-driver, only non-driver diagnostic tools are installed.
EOF
}
while (($# > 0)); do
case "$1" in
--install-driver)
if (($# < 2)); then
printf 'CRITICAL: --install-driver requires a VERSION\n' >&2
exit 2
fi
driver_version="$2"
if [[ ! "$driver_version" =~ ^[0-9]+$ ]]; then
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
exit 2
fi
shift
;;
-h|--help)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
exit 2
;;
esac
shift
done
if ((EUID != 0)); then
printf 'CRITICAL: NVIDIA tooling setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y nvtop clinfo pciutils
printf '\n== NVIDIA PCI devices ==\n'
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
printf '\n== NVIDIA runtime ==\n'
if command -v nvidia-smi >/dev/null 2>&1; then
nvidia-smi || printf 'WARNING: nvidia-smi returned an error\n'
else
printf 'INFO: nvidia-smi is not installed\n'
fi
printf '\n== DKMS ==\n'
if command -v dkms >/dev/null 2>&1; then
dkms status || printf 'WARNING: dkms status returned an error\n'
else
printf 'INFO: dkms is not installed\n'
fi
if [[ -n "$driver_version" ]]; then
driver_package="nvidia-driver-$driver_version"
if ! apt-cache show "$driver_package" >/dev/null 2>&1; then
printf 'CRITICAL: requested NVIDIA driver package is unavailable: %s\n' \
"$driver_package" >&2
exit 1
fi
DEBIAN_FRONTEND=noninteractive apt-get install -y "$driver_package"
printf 'WARNING: NVIDIA driver %s was installed; reboot before validation\n' \
"$driver_version"
else
printf 'OK: NVIDIA diagnostic tools installed; no driver was installed\n'
fi
+67
View File
@@ -0,0 +1,67 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
JOURNAL_SOURCE="$SCRIPT_DIR/../files/systemd/journald-ailab-limits.conf"
JOURNAL_DEST="/etc/systemd/journald.conf.d/ailab-limits.conf"
SYSCTL_SOURCE="$SCRIPT_DIR/../files/sysctl/99-ailab.conf"
SYSCTL_DEST="/etc/sysctl.d/99-ailab.conf"
install_config() {
local source_path="$1"
local destination_path="$2"
local mode="$3"
local backup
install -d -m 0755 "$(dirname "$destination_path")"
if [[ -f "$destination_path" ]] && cmp -s "$source_path" "$destination_path"; then
printf 'OK: %s is already current\n' "$destination_path"
return 0
fi
if [[ -f "$destination_path" ]]; then
backup="${destination_path}.$(date '+%Y%m%d-%H%M%S').bak"
install -m "$mode" "$destination_path" "$backup"
printf 'INFO: backed up %s to %s\n' "$destination_path" "$backup"
fi
install -m "$mode" "$source_path" "$destination_path"
printf 'OK: installed %s\n' "$destination_path"
}
if ((EUID != 0)); then
printf 'CRITICAL: tuning setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
for source_path in "$JOURNAL_SOURCE" "$SYSCTL_SOURCE"; do
if [[ ! -r "$source_path" ]]; then
printf 'CRITICAL: required configuration is missing: %s\n' "$source_path" >&2
exit 2
fi
done
if ! command -v sysctl >/dev/null 2>&1 || ! command -v systemctl >/dev/null 2>&1; then
printf 'CRITICAL: sysctl and systemctl are required\n' >&2
exit 2
fi
if ! command -v sensors-detect >/dev/null 2>&1 || \
! systemctl cat sysstat.service >/dev/null 2>&1; then
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y lm-sensors sysstat
fi
install_config "$JOURNAL_SOURCE" "$JOURNAL_DEST" 0644
install_config "$SYSCTL_SOURCE" "$SYSCTL_DEST" 0644
sysctl --system
systemctl restart systemd-journald
systemctl enable --now sysstat
if command -v sensors-detect >/dev/null 2>&1; then
sensors-detect --auto || printf 'WARNING: sensors-detect did not complete successfully\n'
fi
printf 'OK: host tuning completed\n'
+61
View File
@@ -0,0 +1,61 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=00-platform-guard.inc
source "$SCRIPT_DIR/00-platform-guard.inc"
enable_ufw=0
usage() {
cat <<'EOF'
Usage: sudo ./08-security-baseline.sh [--enable-ufw]
Installs fail2ban and UFW. UFW is enabled only with the explicit flag.
EOF
}
while (($# > 0)); do
case "$1" in
--enable-ufw)
enable_ufw=1
;;
-h|--help)
usage
exit 0
;;
*)
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
exit 2
;;
esac
shift
done
if ((EUID != 0)); then
printf 'CRITICAL: security baseline setup must run as root\n' >&2
exit 2
fi
require_supported_ubuntu
if ! command -v apt-get >/dev/null 2>&1; then
printf 'CRITICAL: apt-get is required\n' >&2
exit 2
fi
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y fail2ban ufw
systemctl enable --now fail2ban
if ((enable_ufw == 1)); then
printf 'WARNING: UFW was explicitly requested; SSH and Cockpit rules will be added before enablement\n'
ufw allow OpenSSH
ufw allow 9090/tcp comment 'Cockpit'
ufw --force enable
else
printf 'WARNING: UFW is installed but was not enabled; use --enable-ufw after reviewing access requirements\n'
fi
ufw status verbose || printf 'WARNING: unable to read UFW status\n'
printf 'OK: security baseline completed\n'
+69
View File
@@ -0,0 +1,69 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Failed systemd units"
if command -v systemctl >/dev/null 2>&1; then
run_optional "failed systemd unit report" systemctl --failed --no-pager
section "Selected service status"
for unit in cockpit.socket docker.service libvirtd.service fail2ban.service; do
if systemctl cat "$unit" >/dev/null 2>&1; then
run_optional "$unit status" systemctl status "$unit" --no-pager
else
printf 'INFO: %s is not installed\n' "$unit"
fi
done
else
printf 'WARNING: systemctl is unavailable\n'
fi
section "Docker"
if command -v docker >/dev/null 2>&1; then
run_optional "Docker container list" docker ps
else
printf 'INFO: Docker is not installed\n'
fi
section "Libvirt"
if command -v virsh >/dev/null 2>&1; then
run_optional "libvirt guest list" virsh list --all
else
printf 'INFO: virsh is not installed\n'
fi
section "NVIDIA"
if command -v nvidia-smi >/dev/null 2>&1; then
run_optional "NVIDIA status" nvidia-smi
else
printf 'INFO: nvidia-smi is not installed\n'
fi
section "Filesystems"
run_optional "filesystem report" df -hT
section "Listening ports"
if command -v ss >/dev/null 2>&1; then
run_optional "listening port report" ss -tulpn
else
printf 'WARNING: ss is unavailable\n'
fi
printf '\nOK: postcheck completed; review warnings above\n'
exit 0
+8 -2
View File
@@ -1,8 +1,14 @@
# platform-projects # platform-projects
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/). This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise: ## Implemented platform projects
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
## Planning areas
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
- `monitoring-zabbix` - `monitoring-zabbix`
- `elk-log-analysis` - `elk-log-analysis`
@@ -0,0 +1,233 @@
# Slurm AI/HPC Cluster Automation Lab
## Executive summary
This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
## What this project demonstrates
- Slurm controller and worker node management.
- Munge authentication across the cluster.
- GPU node integration through Slurm GRES.
- cgroup CPU, memory, and GPU device enforcement.
- SlurmDBD with MariaDB-backed accounting.
- `sacct`, `sreport`, and `sacctmgr` workflows.
- QOS, fairshare, and multifactor priority configuration.
- Node provisioning and decommissioning workflows.
- Rolling OS upgrades with canary validation.
- Health checks and auto-remediation.
- Backup and restore-check workflow for the accounting database.
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
## Architecture overview
```mermaid
flowchart LR
operator[Ansible control node]
munge[Munge authentication]
controller[Slurm controller<br/>slurmctld]
db[MariaDB + SlurmDBD<br/>accounting]
shared[Shared filesystem<br/>site dependency]
cpu_part[CPU partition]
gpu_part[GPU partition]
cpu_nodes[CPU compute nodes<br/>slurmd]
gpu_node[GPU node<br/>slurmd + GRES]
jobs[User jobs<br/>sbatch / srun]
operator -->|bootstrap and configure| controller
operator -->|configure workers| cpu_nodes
operator -->|configure GPU worker| gpu_node
operator -->|deploy key and service| munge
munge --> controller
munge --> cpu_nodes
munge --> gpu_node
controller -->|accounting RPC| db
jobs -->|submit to Slurm| controller
controller -->|schedule CPU jobs| cpu_part
controller -->|schedule GPU jobs| gpu_part
cpu_part --> cpu_nodes
gpu_part --> gpu_node
cpu_nodes --- shared
gpu_node --- shared
controller --- shared
```
The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
## Repository layout
```text
inventories/lab/ Sanitized lab inventory and group variables
playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup
playbooks/core/ Munge, Slurm config, and safe restart workflows
playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation
playbooks/qos/ QOS, fairshare, and priority configuration
playbooks/lifecycle/ Node provisioning, inspection, and decommissioning
playbooks/upgrade/ Canary and rolling OS upgrade workflows
playbooks/health/ Health checks, repair, and auto-remediation
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
playbooks/backup/ Slurm and Munge state backup helpers
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
docs/ Operational runbook
prompts/ Documentation prompts used to expand this project
```
## Main operational workflows
Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
### Bootstrap access
```bash
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
```
### Deploy Munge
```bash
ansible-playbook playbooks/core/manage-munge.yml
```
### Deploy Slurm config
```bash
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
ansible-playbook playbooks/core/restart-slurm-safe.yml
```
### Validate CPU jobs
```bash
ansible-playbook playbooks/tests/validate-slurm-operator.yml
ansible-playbook playbooks/tests/test-cpu-job.yml
```
### Validate GPU jobs
```bash
ansible-playbook playbooks/tests/test-gpu-job.yml
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
```
### Enable accounting
```bash
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
ansible-playbook playbooks/tests/test-sreport-usage.yml
```
### Configure QOS and fairshare
```bash
ansible-playbook playbooks/qos/configure-slurm-qos.yml
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
```
### Provision a node
```bash
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
```
### Decommission a node
```bash
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
-e target_node=<node> \
-e "decom_reason=planned maintenance"
```
### Rolling OS upgrade
```bash
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
-e canary_node=<node> \
-e skip_canary=true
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
```
### Health check and auto-remediation
```bash
ansible-playbook playbooks/health/check-slurm-health.yml
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
```
### Accounting backup and restore-check
```bash
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
```
## Operational maturity
This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
- Ansible workflows are designed to be repeatable and readable.
- Configuration deployment supports check and diff review before applying changes.
- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
- QOS, fairshare, priority, and association workflows show resource governance.
- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
- Health and repair playbooks document remediation paths for common node states.
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
## Tested capabilities
- [x] CPU job scheduling.
- [x] GPU job scheduling.
- [x] GPU denial when no GRES is requested.
- [x] CPU cgroup enforcement.
- [x] SlurmDBD accounting setup.
- [x] `sacct` job history visibility.
- [x] `sreport` usage reporting.
- [x] QOS creation and validation.
- [x] Fairshare and priority visibility.
- [x] Node decommission and reprovision workflow.
- [x] Rolling upgrade canary workflow.
- [x] Node health check and auto-remediation workflow.
These checks represent sanitized lab validation, not a claim of production certification.
## Safety and sanitization
This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
## Why this matters for AI/HPC infrastructure roles
AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
This project demonstrates practical understanding of:
- Linux systems operations.
- Slurm cluster operations.
- GPU infrastructure and GRES scheduling.
- Job scheduling and resource isolation.
- Accounting, reporting, QOS, fairshare, and priority policy.
- Automation and repeatability with Ansible.
- Troubleshooting and operational ownership.
## Deeper docs
- [Runbook](docs/runbook.md)
@@ -0,0 +1,14 @@
[defaults]
inventory = ./inventories/lab/inventory.yml
host_key_checking = False
retry_files_enabled = False
stdout_callback = default
result_format = yaml
interpreter_python = auto_silent
timeout = 30
roles_path = ./roles
collections_path = ./collections
[ssh_connection]
pipelining = True
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
@@ -0,0 +1 @@
Generated backups and reports can be stored here locally. This directory is ignored by git.
@@ -0,0 +1,75 @@
# Slurm AI/HPC Lab Runbook
## Standard deployment order
```bash
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
ansible-playbook playbooks/core/manage-munge.yml
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
ansible-playbook playbooks/core/restart-slurm-safe.yml
ansible-playbook playbooks/tests/validate-slurm-operator.yml
ansible-playbook playbooks/tests/test-cpu-job.yml
ansible-playbook playbooks/tests/test-gpu-job.yml
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
ansible-playbook playbooks/qos/configure-slurm-qos.yml
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
ansible-playbook playbooks/health/check-slurm-health.yml
```
## Node lifecycle
Provision a node:
```bash
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
```
Decommission a node:
```bash
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
```
Repair a node:
```bash
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
```
Run health remediation for nodes that can be recovered by the automated workflow:
```bash
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
```
Back up Slurm and Munge state before planned lifecycle work:
```bash
ansible-playbook playbooks/backup/backup-slurm-state.yml
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
```
## Rolling OS upgrade
```bash
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
```
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
@@ -0,0 +1,128 @@
---
# Example lab inventory variables. Replace addresses, users and node topology for your environment.
slurm_cluster_name: labcluster
slurm_control_machine: slurm-ctl01
slurm_control_addr: 10.10.10.11
slurm_config_dir: /etc/slurm
slurm_user: slurm
slurm_operator_user: slurmuser
slurmctld_port: 6817
slurmd_port: 6818
slurm_job_comp_type: jobcomp/none
slurm_select_type: select/cons_tres
slurm_select_type_parameters: CR_Core_Memory
slurm_return_to_service: 2
slurm_default_mpi_type: none
slurm_gres_types: gpu
slurm_nodes:
- name: slurm-c01
managed_state: present
addr: 10.10.10.12
cpus: 2
real_memory: 1800
features: ""
gres: ""
topology: ""
- name: slurm-c02
managed_state: present
addr: 10.10.10.13
cpus: 2
real_memory: 1800
features: ""
gres: ""
topology: ""
- name: gpu01
managed_state: present
addr: 10.10.10.14
cpus: 12
real_memory: 60000
features: "gpu"
gres: "gpu:1"
gres_file: /dev/nvidia0
topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
slurm_partitions:
- name: debug
managed_state: present
nodes: "slurm-c[01-02]"
default: "YES"
max_time: "INFINITE"
state: "UP"
- name: gpu
managed_state: present
nodes: "gpu01"
default: "NO"
max_time: "INFINITE"
state: "UP"
- name: all
managed_state: present
nodes: "slurm-c[01-02],gpu01"
default: "NO"
max_time: "INFINITE"
state: "UP"
# Cgroup enforcement
slurm_enable_cgroup: true
slurm_task_plugin: task/cgroup,task/affinity
slurm_proctrack_type: proctrack/cgroup
slurm_job_acct_gather_type: jobacct_gather/cgroup
# Slurm accounting / SlurmDBD
slurm_accounting_storage_type: accounting_storage/slurmdbd
slurm_accounting_storage_host: slurm-ctl01
slurm_accounting_storage_port: 6819
slurm_accounting_storage_enforce: associations,limits,qos
slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
slurmdbd_host: slurm-ctl01
slurmdbd_port: 6819
slurmdbd_storage_type: accounting_storage/mysql
slurmdbd_storage_host: localhost
slurmdbd_storage_port: 3306
slurmdbd_storage_loc: slurm_acct_db
slurmdbd_storage_user: slurm
# Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
slurm_account_name: lab
slurm_account_description: "AI/HPC Slurm lab account"
slurm_account_organization: "labcluster"
# SlurmDBD purge / retention policy for lab
slurmdbd_commit_delay: 1
slurmdbd_purge_event_after: 12months
slurmdbd_purge_job_after: 12months
slurmdbd_purge_resv_after: 12months
slurmdbd_purge_step_after: 3months
slurmdbd_purge_suspend_after: 3months
slurmdbd_purge_txn_after: 12months
slurmdbd_purge_usage_after: 24months
# Archive is disabled for the lab; backup playbooks handle database dumps.
slurmdbd_archive_events: no
slurmdbd_archive_jobs: no
slurmdbd_archive_steps: no
slurmdbd_archive_suspend: no
slurmdbd_archive_txn: no
slurmdbd_archive_usage: no
# Slurm priority / fairshare
slurm_priority_type: priority/multifactor
slurm_priority_decay_half_life: 7-0
slurm_priority_calc_period: 5
slurm_priority_favor_small: "NO"
slurm_priority_weight_age: 1000
slurm_priority_weight_fairshare: 10000
slurm_priority_weight_job_size: 1000
slurm_priority_weight_partition: 1000
slurm_priority_weight_qos: 10000
slurm_priority_max_age: 1-0
@@ -0,0 +1,5 @@
---
# Copy this file to vault.yml and encrypt it with ansible-vault.
# ansible-vault encrypt inventories/lab/group_vars/vault.yml
vault_slurmdbd_storage_pass: CHANGE_ME
@@ -0,0 +1,24 @@
all:
vars:
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
children:
slurm_cluster:
children:
slurm_controller:
hosts:
slurm-ctl01:
ansible_host: 10.10.10.11
ansible_user: ansible
slurm_compute:
hosts:
slurm-c01:
ansible_host: 10.10.10.12
ansible_user: ansible
slurm-c02:
ansible_host: 10.10.10.13
ansible_user: ansible
slurm_gpu:
hosts:
gpu01:
ansible_host: 10.10.10.14
ansible_user: ansible
@@ -0,0 +1,90 @@
---
- name: Backup SlurmDBD MariaDB database
hosts: slurm_controller
become: true
gather_facts: true
vars:
slurmdbd_backup_dir: /var/backups/slurmdbd
local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
tasks:
- name: Create remote backup directory
ansible.builtin.file:
path: "{{ slurmdbd_backup_dir }}"
state: directory
owner: root
group: root
mode: "0700"
- name: Create local fetch directory on Ansible controller
ansible.builtin.file:
path: "{{ local_fetch_dir }}"
state: directory
owner: root
group: root
mode: "0700"
delegate_to: localhost
become: false
- name: Validate MariaDB is running
ansible.builtin.command:
cmd: systemctl is-active mariadb
changed_when: false
- name: Validate SlurmDBD is running
ansible.builtin.command:
cmd: systemctl is-active slurmdbd
changed_when: false
- name: Validate Slurm accounting database exists
ansible.builtin.shell: |
set -euo pipefail
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
args:
executable: /bin/bash
changed_when: false
- name: Dump Slurm accounting database
ansible.builtin.shell: |
set -euo pipefail
ts="$(date +%F-%H%M%S)"
out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
mysqldump \
--single-transaction \
--routines \
--events \
--triggers \
{{ slurmdbd_storage_loc }} | gzip -9 > "$out"
chmod 0600 "$out"
echo "$out"
args:
executable: /bin/bash
register: db_dump
changed_when: true
- name: Validate backup file is non-empty
ansible.builtin.stat:
path: "{{ db_dump.stdout }}"
register: backup_file
- name: Fail if backup file is empty
ansible.builtin.fail:
msg: "Backup file is empty: {{ db_dump.stdout }}"
when: backup_file.stat.size | int < 1024
- name: Fetch DB backup to Ansible controller
ansible.builtin.fetch:
src: "{{ db_dump.stdout }}"
dest: "{{ local_fetch_dir }}/"
flat: true
- name: Show DB backup result
ansible.builtin.debug:
msg:
- "Remote backup: {{ db_dump.stdout }}"
- "Backup size bytes: {{ backup_file.stat.size }}"
- "Fetched to: {{ local_fetch_dir }}/"
@@ -0,0 +1,126 @@
---
- name: Initialize Slurm accounting entities
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Wait for sacctmgr connectivity
ansible.builtin.command:
cmd: sacctmgr -n list cluster
register: sacctmgr_cluster_list
retries: 20
delay: 2
until: sacctmgr_cluster_list.rc == 0
changed_when: false
- name: Show current accounting state before changes
ansible.builtin.shell: |
set -euo pipefail
echo "### clusters"
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
echo
echo "### accounts"
sacctmgr list account format=Account,Descr,Org
echo
echo "### users"
sacctmgr list user format=User,DefaultAccount,Admin
echo
echo "### associations"
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
args:
executable: /bin/bash
register: accounting_state_before
changed_when: false
- name: Print current accounting state before changes
ansible.builtin.debug:
var: accounting_state_before.stdout_lines
- name: Ensure Slurm cluster exists in accounting DB
ansible.builtin.shell: |
set -euo pipefail
if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
echo "Cluster {{ slurm_cluster_name }} already exists"
else
sacctmgr -i add cluster {{ slurm_cluster_name }}
fi
args:
executable: /bin/bash
register: ensure_cluster
changed_when: "'Adding Cluster' in ensure_cluster.stdout"
- name: Ensure default lab account exists for cluster
ansible.builtin.shell: |
set -euo pipefail
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
else
sacctmgr -i add account {{ slurm_account_name }} \
Cluster={{ slurm_cluster_name }} \
Description="{{ slurm_account_description }}" \
Organization="{{ slurm_account_organization }}"
fi
args:
executable: /bin/bash
register: ensure_account
changed_when: "'Adding Account' in ensure_account.stdout"
- name: Ensure slurmuser exists with lab account association
ansible.builtin.shell: |
set -euo pipefail
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
else
sacctmgr -i add user slurmuser \
Cluster={{ slurm_cluster_name }} \
Account={{ slurm_account_name }} \
DefaultAccount={{ slurm_account_name }}
fi
args:
executable: /bin/bash
register: ensure_user_assoc
changed_when: "'Adding User' in ensure_user_assoc.stdout"
- name: Ensure slurmuser has default account set
ansible.builtin.shell: |
set -euo pipefail
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
args:
executable: /bin/bash
register: set_default_account
changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
- name: Show final accounting state
ansible.builtin.shell: |
set -euo pipefail
echo "### clusters"
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
echo
echo "### accounts"
sacctmgr list account format=Account,Descr,Org
echo
echo "### users"
sacctmgr list user format=User,DefaultAccount,Admin
echo
echo "### associations"
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
args:
executable: /bin/bash
register: accounting_state_after
changed_when: false
- name: Print final accounting state
ansible.builtin.debug:
var: accounting_state_after.stdout_lines
@@ -0,0 +1,98 @@
---
- name: Restore-check latest SlurmDBD backup into test database
hosts: slurm_controller
become: true
gather_facts: false
vars:
restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
slurmdbd_backup_dir: /var/backups/slurmdbd
tasks:
- name: Validate MariaDB is running
ansible.builtin.command:
cmd: systemctl is-active mariadb
changed_when: false
- name: Find latest SlurmDBD backup
ansible.builtin.shell: |
set -euo pipefail
ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
args:
executable: /bin/bash
register: latest_backup
changed_when: false
- name: Validate latest backup exists
ansible.builtin.stat:
path: "{{ latest_backup.stdout }}"
register: latest_backup_stat
- name: Fail if latest backup is missing or empty
ansible.builtin.fail:
msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
when:
- not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
- name: Recreate restore-check database
ansible.builtin.shell: |
set -euo pipefail
mysql <<SQL
DROP DATABASE IF EXISTS {{ restore_check_db }};
CREATE DATABASE {{ restore_check_db }};
SQL
args:
executable: /bin/bash
changed_when: true
- name: Import backup into restore-check database
ansible.builtin.shell: |
set -euo pipefail
zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
args:
executable: /bin/bash
changed_when: true
- name: Validate restored table count
ansible.builtin.shell: |
set -euo pipefail
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
args:
executable: /bin/bash
register: restored_tables
changed_when: false
failed_when: restored_tables.stdout | int < 1
- name: Validate restored row count sample
ansible.builtin.shell: |
set -euo pipefail
echo "### restored database"
echo "{{ restore_check_db }}"
echo
echo "### table count"
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
echo
echo "### largest tables"
mysql -N -B -e "
SELECT table_name, table_rows
FROM information_schema.tables
WHERE table_schema='{{ restore_check_db }}'
ORDER BY table_rows DESC
LIMIT 10;
"
args:
executable: /bin/bash
register: restore_check_summary
changed_when: false
- name: Show restore-check result
ansible.builtin.debug:
msg:
- "Imported backup: {{ latest_backup.stdout }}"
- "Restore-check DB: {{ restore_check_db }}"
- "Restored tables: {{ restored_tables.stdout }}"
- "Summary:"
- "{{ restore_check_summary.stdout_lines }}"
@@ -0,0 +1,105 @@
---
- name: Install and configure MariaDB for SlurmDBD
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Install MariaDB and SlurmDBD packages
ansible.builtin.apt:
name:
- mariadb-server
- mariadb-client
- slurmdbd
- slurm-wlm-mysql-plugin
state: present
update_cache: true
- name: Ensure MariaDB is enabled and running
ansible.builtin.systemd:
name: mariadb
enabled: true
state: started
- name: Ensure Slurm log directory exists
ansible.builtin.file:
path: /var/log/slurm
state: directory
owner: slurm
group: slurm
mode: "0755"
- name: Create Slurm accounting database and DB user
ansible.builtin.shell: |
set -euo pipefail
mysql <<SQL
CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
FLUSH PRIVILEGES;
SQL
args:
executable: /bin/bash
changed_when: true
- name: Ensure /etc/slurm exists
ansible.builtin.file:
path: /etc/slurm
state: directory
owner: root
group: root
mode: "0755"
- name: Deploy slurmdbd.conf
ansible.builtin.template:
src: ../../templates/slurmdbd.conf.j2
dest: /etc/slurm/slurmdbd.conf
owner: slurm
group: slurm
mode: "0600"
notify:
- Restart slurmdbd
- name: Ensure slurmdbd is enabled and running
ansible.builtin.systemd:
name: slurmdbd
enabled: true
state: started
- name: Flush handlers before validation
ansible.builtin.meta: flush_handlers
- name: Validate slurmdbd service is active
ansible.builtin.command:
cmd: systemctl is-active slurmdbd
register: slurmdbd_active
retries: 10
delay: 2
until: slurmdbd_active.stdout == "active"
changed_when: false
- name: Validate slurmdbd is listening on port
ansible.builtin.shell: |
set -euo pipefail
ss -lntp | grep ':{{ slurmdbd_port }} '
args:
executable: /bin/bash
register: slurmdbd_port_check
retries: 10
delay: 2
until: slurmdbd_port_check.rc == 0
changed_when: false
- name: Show slurmdbd service validation
ansible.builtin.debug:
msg:
- "slurmdbd is active"
- "{{ slurmdbd_port_check.stdout_lines }}"
handlers:
- name: Restart slurmdbd
ansible.builtin.systemd:
name: slurmdbd
state: restarted
@@ -0,0 +1,178 @@
---
- name: Validate Slurm accounting production-like setup
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Validate accounting services
ansible.builtin.shell: |
set -euo pipefail
echo "### services"
systemctl is-active mariadb
systemctl is-active slurmdbd
systemctl is-active slurmctld
echo
echo "### slurmdbd listener"
ss -lntp | grep ':6819 '
args:
executable: /bin/bash
register: service_check
changed_when: false
- name: Validate Slurm accounting runtime config
ansible.builtin.shell: |
set -euo pipefail
echo "### accounting config"
scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
echo
echo "### priority / select / cgroup config"
scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
args:
executable: /bin/bash
register: config_check
changed_when: false
- name: Validate sacctmgr entities
ansible.builtin.shell: |
set -euo pipefail
echo "### clusters"
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
echo
echo "### accounts"
sacctmgr list account format=Account,Descr,Org
echo
echo "### users"
sacctmgr list user format=User,DefaultAccount,Admin
echo
echo "### associations"
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
args:
executable: /bin/bash
register: entity_check
changed_when: false
- name: Submit accounting validation job
ansible.builtin.shell: |
set -euo pipefail
job_id="$(
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
#!/bin/bash
#SBATCH --job-name=acct-prodlike-test
#SBATCH --partition=debug
#SBATCH --cpus-per-task=1
#SBATCH --mem=256M
#SBATCH --time=00:02:00
#SBATCH --output=/shared/acct-prodlike-test-%j.out
echo "HOST=$(hostname)"
echo "USER=$(whoami)"
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
date
SBATCH
)"
echo "JOB_ID=$job_id"
for i in $(seq 1 90); do
if squeue -h -j "$job_id" | grep -q .; then
squeue -j "$job_id"
sleep 1
else
break
fi
done
echo "### sacct"
sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
echo "### output"
cat "/shared/acct-prodlike-test-${job_id}.out"
args:
executable: /bin/bash
register: acct_job
changed_when: true
- name: Validate sacct can read recent jobs
ansible.builtin.shell: |
set -euo pipefail
echo "### recent jobs"
sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
args:
executable: /bin/bash
register: sacct_recent
changed_when: false
- name: Validate sreport commands
ansible.builtin.shell: |
set -euo pipefail
echo "### cluster utilization"
sreport cluster utilization start=today || true
echo
echo "### account utilization by user"
sreport cluster AccountUtilizationByUser start=today || true
echo
echo "### user top"
sreport user top start=today || true
args:
executable: /bin/bash
register: sreport_check
changed_when: false
- name: Validate MariaDB table health summary
ansible.builtin.shell: |
set -euo pipefail
echo "### database exists"
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
echo
echo "### table count"
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
echo
echo "### largest tables"
mysql -N -B -e "
SELECT table_name, table_rows
FROM information_schema.tables
WHERE table_schema='{{ slurmdbd_storage_loc }}'
ORDER BY table_rows DESC
LIMIT 10;
"
args:
executable: /bin/bash
register: db_health
changed_when: false
- name: Print accounting validation
ansible.builtin.debug:
msg:
- "### services"
- "{{ service_check.stdout_lines }}"
- "### runtime config"
- "{{ config_check.stdout_lines }}"
- "### accounting entities"
- "{{ entity_check.stdout_lines }}"
- "### accounting validation job"
- "{{ acct_job.stdout_lines }}"
- "### recent sacct data"
- "{{ sacct_recent.stdout_lines }}"
- "### sreport"
- "{{ sreport_check.stdout_lines }}"
- "### database health"
- "{{ db_health.stdout_lines }}"
@@ -0,0 +1,83 @@
---
- name: Backup Slurm and Munge state on all cluster nodes
hosts: slurm_cluster
become: true
gather_facts: true
vars:
backup_base_dir: /var/backups/slurm
tasks:
- name: Create backup base directory
ansible.builtin.file:
path: "{{ backup_base_dir }}"
state: directory
owner: root
group: root
mode: "0700"
- name: Create timestamped backup directory
ansible.builtin.shell: |
set -euo pipefail
ts="$(date +%F-%H%M%S)"
dir="{{ backup_base_dir }}/$ts"
mkdir -p "$dir"
echo "$dir"
args:
executable: /bin/bash
register: backup_dir_result
changed_when: true
- name: Store backup directory fact
ansible.builtin.set_fact:
node_backup_dir: "{{ backup_dir_result.stdout }}"
- name: Backup Slurm and Munge config/state if present
ansible.builtin.shell: |
set -euo pipefail
backup_dir="{{ node_backup_dir }}"
for p in \
/etc/slurm \
/etc/slurm-llnl \
/etc/munge \
/var/spool/slurmctld \
/var/spool/slurmd \
/var/log/slurm \
/var/log/slurm-llnl
do
if [ -e "$p" ]; then
cp -a "$p" "$backup_dir/"
fi
done
systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
if command -v sinfo >/dev/null 2>&1; then
sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
fi
if command -v scontrol >/dev/null 2>&1; then
scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
fi
find "$backup_dir" -maxdepth 2 -type f -o -type d
args:
executable: /bin/bash
register: backup_content
changed_when: true
- name: Show backup location on node
ansible.builtin.debug:
msg:
- "Host: {{ inventory_hostname }}"
- "Backup directory: {{ node_backup_dir }}"
@@ -0,0 +1,46 @@
---
- name: Fetch latest Slurm backups from nodes to pvef
hosts: slurm_cluster
become: true
gather_facts: false
vars:
remote_backup_base: /var/backups/slurm
local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
tasks:
- name: Find latest remote backup directory
ansible.builtin.shell: |
set -euo pipefail
ls -1dt {{ remote_backup_base }}/* | head -n 1
args:
executable: /bin/bash
register: latest_backup_dir
changed_when: false
- name: Create local backup directory on pvef
ansible.builtin.file:
path: "{{ local_backup_base }}/{{ inventory_hostname }}"
state: directory
mode: "0700"
delegate_to: localhost
become: false
- name: Archive latest backup directory on remote node
ansible.builtin.archive:
path: "{{ latest_backup_dir.stdout }}"
dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
format: gz
force_archive: true
changed_when: true
- name: Fetch archive to pvef
ansible.builtin.fetch:
src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
flat: true
- name: Remove temporary remote archive
ansible.builtin.file:
path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
state: absent
@@ -0,0 +1,58 @@
---
- name: Bootstrap Ansible SSH access from pvef to Slurm nodes
hosts: slurm_cluster
gather_facts: false
become: true
vars:
ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
pre_tasks:
- name: Wait for SSH
ansible.builtin.wait_for_connection:
timeout: 30
- name: Install Python if missing - Debian/Ubuntu
ansible.builtin.raw: |
test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
changed_when: false
tasks:
- name: Ensure sudo is installed
ansible.builtin.apt:
name:
- sudo
- openssh-server
state: present
update_cache: true
- name: Ensure SSH server is enabled and running
ansible.builtin.service:
name: ssh
state: started
enabled: true
- name: Ensure .ssh directory exists for login user
ansible.builtin.file:
path: "/home/{{ ansible_user }}/.ssh"
state: directory
owner: "{{ ansible_user }}"
group: "{{ ansible_user }}"
mode: "0700"
- name: Add pvef root public key to login user's authorized_keys
ansible.builtin.authorized_key:
user: "{{ ansible_user }}"
key: "{{ ansible_controller_pubkey }}"
state: present
manage_dir: true
- name: Allow bootstrap login user passwordless sudo
ansible.builtin.copy:
dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
owner: root
group: root
mode: "0440"
content: |
{{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
validate: "visudo -cf %s"
@@ -0,0 +1,16 @@
---
- name: Configure /etc/hosts for Slurm cluster
hosts: slurm_cluster
become: true
gather_facts: false
tasks:
- name: Add Slurm cluster hosts to /etc/hosts
ansible.builtin.blockinfile:
path: /etc/hosts
marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
block: |
{{ slurm_control_addr }} {{ slurm_control_machine }}
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
{{ node.addr }} {{ node.name }}
{% endfor %}
@@ -0,0 +1,218 @@
---
- name: Create slurmuser and generate SSH keys on every Slurm node
hosts: slurm_cluster
become: true
gather_facts: true
vars:
slurm_operator_user: slurmuser
slurm_operator_shell: /bin/bash
tasks:
- name: Ensure useful packages are installed
ansible.builtin.apt:
name:
- sudo
- openssh-client
- openssh-server
- acl
state: present
update_cache: true
- name: Ensure slurmuser exists
ansible.builtin.user:
name: "{{ slurm_operator_user }}"
shell: "{{ slurm_operator_shell }}"
create_home: true
state: present
- name: Ensure .ssh directory exists for slurmuser
ansible.builtin.file:
path: "/home/{{ slurm_operator_user }}/.ssh"
state: directory
owner: "{{ slurm_operator_user }}"
group: "{{ slurm_operator_user }}"
mode: "0700"
- name: Generate SSH key for slurmuser if missing
ansible.builtin.openssh_keypair:
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
type: ed25519
owner: "{{ slurm_operator_user }}"
group: "{{ slurm_operator_user }}"
mode: "0600"
comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
force: false
- name: Read public key from each node
ansible.builtin.slurp:
src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
register: slurmuser_pubkey_raw
- name: Store decoded public key as host fact
ansible.builtin.set_fact:
slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
- name: Exchange slurmuser SSH keys across all Slurm nodes
hosts: slurm_cluster
become: true
gather_facts: false
vars:
slurm_operator_user: slurmuser
tasks:
- name: Install all slurmuser public keys into authorized_keys on every node
ansible.builtin.authorized_key:
user: "{{ slurm_operator_user }}"
key: "{{ hostvars[item].slurmuser_pubkey }}"
state: present
manage_dir: true
loop: "{{ groups['slurm_cluster'] }}"
- name: Build SSH known_hosts entries for all cluster nodes
ansible.builtin.shell: |
set -e
mkdir -p /home/{{ slurm_operator_user }}/.ssh
touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
{% for host in groups['slurm_cluster'] %}
ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
{% endfor %}
sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
args:
executable: /bin/bash
changed_when: true
- name: Ensure SSH permissions are correct
ansible.builtin.file:
path: "/home/{{ slurm_operator_user }}/.ssh"
state: directory
owner: "{{ slurm_operator_user }}"
group: "{{ slurm_operator_user }}"
mode: "0700"
- name: Ensure private key permissions are correct
ansible.builtin.file:
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
owner: "{{ slurm_operator_user }}"
group: "{{ slurm_operator_user }}"
mode: "0600"
- name: Ensure public key permissions are correct
ansible.builtin.file:
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
owner: "{{ slurm_operator_user }}"
group: "{{ slurm_operator_user }}"
mode: "0644"
- name: Configure sudo permissions for slurmuser
hosts: slurm_cluster
become: true
gather_facts: false
vars:
slurm_operator_user: slurmuser
tasks:
- name: Configure sudoers for slurmuser on Slurm controller
ansible.builtin.copy:
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
owner: root
group: root
mode: "0440"
content: |
# Managed by Ansible
# Operator access for Slurm controller node.
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
/bin/systemctl status slurmctld, \
/bin/systemctl restart slurmctld, \
/bin/systemctl reload slurmctld, \
/bin/systemctl stop slurmctld, \
/bin/systemctl start slurmctld, \
/bin/systemctl status slurmd, \
/bin/systemctl restart slurmd, \
/bin/systemctl reload slurmd, \
/bin/systemctl stop slurmd, \
/bin/systemctl start slurmd, \
/bin/journalctl -u slurmctld, \
/bin/journalctl -u slurmd, \
/usr/bin/scontrol, \
/usr/bin/sinfo, \
/usr/bin/squeue, \
/usr/bin/scancel, \
/usr/bin/sacct, \
/usr/bin/sacctmgr, \
/usr/bin/sbatch, \
/usr/bin/srun, \
/usr/bin/salloc
validate: "visudo -cf %s"
when: inventory_hostname in groups['slurm_controller']
- name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
ansible.builtin.copy:
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
owner: root
group: root
mode: "0440"
content: |
# Managed by Ansible
# Operator access for Slurm worker/GPU nodes.
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
/bin/systemctl status slurmd, \
/bin/systemctl restart slurmd, \
/bin/systemctl reload slurmd, \
/bin/systemctl stop slurmd, \
/bin/systemctl start slurmd, \
/bin/journalctl -u slurmd, \
/usr/bin/scontrol, \
/usr/bin/sinfo, \
/usr/bin/squeue, \
/usr/bin/scancel, \
/usr/bin/sacct, \
/usr/bin/sbatch, \
/usr/bin/srun, \
/usr/bin/salloc
validate: "visudo -cf %s"
when: inventory_hostname not in groups['slurm_controller']
- name: Validate slurmuser SSH mesh and Slurm access
hosts: slurm_cluster
become: true
gather_facts: false
vars:
slurm_operator_user: slurmuser
tasks:
- name: Test local Slurm commands as slurmuser
ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
register: sinfo_test
changed_when: false
failed_when: sinfo_test.rc != 0
- name: Show sinfo result
ansible.builtin.debug:
var: sinfo_test.stdout_lines
- name: Test SSH from each node to every other node as slurmuser
ansible.builtin.shell: |
set -e
{% for host in groups['slurm_cluster'] %}
ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
{% endfor %}
args:
executable: /bin/bash
become_user: "{{ slurm_operator_user }}"
register: ssh_mesh_test
changed_when: false
- name: Show SSH mesh test result
ansible.builtin.debug:
var: ssh_mesh_test.stdout_lines
@@ -0,0 +1,112 @@
---
- name: Fix sudo permissions for slurmuser Slurm operations
hosts: slurm_cluster
become: true
gather_facts: false
vars:
slurm_operator_user: slurmuser
tasks:
- name: Configure sudoers for slurmuser on controller
ansible.builtin.copy:
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
owner: root
group: root
mode: "0440"
content: |
# Managed by Ansible
Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
/bin/systemctl status slurmctld, \
/bin/systemctl status slurmctld *, \
/bin/systemctl restart slurmctld, \
/bin/systemctl reload slurmctld, \
/bin/systemctl start slurmctld, \
/bin/systemctl stop slurmctld, \
/bin/systemctl status slurmd, \
/bin/systemctl status slurmd *, \
/bin/systemctl restart slurmd, \
/bin/systemctl reload slurmd, \
/bin/systemctl start slurmd, \
/bin/systemctl stop slurmd, \
/usr/bin/systemctl status slurmctld, \
/usr/bin/systemctl status slurmctld *, \
/usr/bin/systemctl restart slurmctld, \
/usr/bin/systemctl reload slurmctld, \
/usr/bin/systemctl start slurmctld, \
/usr/bin/systemctl stop slurmctld, \
/usr/bin/systemctl status slurmd, \
/usr/bin/systemctl status slurmd *, \
/usr/bin/systemctl restart slurmd, \
/usr/bin/systemctl reload slurmd, \
/usr/bin/systemctl start slurmd, \
/usr/bin/systemctl stop slurmd
Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
/bin/journalctl -u slurmctld, \
/bin/journalctl -u slurmctld *, \
/bin/journalctl -u slurmd, \
/bin/journalctl -u slurmd *, \
/usr/bin/journalctl -u slurmctld, \
/usr/bin/journalctl -u slurmctld *, \
/usr/bin/journalctl -u slurmd, \
/usr/bin/journalctl -u slurmd *
Cmnd_Alias SLURM_COMMANDS = \
/usr/bin/scontrol, /usr/bin/scontrol *, \
/usr/bin/sinfo, /usr/bin/sinfo *, \
/usr/bin/squeue, /usr/bin/squeue *, \
/usr/bin/scancel, /usr/bin/scancel *, \
/usr/bin/sacct, /usr/bin/sacct *, \
/usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
/usr/bin/sbatch, /usr/bin/sbatch *, \
/usr/bin/srun, /usr/bin/srun *, \
/usr/bin/salloc, /usr/bin/salloc *
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
validate: "visudo -cf %s"
when: inventory_hostname in groups['slurm_controller']
- name: Configure sudoers for slurmuser on compute and GPU nodes
ansible.builtin.copy:
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
owner: root
group: root
mode: "0440"
content: |
# Managed by Ansible
Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
/bin/systemctl status slurmd, \
/bin/systemctl status slurmd *, \
/bin/systemctl restart slurmd, \
/bin/systemctl reload slurmd, \
/bin/systemctl start slurmd, \
/bin/systemctl stop slurmd, \
/usr/bin/systemctl status slurmd, \
/usr/bin/systemctl status slurmd *, \
/usr/bin/systemctl restart slurmd, \
/usr/bin/systemctl reload slurmd, \
/usr/bin/systemctl start slurmd, \
/usr/bin/systemctl stop slurmd
Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
/bin/journalctl -u slurmd, \
/bin/journalctl -u slurmd *, \
/usr/bin/journalctl -u slurmd, \
/usr/bin/journalctl -u slurmd *
Cmnd_Alias SLURM_COMMANDS = \
/usr/bin/scontrol, /usr/bin/scontrol *, \
/usr/bin/sinfo, /usr/bin/sinfo *, \
/usr/bin/squeue, /usr/bin/squeue *, \
/usr/bin/scancel, /usr/bin/scancel *, \
/usr/bin/sacct, /usr/bin/sacct *, \
/usr/bin/sbatch, /usr/bin/sbatch *, \
/usr/bin/srun, /usr/bin/srun *, \
/usr/bin/salloc, /usr/bin/salloc *
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
validate: "visudo -cf %s"
when: inventory_hostname not in groups['slurm_controller']
@@ -0,0 +1,133 @@
---
- name: Read Munge key from Slurm controller
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Check controller munge.key exists
ansible.builtin.stat:
path: /etc/munge/munge.key
register: controller_munge_key
- name: Fail if controller munge.key is missing
ansible.builtin.fail:
msg: "/etc/munge/munge.key is missing on controller. Do not continue."
when: not controller_munge_key.stat.exists
- name: Read controller munge.key
ansible.builtin.slurp:
src: /etc/munge/munge.key
register: controller_munge_key_raw
- name: Store controller Munge key as fact
ansible.builtin.set_fact:
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
- name: Deploy controller Munge key to all Slurm nodes
hosts: slurm_cluster
become: true
gather_facts: false
vars:
controller_host: "{{ groups['slurm_controller'][0] }}"
tasks:
- name: Ensure munge package is installed
ansible.builtin.apt:
name:
- munge
- libmunge2
state: present
update_cache: true
- name: Ensure munge group exists
ansible.builtin.group:
name: munge
system: true
state: present
- name: Ensure munge user exists
ansible.builtin.user:
name: munge
group: munge
system: true
shell: /usr/sbin/nologin
home: /nonexistent
create_home: false
state: present
- name: Ensure /etc/munge exists
ansible.builtin.file:
path: /etc/munge
state: directory
owner: munge
group: munge
mode: "0700"
- name: Deploy shared munge.key from controller
ansible.builtin.copy:
dest: /etc/munge/munge.key
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
owner: munge
group: munge
mode: "0400"
notify:
- Restart munge
- name: Ensure /var/log/munge exists
ansible.builtin.file:
path: /var/log/munge
state: directory
owner: munge
group: munge
mode: "0755"
- name: Ensure /var/lib/munge exists
ansible.builtin.file:
path: /var/lib/munge
state: directory
owner: munge
group: munge
mode: "0711"
- name: Ensure /run/munge exists
ansible.builtin.file:
path: /run/munge
state: directory
owner: munge
group: munge
mode: "0755"
- name: Ensure munge is enabled and running
ansible.builtin.systemd:
name: munge
enabled: true
state: started
handlers:
- name: Restart munge
ansible.builtin.systemd:
name: munge
state: restarted
- name: Validate Munge locally on all nodes
hosts: slurm_cluster
become: true
gather_facts: false
tasks:
- name: Test local munge encode/decode
ansible.builtin.shell: |
set -euo pipefail
munge -n | unmunge
args:
executable: /bin/bash
register: munge_local_test
changed_when: false
- name: Show local Munge validation
ansible.builtin.debug:
var: munge_local_test.stdout_lines
@@ -0,0 +1,132 @@
---
- name: Prepare Slurm config directories and logs
hosts: slurm_cluster
become: true
gather_facts: false
tasks:
- name: Ensure Slurm config directory exists
ansible.builtin.file:
path: "{{ slurm_config_dir }}"
state: directory
owner: root
group: root
mode: "0755"
- name: Ensure Slurm log directory exists
ansible.builtin.file:
path: /var/log/slurm
state: directory
owner: slurm
group: slurm
mode: "0755"
- name: Ensure slurmctld spool directory exists on controller
ansible.builtin.file:
path: /var/spool/slurmctld
state: directory
owner: slurm
group: slurm
mode: "0755"
when: inventory_hostname in groups['slurm_controller']
- name: Ensure slurmd spool directory exists on workers
ansible.builtin.file:
path: /var/spool/slurmd
state: directory
owner: slurm
group: slurm
mode: "0755"
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
- name: Deploy Slurm config files
hosts: slurm_cluster
become: true
gather_facts: false
tasks:
- name: Backup current slurm.conf before managed deployment
ansible.builtin.copy:
src: "{{ slurm_config_dir }}/slurm.conf"
dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
remote_src: true
owner: root
group: root
mode: "0644"
force: false
- name: Deploy managed slurm.conf
ansible.builtin.template:
src: ../../templates/slurm.conf.j2
dest: "{{ slurm_config_dir }}/slurm.conf"
owner: root
group: root
mode: "0644"
notify:
- Reconfigure slurmctld
- Restart slurmd
- name: Deploy managed cgroup.conf
ansible.builtin.template:
src: ../../templates/cgroup.conf.j2
dest: "{{ slurm_config_dir }}/cgroup.conf"
owner: root
group: root
mode: "0644"
when: slurm_enable_cgroup | default(false) | bool
notify:
- Reconfigure slurmctld
- Restart slurmd
- name: Deploy managed gres.conf only on GPU nodes
ansible.builtin.template:
src: ../../templates/gres.conf.j2
dest: "{{ slurm_config_dir }}/gres.conf"
owner: root
group: root
mode: "0644"
when: inventory_hostname in groups['slurm_gpu']
notify:
- Reconfigure slurmctld
- Restart slurmd
handlers:
- name: Reconfigure slurmctld
ansible.builtin.command:
cmd: scontrol reconfigure
when: inventory_hostname in groups['slurm_controller']
changed_when: true
- name: Restart slurmd
ansible.builtin.systemd:
name: slurmd
state: restarted
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
- name: Validate Slurm after config deployment
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Reconfigure controller
ansible.builtin.command:
cmd: scontrol reconfigure
changed_when: true
- name: Validate cluster state
ansible.builtin.shell: |
set -euo pipefail
scontrol ping
sinfo
scontrol show nodes
args:
executable: /bin/bash
register: slurm_config_validation
changed_when: false
- name: Show validation output
ansible.builtin.debug:
var: slurm_config_validation.stdout_lines
@@ -0,0 +1,103 @@
---
- name: Restart Slurm controller safely
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Restart munge on controller
ansible.builtin.systemd:
name: munge
state: restarted
enabled: true
- name: Restart slurmctld on controller
ansible.builtin.systemd:
name: slurmctld
state: restarted
enabled: true
- name: Wait for slurmctld to answer
ansible.builtin.command:
cmd: scontrol ping
register: scontrol_ping
retries: 15
delay: 2
until: scontrol_ping.rc == 0
changed_when: false
- name: Show controller ping
ansible.builtin.debug:
var: scontrol_ping.stdout_lines
- name: Restart Slurm workers safely one by one
hosts: slurm_compute:slurm_gpu
become: true
gather_facts: false
serial: 1
tasks:
- name: Restart munge on worker
ansible.builtin.systemd:
name: munge
state: restarted
enabled: true
- name: Restart slurmd on worker
ansible.builtin.systemd:
name: slurmd
state: restarted
enabled: true
- name: Wait for slurmd to be active
ansible.builtin.command:
cmd: systemctl is-active slurmd
register: slurmd_active
retries: 15
delay: 2
until: slurmd_active.stdout == "active"
changed_when: false
- name: Wait until this node is visible in Slurm
ansible.builtin.command:
cmd: scontrol show node {{ inventory_hostname }}
delegate_to: "{{ groups['slurm_controller'][0] }}"
register: node_visible
retries: 15
delay: 2
until: node_visible.rc == 0
changed_when: false
- name: Validate Slurm after restart
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Validate Slurm cluster state
ansible.builtin.shell: |
set -euo pipefail
echo "### scontrol ping"
scontrol ping
echo
echo "### sinfo"
sinfo
echo
echo "### nodes"
scontrol show nodes
echo
echo "### partitions"
scontrol show partitions
args:
executable: /bin/bash
register: slurm_validation
changed_when: false
- name: Show Slurm validation
ansible.builtin.debug:
var: slurm_validation.stdout_lines
@@ -0,0 +1,40 @@
---
- name: Discover node resources for Slurm config
hosts: slurm_cluster
become: true
gather_facts: true
tasks:
- name: Discover CPU and memory
ansible.builtin.shell: |
set -euo pipefail
echo "HOST={{ inventory_hostname }}"
echo "CPUS=$(nproc)"
echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
args:
executable: /bin/bash
register: cpu_mem
changed_when: false
- name: Discover NVIDIA GPU if present
ansible.builtin.shell: |
set -euo pipefail
if command -v nvidia-smi >/dev/null 2>&1; then
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
else
echo "NO_NVIDIA_SMI"
fi
args:
executable: /bin/bash
register: gpu_info
changed_when: false
- name: Show discovered resources
ansible.builtin.debug:
msg:
- "{{ cpu_mem.stdout_lines }}"
- "GPU:"
- "{{ gpu_info.stdout_lines }}"
@@ -0,0 +1,89 @@
---
- name: Inspect current Slurm and Munge state
hosts: slurm_cluster
become: true
gather_facts: true
tasks:
- name: Basic host info
ansible.builtin.shell: |
set -e
echo "HOST=$(hostname -f 2>/dev/null || hostname)"
echo "SHORT_HOST=$(hostname -s)"
echo "IP_ADDRESSES=$(hostname -I)"
echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
echo "KERNEL=$(uname -r)"
args:
executable: /bin/bash
register: host_info
changed_when: false
- name: Slurm package info
ansible.builtin.shell: |
dpkg -l | grep -Ei 'slurm|munge' || true
args:
executable: /bin/bash
register: package_info
changed_when: false
- name: Slurm config paths
ansible.builtin.shell: |
set -e
for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
echo "### $p"
if [ -e "$p" ]; then
find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
else
echo "MISSING"
fi
done
args:
executable: /bin/bash
register: config_paths
changed_when: false
- name: Service state
ansible.builtin.shell: |
for s in munge slurmctld slurmd; do
echo "### $s"
systemctl is-enabled "$s" 2>/dev/null || true
systemctl is-active "$s" 2>/dev/null || true
done
args:
executable: /bin/bash
register: service_state
changed_when: false
- name: Slurm commands
ansible.builtin.shell: |
echo "### which"
command -v sinfo || true
command -v scontrol || true
command -v sbatch || true
command -v srun || true
command -v munge || true
command -v unmunge || true
echo "### sinfo"
sinfo 2>&1 || true
echo "### scontrol ping"
scontrol ping 2>&1 || true
args:
executable: /bin/bash
register: slurm_commands
changed_when: false
- name: Show inspection report
ansible.builtin.debug:
msg:
- "===== {{ inventory_hostname }} :: host_info ====="
- "{{ host_info.stdout_lines }}"
- "===== {{ inventory_hostname }} :: packages ====="
- "{{ package_info.stdout_lines }}"
- "===== {{ inventory_hostname }} :: config_paths ====="
- "{{ config_paths.stdout_lines }}"
- "===== {{ inventory_hostname }} :: services ====="
- "{{ service_state.stdout_lines }}"
- "===== {{ inventory_hostname }} :: slurm_commands ====="
- "{{ slurm_commands.stdout_lines }}"
@@ -0,0 +1,216 @@
---
- name: Detect problematic Slurm nodes
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Detect nodes needing remediation
ansible.builtin.shell: |
set -euo pipefail
sinfo -N -h -o "%N %T" | awk '
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
' | sort -u
args:
executable: /bin/bash
register: bad_nodes_raw
changed_when: false
- name: Store bad node list
ansible.builtin.set_fact:
bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
- name: Show detected problematic nodes
ansible.builtin.debug:
var: bad_nodes
- name: Attempt auto-remediation on problematic nodes
hosts: slurm_compute:slurm_gpu
become: true
gather_facts: false
serial: 1
vars:
bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
tasks:
- name: Skip healthy nodes
ansible.builtin.meta: end_host
when: inventory_hostname not in bad_nodes_from_controller
- name: Restart Munge
ansible.builtin.systemd:
name: munge
state: restarted
enabled: true
- name: Restart slurmd
ansible.builtin.systemd:
name: slurmd
state: restarted
enabled: true
- name: Validate local services after remediation attempt
ansible.builtin.shell: |
set -euo pipefail
echo "HOST=$(hostname)"
echo
echo "### services"
systemctl is-active munge
systemctl is-active slurmd
echo
echo "### munge"
munge -n | unmunge >/dev/null
echo "munge OK"
echo
echo "### controller ping"
scontrol ping
echo
echo "### slurmd listener"
ss -lntp | grep ':6818 ' || true
echo
echo "### recent slurmd logs"
journalctl -u slurmd -n 30 --no-pager || true
args:
executable: /bin/bash
register: local_repair_check
changed_when: false
- name: Print local remediation result
ansible.builtin.debug:
var: local_repair_check.stdout_lines
- name: Refresh controller and validate remediated nodes
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Restart slurmctld to refresh node states
ansible.builtin.systemd:
name: slurmctld
state: restarted
- name: Wait for controller
ansible.builtin.command:
cmd: scontrol ping
register: slurmctld_ping
retries: 15
delay: 2
until: slurmctld_ping.rc == 0
changed_when: false
- name: Clear maintenance state on previously bad nodes
ansible.builtin.shell: |
set -euo pipefail
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
if [ -z "$bad_nodes" ]; then
echo "No bad nodes detected. Nothing to clear."
sinfo -N
exit 0
fi
for node in $bad_nodes; do
echo "### clearing state on $node"
scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
done
sleep 5
sinfo -N
args:
executable: /bin/bash
register: clear_result
changed_when: true
- name: Print clear-state result
ansible.builtin.debug:
var: clear_result.stdout_lines
- name: Detect nodes still unhealthy after remediation
ansible.builtin.shell: |
set -euo pipefail
sinfo -N -h -o "%N %T" | awk '
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
' | sort -u
args:
executable: /bin/bash
register: still_bad_nodes_raw
changed_when: false
- name: Store still bad nodes
ansible.builtin.set_fact:
still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
- name: Drain nodes that remain unhealthy
ansible.builtin.shell: |
set -euo pipefail
unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
if [ -z "$unresolved_nodes" ]; then
echo "No unresolved unhealthy nodes."
sinfo -N
exit 0
fi
for node in $unresolved_nodes; do
echo "### draining unresolved node $node"
scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
done
sinfo -N
args:
executable: /bin/bash
register: drain_unresolved
changed_when: still_bad_nodes | length > 0
- name: Show remediation summary
ansible.builtin.shell: |
set -euo pipefail
echo "### initial bad nodes"
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
if [ -z "$bad_nodes" ]; then
echo "none"
else
printf '%s\n' $bad_nodes
fi
echo
echo "### still bad nodes"
still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
if [ -z "$still_bad_nodes" ]; then
echo "none"
else
printf '%s\n' $still_bad_nodes
fi
echo
echo "### final sinfo"
sinfo -N
echo
echo "### queue"
squeue
args:
executable: /bin/bash
register: remediation_summary
changed_when: false
- name: Print remediation summary
ansible.builtin.debug:
var: remediation_summary.stdout_lines
@@ -0,0 +1,149 @@
---
- name: Check Slurm controller health
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Check controller services and cluster state
ansible.builtin.shell: |
set -euo pipefail
echo "### controller services"
systemctl is-active munge
systemctl is-active slurmctld
systemctl is-active slurmdbd || true
systemctl is-active mariadb || true
echo
echo "### slurm ping"
scontrol ping
echo
echo "### nodes"
sinfo -N
echo
echo "### partitions"
sinfo
echo
echo "### queue"
squeue
echo
echo "### problematic nodes"
sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
echo
echo "### accounting"
sacctmgr -n list cluster || true
echo
echo "### recent failed jobs"
sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
--format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
args:
executable: /bin/bash
register: controller_health
changed_when: false
- name: Print controller health
ansible.builtin.debug:
var: controller_health.stdout_lines
- name: Check Slurm worker health
hosts: slurm_compute:slurm_gpu
become: true
gather_facts: true
tasks:
- name: Check worker services, config and connectivity
ansible.builtin.shell: |
set -euo pipefail
echo "HOST=$(hostname)"
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
echo "KERNEL=$(uname -r)"
echo "UPTIME=$(uptime -p)"
echo
echo "### services"
systemctl is-active munge
systemctl is-active slurmd
echo
echo "### munge local test"
munge -n | unmunge >/dev/null
echo "munge OK"
echo
echo "### controller connectivity"
getent hosts slurm-ctl01 || true
scontrol ping
echo
echo "### slurmd listener"
ss -lntp | grep ':6818 ' || true
echo
echo "### config checksums"
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
echo
echo "### shared filesystem"
test -d /shared
touch /shared/.slurm-health-$(hostname)
ls -l /shared/.slurm-health-$(hostname)
rm -f /shared/.slurm-health-$(hostname)
echo
echo "### cgroup"
mount | grep cgroup || true
echo
echo "### gpu check"
if command -v nvidia-smi >/dev/null 2>&1; then
nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
else
echo "NO_NVIDIA_SMI"
fi
args:
executable: /bin/bash
register: worker_health
changed_when: false
- name: Print worker health
ansible.builtin.debug:
var: worker_health.stdout_lines
- name: Check Slurm-reported node state consistency
hosts: slurm_controller
become: true
gather_facts: false
tasks:
- name: Build Slurm node health summary
ansible.builtin.shell: |
set -euo pipefail
echo "### node summary"
sinfo -N -o "%N %P %T %C %m %G %E"
echo
echo "### full problematic node details"
for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
echo
echo "### $node"
scontrol show node "$node"
done
args:
executable: /bin/bash
register: slurm_node_summary
changed_when: false
- name: Print Slurm node summary
ansible.builtin.debug:
var: slurm_node_summary.stdout_lines

Some files were not shown because too many files have changed in this diff Show More