Compare commits
7 Commits
8a7b7c5abc
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 4e739c5c99 | |||
| 8cb92de06f | |||
| 1843796e92 | |||
| cd6830334b | |||
| e2624a7533 | |||
| 6475f76787 | |||
| e851568c8c |
@@ -4,6 +4,8 @@
|
|||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
|
||||||
|
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
|
||||||
- Python tooling validation for operational scripts.
|
- Python tooling validation for operational scripts.
|
||||||
- `incident-log-summary` for general incident log summarization.
|
- `incident-log-summary` for general incident log summarization.
|
||||||
- `log-diff-checker` for pre-change and post-change log comparison.
|
- `log-diff-checker` for pre-change and post-change log comparison.
|
||||||
@@ -11,6 +13,8 @@
|
|||||||
- `jvm-log-analyzer` for JVM application log summaries.
|
- `jvm-log-analyzer` for JVM application log summaries.
|
||||||
- `journal-analyzer` for exported `journalctl` log review.
|
- `journal-analyzer` for exported `journalctl` log review.
|
||||||
- `known-error-matcher` with JSON-based known error patterns.
|
- `known-error-matcher` with JSON-based known error patterns.
|
||||||
|
- Standalone Bash incident checks for CPU, memory/OOM, service restart loops, failed SSH logins, certificate expiry, DNS connectivity, NTP drift, read-only filesystems, inode usage, and JVM process diagnostics.
|
||||||
|
- `incident_triage_report.sh` for L2 Markdown incident handover reports built from existing Bash incident checks.
|
||||||
- Repository-level Codex guidance:
|
- Repository-level Codex guidance:
|
||||||
- `AGENTS.md`
|
- `AGENTS.md`
|
||||||
- `docs/codex/README.md`
|
- `docs/codex/README.md`
|
||||||
@@ -34,6 +38,7 @@
|
|||||||
- IBM AIX 7 role and playbook.
|
- IBM AIX 7 role and playbook.
|
||||||
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
||||||
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
||||||
|
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
|
|||||||
Binary file not shown.
@@ -30,6 +30,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
|||||||
|
|
||||||
- [infra-run](./infra-run/) - the main implemented project in this repository.
|
- [infra-run](./infra-run/) - the main implemented project in this repository.
|
||||||
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
|
- [Linux healthcheck scripts](./infra-run/scripts/bash/os-healthcheck/) - host, disk, service, network, and report helpers.
|
||||||
|
- [Bash incident checks](./infra-run/scripts/bash/incident-checks/) - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
|
||||||
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
|
- [Disk full workflow](./infra-run/scripts/bash/disk-full/) - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
|
||||||
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
|
- [Veritas examples](./infra-run/scripts/bash/veritas/) - dry-run-first VxVM/VCS storage expansion workflow examples.
|
||||||
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
|
- [GPFS examples](./infra-run/scripts/bash/gpfs/) - dry-run-first IBM Spectrum Scale expansion workflow examples.
|
||||||
@@ -41,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
|||||||
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||||
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||||
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||||
|
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
## Planned Areas
|
## Planned Areas
|
||||||
|
|
||||||
@@ -105,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
|
|||||||
- Veritas VxVM/VCS operational awareness.
|
- Veritas VxVM/VCS operational awareness.
|
||||||
- GPFS / IBM Spectrum Scale operational awareness.
|
- GPFS / IBM Spectrum Scale operational awareness.
|
||||||
- Ansible role organization for selected hardening controls.
|
- Ansible role organization for selected hardening controls.
|
||||||
|
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||||
- Clear documentation of what was tested and what still needs a real system.
|
- Clear documentation of what was tested and what still needs a real system.
|
||||||
|
|||||||
@@ -20,6 +20,7 @@ This file keeps future portfolio ideas in one place so empty folders do not look
|
|||||||
|
|
||||||
## Implemented Portfolio Additions
|
## Implemented Portfolio Additions
|
||||||
|
|
||||||
|
- Standalone Bash incident checks under `infra-run/scripts/bash/incident-checks/` for common Linux incident triage and ticket evidence.
|
||||||
- Python operational log analysis suite under `infra-run/scripts/python/`:
|
- Python operational log analysis suite under `infra-run/scripts/python/`:
|
||||||
- `incident-log-summary`
|
- `incident-log-summary`
|
||||||
- `log-diff-checker`
|
- `log-diff-checker`
|
||||||
|
|||||||
@@ -9,6 +9,7 @@ The goal is to show operational judgment, not to ship a universal automation pro
|
|||||||
### Bash Operational Scripts
|
### Bash Operational Scripts
|
||||||
|
|
||||||
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
|
- [scripts/bash/os-healthcheck](./scripts/bash/os-healthcheck/) - general Linux health, service, disk, network, and report scripts.
|
||||||
|
- [scripts/bash/incident-checks](./scripts/bash/incident-checks/) - standalone read-only incident checks for CPU, memory/OOM, SSH failures, TLS expiry, DNS, NTP, filesystems, inodes, services, JVM diagnostics, and an L2 Markdown triage report wrapper.
|
||||||
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
|
- [scripts/bash/disk-full](./scripts/bash/disk-full/) - disk-full triage and cleanup review workflow.
|
||||||
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
|
- [scripts/bash/veritas](./scripts/bash/veritas/) - Veritas VxVM/VCS storage expansion workflow examples.
|
||||||
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
|
- [scripts/bash/gpfs](./scripts/bash/gpfs/) - GPFS / IBM Spectrum Scale expansion workflow examples.
|
||||||
|
|||||||
@@ -16,6 +16,7 @@ This file tracks planned `infra-run` additions without presenting them as comple
|
|||||||
|
|
||||||
## Implemented Additions
|
## Implemented Additions
|
||||||
|
|
||||||
|
- `infra-run/scripts/bash/incident-checks/` - standalone read-only Bash checks for CPU, memory/OOM, service restart loops, failed SSH logins, TLS certificate expiry, DNS connectivity, time sync drift, read-only filesystems, inode pressure, and JVM process diagnostics.
|
||||||
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
|
- `infra-run/scripts/python/incident-log-summary/` - first read-only Python log analysis helper for summarizing configured incident patterns from local log files.
|
||||||
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
|
- `infra-run/scripts/python/log-diff-checker/` - read-only before/after log comparison helper for post-change pattern review.
|
||||||
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
|
- `infra-run/scripts/python/auth-log-audit/` - read-only authentication log audit helper for local SSH, sudo, su, and PAM review.
|
||||||
|
|||||||
@@ -7,5 +7,6 @@ These files use fake hostnames, reserved example domains, reserved IP address ra
|
|||||||
## Included
|
## Included
|
||||||
|
|
||||||
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
|
- `disk-full/` - sample filesystem usage, deleted open files, and a short after-action report.
|
||||||
|
- `incident-triage/` - sample L2 incident triage report for repeatable handoff and ticket evidence.
|
||||||
- `veritas/` - sample VxVM disk and VCS service group output.
|
- `veritas/` - sample VxVM disk and VCS service group output.
|
||||||
- `gpfs/` - sample GPFS cluster and NSD output.
|
- `gpfs/` - sample GPFS cluster and NSD output.
|
||||||
|
|||||||
@@ -0,0 +1,131 @@
|
|||||||
|
# L2 Incident Triage Report
|
||||||
|
|
||||||
|
- Generated: 2026-05-12T19:30:00Z
|
||||||
|
- Local hostname: app01.example.internal
|
||||||
|
- Current user: triage
|
||||||
|
- Incident type: all
|
||||||
|
- Service: nginx
|
||||||
|
- Host: app.example.com
|
||||||
|
- Port: 443
|
||||||
|
- PID: not provided
|
||||||
|
- Process match: not provided
|
||||||
|
- Since: 30 minutes ago
|
||||||
|
|
||||||
|
## Executed Checks
|
||||||
|
|
||||||
|
| Check | Script | Status | Exit | Command |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| CPU saturation | `check_high_cpu.sh` | OK | 0 | `./check_high_cpu.sh` |
|
||||||
|
| Memory and OOM | `check_high_memory_oom.sh` | WARNING | 1 | `./check_high_memory_oom.sh --since "30 minutes ago"` |
|
||||||
|
| Service restart loop | `check_service_restart_loop.sh` | OK | 0 | `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"` |
|
||||||
|
| DNS and connectivity | `check_dns_connectivity.sh` | OK | 0 | `./check_dns_connectivity.sh --host app.example.com --port 443` |
|
||||||
|
| Failed SSH logins | `check_failed_ssh_logins.sh` | OK | 0 | `./check_failed_ssh_logins.sh --since "30 minutes ago"` |
|
||||||
|
| Certificate expiry | `check_certificate_expiry.sh` | OK | 0 | `./check_certificate_expiry.sh --host app.example.com --port 443` |
|
||||||
|
| Read-only filesystems | `check_filesystem_readonly.sh` | OK | 0 | `./check_filesystem_readonly.sh` |
|
||||||
|
| Inode usage | `check_inode_usage.sh` | OK | 0 | `./check_inode_usage.sh` |
|
||||||
|
| JVM threads and heap | `check_jvm_threads_heap.sh` | WARNING | 1 | `./check_jvm_threads_heap.sh` |
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
- CPU saturation: OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||||
|
- Memory and OOM: WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
- Service restart loop: OK: Service nginx state=active substate=running restarts=0
|
||||||
|
- DNS and connectivity: OK: DNS=OK ping=OK tcp_443=OK
|
||||||
|
- Failed SSH logins: OK: Found 2 failed SSH login attempt(s) for requested window
|
||||||
|
- Certificate expiry: OK: Certificate for app.example.com:443 expires in 74 day(s)
|
||||||
|
- Read-only filesystems: OK: Found 0 read-only filesystem(s)
|
||||||
|
- Inode usage: OK: Highest inode usage is 42%
|
||||||
|
- JVM threads and heap: WARNING: No Java processes detected
|
||||||
|
|
||||||
|
## Raw Evidence
|
||||||
|
|
||||||
|
### CPU saturation
|
||||||
|
|
||||||
|
Script: `check_high_cpu.sh`
|
||||||
|
|
||||||
|
Command: `./check_high_cpu.sh`
|
||||||
|
|
||||||
|
Status: OK, exit: 0
|
||||||
|
|
||||||
|
```text
|
||||||
|
OK: 1-minute load is 0.42 across 4 CPU(s) (10% of CPU count)
|
||||||
|
|
||||||
|
Load average:
|
||||||
|
1m=0.42 5m=0.38 15m=0.31
|
||||||
|
|
||||||
|
Top CPU processes:
|
||||||
|
PID PPID USER %CPU %MEM COMMAND ARGS
|
||||||
|
1450 1 app 7.2 2.1 nginx nginx: worker process
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check process ownership and whether the top process is expected
|
||||||
|
- Review logs for the top CPU-consuming process
|
||||||
|
```
|
||||||
|
|
||||||
|
### Memory and OOM
|
||||||
|
|
||||||
|
Script: `check_high_memory_oom.sh`
|
||||||
|
|
||||||
|
Command: `./check_high_memory_oom.sh --since "30 minutes ago"`
|
||||||
|
|
||||||
|
Status: WARNING, exit: 1
|
||||||
|
|
||||||
|
```text
|
||||||
|
WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
|
||||||
|
Memory summary:
|
||||||
|
Mem: 15800 13272 1110 210 1418 1840
|
||||||
|
Swap: 4095 512 3583
|
||||||
|
|
||||||
|
OOM events since 30 minutes ago:
|
||||||
|
OK: no OOM evidence found in available sources
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service restart loop
|
||||||
|
|
||||||
|
Script: `check_service_restart_loop.sh`
|
||||||
|
|
||||||
|
Command: `./check_service_restart_loop.sh --service nginx --since "30 minutes ago"`
|
||||||
|
|
||||||
|
Status: OK, exit: 0
|
||||||
|
|
||||||
|
```text
|
||||||
|
OK: Service nginx state=active substate=running restarts=0
|
||||||
|
|
||||||
|
Systemd properties:
|
||||||
|
Id=nginx.service
|
||||||
|
ActiveState=active
|
||||||
|
SubState=running
|
||||||
|
NRestarts=0
|
||||||
|
```
|
||||||
|
|
||||||
|
### Skipped or limited checks
|
||||||
|
|
||||||
|
```text
|
||||||
|
JVM threads and heap returned WARNING because no Java process was detected.
|
||||||
|
No destructive commands were run. No service restarts, process kills, remounts, or configuration changes were attempted.
|
||||||
|
```
|
||||||
|
|
||||||
|
## L2 Handover Checklist
|
||||||
|
|
||||||
|
- [ ] Business impact confirmed
|
||||||
|
- [ ] Affected host/service identified
|
||||||
|
- [ ] Monitoring alert attached
|
||||||
|
- [ ] Recent changes checked
|
||||||
|
- [ ] Logs attached
|
||||||
|
- [ ] Service owner identified
|
||||||
|
- [ ] Escalation target identified
|
||||||
|
|
||||||
|
## Escalation Notes
|
||||||
|
|
||||||
|
- Escalate when impact is active, spreading, customer-facing, or outside L2 access.
|
||||||
|
- Include the alert, timeline, commands run, and the raw evidence above.
|
||||||
|
- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.
|
||||||
|
- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.
|
||||||
|
|
||||||
|
## Recommended Next Steps
|
||||||
|
|
||||||
|
- Confirm the symptom against monitoring and user reports.
|
||||||
|
- Compare this point-in-time evidence with recent deploys, config changes, and host events.
|
||||||
|
- Attach this report to the incident ticket before handoff.
|
||||||
|
- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.
|
||||||
@@ -7,13 +7,15 @@ Small, practical Bash scripts for Linux operations checks and incident triage. T
|
|||||||
```mermaid
|
```mermaid
|
||||||
flowchart TD
|
flowchart TD
|
||||||
A["bash"] --> B["os-healthcheck"]
|
A["bash"] --> B["os-healthcheck"]
|
||||||
A --> C["disk-full"]
|
A --> C["incident-checks"]
|
||||||
A --> D["veritas"]
|
A --> D["disk-full"]
|
||||||
A --> E["gpfs"]
|
A --> E["veritas"]
|
||||||
|
A --> F["gpfs"]
|
||||||
B --> B1["Host diagnostics"]
|
B --> B1["Host diagnostics"]
|
||||||
C --> C1["Incident workflow"]
|
C --> C1["Standalone triage checks"]
|
||||||
D --> D1["VxVM and VCS change flow"]
|
D --> D1["Incident workflow"]
|
||||||
E --> E1["Spectrum Scale expansion flow"]
|
E --> E1["VxVM and VCS change flow"]
|
||||||
|
F --> F1["Spectrum Scale expansion flow"]
|
||||||
```
|
```
|
||||||
|
|
||||||
## Scripts
|
## Scripts
|
||||||
@@ -23,6 +25,7 @@ flowchart TD
|
|||||||
- `os-healthcheck/service_check.sh` - critical service status check.
|
- `os-healthcheck/service_check.sh` - critical service status check.
|
||||||
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
- `os-healthcheck/system_report.sh` - writes a timestamped system report to `/tmp`.
|
||||||
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
- `os-healthcheck/network_troubleshoot.sh` - local and optional remote network diagnostics.
|
||||||
|
- `incident-checks/` - standalone read-only incident checks for CPU, memory/OOM, services, SSH failures, TLS certificates, DNS, NTP, filesystems, inodes, and JVM diagnostics.
|
||||||
|
|
||||||
## Usage
|
## Usage
|
||||||
|
|
||||||
@@ -37,6 +40,12 @@ cd infra-run/scripts/bash/os-healthcheck
|
|||||||
./system_report.sh
|
./system_report.sh
|
||||||
./network_troubleshoot.sh
|
./network_troubleshoot.sh
|
||||||
./network_troubleshoot.sh google.com
|
./network_troubleshoot.sh google.com
|
||||||
|
|
||||||
|
cd ../incident-checks
|
||||||
|
./check_high_cpu.sh
|
||||||
|
./check_high_memory_oom.sh --since "24 hours ago"
|
||||||
|
./check_service_restart_loop.sh --service sshd
|
||||||
|
./check_certificate_expiry.sh --host example.com
|
||||||
```
|
```
|
||||||
|
|
||||||
## Standards
|
## Standards
|
||||||
|
|||||||
@@ -0,0 +1,124 @@
|
|||||||
|
# Bash Incident Checks
|
||||||
|
|
||||||
|
Standalone, read-only Bash checks for common Linux incident triage. These scripts are designed to be copied to a server during an incident, run without repository context, and pasted into an incident or change ticket as evidence.
|
||||||
|
|
||||||
|
They favor standard tools found on RHEL-like and Debian/Ubuntu systems. Optional commands are used when available and reported clearly when missing.
|
||||||
|
|
||||||
|
## Scripts
|
||||||
|
|
||||||
|
- `check_high_cpu.sh` - load, CPU saturation hint, and top CPU processes.
|
||||||
|
- `check_high_memory_oom.sh` - memory and swap pressure plus recent OOM evidence.
|
||||||
|
- `check_service_restart_loop.sh` - systemd service state, restart count, and recent failure lines.
|
||||||
|
- `check_failed_ssh_logins.sh` - failed SSH login burst review from journal or auth logs.
|
||||||
|
- `check_certificate_expiry.sh` - remote or local TLS certificate expiry check.
|
||||||
|
- `check_dns_connectivity.sh` - DNS resolution, ping, optional TCP check, and local route hints.
|
||||||
|
- `check_ntp_time_drift.sh` - time sync status and offset evidence when available.
|
||||||
|
- `check_filesystem_readonly.sh` - read-only filesystem detection.
|
||||||
|
- `check_inode_usage.sh` - inode pressure and top affected mount points.
|
||||||
|
- `check_jvm_threads_heap.sh` - lightweight JVM process, heap, and thread diagnostics.
|
||||||
|
- `incident_triage_report.sh` - wrapper that runs selected checks and writes a single Markdown L2 handover report.
|
||||||
|
|
||||||
|
## Usage Examples
|
||||||
|
|
||||||
|
```bash
|
||||||
|
./check_high_cpu.sh
|
||||||
|
./check_high_cpu.sh --warning 70 --critical 90 --top 15
|
||||||
|
|
||||||
|
./check_high_memory_oom.sh
|
||||||
|
./check_high_memory_oom.sh --since "6 hours ago" --top 5
|
||||||
|
|
||||||
|
./check_service_restart_loop.sh --service nginx
|
||||||
|
./check_service_restart_loop.sh --service app.service --since "30 minutes ago"
|
||||||
|
|
||||||
|
./check_failed_ssh_logins.sh
|
||||||
|
./check_failed_ssh_logins.sh --since "15 minutes ago" --warning 10 --critical 25
|
||||||
|
|
||||||
|
./check_certificate_expiry.sh --host example.com
|
||||||
|
./check_certificate_expiry.sh --host app.example.com --port 8443 --servername app.example.com
|
||||||
|
./check_certificate_expiry.sh --file /etc/pki/tls/certs/example.crt
|
||||||
|
|
||||||
|
./check_dns_connectivity.sh --host example.com
|
||||||
|
./check_dns_connectivity.sh --host db.example.internal --port 5432
|
||||||
|
|
||||||
|
./check_ntp_time_drift.sh
|
||||||
|
./check_ntp_time_drift.sh --warning-offset 250 --critical-offset 2000
|
||||||
|
|
||||||
|
./check_filesystem_readonly.sh
|
||||||
|
./check_filesystem_readonly.sh --include-system
|
||||||
|
|
||||||
|
./check_inode_usage.sh
|
||||||
|
./check_inode_usage.sh --warning 75 --critical 90
|
||||||
|
|
||||||
|
./check_jvm_threads_heap.sh
|
||||||
|
./check_jvm_threads_heap.sh --pid 1234
|
||||||
|
./check_jvm_threads_heap.sh --match app-name
|
||||||
|
|
||||||
|
./incident_triage_report.sh --type cpu
|
||||||
|
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||||
|
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||||
|
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## L2 Triage Report Wrapper
|
||||||
|
|
||||||
|
`incident_triage_report.sh` collects selected incident checks into one Markdown report. It is useful for L2 mentoring, repeatable triage, and ticket evidence because it keeps the command list, point-in-time output, handover checklist, escalation notes, and recommended next steps in one place.
|
||||||
|
|
||||||
|
Supported report types are `cpu`, `memory`, `service`, `network`, `auth`, `cert`, `filesystem`, `jvm`, and `all`.
|
||||||
|
|
||||||
|
The wrapper is read-only apart from writing the requested `--output` file. It does not require root and skips checks safely when an underlying script is missing, not executable, or missing required context such as `--service` or `--host`.
|
||||||
|
|
||||||
|
## Exit Codes
|
||||||
|
|
||||||
|
- `0` - OK.
|
||||||
|
- `1` - WARNING or operational issue detected.
|
||||||
|
- `2` - invalid input or missing required dependency.
|
||||||
|
- `3` - CRITICAL issue detected.
|
||||||
|
|
||||||
|
## Supported Platforms
|
||||||
|
|
||||||
|
These checks are written for Bash on Linux and should work on common RHEL/Rocky/Alma/Oracle Linux and Debian/Ubuntu systems where the relevant platform tools are installed.
|
||||||
|
|
||||||
|
Some data sources vary by distribution:
|
||||||
|
|
||||||
|
- RHEL-like systems often use `/var/log/secure` and `/var/log/messages`.
|
||||||
|
- Debian/Ubuntu systems often use `/var/log/auth.log`, `/var/log/syslog`, and `/var/log/kern.log`.
|
||||||
|
- systemd-based checks require `systemctl`; journal-based evidence uses `journalctl` when available.
|
||||||
|
|
||||||
|
## Safety Notes
|
||||||
|
|
||||||
|
- Scripts are read-only.
|
||||||
|
- Scripts do not restart services, kill processes, remount filesystems, change time services, or write persistent files.
|
||||||
|
- Root is not required, but some logs, process command lines, and JVM attach details may be limited without elevated permissions.
|
||||||
|
- Treat output as triage evidence, not as complete root-cause analysis.
|
||||||
|
|
||||||
|
## Dependency Notes
|
||||||
|
|
||||||
|
Required dependencies vary by script and are checked at runtime. Common dependencies include `bash`, `awk`, `sed`, `grep`, `sort`, `head`, `ps`, `df`, `free`, `systemctl`, `getent`, `openssl`, `date`, `mount`, and `findmnt`.
|
||||||
|
|
||||||
|
Optional dependencies include `journalctl`, `ping`, `ip`, `ss`, `timedatectl`, `chronyc`, `ntpq`, `jcmd`, `jstat`, and readable `/proc` files.
|
||||||
|
|
||||||
|
## Copy-To-Server Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
scp infra-run/scripts/bash/incident-checks/check_high_memory_oom.sh admin@server:/tmp/
|
||||||
|
ssh admin@server 'bash /tmp/check_high_memory_oom.sh --since "24 hours ago"'
|
||||||
|
```
|
||||||
|
|
||||||
|
Attach the script output to the incident or change ticket so the next responder can see the exact evidence, thresholds, and limitations.
|
||||||
|
|
||||||
|
## Sample Outputs
|
||||||
|
|
||||||
|
Sanitized examples are available in [examples](./examples/):
|
||||||
|
|
||||||
|
- `high-cpu.sample.txt`
|
||||||
|
- `high-memory-oom.sample.txt`
|
||||||
|
- `service-restart-loop.sample.txt`
|
||||||
|
- `failed-ssh-logins.sample.txt`
|
||||||
|
- `certificate-expiry.sample.txt`
|
||||||
|
- `dns-connectivity.sample.txt`
|
||||||
|
- `ntp-time-drift.sample.txt`
|
||||||
|
- `filesystem-readonly.sample.txt`
|
||||||
|
- `inode-usage.sample.txt`
|
||||||
|
- `jvm-threads-heap.sample.txt`
|
||||||
|
|
||||||
|
A sanitized report sample is available at [../../../examples/incident-triage/l2-incident-triage-report.sample.md](../../../examples/incident-triage/l2-incident-triage-report.sample.md).
|
||||||
@@ -0,0 +1,134 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
host_name=""
|
||||||
|
port=443
|
||||||
|
cert_file=""
|
||||||
|
warning_days=30
|
||||||
|
critical_days=7
|
||||||
|
servername=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_certificate_expiry.sh (--host HOST [--port PORT] | --file CERT_FILE) [--servername SNI_NAME] [--warning-days DAYS] [--critical-days DAYS] [--help]
|
||||||
|
|
||||||
|
Check TLS certificate expiry for a remote endpoint or local certificate file.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||||
|
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||||
|
--file) [[ $# -ge 2 ]] || { printf 'CRITICAL: --file requires a value\n'; exit 2; }; cert_file="$2"; shift 2 ;;
|
||||||
|
--servername) [[ $# -ge 2 ]] || { printf 'CRITICAL: --servername requires a value\n'; exit 2; }; servername="$2"; shift 2 ;;
|
||||||
|
--warning-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-days requires a value\n'; exit 2; }; warning_days="$2"; shift 2 ;;
|
||||||
|
--critical-days) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-days requires a value\n'; exit 2; }; critical_days="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if ! command -v openssl >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: openssl\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$port" "$warning_days" "$critical_days"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((critical_days >= warning_days)); then
|
||||||
|
printf 'CRITICAL: --critical-days must be lower than --warning-days\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$host_name" && -n "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: use either --host or --file, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -z "$host_name" && -z "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: either --host or --file is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$cert_file" && ! -r "$cert_file" ]]; then
|
||||||
|
printf 'CRITICAL: certificate file is not readable: %s\n' "$cert_file"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -z "$servername" ]]; then
|
||||||
|
servername="$host_name"
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_cert="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_cert"' EXIT
|
||||||
|
|
||||||
|
if [[ -n "$host_name" ]]; then
|
||||||
|
if ! openssl s_client -connect "${host_name}:${port}" -servername "$servername" -showcerts </dev/null 2>/dev/null \
|
||||||
|
| openssl x509 -outform PEM > "$tmp_cert" 2>/dev/null; then
|
||||||
|
printf 'CRITICAL: unable to retrieve certificate from %s:%s\n' "$host_name" "$port"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
cp "$cert_file" "$tmp_cert"
|
||||||
|
fi
|
||||||
|
|
||||||
|
subject="$(openssl x509 -in "$tmp_cert" -noout -subject 2>/dev/null | sed 's/^subject=//')"
|
||||||
|
issuer="$(openssl x509 -in "$tmp_cert" -noout -issuer 2>/dev/null | sed 's/^issuer=//')"
|
||||||
|
not_before="$(openssl x509 -in "$tmp_cert" -noout -startdate 2>/dev/null | sed 's/^notBefore=//')"
|
||||||
|
not_after="$(openssl x509 -in "$tmp_cert" -noout -enddate 2>/dev/null | sed 's/^notAfter=//')"
|
||||||
|
san_text="$(openssl x509 -in "$tmp_cert" -noout -ext subjectAltName 2>/dev/null | sed '1d' | sed 's/^ *//')"
|
||||||
|
|
||||||
|
expiry_epoch="$(date -d "$not_after" +%s 2>/dev/null || printf '')"
|
||||||
|
now_epoch="$(date +%s)"
|
||||||
|
if [[ -z "$expiry_epoch" ]]; then
|
||||||
|
printf 'CRITICAL: unable to parse certificate expiry date: %s\n' "$not_after"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
seconds_left=$((expiry_epoch - now_epoch))
|
||||||
|
days_left=$((seconds_left / 86400))
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((days_left < critical_days)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((days_left < warning_days)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
target="$cert_file"
|
||||||
|
if [[ -n "$host_name" ]]; then
|
||||||
|
target="${host_name}:${port}"
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Certificate for %s expires in %s day(s)\n\n' "$status" "$target" "$days_left"
|
||||||
|
|
||||||
|
printf 'Certificate details:\n'
|
||||||
|
printf 'Subject: %s\n' "$subject"
|
||||||
|
printf 'Issuer: %s\n' "$issuer"
|
||||||
|
printf 'notBefore: %s\n' "$not_before"
|
||||||
|
printf 'notAfter: %s\n' "$not_after"
|
||||||
|
printf 'SAN/CN: %s\n' "${san_text:-$subject}"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Target: %s\n' "$target"
|
||||||
|
printf 'SNI: %s\n' "${servername:-not used}"
|
||||||
|
printf 'Thresholds: warning=%s days critical=%s days\n\n' "$warning_days" "$critical_days"
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Renew certificate before the operational threshold is breached\n'
|
||||||
|
printf -- '- Check the full chain and intermediate certificates\n'
|
||||||
|
printf -- '- Check the load balancer, ingress, or reverse proxy serving this certificate\n'
|
||||||
|
printf -- '- Verify monitoring threshold and alert ownership\n'
|
||||||
|
printf -- '- Attach this output to incident or change ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,161 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
host_name=""
|
||||||
|
port=""
|
||||||
|
count=3
|
||||||
|
timeout_seconds=3
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_dns_connectivity.sh --host HOST [--port PORT] [--count COUNT] [--timeout SECONDS] [--help]
|
||||||
|
|
||||||
|
Check DNS resolution, ping, optional TCP connectivity, and local route hints.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--host) [[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }; host_name="$2"; shift 2 ;;
|
||||||
|
--port) [[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }; port="$2"; shift 2 ;;
|
||||||
|
--count) [[ $# -ge 2 ]] || { printf 'CRITICAL: --count requires a value\n'; exit 2; }; count="$2"; shift 2 ;;
|
||||||
|
--timeout) [[ $# -ge 2 ]] || { printf 'CRITICAL: --timeout requires a value\n'; exit 2; }; timeout_seconds="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
printf 'CRITICAL: --host is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$count" "$timeout_seconds"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||||
|
printf 'CRITICAL: --port must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v getent >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: getent\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
dns_ok=0
|
||||||
|
ping_ok=0
|
||||||
|
tcp_ok=0
|
||||||
|
tcp_checked=0
|
||||||
|
tcp_note=""
|
||||||
|
ping_output="$(mktemp)"
|
||||||
|
trap 'rm -f "$ping_output"' EXIT
|
||||||
|
|
||||||
|
dns_output="$(getent hosts "$host_name" 2>/dev/null || true)"
|
||||||
|
if [[ -n "$dns_output" ]]; then
|
||||||
|
dns_ok=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v ping >/dev/null 2>&1; then
|
||||||
|
if ping -c "$count" -W "$timeout_seconds" "$host_name" > "$ping_output" 2>&1; then
|
||||||
|
ping_ok=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: ping command not available; ICMP check skipped\n' > "$ping_output"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
tcp_checked=1
|
||||||
|
if command -v timeout >/dev/null 2>&1; then
|
||||||
|
if timeout "$timeout_seconds" bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||||
|
tcp_ok=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
tcp_note="WARNING: timeout command not available; TCP /dev/tcp check used without external timeout"
|
||||||
|
if bash -c ":</dev/tcp/${host_name}/${port}" >/dev/null 2>&1; then
|
||||||
|
tcp_ok=1
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((dns_ok == 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((tcp_checked == 1 && tcp_ok == 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif command -v ping >/dev/null 2>&1 && ((ping_ok == 0)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: DNS=%s ping=%s' "$status" "$([[ "$dns_ok" == 1 ]] && printf OK || printf FAILED)" "$([[ "$ping_ok" == 1 ]] && printf OK || printf UNKNOWN_OR_FAILED)"
|
||||||
|
if ((tcp_checked == 1)); then
|
||||||
|
printf ' tcp_%s=%s' "$port" "$([[ "$tcp_ok" == 1 ]] && printf OK || printf FAILED)"
|
||||||
|
fi
|
||||||
|
printf '\n\n'
|
||||||
|
|
||||||
|
printf 'DNS result:\n'
|
||||||
|
if [[ -n "$dns_output" ]]; then
|
||||||
|
printf '%s\n' "$dns_output"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: getent hosts returned no records for %s\n' "$host_name"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Ping result:\n'
|
||||||
|
if [[ -s "$ping_output" ]]; then
|
||||||
|
cat "$ping_output"
|
||||||
|
else
|
||||||
|
printf 'WARNING: ping result unavailable or ping command missing\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
if ((tcp_checked == 1)); then
|
||||||
|
printf 'TCP port result:\n'
|
||||||
|
if ((tcp_ok == 1)); then
|
||||||
|
printf 'OK: TCP connection to %s:%s succeeded\n' "$host_name" "$port"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: TCP connection to %s:%s failed or timed out\n' "$host_name" "$port"
|
||||||
|
fi
|
||||||
|
if [[ -n "$tcp_note" ]]; then
|
||||||
|
printf '%s\n' "$tcp_note"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'Local network hints:\n'
|
||||||
|
if command -v ip >/dev/null 2>&1; then
|
||||||
|
ip route show default 2>/dev/null || printf 'WARNING: unable to read default route\n'
|
||||||
|
elif command -v ss >/dev/null 2>&1; then
|
||||||
|
ss -tuln 2>/dev/null | head -n 20 || printf 'WARNING: unable to read socket summary\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: ip and ss are unavailable; local network hints skipped\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Host: %s count=%s timeout=%ss port=%s\n' "$host_name" "$count" "$timeout_seconds" "${port:-not checked}"
|
||||||
|
if [[ -n "$tcp_note" ]]; then
|
||||||
|
printf '%s\n' "$tcp_note"
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify the DNS record and resolver path\n'
|
||||||
|
printf -- '- Check firewall, routing, security group, or proxy policy\n'
|
||||||
|
printf -- '- Compare results from another host or network segment\n'
|
||||||
|
printf -- '- Check application endpoint health after network reachability is confirmed\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,124 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
since_value="1 hour ago"
|
||||||
|
warning_count=20
|
||||||
|
critical_count=50
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_failed_ssh_logins.sh [--since TEXT] [--warning COUNT] [--critical COUNT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect failed SSH login bursts from journal or readable authentication logs.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_count" "$critical_count" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_count >= critical_count)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_log="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_log"' EXIT
|
||||||
|
log_source="journalctl"
|
||||||
|
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
journalctl --since "$since_value" --no-pager 2>/dev/null \
|
||||||
|
| grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' > "$tmp_log" || true
|
||||||
|
else
|
||||||
|
log_source="log file fallback"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! -s "$tmp_log" ]]; then
|
||||||
|
for log_file in /var/log/auth.log /var/log/secure /var/log/messages; do
|
||||||
|
if [[ -r "$log_file" ]]; then
|
||||||
|
grep -Ei 'sshd.*(Failed password|Invalid user|authentication failure)|authentication failure.*sshd' "$log_file" >> "$tmp_log" || true
|
||||||
|
log_source="$log_file"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
attempts="$(wc -l < "$tmp_log" | awk '{print $1}')"
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((attempts >= critical_count)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((attempts >= warning_count)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Found %s failed SSH login attempt(s) for requested window\n\n' "$status" "$attempts"
|
||||||
|
|
||||||
|
printf 'Top source IPs:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
grep -Eo 'from ([0-9]{1,3}\.){3}[0-9]{1,3}|rhost=([0-9]{1,3}\.){3}[0-9]{1,3}' "$tmp_log" \
|
||||||
|
| sed -E 's/^(from|rhost=) //' \
|
||||||
|
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||||
|
else
|
||||||
|
printf 'OK: no failed SSH attempts found in available sources\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top attempted users:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
sed -nE 's/.*Invalid user ([^ ]+).*/\1/p; s/.*Failed password for invalid user ([^ ]+).*/\1/p; s/.*Failed password for ([^ ]+).*/\1/p; s/.*user=([^ ]+).*/\1/p' "$tmp_log" \
|
||||||
|
| sort | uniq -c | sort -rn | head -n "$top_count" || true
|
||||||
|
else
|
||||||
|
printf 'OK: no attempted users extracted\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Sample recent lines:\n'
|
||||||
|
if [[ -s "$tmp_log" ]]; then
|
||||||
|
tail -n "$top_count" "$tmp_log"
|
||||||
|
else
|
||||||
|
printf 'OK: no sample lines available\n'
|
||||||
|
fi
|
||||||
|
printf '\n\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s critical=%s since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||||
|
printf 'Log source: %s\n' "$log_source"
|
||||||
|
if [[ "$log_source" != "journalctl" ]]; then
|
||||||
|
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; authentication log visibility may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify source IPs against expected scanners, admins, or automation\n'
|
||||||
|
printf -- '- Check firewall, fail2ban, or security tooling state\n'
|
||||||
|
printf -- '- Confirm whether the attempts are expected for this host\n'
|
||||||
|
printf -- '- Review successful logins too, not only failures\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
include_system=0
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_filesystem_readonly.sh [--include-system] [--help]
|
||||||
|
|
||||||
|
Detect filesystems mounted read-only. Read-only.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--include-system) include_system=1; shift ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
tmp_mounts="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_mounts"' EXIT
|
||||||
|
|
||||||
|
if command -v findmnt >/dev/null 2>&1; then
|
||||||
|
findmnt -rn -o TARGET,SOURCE,FSTYPE,OPTIONS > "$tmp_mounts" 2>/dev/null || true
|
||||||
|
elif command -v mount >/dev/null 2>&1; then
|
||||||
|
mount | awk '{ source=$1; target=$3; type=$5; opts=$6; gsub(/[()]/, "", opts); print target, source, type, opts }' > "$tmp_mounts"
|
||||||
|
else
|
||||||
|
printf 'CRITICAL: findmnt or mount is required\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_ro="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_mounts" "$tmp_ro"' EXIT
|
||||||
|
|
||||||
|
awk -v include_system="$include_system" '
|
||||||
|
function system_fs(type, target) {
|
||||||
|
return type ~ /^(proc|sysfs|tmpfs|devtmpfs|devpts|securityfs|cgroup|cgroup2|pstore|bpf|tracefs|debugfs|configfs|fusectl|mqueue|hugetlbfs|overlay|squashfs|autofs)$/ || target ~ /^\/(proc|sys|dev|run)(\/|$)/
|
||||||
|
}
|
||||||
|
{
|
||||||
|
target=$1; source=$2; type=$3; opts=$4
|
||||||
|
if (opts ~ /(^|,)ro(,|$)/) {
|
||||||
|
if (include_system == 1 || ! system_fs(type, target)) {
|
||||||
|
print target "\t" source "\t" type "\t" opts
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
' "$tmp_mounts" > "$tmp_ro"
|
||||||
|
|
||||||
|
readonly_count="$(wc -l < "$tmp_ro" | awk '{print $1}')"
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((readonly_count > 0)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Found %s read-only filesystem(s)\n\n' "$status" "$readonly_count"
|
||||||
|
|
||||||
|
printf 'Read-only filesystems:\n'
|
||||||
|
if [[ -s "$tmp_ro" ]]; then
|
||||||
|
printf 'MOUNT_POINT\tSOURCE\tFSTYPE\tOPTIONS\n'
|
||||||
|
cat "$tmp_ro"
|
||||||
|
else
|
||||||
|
printf 'OK: no read-only filesystems found with current filters\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'include_system=%s\n' "$include_system"
|
||||||
|
printf 'Collector: '
|
||||||
|
if command -v findmnt >/dev/null 2>&1; then
|
||||||
|
printf 'findmnt\n'
|
||||||
|
else
|
||||||
|
printf 'mount fallback\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check dmesg or journal logs for I/O errors and filesystem remount events\n'
|
||||||
|
printf -- '- Check storage path, multipath, SAN, cloud volume, or underlying disk health\n'
|
||||||
|
printf -- '- Check filesystem health with the platform-approved procedure\n'
|
||||||
|
printf -- '- Do not remount read-write before understanding the cause\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
+146
@@ -0,0 +1,146 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=75
|
||||||
|
critical_threshold=90
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_high_cpu.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect high CPU load and show top CPU-consuming processes.
|
||||||
|
|
||||||
|
Exit codes:
|
||||||
|
0 OK
|
||||||
|
1 WARNING / operational issue detected
|
||||||
|
2 invalid input / missing required dependency
|
||||||
|
3 CRITICAL issue detected
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }
|
||||||
|
warning_threshold="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--critical)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }
|
||||||
|
critical_threshold="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--top)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }
|
||||||
|
top_count="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--help|-h)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd ps
|
||||||
|
require_cmd awk
|
||||||
|
require_cmd head
|
||||||
|
|
||||||
|
cpu_count=1
|
||||||
|
if command -v getconf >/dev/null 2>&1; then
|
||||||
|
cpu_count="$(getconf _NPROCESSORS_ONLN 2>/dev/null || printf '1')"
|
||||||
|
elif [[ -r /proc/cpuinfo ]]; then
|
||||||
|
cpu_count="$(grep -c '^processor' /proc/cpuinfo 2>/dev/null || printf '1')"
|
||||||
|
fi
|
||||||
|
[[ "$cpu_count" =~ ^[0-9]+$ ]] || cpu_count=1
|
||||||
|
((cpu_count > 0)) || cpu_count=1
|
||||||
|
|
||||||
|
load_1m="unavailable"
|
||||||
|
load_5m="unavailable"
|
||||||
|
load_15m="unavailable"
|
||||||
|
load_per_cpu_pct=0
|
||||||
|
if [[ -r /proc/loadavg ]]; then
|
||||||
|
read -r load_1m load_5m load_15m _ < /proc/loadavg
|
||||||
|
load_per_cpu_pct="$(awk -v load_avg="$load_1m" -v cpus="$cpu_count" 'BEGIN { printf "%d", (load_avg / cpus) * 100 }')"
|
||||||
|
elif command -v uptime >/dev/null 2>&1; then
|
||||||
|
load_line="$(uptime 2>/dev/null || true)"
|
||||||
|
load_1m="$(printf '%s\n' "$load_line" | sed -n 's/.*load average[s]*: *\([^,]*\).*/\1/p')"
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((load_per_cpu_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((load_per_cpu_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: 1-minute load is %s across %s CPU(s) (%s%% of CPU count)\n\n' "$status" "$load_1m" "$cpu_count" "$load_per_cpu_pct"
|
||||||
|
|
||||||
|
printf 'Load average:\n'
|
||||||
|
printf '1m=%s 5m=%s 15m=%s\n\n' "$load_1m" "$load_5m" "$load_15m"
|
||||||
|
|
||||||
|
printf 'CPU count:\n'
|
||||||
|
printf '%s\n\n' "$cpu_count"
|
||||||
|
|
||||||
|
printf 'Top CPU processes:\n'
|
||||||
|
ps -eo pid,ppid,user,pcpu,pmem,comm,args --sort=-pcpu | head -n "$((top_count + 1))"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
if command -v uptime >/dev/null 2>&1; then
|
||||||
|
uptime || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: uptime command not available; used /proc/loadavg where possible\n'
|
||||||
|
fi
|
||||||
|
if ((load_per_cpu_pct >= 100)); then
|
||||||
|
printf 'WARNING: load is higher than online CPU count; runnable task saturation is possible\n'
|
||||||
|
else
|
||||||
|
printf 'OK: load is not above online CPU count at collection time\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; process ownership details are usually available, but some command lines may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check process ownership and whether the top process is expected\n'
|
||||||
|
printf -- '- Check recent deployments, cron jobs, batch jobs, or maintenance activity\n'
|
||||||
|
printf -- '- Review logs for the top CPU-consuming process\n'
|
||||||
|
printf -- '- Compare with longer trend data from monitoring before taking action\n'
|
||||||
|
printf -- '- Attach this output to the incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=80
|
||||||
|
critical_threshold=90
|
||||||
|
since_value="24 hours ago"
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_high_memory_oom.sh [--warning PERCENT] [--critical PERCENT] [--since TEXT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect high memory or swap usage and show recent OOM killer evidence.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd free
|
||||||
|
require_cmd ps
|
||||||
|
require_cmd awk
|
||||||
|
require_cmd head
|
||||||
|
|
||||||
|
read -r mem_total mem_used swap_total swap_used < <(free -m | awk '
|
||||||
|
/^Mem:/ { mt=$2; mu=$3 }
|
||||||
|
/^Swap:/ { st=$2; su=$3 }
|
||||||
|
END { printf "%d %d %d %d\n", mt, mu, st, su }
|
||||||
|
')
|
||||||
|
|
||||||
|
mem_pct=0
|
||||||
|
swap_pct=0
|
||||||
|
if ((mem_total > 0)); then
|
||||||
|
mem_pct=$((mem_used * 100 / mem_total))
|
||||||
|
fi
|
||||||
|
if ((swap_total > 0)); then
|
||||||
|
swap_pct=$((swap_used * 100 / swap_total))
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((mem_pct >= critical_threshold || swap_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((mem_pct >= warning_threshold || swap_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Memory usage is %s%% and swap usage is %s%%\n\n' "$status" "$mem_pct" "$swap_pct"
|
||||||
|
|
||||||
|
printf 'Memory summary:\n'
|
||||||
|
free -m
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top memory processes:\n'
|
||||||
|
printf 'PID RSS_MB COMMAND\n'
|
||||||
|
ps -eo pid=,rss=,comm= --sort=-rss | head -n "$top_count" | awk '{ printf "%-7s %-8d %s\n", $1, int($2 / 1024), $3 }'
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'OOM events since %s:\n' "$since_value"
|
||||||
|
oom_found=0
|
||||||
|
oom_source="journalctl"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
if journalctl --since "$since_value" -k --no-pager 2>/dev/null | grep -Ei 'out of memory|oom-killer|killed process' | tail -n 20; then
|
||||||
|
oom_found=1
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl not available; checking readable log files\n'
|
||||||
|
oom_source="log file fallback"
|
||||||
|
fi
|
||||||
|
if ((oom_found == 0)); then
|
||||||
|
for log_file in /var/log/messages /var/log/syslog /var/log/kern.log; do
|
||||||
|
if [[ -r "$log_file" ]]; then
|
||||||
|
if grep -Ei 'out of memory|oom-killer|killed process' "$log_file" | tail -n 20; then
|
||||||
|
oom_found=1
|
||||||
|
oom_source="$log_file"
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
if ((oom_found == 0)); then
|
||||||
|
printf 'OK: no OOM evidence found in available sources\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s%% critical=%s%% since="%s"\n' "$warning_threshold" "$critical_threshold" "$since_value"
|
||||||
|
printf 'OOM evidence source: %s\n' "$oom_source"
|
||||||
|
if [[ "$oom_source" != "journalctl" ]]; then
|
||||||
|
printf 'WARNING: log file fallback may include entries outside the requested --since window\n'
|
||||||
|
fi
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; kernel logs or process details may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Check application memory trend\n'
|
||||||
|
printf -- '- Review JVM heap settings if process is Java\n'
|
||||||
|
printf -- '- Verify swap pressure and paging activity\n'
|
||||||
|
printf -- '- Confirm whether OOM events align with application impact\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
+103
@@ -0,0 +1,103 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_threshold=80
|
||||||
|
critical_threshold=90
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_inode_usage.sh [--warning PERCENT] [--critical PERCENT] [--top N] [--help]
|
||||||
|
|
||||||
|
Detect inode exhaustion using df -i.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_threshold="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_threshold="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_threshold" "$critical_threshold" "$top_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_threshold >= critical_threshold)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v df >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: df\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_df="$(mktemp)"
|
||||||
|
tmp_alerts="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_df" "$tmp_alerts"' EXIT
|
||||||
|
|
||||||
|
df -Pi > "$tmp_df"
|
||||||
|
awk -v warn="$warning_threshold" '
|
||||||
|
NR > 1 {
|
||||||
|
pct=$5
|
||||||
|
gsub(/%/, "", pct)
|
||||||
|
if (pct >= warn) {
|
||||||
|
print $0
|
||||||
|
}
|
||||||
|
}
|
||||||
|
' "$tmp_df" > "$tmp_alerts"
|
||||||
|
|
||||||
|
max_pct="$(awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); if (pct > max) max=pct } END { printf "%d", max }' "$tmp_df")"
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if ((max_pct >= critical_threshold)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((max_pct >= warning_threshold)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Highest inode usage is %s%%\n\n' "$status" "$max_pct"
|
||||||
|
|
||||||
|
printf 'Filesystems above threshold:\n'
|
||||||
|
if [[ -s "$tmp_alerts" ]]; then
|
||||||
|
cat "$tmp_alerts"
|
||||||
|
else
|
||||||
|
printf 'OK: no filesystems above warning threshold\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Inode usage table:\n'
|
||||||
|
cat "$tmp_df"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Top affected mount points:\n'
|
||||||
|
awk 'NR > 1 { pct=$5; gsub(/%/, "", pct); print pct, $6, $1, $2, $3, $4 }' "$tmp_df" \
|
||||||
|
| sort -rn | head -n "$top_count" \
|
||||||
|
| awk '{ printf "%s%% %s %s inodes=%s used=%s free=%s\n", $1, $2, $3, $4, $5, $6 }'
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s%% critical=%s%%\n\n' "$warning_threshold" "$critical_threshold"
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Find directories with many small files under affected mount points\n'
|
||||||
|
printf -- '- Check logs, cache, spool, session, and temporary directories\n'
|
||||||
|
printf -- '- Avoid deleting blindly; confirm ownership and application impact first\n'
|
||||||
|
printf -- '- Confirm whether inode exhaustion is causing write or deploy failures\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,134 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
target_pid=""
|
||||||
|
match_string=""
|
||||||
|
top_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_jvm_threads_heap.sh [--pid PID | --match STRING] [--top N] [--help]
|
||||||
|
|
||||||
|
Provide lightweight JVM process diagnostics. Does not create heap dumps or modify processes.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--pid) [[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }; target_pid="$2"; shift 2 ;;
|
||||||
|
--match) [[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }; match_string="$2"; shift 2 ;;
|
||||||
|
--top) [[ $# -ge 2 ]] || { printf 'CRITICAL: --top requires a value\n'; exit 2; }; top_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
printf 'CRITICAL: use either --pid or --match, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||||
|
printf 'CRITICAL: --pid must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! is_number "$top_count"; then
|
||||||
|
printf 'CRITICAL: --top must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v ps >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: ps\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_java="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_java"' EXIT
|
||||||
|
|
||||||
|
ps -eo pid=,user=,rss=,pcpu=,comm=,args= \
|
||||||
|
| awk 'tolower($0) ~ /java/ && $1 != "" { print }' > "$tmp_java"
|
||||||
|
|
||||||
|
if [[ -z "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
target_pid="$(grep -F "$match_string" "$tmp_java" | awk 'NR == 1 { print $1 }' || true)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -z "$target_pid" ]]; then
|
||||||
|
detected_count="$(wc -l < "$tmp_java" | awk '{print $1}')"
|
||||||
|
if ((detected_count == 0)); then
|
||||||
|
printf 'WARNING: No Java processes detected\n\n'
|
||||||
|
else
|
||||||
|
printf 'OK: Detected %s Java process(es); rerun with --pid PID for heap detail\n\n' "$detected_count"
|
||||||
|
fi
|
||||||
|
printf 'Detected JVM processes:\n'
|
||||||
|
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||||
|
awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }' "$tmp_java" | head -n "$top_count"
|
||||||
|
printf '\nRecommended next steps:\n'
|
||||||
|
printf -- '- Select a JVM process with --pid for focused diagnostics\n'
|
||||||
|
printf -- '- Review GC logs and application logs for the selected process\n'
|
||||||
|
printf -- '- Check heap sizing and thread count trend\n'
|
||||||
|
printf -- '- Capture jstack only if approved by operational process\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! ps -p "$target_pid" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: process does not exist or is not visible: %s\n' "$target_pid"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
proc_line="$(ps -p "$target_pid" -o pid=,user=,rss=,pcpu=,comm=,args=)"
|
||||||
|
if ! printf '%s\n' "$proc_line" | grep -qi 'java'; then
|
||||||
|
printf 'WARNING: PID %s does not appear to be a Java process from ps output\n\n' "$target_pid"
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
else
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
fi
|
||||||
|
|
||||||
|
thread_count="unavailable"
|
||||||
|
if [[ -r "/proc/${target_pid}/status" ]]; then
|
||||||
|
thread_count="$(awk '/^Threads:/ { print $2 }' "/proc/${target_pid}/status")"
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: JVM diagnostics collected for PID %s\n\n' "$status" "$target_pid"
|
||||||
|
|
||||||
|
printf 'Detected JVM process:\n'
|
||||||
|
printf 'PID USER RSS_MB CPU COMMAND\n'
|
||||||
|
printf '%s\n' "$proc_line" | awk '{ pid=$1; user=$2; rss=int($3 / 1024); cpu=$4; $1=$2=$3=$4=""; sub(/^ +/, ""); printf "%s %s %s %s %s\n", pid, user, rss, cpu, $0 }'
|
||||||
|
printf 'Thread count: %s\n\n' "$thread_count"
|
||||||
|
|
||||||
|
printf 'Heap and JVM evidence:\n'
|
||||||
|
if command -v jcmd >/dev/null 2>&1; then
|
||||||
|
printf '\n[jcmd VM.flags]\n'
|
||||||
|
jcmd "$target_pid" VM.flags 2>/dev/null || printf 'WARNING: jcmd VM.flags failed; permissions may be limited\n'
|
||||||
|
printf '\n[jcmd GC.heap_info]\n'
|
||||||
|
jcmd "$target_pid" GC.heap_info 2>/dev/null || printf 'WARNING: jcmd GC.heap_info failed; permissions may be limited\n'
|
||||||
|
printf '\n[jcmd Thread.print summary]\n'
|
||||||
|
jcmd "$target_pid" Thread.print 2>/dev/null | awk '/java.lang.Thread.State/ { state[$0]++ } END { for (item in state) print state[item], item }' | sort -rn | head -n "$top_count" || printf 'WARNING: jcmd Thread.print failed; permissions may be limited\n'
|
||||||
|
elif command -v jstat >/dev/null 2>&1; then
|
||||||
|
printf '\n[jstat -gc]\n'
|
||||||
|
jstat -gc "$target_pid" 1 1 2>/dev/null || printf 'WARNING: jstat failed; permissions may be limited\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: jcmd and jstat are unavailable; heap details skipped\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'PID=%s thread_count=%s top=%s\n' "$target_pid" "$thread_count" "$top_count"
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; JVM attach and /proc details may be limited by process ownership\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Review GC logs and recent application errors\n'
|
||||||
|
printf -- '- Check JVM heap sizing against container or host memory limits\n'
|
||||||
|
printf -- '- Check thread count trend in monitoring before concluding a leak\n'
|
||||||
|
printf -- '- Capture jstack only if approved by operational process\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,121 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
warning_offset_ms=500
|
||||||
|
critical_offset_ms=5000
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_ntp_time_drift.sh [--warning-offset MS] [--critical-offset MS] [--help]
|
||||||
|
|
||||||
|
Check time synchronization status and offset evidence when available.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--warning-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning-offset requires a value\n'; exit 2; }; warning_offset_ms="$2"; shift 2 ;;
|
||||||
|
--critical-offset) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical-offset requires a value\n'; exit 2; }; critical_offset_ms="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
for value in "$warning_offset_ms" "$critical_offset_ms"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_offset_ms >= critical_offset_ms)); then
|
||||||
|
printf 'CRITICAL: --warning-offset must be lower than --critical-offset\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
system_time="$(date '+%Y-%m-%d %H:%M:%S %Z %z')"
|
||||||
|
timezone="$(date '+%Z %z')"
|
||||||
|
sync_status="unknown"
|
||||||
|
detected_tool="none"
|
||||||
|
offset_ms=""
|
||||||
|
|
||||||
|
timedate_output=""
|
||||||
|
if command -v timedatectl >/dev/null 2>&1; then
|
||||||
|
detected_tool="timedatectl"
|
||||||
|
timedate_output="$(timedatectl 2>/dev/null || true)"
|
||||||
|
sync_status="$(printf '%s\n' "$timedate_output" | awk -F: '/System clock synchronized|NTP synchronized/ { gsub(/^ +/, "", $2); print $2; exit }')"
|
||||||
|
[[ -n "$sync_status" ]] || sync_status="unknown"
|
||||||
|
fi
|
||||||
|
|
||||||
|
chronyc_output=""
|
||||||
|
if command -v chronyc >/dev/null 2>&1; then
|
||||||
|
detected_tool="chronyc"
|
||||||
|
chronyc_output="$(chronyc tracking 2>/dev/null || true)"
|
||||||
|
raw_offset="$(printf '%s\n' "$chronyc_output" | awk -F: '/Last offset|System time/ { gsub(/^ +| seconds.*$/, "", $2); print $2; exit }')"
|
||||||
|
if [[ -n "$raw_offset" ]]; then
|
||||||
|
offset_ms="$(awk -v seconds="$raw_offset" 'BEGIN { if (seconds < 0) seconds = -seconds; printf "%d", seconds * 1000 }')"
|
||||||
|
fi
|
||||||
|
elif command -v ntpq >/dev/null 2>&1; then
|
||||||
|
detected_tool="ntpq"
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if [[ "$sync_status" =~ ^(no|false)$ ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
if [[ -n "$offset_ms" ]]; then
|
||||||
|
if ((offset_ms >= critical_offset_ms)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((offset_ms >= warning_offset_ms)); then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
elif [[ "$detected_tool" == "none" ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Time sync status=%s offset_ms=%s\n\n' "$status" "$sync_status" "${offset_ms:-unavailable}"
|
||||||
|
|
||||||
|
printf 'Time status:\n'
|
||||||
|
printf 'System time: %s\n' "$system_time"
|
||||||
|
printf 'Timezone: %s\n' "$timezone"
|
||||||
|
printf 'Detected tool: %s\n' "$detected_tool"
|
||||||
|
printf 'NTP synchronized: %s\n' "$sync_status"
|
||||||
|
printf 'Offset ms: %s\n\n' "${offset_ms:-unavailable}"
|
||||||
|
|
||||||
|
printf 'Tool evidence:\n'
|
||||||
|
if [[ -n "$chronyc_output" ]]; then
|
||||||
|
printf '%s\n' "$chronyc_output"
|
||||||
|
elif command -v ntpq >/dev/null 2>&1; then
|
||||||
|
ntpq -p 2>/dev/null || printf 'WARNING: ntpq command failed\n'
|
||||||
|
elif [[ -n "$timedate_output" ]]; then
|
||||||
|
printf '%s\n' "$timedate_output"
|
||||||
|
else
|
||||||
|
printf 'WARNING: timedatectl, chronyc, and ntpq are unavailable or returned no data\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%sms critical=%sms\n' "$warning_offset_ms" "$critical_offset_ms"
|
||||||
|
if [[ -z "$offset_ms" ]]; then
|
||||||
|
printf 'WARNING: offset unavailable; status is based on available synchronization indicators only\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Verify chrony or ntpd service status and configuration\n'
|
||||||
|
printf -- '- Check NTP sources and reachability\n'
|
||||||
|
printf -- '- Check virtualization host time if this is a VM\n'
|
||||||
|
printf -- '- Avoid restarting time services blindly in production\n'
|
||||||
|
printf -- '- Attach this output to incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,111 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
service_name=""
|
||||||
|
since_value="1 hour ago"
|
||||||
|
warning_count=3
|
||||||
|
critical_count=10
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: check_service_restart_loop.sh --service SERVICE_NAME [--since TEXT] [--warning COUNT] [--critical COUNT] [--help]
|
||||||
|
|
||||||
|
Detect restart-loop evidence for a systemd service. Read-only.
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
require_cmd() {
|
||||||
|
if ! command -v "$1" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command not found: %s\n' "$1"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--service) [[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }; service_name="$2"; shift 2 ;;
|
||||||
|
--since) [[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }; since_value="$2"; shift 2 ;;
|
||||||
|
--warning) [[ $# -ge 2 ]] || { printf 'CRITICAL: --warning requires a value\n'; exit 2; }; warning_count="$2"; shift 2 ;;
|
||||||
|
--critical) [[ $# -ge 2 ]] || { printf 'CRITICAL: --critical requires a value\n'; exit 2; }; critical_count="$2"; shift 2 ;;
|
||||||
|
--help|-h) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown option: %s\n' "$1"; usage; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$service_name" ]]; then
|
||||||
|
printf 'CRITICAL: --service is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for value in "$warning_count" "$critical_count"; do
|
||||||
|
if ! is_number "$value"; then
|
||||||
|
printf 'CRITICAL: numeric option expected, got: %s\n' "$value"
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if ((warning_count >= critical_count)); then
|
||||||
|
printf 'CRITICAL: --warning must be lower than --critical\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
require_cmd systemctl
|
||||||
|
|
||||||
|
active_state="$(systemctl show "$service_name" --property=ActiveState --value 2>/dev/null || printf 'unknown')"
|
||||||
|
sub_state="$(systemctl show "$service_name" --property=SubState --value 2>/dev/null || printf 'unknown')"
|
||||||
|
n_restarts="$(systemctl show "$service_name" --property=NRestarts --value 2>/dev/null || printf '')"
|
||||||
|
restart_count="${n_restarts:-0}"
|
||||||
|
if ! is_number "$restart_count"; then
|
||||||
|
restart_count=0
|
||||||
|
fi
|
||||||
|
|
||||||
|
status="OK"
|
||||||
|
exit_code=0
|
||||||
|
if [[ "$active_state" == "failed" ]] || ((restart_count >= critical_count)); then
|
||||||
|
status="CRITICAL"
|
||||||
|
exit_code=3
|
||||||
|
elif ((restart_count >= warning_count)) || [[ "$active_state" != "active" ]]; then
|
||||||
|
status="WARNING"
|
||||||
|
exit_code=1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '%s: Service %s state=%s substate=%s restarts=%s\n\n' "$status" "$service_name" "$active_state" "$sub_state" "$restart_count"
|
||||||
|
|
||||||
|
printf 'Service state:\n'
|
||||||
|
systemctl status "$service_name" --no-pager --lines=8 2>/dev/null || printf 'WARNING: unable to read service status for %s\n' "$service_name"
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Systemd properties:\n'
|
||||||
|
systemctl show "$service_name" --property=Id,Names,LoadState,ActiveState,SubState,Result,ExecMainStatus,NRestarts,Restart,RestartUSec --no-pager 2>/dev/null || true
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recent start/stop/failure log lines since %s:\n' "$since_value"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
journalctl -u "$service_name" --since "$since_value" --no-pager 2>/dev/null \
|
||||||
|
| grep -Ei 'start|stop|fail|restart|exit|status|main process' \
|
||||||
|
| tail -n 40 || printf 'OK: no matching journal lines found\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl not available; service logs unavailable from this script\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Evidence:\n'
|
||||||
|
printf 'Thresholds: warning=%s restarts critical=%s restarts since="%s"\n' "$warning_count" "$critical_count" "$since_value"
|
||||||
|
if [[ "${EUID:-$(id -u 2>/dev/null || printf '1')}" != "0" ]]; then
|
||||||
|
printf 'WARNING: running without root; journal visibility may be limited\n'
|
||||||
|
fi
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf 'Recommended next steps:\n'
|
||||||
|
printf -- '- Inspect the unit file and drop-in overrides\n'
|
||||||
|
printf -- '- Review application logs around the restart timestamps\n'
|
||||||
|
printf -- '- Check dependencies such as network, storage, database, or secrets\n'
|
||||||
|
printf -- '- Verify recent configuration or package changes\n'
|
||||||
|
printf -- '- Do not restart blindly; attach this output to the incident ticket\n'
|
||||||
|
|
||||||
|
exit "$exit_code"
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
WARNING: Certificate for app.example.com:443 expires in 18 day(s)
|
||||||
|
|
||||||
|
Certificate details:
|
||||||
|
Subject: CN = app.example.com
|
||||||
|
Issuer: C = US, O = Example CA, CN = Example Intermediate CA
|
||||||
|
notBefore: Apr 11 00:00:00 2026 GMT
|
||||||
|
notAfter: May 29 23:59:59 2026 GMT
|
||||||
|
SAN/CN: DNS:app.example.com, DNS:api.example.com
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Target: app.example.com:443
|
||||||
|
SNI: app.example.com
|
||||||
|
Thresholds: warning=30 days critical=7 days
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Renew certificate before the operational threshold is breached
|
||||||
|
- Check the full chain and intermediate certificates
|
||||||
|
- Check the load balancer, ingress, or reverse proxy serving this certificate
|
||||||
|
- Verify monitoring threshold and alert ownership
|
||||||
|
- Attach this output to incident or change ticket
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
OK: DNS=OK ping=OK tcp_443=OK
|
||||||
|
|
||||||
|
DNS result:
|
||||||
|
93.184.216.34 example.com
|
||||||
|
|
||||||
|
Ping result:
|
||||||
|
3 packets transmitted, 3 received, 0% packet loss, time 2002ms
|
||||||
|
|
||||||
|
TCP port result:
|
||||||
|
OK: TCP connection to example.com:443 succeeded
|
||||||
|
|
||||||
|
Local network hints:
|
||||||
|
default via 10.0.2.1 dev eth0 proto dhcp src 10.0.2.15
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Host: example.com count=3 timeout=3s port=443
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify the DNS record and resolver path
|
||||||
|
- Check firewall, routing, security group, or proxy policy
|
||||||
|
- Compare results from another host or network segment
|
||||||
|
- Check application endpoint health after network reachability is confirmed
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,26 @@
|
|||||||
|
CRITICAL: Found 73 failed SSH login attempt(s) for requested window
|
||||||
|
|
||||||
|
Top source IPs:
|
||||||
|
52 203.0.113.44
|
||||||
|
12 198.51.100.20
|
||||||
|
9 192.0.2.10
|
||||||
|
|
||||||
|
Top attempted users:
|
||||||
|
31 admin
|
||||||
|
24 oracle
|
||||||
|
18 root
|
||||||
|
|
||||||
|
Sample recent lines:
|
||||||
|
May 11 10:01:02 host sshd[2201]: Failed password for invalid user admin from 203.0.113.44 port 51240 ssh2
|
||||||
|
May 11 10:01:06 host sshd[2205]: Invalid user oracle from 198.51.100.20
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=20 critical=50 since="1 hour ago"
|
||||||
|
Log source: journalctl
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify source IPs against expected scanners, admins, or automation
|
||||||
|
- Check firewall, fail2ban, or security tooling state
|
||||||
|
- Confirm whether the attempts are expected for this host
|
||||||
|
- Review successful logins too, not only failures
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
CRITICAL: Found 1 read-only filesystem(s)
|
||||||
|
|
||||||
|
Read-only filesystems:
|
||||||
|
MOUNT_POINT SOURCE FSTYPE OPTIONS
|
||||||
|
/data /dev/mapper/vg_data-lv_data xfs ro,relatime,seclabel,attr2,inode64
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
include_system=0
|
||||||
|
Collector: findmnt
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check dmesg or journal logs for I/O errors and filesystem remount events
|
||||||
|
- Check storage path, multipath, SAN, cloud volume, or underlying disk health
|
||||||
|
- Check filesystem health with the platform-approved procedure
|
||||||
|
- Do not remount read-write before understanding the cause
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
WARNING: 1-minute load is 7.82 across 8 CPU(s) (97% of CPU count)
|
||||||
|
|
||||||
|
Load average:
|
||||||
|
1m=7.82 5m=6.91 15m=5.40
|
||||||
|
|
||||||
|
CPU count:
|
||||||
|
8
|
||||||
|
|
||||||
|
Top CPU processes:
|
||||||
|
PID PPID USER %CPU %MEM COMMAND COMMAND
|
||||||
|
2314 1 app 245 12.1 java java -jar order-api.jar
|
||||||
|
991 1 root 38 0.4 backup-agent backup-agent --scan
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
WARNING: load is close to online CPU count; runnable task saturation is possible
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check process ownership and whether the top process is expected
|
||||||
|
- Check recent deployments, cron jobs, batch jobs, or maintenance activity
|
||||||
|
- Review logs for the top CPU-consuming process
|
||||||
|
- Compare with longer trend data from monitoring before taking action
|
||||||
|
- Attach this output to the incident ticket
|
||||||
@@ -0,0 +1,25 @@
|
|||||||
|
WARNING: Memory usage is 84% and swap usage is 12%
|
||||||
|
|
||||||
|
Memory summary:
|
||||||
|
total used free shared buff/cache available
|
||||||
|
Mem: 15934 13386 512 121 2036 2101
|
||||||
|
Swap: 4095 512 3583
|
||||||
|
|
||||||
|
Top memory processes:
|
||||||
|
PID RSS_MB COMMAND
|
||||||
|
1234 2048 java
|
||||||
|
987 812 postgres
|
||||||
|
|
||||||
|
OOM events since 24 hours ago:
|
||||||
|
2026-05-11 08:42:13 kernel: Out of memory: Killed process 1234 (java)
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=80% critical=90% since="24 hours ago"
|
||||||
|
OOM evidence source: journalctl
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Check application memory trend
|
||||||
|
- Review JVM heap settings if process is Java
|
||||||
|
- Verify swap pressure and paging activity
|
||||||
|
- Confirm whether OOM events align with application impact
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
WARNING: Highest inode usage is 87%
|
||||||
|
|
||||||
|
Filesystems above threshold:
|
||||||
|
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||||
|
|
||||||
|
Inode usage table:
|
||||||
|
Filesystem Inodes IUsed IFree IUse% Mounted on
|
||||||
|
/dev/mapper/vg_root-lv_root 524288 91300 432988 18% /
|
||||||
|
/dev/mapper/vg_var-lv_var 1310720 1140326 170394 87% /var
|
||||||
|
|
||||||
|
Top affected mount points:
|
||||||
|
87% /var /dev/mapper/vg_var-lv_var inodes=1310720 used=1140326 free=170394
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=80% critical=90%
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Find directories with many small files under affected mount points
|
||||||
|
- Check logs, cache, spool, session, and temporary directories
|
||||||
|
- Avoid deleting blindly; confirm ownership and application impact first
|
||||||
|
- Confirm whether inode exhaustion is causing write or deploy failures
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
OK: JVM diagnostics collected for PID 1234
|
||||||
|
|
||||||
|
Detected JVM process:
|
||||||
|
PID USER RSS_MB CPU COMMAND
|
||||||
|
1234 app 2048 42.1 java -Xms2g -Xmx2g -jar order-api.jar
|
||||||
|
Thread count: 188
|
||||||
|
|
||||||
|
Heap and JVM evidence:
|
||||||
|
|
||||||
|
[jcmd VM.flags]
|
||||||
|
1234:
|
||||||
|
-XX:InitialHeapSize=2147483648 -XX:MaxHeapSize=2147483648
|
||||||
|
|
||||||
|
[jcmd GC.heap_info]
|
||||||
|
garbage-first heap total 2097152K, used 1521000K
|
||||||
|
|
||||||
|
[jcmd Thread.print summary]
|
||||||
|
102 java.lang.Thread.State: WAITING
|
||||||
|
53 java.lang.Thread.State: RUNNABLE
|
||||||
|
33 java.lang.Thread.State: TIMED_WAITING
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
PID=1234 thread_count=188 top=10
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Review GC logs and recent application errors
|
||||||
|
- Check JVM heap sizing against container or host memory limits
|
||||||
|
- Check thread count trend in monitoring before concluding a leak
|
||||||
|
- Capture jstack only if approved by operational process
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,23 @@
|
|||||||
|
WARNING: Time sync status=yes offset_ms=812
|
||||||
|
|
||||||
|
Time status:
|
||||||
|
System time: 2026-05-11 10:18:01 UTC +0000
|
||||||
|
Timezone: UTC +0000
|
||||||
|
Detected tool: chronyc
|
||||||
|
NTP synchronized: yes
|
||||||
|
Offset ms: 812
|
||||||
|
|
||||||
|
Tool evidence:
|
||||||
|
Reference ID : 203.0.113.10
|
||||||
|
System time : 0.812345 seconds fast of NTP time
|
||||||
|
Last offset : +0.812345 seconds
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=500ms critical=5000ms
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Verify chrony or ntpd service status and configuration
|
||||||
|
- Check NTP sources and reachability
|
||||||
|
- Check virtualization host time if this is a VM
|
||||||
|
- Avoid restarting time services blindly in production
|
||||||
|
- Attach this output to incident ticket
|
||||||
@@ -0,0 +1,27 @@
|
|||||||
|
CRITICAL: Service app.service state=failed substate=failed restarts=12
|
||||||
|
|
||||||
|
Service state:
|
||||||
|
app.service - Example application
|
||||||
|
Loaded: loaded (/etc/systemd/system/app.service; enabled)
|
||||||
|
Active: failed (Result: exit-code)
|
||||||
|
|
||||||
|
Systemd properties:
|
||||||
|
Id=app.service
|
||||||
|
ActiveState=failed
|
||||||
|
SubState=failed
|
||||||
|
Result=exit-code
|
||||||
|
NRestarts=12
|
||||||
|
|
||||||
|
Recent start/stop/failure log lines since 1 hour ago:
|
||||||
|
May 11 09:05:01 host systemd[1]: app.service: Main process exited, status=1/FAILURE
|
||||||
|
May 11 09:05:01 host systemd[1]: app.service: Failed with result 'exit-code'.
|
||||||
|
|
||||||
|
Evidence:
|
||||||
|
Thresholds: warning=3 restarts critical=10 restarts since="1 hour ago"
|
||||||
|
|
||||||
|
Recommended next steps:
|
||||||
|
- Inspect the unit file and drop-in overrides
|
||||||
|
- Review application logs around the restart timestamps
|
||||||
|
- Check dependencies such as network, storage, database, or secrets
|
||||||
|
- Verify recent configuration or package changes
|
||||||
|
- Do not restart blindly; attach this output to the incident ticket
|
||||||
@@ -0,0 +1,385 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
incident_type=""
|
||||||
|
service_name=""
|
||||||
|
host_name=""
|
||||||
|
port=""
|
||||||
|
target_pid=""
|
||||||
|
match_string=""
|
||||||
|
output_file=""
|
||||||
|
since_value="1 hour ago"
|
||||||
|
|
||||||
|
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'USAGE'
|
||||||
|
Usage: incident_triage_report.sh --type TYPE [options]
|
||||||
|
|
||||||
|
Run selected read-only incident checks and produce a Markdown triage report.
|
||||||
|
|
||||||
|
Incident types:
|
||||||
|
cpu
|
||||||
|
memory
|
||||||
|
service
|
||||||
|
network
|
||||||
|
auth
|
||||||
|
cert
|
||||||
|
filesystem
|
||||||
|
jvm
|
||||||
|
all
|
||||||
|
|
||||||
|
Options:
|
||||||
|
--type TYPE Incident type to collect
|
||||||
|
--service SERVICE_NAME systemd service name for service checks
|
||||||
|
--host HOSTNAME_OR_FQDN host for DNS, network, or certificate checks
|
||||||
|
--port PORT TCP or TLS port for host checks
|
||||||
|
--pid PID JVM process ID
|
||||||
|
--match PROCESS_MATCH JVM process match string
|
||||||
|
--output FILE write Markdown report to FILE
|
||||||
|
--since VALUE time window for log-based checks
|
||||||
|
--help show this help
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
./incident_triage_report.sh --type cpu
|
||||||
|
./incident_triage_report.sh --type service --service nginx --since "30 minutes ago"
|
||||||
|
./incident_triage_report.sh --type network --host app.example.com --port 443
|
||||||
|
./incident_triage_report.sh --type all --service nginx --host app.example.com --port 443 --output triage.md
|
||||||
|
USAGE
|
||||||
|
}
|
||||||
|
|
||||||
|
is_number() {
|
||||||
|
[[ "$1" =~ ^[0-9]+$ ]]
|
||||||
|
}
|
||||||
|
|
||||||
|
valid_type() {
|
||||||
|
case "$1" in
|
||||||
|
cpu|memory|service|network|auth|cert|filesystem|jvm|all) return 0 ;;
|
||||||
|
*) return 1 ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--type)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --type requires a value\n'; exit 2; }
|
||||||
|
incident_type="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--service)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --service requires a value\n'; exit 2; }
|
||||||
|
service_name="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--host)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --host requires a value\n'; exit 2; }
|
||||||
|
host_name="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--port)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --port requires a value\n'; exit 2; }
|
||||||
|
port="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--pid)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --pid requires a value\n'; exit 2; }
|
||||||
|
target_pid="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--match)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --match requires a value\n'; exit 2; }
|
||||||
|
match_string="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--output)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --output requires a value\n'; exit 2; }
|
||||||
|
output_file="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--since)
|
||||||
|
[[ $# -ge 2 ]] || { printf 'CRITICAL: --since requires a value\n'; exit 2; }
|
||||||
|
since_value="$2"
|
||||||
|
shift 2
|
||||||
|
;;
|
||||||
|
--help|-h)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ -z "$incident_type" ]]; then
|
||||||
|
printf 'CRITICAL: --type is required\n'
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! valid_type "$incident_type"; then
|
||||||
|
printf 'CRITICAL: unsupported incident type: %s\n' "$incident_type"
|
||||||
|
usage
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]] && ! is_number "$port"; then
|
||||||
|
printf 'CRITICAL: --port must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" ]] && ! is_number "$target_pid"; then
|
||||||
|
printf 'CRITICAL: --pid must be numeric\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if [[ -n "$target_pid" && -n "$match_string" ]]; then
|
||||||
|
printf 'CRITICAL: use either --pid or --match for JVM checks, not both\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
tmp_dir="$(mktemp -d)"
|
||||||
|
trap 'rm -rf "$tmp_dir"' EXIT
|
||||||
|
|
||||||
|
report_file="$tmp_dir/report.md"
|
||||||
|
|
||||||
|
check_labels=()
|
||||||
|
check_names=()
|
||||||
|
check_commands=()
|
||||||
|
check_statuses=()
|
||||||
|
check_exit_codes=()
|
||||||
|
check_summaries=()
|
||||||
|
check_outputs=()
|
||||||
|
|
||||||
|
status_from_exit() {
|
||||||
|
case "$1" in
|
||||||
|
0) printf 'OK' ;;
|
||||||
|
1) printf 'WARNING' ;;
|
||||||
|
2) printf 'INVALID' ;;
|
||||||
|
3) printf 'CRITICAL' ;;
|
||||||
|
*) printf 'ERROR' ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
render_command() {
|
||||||
|
local item
|
||||||
|
for item in "$@"; do
|
||||||
|
printf '%q ' "$item"
|
||||||
|
done | sed 's/[[:space:]]*$//'
|
||||||
|
}
|
||||||
|
|
||||||
|
append_skipped_check() {
|
||||||
|
local label="$1"
|
||||||
|
local name="$2"
|
||||||
|
local reason="$3"
|
||||||
|
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||||
|
|
||||||
|
printf 'SKIPPED: %s\n' "$reason" > "$output_path"
|
||||||
|
|
||||||
|
check_labels+=("$label")
|
||||||
|
check_names+=("$name")
|
||||||
|
check_commands+=("not run")
|
||||||
|
check_statuses+=("SKIPPED")
|
||||||
|
check_exit_codes+=("-")
|
||||||
|
check_summaries+=("$reason")
|
||||||
|
check_outputs+=("$output_path")
|
||||||
|
}
|
||||||
|
|
||||||
|
run_check() {
|
||||||
|
local label="$1"
|
||||||
|
local script_name="$2"
|
||||||
|
shift 2
|
||||||
|
|
||||||
|
local script_path="${script_dir}/${script_name}"
|
||||||
|
local output_path="$tmp_dir/check_${#check_labels[@]}.txt"
|
||||||
|
local command_text
|
||||||
|
local exit_code
|
||||||
|
local status
|
||||||
|
local summary
|
||||||
|
|
||||||
|
command_text="$(render_command "$script_path" "$@")"
|
||||||
|
|
||||||
|
if [[ ! -e "$script_path" ]]; then
|
||||||
|
append_skipped_check "$label" "$script_name" "missing script: $script_name"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ ! -x "$script_path" ]]; then
|
||||||
|
append_skipped_check "$label" "$script_name" "script is not executable: $script_name"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
set +e
|
||||||
|
"$script_path" "$@" > "$output_path" 2>&1
|
||||||
|
exit_code=$?
|
||||||
|
set -e
|
||||||
|
|
||||||
|
status="$(status_from_exit "$exit_code")"
|
||||||
|
summary="$(sed -n '1p' "$output_path")"
|
||||||
|
if [[ -z "$summary" ]]; then
|
||||||
|
summary="no output captured"
|
||||||
|
fi
|
||||||
|
|
||||||
|
check_labels+=("$label")
|
||||||
|
check_names+=("$script_name")
|
||||||
|
check_commands+=("$command_text")
|
||||||
|
check_statuses+=("$status")
|
||||||
|
check_exit_codes+=("$exit_code")
|
||||||
|
check_summaries+=("$summary")
|
||||||
|
check_outputs+=("$output_path")
|
||||||
|
}
|
||||||
|
|
||||||
|
run_cpu_checks() {
|
||||||
|
run_check "CPU saturation" "check_high_cpu.sh"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_memory_checks() {
|
||||||
|
run_check "Memory and OOM" "check_high_memory_oom.sh" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_service_checks() {
|
||||||
|
if [[ -z "$service_name" ]]; then
|
||||||
|
append_skipped_check "Service restart loop" "check_service_restart_loop.sh" "requires --service SERVICE_NAME"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
run_check "Service restart loop" "check_service_restart_loop.sh" --service "$service_name" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_network_checks() {
|
||||||
|
local args=(--host "$host_name")
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
append_skipped_check "DNS and connectivity" "check_dns_connectivity.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
args+=(--port "$port")
|
||||||
|
fi
|
||||||
|
run_check "DNS and connectivity" "check_dns_connectivity.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_auth_checks() {
|
||||||
|
run_check "Failed SSH logins" "check_failed_ssh_logins.sh" --since "$since_value"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_cert_checks() {
|
||||||
|
local args=(--host "$host_name")
|
||||||
|
if [[ -z "$host_name" ]]; then
|
||||||
|
append_skipped_check "Certificate expiry" "check_certificate_expiry.sh" "requires --host HOSTNAME_OR_FQDN"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
if [[ -n "$port" ]]; then
|
||||||
|
args+=(--port "$port")
|
||||||
|
fi
|
||||||
|
run_check "Certificate expiry" "check_certificate_expiry.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_filesystem_checks() {
|
||||||
|
run_check "Read-only filesystems" "check_filesystem_readonly.sh"
|
||||||
|
run_check "Inode usage" "check_inode_usage.sh"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_jvm_checks() {
|
||||||
|
local args=()
|
||||||
|
if [[ -n "$target_pid" ]]; then
|
||||||
|
args+=(--pid "$target_pid")
|
||||||
|
elif [[ -n "$match_string" ]]; then
|
||||||
|
args+=(--match "$match_string")
|
||||||
|
fi
|
||||||
|
run_check "JVM threads and heap" "check_jvm_threads_heap.sh" "${args[@]}"
|
||||||
|
}
|
||||||
|
|
||||||
|
case "$incident_type" in
|
||||||
|
cpu) run_cpu_checks ;;
|
||||||
|
memory) run_memory_checks ;;
|
||||||
|
service) run_service_checks ;;
|
||||||
|
network) run_network_checks ;;
|
||||||
|
auth) run_auth_checks ;;
|
||||||
|
cert) run_cert_checks ;;
|
||||||
|
filesystem) run_filesystem_checks ;;
|
||||||
|
jvm) run_jvm_checks ;;
|
||||||
|
all)
|
||||||
|
run_cpu_checks
|
||||||
|
run_memory_checks
|
||||||
|
run_service_checks
|
||||||
|
run_network_checks
|
||||||
|
run_auth_checks
|
||||||
|
run_cert_checks
|
||||||
|
run_filesystem_checks
|
||||||
|
run_jvm_checks
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
generated_at="$(date -u '+%Y-%m-%dT%H:%M:%SZ')"
|
||||||
|
local_hostname="$(hostname 2>/dev/null || printf 'unknown')"
|
||||||
|
current_user="$(id -un 2>/dev/null || printf 'unknown')"
|
||||||
|
|
||||||
|
{
|
||||||
|
printf '# L2 Incident Triage Report\n\n'
|
||||||
|
printf -- '- Generated: %s\n' "$generated_at"
|
||||||
|
printf -- '- Local hostname: %s\n' "$local_hostname"
|
||||||
|
printf -- '- Current user: %s\n' "$current_user"
|
||||||
|
printf -- '- Incident type: %s\n' "$incident_type"
|
||||||
|
printf -- '- Service: %s\n' "${service_name:-not provided}"
|
||||||
|
printf -- '- Host: %s\n' "${host_name:-not provided}"
|
||||||
|
printf -- '- Port: %s\n' "${port:-not provided}"
|
||||||
|
printf -- '- PID: %s\n' "${target_pid:-not provided}"
|
||||||
|
printf -- '- Process match: %s\n' "${match_string:-not provided}"
|
||||||
|
printf -- '- Since: %s\n\n' "$since_value"
|
||||||
|
|
||||||
|
printf '## Executed Checks\n\n'
|
||||||
|
printf '| Check | Script | Status | Exit | Command |\n'
|
||||||
|
printf '| --- | --- | --- | --- | --- |\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf "| %s | \`%s\` | %s | %s | \`%s\` |\n" \
|
||||||
|
"${check_labels[$index]}" \
|
||||||
|
"${check_names[$index]}" \
|
||||||
|
"${check_statuses[$index]}" \
|
||||||
|
"${check_exit_codes[$index]}" \
|
||||||
|
"${check_commands[$index]}"
|
||||||
|
done
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf '## Summary\n\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf -- '- %s: %s\n' "${check_labels[$index]}" "${check_summaries[$index]}"
|
||||||
|
done
|
||||||
|
printf '\n'
|
||||||
|
|
||||||
|
printf '## Raw Evidence\n\n'
|
||||||
|
for index in "${!check_labels[@]}"; do
|
||||||
|
printf '### %s\n\n' "${check_labels[$index]}"
|
||||||
|
printf "Script: \`%s\`\n\n" "${check_names[$index]}"
|
||||||
|
printf "Command: \`%s\`\n\n" "${check_commands[$index]}"
|
||||||
|
printf 'Status: %s, exit: %s\n\n' "${check_statuses[$index]}" "${check_exit_codes[$index]}"
|
||||||
|
printf '```text\n'
|
||||||
|
cat "${check_outputs[$index]}"
|
||||||
|
printf '\n```\n\n'
|
||||||
|
done
|
||||||
|
|
||||||
|
printf '## L2 Handover Checklist\n\n'
|
||||||
|
printf -- '- [ ] Business impact confirmed\n'
|
||||||
|
printf -- '- [ ] Affected host/service identified\n'
|
||||||
|
printf -- '- [ ] Monitoring alert attached\n'
|
||||||
|
printf -- '- [ ] Recent changes checked\n'
|
||||||
|
printf -- '- [ ] Logs attached\n'
|
||||||
|
printf -- '- [ ] Service owner identified\n'
|
||||||
|
printf -- '- [ ] Escalation target identified\n\n'
|
||||||
|
|
||||||
|
printf '## Escalation Notes\n\n'
|
||||||
|
printf -- '- Escalate when impact is active, spreading, customer-facing, or outside L2 access.\n'
|
||||||
|
printf -- '- Include the alert, timeline, commands run, and the raw evidence above.\n'
|
||||||
|
printf -- '- Call out skipped checks and missing inputs so the next responder does not repeat the same gap.\n'
|
||||||
|
printf -- '- Do not restart, kill, remount, or rotate anything unless the incident owner approves the action.\n\n'
|
||||||
|
|
||||||
|
printf '## Recommended Next Steps\n\n'
|
||||||
|
printf -- '- Confirm the symptom against monitoring and user reports.\n'
|
||||||
|
printf -- '- Compare this point-in-time evidence with recent deploys, config changes, and host events.\n'
|
||||||
|
printf -- '- Attach this report to the incident ticket before handoff.\n'
|
||||||
|
printf -- '- If escalation is needed, include exact hostnames, service names, timestamps, and observed impact.\n'
|
||||||
|
} > "$report_file"
|
||||||
|
|
||||||
|
if [[ -n "$output_file" ]]; then
|
||||||
|
cp "$report_file" "$output_file"
|
||||||
|
printf 'OK: wrote L2 incident triage report to %s\n' "$output_file"
|
||||||
|
else
|
||||||
|
cat "$report_file"
|
||||||
|
fi
|
||||||
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
|
|||||||
- `ci-cd`
|
- `ci-cd`
|
||||||
- `docker`
|
- `docker`
|
||||||
|
|
||||||
|
## Linux operations labs
|
||||||
|
|
||||||
|
- [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
|
||||||
|
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
|
||||||
|
|
||||||
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
||||||
|
|
||||||
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
||||||
|
|||||||
@@ -0,0 +1,308 @@
|
|||||||
|
# AI Lab Maintenance Toolkit
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
|
||||||
|
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
|
||||||
|
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
|
||||||
|
configuration backup, and non-destructive VM inventory into a small toolkit
|
||||||
|
that is readable enough for review and guarded enough for homelab use.
|
||||||
|
|
||||||
|
This is a portfolio and lab implementation, not evidence of production
|
||||||
|
certification. Review package policy, backup coverage, maintenance windows, and
|
||||||
|
application impact before deploying it to another host.
|
||||||
|
|
||||||
|
## Problem solved
|
||||||
|
|
||||||
|
AI lab hosts accumulate operating system packages, kernel packages, container
|
||||||
|
images, build cache, journals, and configuration changes while also carrying
|
||||||
|
stateful workloads. Manual maintenance is easy to defer and risky to perform
|
||||||
|
without evidence. This project provides scheduled, logged tasks with explicit
|
||||||
|
safety boundaries and separate read-only audit commands.
|
||||||
|
|
||||||
|
## What this demonstrates
|
||||||
|
|
||||||
|
- Bash strict mode, input validation, dependency checks, and operational exit
|
||||||
|
codes.
|
||||||
|
- Dry-run-first maintenance with explicit authorization for changes.
|
||||||
|
- systemd oneshot services and persistent calendar timers.
|
||||||
|
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
|
||||||
|
- Docker cleanup that preserves volumes.
|
||||||
|
- Configuration-focused backups with bounded retention.
|
||||||
|
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
|
||||||
|
- Idempotent installation and guarded JSON configuration updates.
|
||||||
|
|
||||||
|
## Architecture and directory layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
ailab-maintenance/
|
||||||
|
├── README.md
|
||||||
|
├── install.sh
|
||||||
|
├── scripts/
|
||||||
|
│ ├── ailab-healthcheck.sh
|
||||||
|
│ ├── ailab-disk-watch.sh
|
||||||
|
│ ├── ailab-apt-cleanup.sh
|
||||||
|
│ ├── ailab-kernel-cleanup.sh
|
||||||
|
│ ├── ailab-docker-cleanup.sh
|
||||||
|
│ ├── ailab-config-backup.sh
|
||||||
|
│ └── ailab-vm-audit.sh
|
||||||
|
└── systemd/
|
||||||
|
├── ailab-apt-cleanup.service
|
||||||
|
├── ailab-apt-cleanup.timer
|
||||||
|
├── ailab-kernel-cleanup.service
|
||||||
|
├── ailab-kernel-cleanup.timer
|
||||||
|
├── ailab-docker-cleanup.service
|
||||||
|
├── ailab-docker-cleanup.timer
|
||||||
|
├── ailab-config-backup.service
|
||||||
|
├── ailab-config-backup.timer
|
||||||
|
├── ailab-disk-watch.service
|
||||||
|
└── ailab-disk-watch.timer
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer deploys scripts to `/usr/local/sbin` and units to
|
||||||
|
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
|
||||||
|
through an additional framework.
|
||||||
|
|
||||||
|
## Maintenance tasks
|
||||||
|
|
||||||
|
| Command | Purpose | Change behavior |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
|
||||||
|
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
|
||||||
|
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
|
||||||
|
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
|
||||||
|
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
|
||||||
|
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
|
||||||
|
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
|
||||||
|
|
||||||
|
## Safety model
|
||||||
|
|
||||||
|
Change-capable scripts default to dry-run behavior. Manual execution requires
|
||||||
|
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
|
||||||
|
use `--execute --non-interactive`; installing and enabling those reviewed unit
|
||||||
|
files is the explicit authorization for scheduled maintenance.
|
||||||
|
|
||||||
|
Exit codes follow the repository convention:
|
||||||
|
|
||||||
|
- `0`: completed successfully or an optional component was absent.
|
||||||
|
- `1`: an operational check or maintenance action failed.
|
||||||
|
- `2`: invalid input, missing required dependency, or insufficient privilege.
|
||||||
|
|
||||||
|
The scripts do not bypass APT or Docker locks, delete VM resources, manually
|
||||||
|
select kernel names for removal, or hide command failures.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
Review every script and unit first. Installation changes package state,
|
||||||
|
journald settings, Docker daemon settings when Docker exists, and enabled timer
|
||||||
|
state.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd labs/linux/ailab-maintenance
|
||||||
|
sudo ./install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer:
|
||||||
|
|
||||||
|
1. Installs the documented Ubuntu utilities.
|
||||||
|
2. Deploys scripts and systemd units with fixed permissions.
|
||||||
|
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
|
||||||
|
4. Restarts `systemd-journald`.
|
||||||
|
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
|
||||||
|
with `jq`, and attempts a Docker restart.
|
||||||
|
6. Enables all five timers.
|
||||||
|
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
|
||||||
|
|
||||||
|
The installer is intended for Ubuntu 26.04. It is not run automatically by
|
||||||
|
repository validation.
|
||||||
|
|
||||||
|
## Manual commands
|
||||||
|
|
||||||
|
Read-only reports:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||||
|
sudo /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Preview maintenance:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-config-backup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Apply reviewed maintenance interactively:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-config-backup.sh --execute
|
||||||
|
```
|
||||||
|
|
||||||
|
`--non-interactive` is reserved for reviewed automation and is rejected unless
|
||||||
|
`--execute` is also present.
|
||||||
|
|
||||||
|
## Systemd timers
|
||||||
|
|
||||||
|
| Timer | Schedule |
|
||||||
|
| --- | --- |
|
||||||
|
| `ailab-config-backup.timer` | Daily at 03:30 |
|
||||||
|
| `ailab-disk-watch.timer` | Hourly |
|
||||||
|
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
|
||||||
|
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
|
||||||
|
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
|
||||||
|
|
||||||
|
All timers use `Persistent=true`, so a missed event runs after the host becomes
|
||||||
|
available. Inspect timer and service evidence with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl list-timers --all | grep ailab-
|
||||||
|
systemctl status ailab-config-backup.timer
|
||||||
|
journalctl -u ailab-kernel-cleanup.service
|
||||||
|
```
|
||||||
|
|
||||||
|
## Logs
|
||||||
|
|
||||||
|
Scheduled and manual maintenance writes to:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/var/log/ailab-apt-cleanup.log
|
||||||
|
/var/log/ailab-kernel-cleanup.log
|
||||||
|
/var/log/ailab-docker-cleanup.log
|
||||||
|
/var/log/ailab-config-backup.log
|
||||||
|
/var/log/ailab-disk-watch.log
|
||||||
|
```
|
||||||
|
|
||||||
|
systemd also records service output in the journal. Logrotate is installed as a
|
||||||
|
dependency, but this lab does not create a custom rotation policy for these
|
||||||
|
small maintenance logs.
|
||||||
|
|
||||||
|
## Docker policy
|
||||||
|
|
||||||
|
Docker cleanup runs `docker system prune -af` and removes build cache older
|
||||||
|
than seven days. It never passes `--volumes`. Named and anonymous volumes
|
||||||
|
remain outside this automated policy and require application-aware review.
|
||||||
|
|
||||||
|
The installer configures the `json-file` driver with a maximum size of `50m`
|
||||||
|
and five files. Existing valid JSON is backed up and merged. Invalid JSON
|
||||||
|
causes installation to stop rather than overwrite operator configuration.
|
||||||
|
|
||||||
|
## Kernel policy
|
||||||
|
|
||||||
|
Kernel removal is delegated to `apt autoremove --purge`; package names are not
|
||||||
|
constructed or purged with regular expressions. Before execution, the script
|
||||||
|
logs the APT simulation and refuses cleanup unless at least two installed
|
||||||
|
versioned kernel image packages remain after simulated removals.
|
||||||
|
|
||||||
|
This protects a fallback kernel while preserving Ubuntu dependency policy.
|
||||||
|
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
|
||||||
|
Secure Boot state, and the simulated removal set before manual execution.
|
||||||
|
|
||||||
|
## Backup policy
|
||||||
|
|
||||||
|
Backups are written to `/srv/backups/ailab-config` as
|
||||||
|
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
|
||||||
|
deleted only after a new archive is created.
|
||||||
|
|
||||||
|
The backup covers `/etc`, selected root shell configuration,
|
||||||
|
`/opt/ailab-maintenance` when present, and libvirt configuration under
|
||||||
|
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
|
||||||
|
Ollama models, VM disk images, or other large application datasets. Because
|
||||||
|
`/etc` is included, explicitly listed configuration subdirectories are already
|
||||||
|
covered even when optional-path reporting mentions them separately.
|
||||||
|
|
||||||
|
This is a local configuration backup, not a disaster-recovery design. A real
|
||||||
|
deployment should copy archives to independently protected storage and test
|
||||||
|
restoration.
|
||||||
|
|
||||||
|
## Journald policy
|
||||||
|
|
||||||
|
The installer applies:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
|
```
|
||||||
|
|
||||||
|
These settings bound journal growth while retaining useful troubleshooting
|
||||||
|
evidence. Capacity and retention should be adjusted to the host's disk size
|
||||||
|
and incident-response requirements.
|
||||||
|
|
||||||
|
## Disk watch policy
|
||||||
|
|
||||||
|
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
|
||||||
|
`1` when any checked filesystem meets or exceeds the threshold. Override the
|
||||||
|
threshold for a manual or unit invocation with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The script reports every filesystem as `OK` or `WARNING`; it does not delete
|
||||||
|
data or attempt remediation.
|
||||||
|
|
||||||
|
## Example operational workflows
|
||||||
|
|
||||||
|
### Weekly maintenance review
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||||
|
systemctl list-timers --all | grep ailab-
|
||||||
|
```
|
||||||
|
|
||||||
|
Review the kernel simulation, Docker usage, failed units, backup freshness, and
|
||||||
|
disk warnings before approving manual changes.
|
||||||
|
|
||||||
|
### Disk pressure investigation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
sudo docker system df
|
||||||
|
sudo journalctl --disk-usage
|
||||||
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Use the evidence to identify ownership. Do not treat Docker pruning or file
|
||||||
|
deletion as a substitute for application-specific retention policy.
|
||||||
|
|
||||||
|
### Post-maintenance evidence
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh \
|
||||||
|
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
|
||||||
|
journalctl --since today -u 'ailab-*.service'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Interview talking points
|
||||||
|
|
||||||
|
- Why timer units explicitly carry the non-interactive execution boundary.
|
||||||
|
- Why APT dependency policy is safer than regex-based kernel deletion.
|
||||||
|
- How Docker volume preservation separates platform hygiene from application
|
||||||
|
data lifecycle decisions.
|
||||||
|
- How optional dependency handling keeps one health command useful across
|
||||||
|
container, GPU, and virtualization host variants.
|
||||||
|
- Why configuration backup and application-data backup are separate concerns.
|
||||||
|
- How exit codes, persistent timers, logs, and post-checks support operations.
|
||||||
|
|
||||||
|
## Future improvements
|
||||||
|
|
||||||
|
- Add a dedicated logrotate policy after measuring log growth.
|
||||||
|
- Export disk-watch status to a monitoring system instead of relying only on
|
||||||
|
timer failure state.
|
||||||
|
- Add automated archive integrity checks and off-host replication.
|
||||||
|
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
|
||||||
|
commands.
|
||||||
|
- Add package-lock detection with bounded retry policy if recurring contention
|
||||||
|
is observed.
|
||||||
|
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
|
||||||
|
dedicated read-only audit.
|
||||||
Executable
+103
@@ -0,0 +1,103 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||||
|
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||||
|
packages=(
|
||||||
|
logrotate
|
||||||
|
needrestart
|
||||||
|
smartmontools
|
||||||
|
nvme-cli
|
||||||
|
sysstat
|
||||||
|
iotop
|
||||||
|
ncdu
|
||||||
|
duf
|
||||||
|
jq
|
||||||
|
lsof
|
||||||
|
psmisc
|
||||||
|
tar
|
||||||
|
gzip
|
||||||
|
)
|
||||||
|
timers=(
|
||||||
|
ailab-apt-cleanup.timer
|
||||||
|
ailab-kernel-cleanup.timer
|
||||||
|
ailab-docker-cleanup.timer
|
||||||
|
ailab-config-backup.timer
|
||||||
|
ailab-disk-watch.timer
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: install.sh must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in apt-get install systemctl; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf 'Installing maintenance dependencies...\n'
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
|
||||||
|
printf 'Installing scripts and systemd units...\n'
|
||||||
|
for script in "$SCRIPT_DIR"/scripts/*.sh; do
|
||||||
|
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
|
||||||
|
done
|
||||||
|
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
|
||||||
|
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
|
||||||
|
done
|
||||||
|
|
||||||
|
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
|
||||||
|
tmp_journald="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
|
||||||
|
cat >"$tmp_journald" <<'EOF'
|
||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
|
EOF
|
||||||
|
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
|
||||||
|
systemctl restart systemd-journald
|
||||||
|
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
printf 'Configuring Docker log rotation limits...\n'
|
||||||
|
install -d -m 0755 /etc/docker
|
||||||
|
tmp_docker="$(mktemp)"
|
||||||
|
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$DOCKER_CONFIG" "$backup"
|
||||||
|
jq '. + {
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||||
|
}' "$DOCKER_CONFIG" >"$tmp_docker"
|
||||||
|
else
|
||||||
|
jq -n '{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {"max-size": "50m", "max-file": "5"}
|
||||||
|
}' >"$tmp_docker"
|
||||||
|
fi
|
||||||
|
|
||||||
|
jq empty "$tmp_docker"
|
||||||
|
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
|
||||||
|
systemctl restart docker || true
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now "${timers[@]}"
|
||||||
|
|
||||||
|
printf '\nEnabled AI Lab timers:\n'
|
||||||
|
systemctl list-timers --all --no-pager | grep 'ailab-' || true
|
||||||
|
|
||||||
|
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
|
||||||
|
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-apt-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v apt >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
|
||||||
|
printf 'INFO: simulated autoremove follows\n'
|
||||||
|
LC_ALL=C apt -s autoremove --purge
|
||||||
|
printf 'INFO: rerun with --execute and confirm to apply changes\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt update
|
||||||
|
apt autoremove --purge -y
|
||||||
|
apt autoclean -y
|
||||||
|
if command -v needrestart >/dev/null 2>&1; then
|
||||||
|
needrestart -b || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: needrestart is not installed\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: APT cleanup completed\n'
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-config-backup.log"
|
||||||
|
BACKUP_DIR="/srv/backups/ailab-config"
|
||||||
|
RETENTION_DAYS=30
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in tar gzip find; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
timestamp="$(date '+%Y%m%d-%H%M%S')"
|
||||||
|
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
|
||||||
|
candidate_paths=(
|
||||||
|
/etc
|
||||||
|
/root/.bashrc
|
||||||
|
/root/.bashrc.d
|
||||||
|
/opt/ailab-maintenance
|
||||||
|
/var/lib/libvirt/qemu
|
||||||
|
)
|
||||||
|
source_paths=()
|
||||||
|
|
||||||
|
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
|
||||||
|
for path in "${candidate_paths[@]}"; do
|
||||||
|
if [[ -e "$path" ]]; then
|
||||||
|
source_paths+=("${path#/}")
|
||||||
|
printf 'OK: include %s\n' "$path"
|
||||||
|
else
|
||||||
|
printf 'INFO: optional path is absent: %s\n' "$path"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((${#source_paths[@]} == 0)); then
|
||||||
|
printf 'CRITICAL: no backup source paths are present\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'Backup destination: %s\n' "$archive"
|
||||||
|
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
|
||||||
|
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
|
||||||
|
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'Type EXECUTE to create the archive and apply retention: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -m 0750 "$BACKUP_DIR"
|
||||||
|
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
|
||||||
|
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
|
||||||
|
printf 'OK: configuration backup created: %s\n' "$archive"
|
||||||
@@ -0,0 +1,38 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-disk-watch.log"
|
||||||
|
threshold="${AILAB_DISK_THRESHOLD:-85}"
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
|
||||||
|
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
|
||||||
|
|
||||||
|
status=0
|
||||||
|
while read -r filesystem _blocks _used available use_percent mountpoint; do
|
||||||
|
usage="${use_percent%\%}"
|
||||||
|
|
||||||
|
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
|
||||||
|
status=1
|
||||||
|
elif ((usage >= threshold)); then
|
||||||
|
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
|
||||||
|
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
|
||||||
|
status=1
|
||||||
|
else
|
||||||
|
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
|
||||||
|
fi
|
||||||
|
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
|
||||||
|
|
||||||
|
exit "$status"
|
||||||
@@ -0,0 +1,70 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-docker-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
|
||||||
|
if ! command -v docker >/dev/null 2>&1; then
|
||||||
|
printf 'INFO: Docker is not installed; nothing to do\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
|
||||||
|
printf 'INFO: docker.service is inactive; nothing to do\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nDocker disk usage before cleanup:\n'
|
||||||
|
docker system df
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; would run docker system prune -af\n'
|
||||||
|
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
|
||||||
|
printf 'INFO: Docker volumes are never included in this cleanup\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker system prune -af
|
||||||
|
docker builder prune -af --filter "until=168h"
|
||||||
|
|
||||||
|
printf '\nDocker disk usage after cleanup:\n'
|
||||||
|
docker system df
|
||||||
|
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
|
||||||
+111
@@ -0,0 +1,111 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Host identity"
|
||||||
|
if command -v hostnamectl >/dev/null 2>&1; then
|
||||||
|
run_optional "hostnamectl" hostnamectl
|
||||||
|
else
|
||||||
|
run_optional "hostname" hostname
|
||||||
|
fi
|
||||||
|
run_optional "kernel information" uname -a
|
||||||
|
run_optional "uptime" uptime
|
||||||
|
|
||||||
|
section "Memory"
|
||||||
|
if command -v free >/dev/null 2>&1; then
|
||||||
|
run_optional "memory report" free -h
|
||||||
|
else
|
||||||
|
printf 'WARNING: free is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Filesystems"
|
||||||
|
if command -v df >/dev/null 2>&1; then
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
printf '\nKey mountpoints present:\n'
|
||||||
|
for mountpoint in / /boot /var /srv /opt /home; do
|
||||||
|
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
|
||||||
|
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
else
|
||||||
|
printf 'WARNING: df is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Journal usage"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
run_optional "journal disk usage" journalctl --disk-usage
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Docker"
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "Docker service state" systemctl is-active docker
|
||||||
|
fi
|
||||||
|
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
|
||||||
|
run_optional "Docker disk usage" docker system df
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Libvirt"
|
||||||
|
if command -v virsh >/dev/null 2>&1; then
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "libvirtd service state" systemctl is-active libvirtd
|
||||||
|
fi
|
||||||
|
run_optional "libvirt guest list" virsh list --all
|
||||||
|
else
|
||||||
|
printf 'INFO: virsh is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
run_optional "NVIDIA status" nvidia-smi
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Failed systemd units"
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "SMART quick health"
|
||||||
|
if command -v smartctl >/dev/null 2>&1; then
|
||||||
|
shopt -s nullglob
|
||||||
|
devices=(/dev/sd? /dev/nvme?n?)
|
||||||
|
shopt -u nullglob
|
||||||
|
|
||||||
|
if ((${#devices[@]} == 0)); then
|
||||||
|
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
|
||||||
|
else
|
||||||
|
for device in "${devices[@]}"; do
|
||||||
|
printf '\n-- %s --\n' "$device"
|
||||||
|
run_optional "SMART health check for $device" smartctl -H "$device"
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'INFO: smartctl is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
@@ -0,0 +1,117 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
# APT autoremove respects package dependencies and kernel protection rules. That
|
||||||
|
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
kernel_packages() {
|
||||||
|
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
|
||||||
|
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
|
||||||
|
| awk '$1 ~ /^ii/ {print $2}' \
|
||||||
|
| sort -u || true
|
||||||
|
}
|
||||||
|
|
||||||
|
versioned_kernel_images() {
|
||||||
|
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
|
||||||
|
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
|
||||||
|
| sort -u || true
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in apt dpkg-query uname; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
printf 'Running kernel: %s\n' "$(uname -r)"
|
||||||
|
printf '\nInstalled kernel-related packages before cleanup:\n'
|
||||||
|
kernel_packages
|
||||||
|
|
||||||
|
simulation="$(LC_ALL=C apt -s autoremove --purge)"
|
||||||
|
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
|
||||||
|
|
||||||
|
mapfile -t installed_images < <(versioned_kernel_images)
|
||||||
|
mapfile -t removed_images < <(
|
||||||
|
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
|
||||||
|
)
|
||||||
|
|
||||||
|
remaining_images=0
|
||||||
|
for image in "${installed_images[@]}"; do
|
||||||
|
remove_image=false
|
||||||
|
for removed in "${removed_images[@]}"; do
|
||||||
|
if [[ "$image" == "$removed" ]]; then
|
||||||
|
remove_image=true
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [[ "$remove_image" != true ]]; then
|
||||||
|
remaining_images=$((remaining_images + 1))
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
|
||||||
|
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
|
||||||
|
|
||||||
|
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
|
||||||
|
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; no packages were removed\n'
|
||||||
|
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt autoremove --purge -y
|
||||||
|
apt autoclean -y
|
||||||
|
if command -v update-grub >/dev/null 2>&1; then
|
||||||
|
update-grub || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: update-grub is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nInstalled kernel-related packages after cleanup:\n'
|
||||||
|
kernel_packages
|
||||||
|
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
|
||||||
+42
@@ -0,0 +1,42 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ! command -v virsh >/dev/null 2>&1; then
|
||||||
|
printf 'INFO: virsh is not installed; VM audit skipped\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Virtual machines"
|
||||||
|
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||||
|
|
||||||
|
section "Storage pools"
|
||||||
|
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
|
||||||
|
|
||||||
|
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
|
||||||
|
for pool in "${pools[@]}"; do
|
||||||
|
section "Volumes in pool: $pool"
|
||||||
|
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
|
||||||
|
done
|
||||||
|
|
||||||
|
section "Possible VM disk and installation images"
|
||||||
|
search_roots=()
|
||||||
|
for path in /var/lib/libvirt /srv /opt; do
|
||||||
|
[[ -d "$path" ]] && search_roots+=("$path")
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((${#search_roots[@]} == 0)); then
|
||||||
|
printf 'INFO: no configured search roots are present\n'
|
||||||
|
else
|
||||||
|
find "${search_roots[@]}" -xdev -type f \
|
||||||
|
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
|
||||||
|
-printf '%12s bytes %p\n' 2>/dev/null \
|
||||||
|
| sort -nr || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe APT cleanup
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab APT cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:00:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab configuration backup
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab configuration backup daily
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=*-*-* 03:30:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab disk usage check
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab disk usage check hourly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=hourly
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe Docker cleanup
|
||||||
|
Requires=docker.service
|
||||||
|
After=docker.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab Docker cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:40:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe kernel cleanup
|
||||||
|
After=network-online.target ailab-apt-cleanup.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab kernel cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:20:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,276 @@
|
|||||||
|
# Linux Fresh Setup Toolkit
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
|
||||||
|
lab server or workstation. It prepares a host for routine administration,
|
||||||
|
Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
|
||||||
|
diagnostics, bounded logging, practical kernel tuning, and a conservative
|
||||||
|
security baseline.
|
||||||
|
|
||||||
|
The scripts are modular and safe to rerun. Optional components remain optional,
|
||||||
|
UFW is not enabled without a specific flag, and an NVIDIA driver is never
|
||||||
|
installed without an explicit version. This is a portfolio and homelab
|
||||||
|
implementation, not a production-certified build standard.
|
||||||
|
|
||||||
|
## Scope and non-goals
|
||||||
|
|
||||||
|
The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
|
||||||
|
with APT package management. It is suitable for a host such as `ailab` that may
|
||||||
|
run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
|
||||||
|
|
||||||
|
It does not:
|
||||||
|
|
||||||
|
- Deploy applications, containers, or virtual machines.
|
||||||
|
- Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
|
||||||
|
- Select an NVIDIA driver automatically.
|
||||||
|
- Define a complete firewall policy or compliance baseline.
|
||||||
|
- Replace backup, monitoring, patching, or ongoing maintenance processes.
|
||||||
|
- Claim live validation against every future Ubuntu release.
|
||||||
|
|
||||||
|
## Why this is separate from ailab-maintenance
|
||||||
|
|
||||||
|
This project establishes a fresh host. The sibling
|
||||||
|
[AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
|
||||||
|
checks, scheduled cleanup, configuration backup, disk monitoring, and VM
|
||||||
|
inventory after a host is operating.
|
||||||
|
|
||||||
|
Keeping bootstrap and maintenance separate makes the change boundary clear:
|
||||||
|
this toolkit installs platform capabilities and baseline configuration, while
|
||||||
|
the maintenance toolkit manages recurring operational tasks.
|
||||||
|
|
||||||
|
## Directory layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
setup/
|
||||||
|
├── README.md
|
||||||
|
├── install.sh
|
||||||
|
├── scripts/
|
||||||
|
│ ├── 00-preflight.sh
|
||||||
|
│ ├── 00-platform-guard.inc
|
||||||
|
│ ├── 01-base-packages.sh
|
||||||
|
│ ├── 02-shell-profile.sh
|
||||||
|
│ ├── 03-cockpit.sh
|
||||||
|
│ ├── 04-docker.sh
|
||||||
|
│ ├── 05-libvirt.sh
|
||||||
|
│ ├── 06-nvidia-tools.sh
|
||||||
|
│ ├── 07-tuning.sh
|
||||||
|
│ ├── 08-security-baseline.sh
|
||||||
|
│ └── 99-postcheck.sh
|
||||||
|
├── files/
|
||||||
|
│ ├── bashrc.d/ailab.sh
|
||||||
|
│ ├── docker/daemon.json
|
||||||
|
│ ├── sysctl/99-ailab.conf
|
||||||
|
│ └── systemd/journald-ailab-limits.conf
|
||||||
|
└── docs/
|
||||||
|
├── fresh-install-checklist.md
|
||||||
|
├── cockpit.md
|
||||||
|
├── docker.md
|
||||||
|
├── libvirt.md
|
||||||
|
├── nvidia.md
|
||||||
|
└── bash-shell.md
|
||||||
|
```
|
||||||
|
|
||||||
|
`00-platform-guard.inc` is an internal sourced helper used by mutating
|
||||||
|
component scripts; it is not an executable profile.
|
||||||
|
|
||||||
|
## Supported profiles and flags
|
||||||
|
|
||||||
|
| Flag | Result |
|
||||||
|
| --- | --- |
|
||||||
|
| `--base` | Install operational CLI, diagnostic, storage, and network packages |
|
||||||
|
| `--shell` | Install the root AI lab Bash profile |
|
||||||
|
| `--cockpit` | Install and enable Cockpit |
|
||||||
|
| `--docker` | Install Docker and bounded JSON-file logging |
|
||||||
|
| `--libvirt` | Install and enable libvirt/KVM |
|
||||||
|
| `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
|
||||||
|
| `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
|
||||||
|
| `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
|
||||||
|
| `--security` | Install and enable fail2ban; install but do not enable UFW |
|
||||||
|
| `--enable-ufw` | Run security setup and explicitly enable UFW |
|
||||||
|
| `--all` | Run every standard profile without UFW enablement or driver installation |
|
||||||
|
|
||||||
|
`--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
|
||||||
|
`--security`. With no flags, the installer prints help and makes no changes.
|
||||||
|
|
||||||
|
## Installation examples
|
||||||
|
|
||||||
|
Review the scripts and current host access path before execution:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd labs/linux/setup
|
||||||
|
./install.sh
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
sudo ./install.sh --cockpit --docker --libvirt
|
||||||
|
sudo ./install.sh --all
|
||||||
|
```
|
||||||
|
|
||||||
|
Explicit high-impact options can be combined with `--all`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --all --enable-ufw
|
||||||
|
sudo ./install.sh --all --install-nvidia-driver 550
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer runs the read-only preflight once before selected profiles and a
|
||||||
|
postcheck after all successful profile steps.
|
||||||
|
|
||||||
|
## Fresh host workflow
|
||||||
|
|
||||||
|
1. Patch the base Ubuntu installation and confirm console or out-of-band access.
|
||||||
|
2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
|
||||||
|
3. Run `sudo ./install.sh --base --shell`.
|
||||||
|
4. Add only the platform profiles needed by the host.
|
||||||
|
5. Review service state, listening ports, storage, networking, and warnings in
|
||||||
|
the postcheck.
|
||||||
|
6. Reboot if a driver or kernel-related package requires it.
|
||||||
|
7. Capture host-specific configuration and backup requirements separately.
|
||||||
|
|
||||||
|
## AI lab workflow
|
||||||
|
|
||||||
|
A general AI lab host can start with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
|
||||||
|
```
|
||||||
|
|
||||||
|
This installs GPU diagnostics but leaves driver choice to the operator. Add
|
||||||
|
libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
|
||||||
|
Cockpit, application, bridge, and VM networking requirements.
|
||||||
|
|
||||||
|
## Safety model
|
||||||
|
|
||||||
|
- Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
|
||||||
|
than 24.04.
|
||||||
|
- Component profiles install their own direct prerequisites.
|
||||||
|
- Existing managed configuration is changed only when content differs.
|
||||||
|
- Changed root shell, Docker, journald, and sysctl files receive timestamped
|
||||||
|
backups.
|
||||||
|
- Existing valid Docker JSON is merged so unrelated settings survive.
|
||||||
|
- Invalid Docker JSON stops configuration rather than being overwritten.
|
||||||
|
- UFW and NVIDIA driver installation require explicit flags.
|
||||||
|
- Package and service failures are not hidden.
|
||||||
|
- Postcheck warnings report optional or inactive components without masking a
|
||||||
|
successfully completed diagnostic script.
|
||||||
|
|
||||||
|
APT installation and service restarts are real system changes. Test first on a
|
||||||
|
disposable host and maintain a console path when changing remote access policy.
|
||||||
|
|
||||||
|
## Bash shell profile
|
||||||
|
|
||||||
|
The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
|
||||||
|
source line is maintained in `/root/.bashrc`. It adds concise helpers for
|
||||||
|
systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
|
||||||
|
|
||||||
|
See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
|
||||||
|
|
||||||
|
## Cockpit setup
|
||||||
|
|
||||||
|
Cockpit provides browser-based host, storage, network, package, VM, metrics,
|
||||||
|
and support-report views. The installer enables `cockpit.socket` and reports
|
||||||
|
`https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
|
||||||
|
available in every enabled Ubuntu repository.
|
||||||
|
|
||||||
|
See [Cockpit setup](docs/cockpit.md).
|
||||||
|
|
||||||
|
## Docker setup
|
||||||
|
|
||||||
|
The Ubuntu `docker.io` package path is preferred. The Docker official
|
||||||
|
repository is configured only when `docker.io` is unavailable. The daemon uses
|
||||||
|
the `json-file` log driver with five 50 MB files per container.
|
||||||
|
|
||||||
|
The toolkit configures log retention only. It does not prune data, deploy
|
||||||
|
Compose applications, or configure an NVIDIA container runtime.
|
||||||
|
|
||||||
|
See [Docker setup](docs/docker.md).
|
||||||
|
|
||||||
|
## libvirt/KVM setup
|
||||||
|
|
||||||
|
The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
|
||||||
|
virt-manager, bridge utilities, and libvirt clients and services. It enables
|
||||||
|
`libvirtd` and prints existing guests and networks.
|
||||||
|
|
||||||
|
See [libvirt/KVM setup](docs/libvirt.md).
|
||||||
|
|
||||||
|
## NVIDIA tooling
|
||||||
|
|
||||||
|
The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
|
||||||
|
It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
|
||||||
|
commands exist.
|
||||||
|
|
||||||
|
Driver installation requires a numeric version that maps to an available
|
||||||
|
Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
|
||||||
|
driver suitability, CUDA, container runtime support, and passthrough remain
|
||||||
|
operator decisions.
|
||||||
|
|
||||||
|
See [NVIDIA tooling](docs/nvidia.md).
|
||||||
|
|
||||||
|
## Tuning
|
||||||
|
|
||||||
|
The tuning profile bounds persistent journal use, raises inotify limits for
|
||||||
|
development and container workloads, reduces swappiness, enables sysstat, and
|
||||||
|
runs automatic sensor detection when available.
|
||||||
|
|
||||||
|
Review these values against available memory, storage, monitoring retention,
|
||||||
|
and workload behavior before deployment beyond a lab.
|
||||||
|
|
||||||
|
## Security baseline
|
||||||
|
|
||||||
|
The security profile installs UFW and fail2ban and enables fail2ban. It leaves
|
||||||
|
UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
|
||||||
|
OpenSSH and TCP port 9090 before activation.
|
||||||
|
|
||||||
|
This is a minimal access-preservation baseline, not a complete host firewall or
|
||||||
|
hardening standard. Application and VM networking may require additional
|
||||||
|
reviewed rules.
|
||||||
|
|
||||||
|
## Postcheck
|
||||||
|
|
||||||
|
The final script reports:
|
||||||
|
|
||||||
|
- Failed systemd units.
|
||||||
|
- Cockpit, Docker, libvirt, and fail2ban status when installed.
|
||||||
|
- Running Docker containers and defined virtual machines.
|
||||||
|
- NVIDIA runtime state.
|
||||||
|
- Filesystem usage and listening ports.
|
||||||
|
|
||||||
|
Warnings require operator review but optional component absence does not cause
|
||||||
|
the postcheck itself to fail.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
Run individual read-only checks after correcting a failed profile:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./scripts/00-preflight.sh
|
||||||
|
sudo ./scripts/99-postcheck.sh
|
||||||
|
systemctl --failed
|
||||||
|
journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
|
||||||
|
```
|
||||||
|
|
||||||
|
Common failure areas are unavailable APT repositories, unsupported package
|
||||||
|
names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
|
||||||
|
module signing, disabled CPU virtualization, and remote firewall assumptions.
|
||||||
|
|
||||||
|
To roll back a managed configuration, compare the current file with its
|
||||||
|
timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
|
||||||
|
owning service. Package removal is intentionally not automated because it may
|
||||||
|
affect workloads and dependencies.
|
||||||
|
|
||||||
|
## Interview talking points
|
||||||
|
|
||||||
|
- Why day-0 bootstrap and day-2 maintenance have separate ownership.
|
||||||
|
- How explicit flags protect firewall and GPU driver decisions.
|
||||||
|
- Why Docker JSON is validated, backed up, and merged.
|
||||||
|
- How idempotent content checks prevent backup and restart churn.
|
||||||
|
- Why preflight and postcheck evidence surround mutating profiles.
|
||||||
|
- Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
|
||||||
|
|
||||||
|
## Future improvements
|
||||||
|
|
||||||
|
- Add automated tests using disposable Ubuntu VMs.
|
||||||
|
- Add a documented NVIDIA Container Toolkit profile.
|
||||||
|
- Add optional non-root administrative user and group membership management.
|
||||||
|
- Add bridge and VFIO planning checks without applying passthrough changes.
|
||||||
|
- Add package compatibility matrices after validating future Ubuntu releases.
|
||||||
|
- Export postcheck results in a structured format for evidence collection.
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
# Bash Shell Profile
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
The shell profile is installed for root:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/root/.bashrc.d/ailab.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer maintains one exact source line in `/root/.bashrc` and backs up
|
||||||
|
changed files. Start a new Bash session or run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source /root/.bashrc
|
||||||
|
```
|
||||||
|
|
||||||
|
## Aliases
|
||||||
|
|
||||||
|
| Alias | Purpose |
|
||||||
|
| --- | --- |
|
||||||
|
| `ll`, `la` | Detailed and hidden-file directory listings |
|
||||||
|
| `ports` | Listening TCP/UDP sockets and processes |
|
||||||
|
| `dus`, `dufh` | Directory and filesystem usage |
|
||||||
|
| `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
|
||||||
|
| `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
|
||||||
|
| `vms` | All libvirt guests |
|
||||||
|
| `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
|
||||||
|
| `now` | Current timestamp and timezone |
|
||||||
|
|
||||||
|
`dcu` runs `docker compose up -d` in the current directory and therefore may
|
||||||
|
create or start resources. Review the Compose project before using it.
|
||||||
|
|
||||||
|
## Functions
|
||||||
|
|
||||||
|
- `svc_status SERVICE`
|
||||||
|
- `svc_logs SERVICE [LINES]`
|
||||||
|
- `docker_logs CONTAINER [LINES]`
|
||||||
|
- `docker_restart CONTAINER`
|
||||||
|
- `vm_autostart VM`
|
||||||
|
- `vm_no_autostart VM`
|
||||||
|
- `path_backup PATH`
|
||||||
|
- `extract ARCHIVE`
|
||||||
|
|
||||||
|
Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
|
||||||
|
report missing commands clearly. `path_backup` creates a timestamped adjacent
|
||||||
|
copy and can consume substantial space for large paths.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
Review timestamped backups under `/root`, restore the desired `.bashrc` or
|
||||||
|
profile copy, and start a new shell. Avoid restoring a backup without checking
|
||||||
|
for unrelated shell changes made after bootstrap.
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
# Cockpit
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
The Cockpit profile installs browser-based host administration modules for
|
||||||
|
system state, storage, networking, packages, virtual machines, metrics, and
|
||||||
|
support reports. It enables the socket-activated service.
|
||||||
|
|
||||||
|
## Installation and validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --cockpit
|
||||||
|
systemctl status cockpit.socket
|
||||||
|
ss -ltnp | grep ':9090'
|
||||||
|
```
|
||||||
|
|
||||||
|
Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
|
||||||
|
default host certificate is not trusted.
|
||||||
|
|
||||||
|
`cockpit-files` is installed when available and skipped with a warning
|
||||||
|
otherwise.
|
||||||
|
|
||||||
|
## Access and firewall
|
||||||
|
|
||||||
|
The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
|
||||||
|
TCP 9090, but upstream firewalls and network ACLs remain external concerns.
|
||||||
|
Use normal Linux accounts and review which users may administer the host.
|
||||||
|
|
||||||
|
## Troubleshooting and rollback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u cockpit.socket -u cockpit.service
|
||||||
|
systemctl restart cockpit.socket
|
||||||
|
apt-cache policy cockpit cockpit-machines cockpit-files
|
||||||
|
```
|
||||||
|
|
||||||
|
To disable remote access without removing packages:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl disable --now cockpit.socket
|
||||||
|
```
|
||||||
@@ -0,0 +1,56 @@
|
|||||||
|
# Docker
|
||||||
|
|
||||||
|
## Package policy
|
||||||
|
|
||||||
|
The profile prefers Ubuntu's `docker.io` package. If that package is
|
||||||
|
unavailable after an APT refresh, it configures Docker's official Ubuntu
|
||||||
|
repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
|
||||||
|
|
||||||
|
This fallback requires network access to `download.docker.com`.
|
||||||
|
|
||||||
|
## Daemon configuration
|
||||||
|
|
||||||
|
The managed settings are:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {
|
||||||
|
"max-size": "50m",
|
||||||
|
"max-file": "5"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Existing valid `/etc/docker/daemon.json` content is preserved and merged with
|
||||||
|
these log settings. A changed file is backed up with a timestamp. Invalid JSON
|
||||||
|
causes the profile to stop rather than overwrite operator configuration.
|
||||||
|
|
||||||
|
Log limits apply to newly created containers. Existing containers may retain
|
||||||
|
their original logging configuration until recreated.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker version
|
||||||
|
docker compose version
|
||||||
|
docker info
|
||||||
|
docker ps
|
||||||
|
docker system df
|
||||||
|
jq . /etc/docker/daemon.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting and rollback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status docker
|
||||||
|
journalctl -u docker
|
||||||
|
jq empty /etc/docker/daemon.json
|
||||||
|
```
|
||||||
|
|
||||||
|
To restore a previous daemon configuration, review a timestamped backup,
|
||||||
|
replace the current file, validate it with `jq empty`, and restart Docker.
|
||||||
|
Do not restore blindly when workloads depend on newer daemon settings.
|
||||||
|
|
||||||
|
The profile does not configure Docker data roots, prune objects, deploy
|
||||||
|
applications, or install the NVIDIA Container Toolkit.
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
# Fresh Install Checklist
|
||||||
|
|
||||||
|
## Before bootstrap
|
||||||
|
|
||||||
|
- Confirm Ubuntu 24.04 or newer and record the release and kernel.
|
||||||
|
- Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
|
||||||
|
- Confirm console or out-of-band access before firewall work.
|
||||||
|
- Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
|
||||||
|
- Patch the base system and reboot if required.
|
||||||
|
- Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
|
||||||
|
- Review application ports and VM networking before enabling UFW.
|
||||||
|
- Confirm backups exist for any pre-existing host configuration.
|
||||||
|
|
||||||
|
## Bootstrap
|
||||||
|
|
||||||
|
Start with the least capability required:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
```
|
||||||
|
|
||||||
|
Add reviewed platform profiles:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
|
||||||
|
```
|
||||||
|
|
||||||
|
Do not select `--enable-ufw` until remote access and application rules are
|
||||||
|
understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
|
||||||
|
and workload compatibility are known.
|
||||||
|
|
||||||
|
## Post-bootstrap evidence
|
||||||
|
|
||||||
|
- Review all installer warnings.
|
||||||
|
- Run `systemctl --failed`.
|
||||||
|
- Confirm expected services with `systemctl status`.
|
||||||
|
- Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
|
||||||
|
- Confirm Docker with `docker version` and `docker compose version`.
|
||||||
|
- Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
|
||||||
|
- Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
|
||||||
|
- Reboot after driver installation and repeat the postcheck.
|
||||||
|
|
||||||
|
## Handover
|
||||||
|
|
||||||
|
Document host-specific storage, network, firewall, backup, application, GPU,
|
||||||
|
and VM decisions. Install the separate `ailab-maintenance` toolkit only after
|
||||||
|
reviewing its scheduled day-2 behavior.
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
# libvirt and KVM
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
|
||||||
|
TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
|
||||||
|
supports later Linux or Windows 11 VM work without defining guests.
|
||||||
|
|
||||||
|
## Firmware pre-checks
|
||||||
|
|
||||||
|
Confirm CPU virtualization is enabled:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lscpu | grep -E 'Virtualization|Hypervisor'
|
||||||
|
grep -Eom1 '(vmx|svm)' /proc/cpuinfo
|
||||||
|
```
|
||||||
|
|
||||||
|
IOMMU and GPU passthrough require separate firmware, kernel command-line,
|
||||||
|
device isolation, driver binding, and recovery planning. This toolkit reports
|
||||||
|
hints but does not apply those changes.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status libvirtd
|
||||||
|
virsh list --all
|
||||||
|
virsh net-list --all
|
||||||
|
virsh pool-list --all
|
||||||
|
```
|
||||||
|
|
||||||
|
Use `virt-host-validate` when available for a broader host capability report.
|
||||||
|
Desktop use of `virt-manager` requires a graphical environment or remote
|
||||||
|
display strategy.
|
||||||
|
|
||||||
|
## Networking and Windows 11
|
||||||
|
|
||||||
|
The default libvirt NAT network is distinct from host bridge networking. Review
|
||||||
|
DHCP, DNS, forwarding, and firewall behavior before changing it.
|
||||||
|
|
||||||
|
Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
|
||||||
|
packages provide those building blocks, but guest creation and licensing remain
|
||||||
|
manual.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u libvirtd
|
||||||
|
virsh net-info default
|
||||||
|
virsh pool-list --all
|
||||||
|
lsmod | grep kvm
|
||||||
|
```
|
||||||
|
|
||||||
|
Disabling `libvirtd` does not remove VM disks or definitions. Package removal
|
||||||
|
and VM data deletion are intentionally outside this toolkit.
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# NVIDIA Tooling
|
||||||
|
|
||||||
|
## Diagnostic-only default
|
||||||
|
|
||||||
|
The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
|
||||||
|
does not install or select a driver:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --nvidia-tools
|
||||||
|
```
|
||||||
|
|
||||||
|
Review hardware and current module state:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lspci -nn | grep -i nvidia
|
||||||
|
nvidia-smi
|
||||||
|
dkms status
|
||||||
|
mokutil --sb-state
|
||||||
|
```
|
||||||
|
|
||||||
|
## Explicit driver installation
|
||||||
|
|
||||||
|
Install only a reviewed Ubuntu driver package version:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --install-nvidia-driver 550
|
||||||
|
```
|
||||||
|
|
||||||
|
The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
|
||||||
|
an unavailable package. Reboot after installation, then validate `nvidia-smi`,
|
||||||
|
kernel logs, DKMS state, and application behavior.
|
||||||
|
|
||||||
|
## Selection considerations
|
||||||
|
|
||||||
|
- GPU generation and supported driver branch.
|
||||||
|
- Ubuntu release, kernel, and HWE stack.
|
||||||
|
- Secure Boot module enrollment.
|
||||||
|
- CUDA or application compatibility.
|
||||||
|
- Docker NVIDIA Container Toolkit requirements.
|
||||||
|
- Whether the device will be bound to VFIO instead of the host driver.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
|
||||||
|
lsmod | grep -E 'nvidia|nouveau'
|
||||||
|
dkms status
|
||||||
|
apt-cache policy 'nvidia-driver-*'
|
||||||
|
```
|
||||||
|
|
||||||
|
Driver rollback is environment-specific and is not automated. Preserve console
|
||||||
|
access and a known-good kernel before changing GPU or Secure Boot configuration.
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# AI lab operational shell helpers. This file is intended to be sourced.
|
||||||
|
|
||||||
|
alias ll='ls -alF'
|
||||||
|
alias la='ls -A'
|
||||||
|
alias ports='ss -tulpn'
|
||||||
|
alias dus='du -xhd1 2>/dev/null | sort -h'
|
||||||
|
alias dufh='df -hT'
|
||||||
|
alias failed='systemctl --failed --no-pager'
|
||||||
|
alias jerr='journalctl -p err -b --no-pager'
|
||||||
|
alias timers='systemctl list-timers --all --no-pager'
|
||||||
|
alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
|
||||||
|
alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
|
||||||
|
alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
|
||||||
|
alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
|
||||||
|
alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||||
|
alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||||
|
alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
|
||||||
|
|
||||||
|
svc_status() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: svc_status SERVICE\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
systemctl status "$1" --no-pager
|
||||||
|
}
|
||||||
|
|
||||||
|
svc_logs() {
|
||||||
|
if (($# < 1 || $# > 2)); then
|
||||||
|
printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
local lines="${2:-100}"
|
||||||
|
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||||
|
printf 'LINES must be numeric\n' >&2
|
||||||
|
return 2
|
||||||
|
}
|
||||||
|
journalctl -u "$1" -n "$lines" --no-pager
|
||||||
|
}
|
||||||
|
|
||||||
|
docker_logs() {
|
||||||
|
if (($# < 1 || $# > 2)); then
|
||||||
|
printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v docker >/dev/null 2>&1 || {
|
||||||
|
printf 'Docker is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
local lines="${2:-100}"
|
||||||
|
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||||
|
printf 'LINES must be numeric\n' >&2
|
||||||
|
return 2
|
||||||
|
}
|
||||||
|
docker logs --tail "$lines" "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
docker_restart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: docker_restart CONTAINER\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v docker >/dev/null 2>&1 || {
|
||||||
|
printf 'Docker is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
docker restart "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
vm_autostart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: vm_autostart VM\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v virsh >/dev/null 2>&1 || {
|
||||||
|
printf 'virsh is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
virsh autostart "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
vm_no_autostart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: vm_no_autostart VM\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v virsh >/dev/null 2>&1 || {
|
||||||
|
printf 'virsh is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
virsh autostart --disable "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
path_backup() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: path_backup PATH\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
if [[ ! -e "$1" ]]; then
|
||||||
|
printf 'Path does not exist: %s\n' "$1" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
local destination
|
||||||
|
destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
cp -a -- "$1" "$destination"
|
||||||
|
printf 'Backup created: %s\n' "$destination"
|
||||||
|
}
|
||||||
|
|
||||||
|
extract() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: extract ARCHIVE\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
if [[ ! -f "$1" ]]; then
|
||||||
|
printf 'Archive does not exist: %s\n' "$1" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
case "$1" in
|
||||||
|
*.tar.bz2|*.tbz2) tar xjf "$1" ;;
|
||||||
|
*.tar.gz|*.tgz) tar xzf "$1" ;;
|
||||||
|
*.tar.xz|*.txz) tar xJf "$1" ;;
|
||||||
|
*.tar) tar xf "$1" ;;
|
||||||
|
*.bz2) bunzip2 "$1" ;;
|
||||||
|
*.gz) gunzip "$1" ;;
|
||||||
|
*.zip) unzip "$1" ;;
|
||||||
|
*.7z) 7z x "$1" ;;
|
||||||
|
*.rar) unrar x "$1" ;;
|
||||||
|
*)
|
||||||
|
printf 'Unsupported archive type: %s\n' "$1" >&2
|
||||||
|
return 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
@@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {
|
||||||
|
"max-size": "50m",
|
||||||
|
"max-file": "5"
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
fs.inotify.max_user_watches=1048576
|
||||||
|
fs.inotify.max_user_instances=1024
|
||||||
|
vm.swappiness=10
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
Executable
+182
@@ -0,0 +1,182 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
run_base=0
|
||||||
|
run_shell=0
|
||||||
|
run_cockpit=0
|
||||||
|
run_docker=0
|
||||||
|
run_libvirt=0
|
||||||
|
run_nvidia=0
|
||||||
|
run_tuning=0
|
||||||
|
run_security=0
|
||||||
|
enable_ufw=0
|
||||||
|
nvidia_driver_version=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./install.sh [OPTIONS]
|
||||||
|
|
||||||
|
Day-0 bootstrap automation for Ubuntu 24.04 or newer.
|
||||||
|
|
||||||
|
Profiles:
|
||||||
|
--base Install baseline operational packages
|
||||||
|
--shell Install the root shell profile
|
||||||
|
--cockpit Install and enable Cockpit
|
||||||
|
--docker Install and configure Docker
|
||||||
|
--libvirt Install and enable libvirt/KVM
|
||||||
|
--nvidia-tools Install NVIDIA diagnostic tools only
|
||||||
|
--install-nvidia-driver VERSION
|
||||||
|
Install diagnostic tools and the explicit driver
|
||||||
|
--tuning Install journald and sysctl tuning
|
||||||
|
--security Install fail2ban and UFW; do not enable UFW
|
||||||
|
--enable-ufw Run security profile and explicitly enable UFW
|
||||||
|
--all Run every profile without enabling UFW or
|
||||||
|
installing an NVIDIA driver
|
||||||
|
-h, --help Show this help
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
sudo ./install.sh --all
|
||||||
|
sudo ./install.sh --all --enable-ufw
|
||||||
|
sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
require_supported_ubuntu() {
|
||||||
|
if [[ ! -r /etc/os-release ]]; then
|
||||||
|
printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||||
|
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||||
|
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||||
|
"${VERSION_ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
if (($# == 0)); then
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--base)
|
||||||
|
run_base=1
|
||||||
|
;;
|
||||||
|
--shell)
|
||||||
|
run_shell=1
|
||||||
|
;;
|
||||||
|
--cockpit)
|
||||||
|
run_cockpit=1
|
||||||
|
;;
|
||||||
|
--docker)
|
||||||
|
run_docker=1
|
||||||
|
;;
|
||||||
|
--libvirt)
|
||||||
|
run_libvirt=1
|
||||||
|
;;
|
||||||
|
--nvidia-tools)
|
||||||
|
run_nvidia=1
|
||||||
|
;;
|
||||||
|
--install-nvidia-driver)
|
||||||
|
if (($# < 2)); then
|
||||||
|
printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
nvidia_driver_version="$2"
|
||||||
|
if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
run_nvidia=1
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--tuning)
|
||||||
|
run_tuning=1
|
||||||
|
;;
|
||||||
|
--security)
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
--enable-ufw)
|
||||||
|
enable_ufw=1
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
--all)
|
||||||
|
run_base=1
|
||||||
|
run_shell=1
|
||||||
|
run_cockpit=1
|
||||||
|
run_docker=1
|
||||||
|
run_libvirt=1
|
||||||
|
run_nvidia=1
|
||||||
|
run_tuning=1
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
|
||||||
|
usage >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
for required_command in bash dpkg; do
|
||||||
|
if ! command -v "$required_command" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
require_supported_ubuntu
|
||||||
|
|
||||||
|
printf 'INFO: running read-only preflight\n'
|
||||||
|
"$SCRIPT_DIR/scripts/00-preflight.sh"
|
||||||
|
|
||||||
|
((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
|
||||||
|
((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
|
||||||
|
((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
|
||||||
|
((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
|
||||||
|
((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
|
||||||
|
|
||||||
|
if ((run_nvidia == 1)); then
|
||||||
|
if [[ -n "$nvidia_driver_version" ]]; then
|
||||||
|
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
|
||||||
|
else
|
||||||
|
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
|
||||||
|
|
||||||
|
if ((run_security == 1)); then
|
||||||
|
if ((enable_ufw == 1)); then
|
||||||
|
"$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
|
||||||
|
else
|
||||||
|
"$SCRIPT_DIR/scripts/08-security-baseline.sh"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nINFO: running post-install checks\n'
|
||||||
|
"$SCRIPT_DIR/scripts/99-postcheck.sh"
|
||||||
|
printf '\nOK: selected Linux setup profiles completed\n'
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
# shellcheck shell=bash
|
||||||
|
|
||||||
|
require_supported_ubuntu() {
|
||||||
|
if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||||
|
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||||
|
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||||
|
"${VERSION_ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
Executable
+124
@@ -0,0 +1,124 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Operating system"
|
||||||
|
if [[ -r /etc/os-release ]]; then
|
||||||
|
run_optional "OS release report" cat /etc/os-release
|
||||||
|
else
|
||||||
|
printf 'WARNING: /etc/os-release is unavailable\n'
|
||||||
|
fi
|
||||||
|
run_optional "kernel report" uname -a
|
||||||
|
|
||||||
|
section "Host"
|
||||||
|
run_optional "hostname report" hostname
|
||||||
|
run_optional "uptime report" uptime
|
||||||
|
|
||||||
|
section "CPU and virtualization"
|
||||||
|
if command -v lscpu >/dev/null 2>&1; then
|
||||||
|
run_optional "CPU report" lscpu
|
||||||
|
printf '\nVirtualization flags:\n'
|
||||||
|
lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
|
||||||
|
printf 'INFO: no virtualization summary reported by lscpu\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: lscpu is unavailable\n'
|
||||||
|
fi
|
||||||
|
if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
|
||||||
|
printf 'OK: CPU virtualization flags detected\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: CPU virtualization flags were not detected\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Memory"
|
||||||
|
if command -v free >/dev/null 2>&1; then
|
||||||
|
run_optional "memory report" free -h
|
||||||
|
else
|
||||||
|
run_optional "memory report" cat /proc/meminfo
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Disks"
|
||||||
|
if command -v lsblk >/dev/null 2>&1; then
|
||||||
|
run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
|
||||||
|
else
|
||||||
|
printf 'WARNING: lsblk is unavailable\n'
|
||||||
|
fi
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
|
||||||
|
section "Network"
|
||||||
|
if command -v ip >/dev/null 2>&1; then
|
||||||
|
run_optional "network interface report" ip -brief address
|
||||||
|
run_optional "route report" ip route
|
||||||
|
else
|
||||||
|
printf 'WARNING: ip is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Firmware and Secure Boot"
|
||||||
|
if [[ -d /sys/firmware/efi ]]; then
|
||||||
|
printf 'OK: boot mode is UEFI\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: boot mode appears to be legacy BIOS\n'
|
||||||
|
fi
|
||||||
|
if command -v mokutil >/dev/null 2>&1; then
|
||||||
|
run_optional "Secure Boot report" mokutil --sb-state
|
||||||
|
else
|
||||||
|
printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "IOMMU"
|
||||||
|
if [[ -r /proc/cmdline ]]; then
|
||||||
|
printf 'Kernel command line:\n'
|
||||||
|
cat /proc/cmdline
|
||||||
|
if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
|
||||||
|
printf 'OK: IOMMU-related kernel arguments detected\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: no explicit IOMMU kernel argument detected\n'
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
if command -v dmesg >/dev/null 2>&1; then
|
||||||
|
dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
|
||||||
|
printf 'INFO: no readable IOMMU hints found in dmesg\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA hardware"
|
||||||
|
if command -v lspci >/dev/null 2>&1; then
|
||||||
|
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: lspci is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Existing platform components"
|
||||||
|
for command_name in docker virsh cockpit-bridge; do
|
||||||
|
if command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$command_name"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
for unit in docker.service libvirtd.service cockpit.socket; do
|
||||||
|
if systemctl cat "$unit" >/dev/null 2>&1; then
|
||||||
|
state="$(systemctl is-active "$unit" 2>/dev/null || true)"
|
||||||
|
printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$unit"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nOK: preflight completed without modifying the host\n'
|
||||||
Executable
+41
@@ -0,0 +1,41 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
packages=(
|
||||||
|
curl wget git vim nano tmux byobu htop btop glances
|
||||||
|
jq unzip zip rsync tree ncdu duf
|
||||||
|
lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
|
||||||
|
smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
|
||||||
|
sysstat iotop iftop nload
|
||||||
|
ca-certificates gnupg software-properties-common apt-transport-https
|
||||||
|
needrestart unattended-upgrades logrotate
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: base package setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'INFO: refreshing APT metadata\n'
|
||||||
|
apt-get update
|
||||||
|
printf 'INFO: installing baseline operational packages\n'
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
systemctl enable --now sysstat
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'OK: baseline operational packages are installed\n'
|
||||||
Executable
+60
@@ -0,0 +1,60 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
|
||||||
|
PROFILE_DIR="/root/.bashrc.d"
|
||||||
|
PROFILE_FILE="$PROFILE_DIR/ailab.sh"
|
||||||
|
BASHRC="/root/.bashrc"
|
||||||
|
SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
|
||||||
|
|
||||||
|
backup_file() {
|
||||||
|
local path="$1"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: shell profile setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if [[ ! -r "$SOURCE_FILE" ]]; then
|
||||||
|
printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -m 0755 "$PROFILE_DIR"
|
||||||
|
if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
|
||||||
|
if [[ -f "$PROFILE_FILE" ]]; then
|
||||||
|
backup_file "$PROFILE_FILE"
|
||||||
|
fi
|
||||||
|
install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
|
||||||
|
printf 'OK: installed %s\n' "$PROFILE_FILE"
|
||||||
|
else
|
||||||
|
printf 'OK: shell profile is already current\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! -f "$BASHRC" ]]; then
|
||||||
|
install -m 0644 /dev/null "$BASHRC"
|
||||||
|
fi
|
||||||
|
|
||||||
|
source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
|
||||||
|
if [[ "$source_count" != "1" ]]; then
|
||||||
|
tmp_bashrc="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_bashrc"' EXIT
|
||||||
|
grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
|
||||||
|
printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
|
||||||
|
backup_file "$BASHRC"
|
||||||
|
install -m 0644 "$tmp_bashrc" "$BASHRC"
|
||||||
|
printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
|
||||||
|
else
|
||||||
|
printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
|
||||||
|
fi
|
||||||
Executable
+36
@@ -0,0 +1,36 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
required_packages=(
|
||||||
|
cockpit cockpit-system cockpit-storaged cockpit-networkmanager
|
||||||
|
cockpit-packagekit cockpit-machines cockpit-sosreport cockpit-pcp
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: Cockpit setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${required_packages[@]}"
|
||||||
|
|
||||||
|
if apt-cache show cockpit-files >/dev/null 2>&1; then
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y cockpit-files
|
||||||
|
printf 'OK: installed optional cockpit-files package\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: cockpit-files is unavailable; continuing without it\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl enable --now cockpit.socket
|
||||||
|
printf 'OK: Cockpit is enabled at https://%s:9090\n' "$(hostname)"
|
||||||
Executable
+136
@@ -0,0 +1,136 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
SOURCE_CONFIG="$SCRIPT_DIR/../files/docker/daemon.json"
|
||||||
|
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||||
|
temporary_files=()
|
||||||
|
|
||||||
|
cleanup() {
|
||||||
|
local path
|
||||||
|
|
||||||
|
for path in "${temporary_files[@]}"; do
|
||||||
|
rm -f "$path"
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
trap cleanup EXIT
|
||||||
|
|
||||||
|
backup_file() {
|
||||||
|
local path="$1"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: Docker setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
for command_name in apt-get apt-cache; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y ca-certificates curl gnupg jq
|
||||||
|
|
||||||
|
if apt-cache show docker.io >/dev/null 2>&1; then
|
||||||
|
packages=(docker.io)
|
||||||
|
if apt-cache show docker-compose-v2 >/dev/null 2>&1; then
|
||||||
|
packages+=(docker-compose-v2)
|
||||||
|
else
|
||||||
|
printf 'WARNING: docker-compose-v2 is unavailable from Ubuntu repositories\n'
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: docker.io is unavailable; configuring Docker official repository\n'
|
||||||
|
install -d -m 0755 /etc/apt/keyrings
|
||||||
|
tmp_key="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_key")
|
||||||
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
|
||||||
|
| gpg --dearmor --yes -o "$tmp_key"
|
||||||
|
if [[ ! -f /etc/apt/keyrings/docker.gpg ]] || \
|
||||||
|
! cmp -s "$tmp_key" /etc/apt/keyrings/docker.gpg; then
|
||||||
|
if [[ -f /etc/apt/keyrings/docker.gpg ]]; then
|
||||||
|
backup_file /etc/apt/keyrings/docker.gpg
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_key" /etc/apt/keyrings/docker.gpg
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
architecture="$(dpkg --print-architecture)"
|
||||||
|
tmp_repository="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_repository")
|
||||||
|
printf 'deb [arch=%s signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu %s stable\n' \
|
||||||
|
"$architecture" "${VERSION_CODENAME:?}" \
|
||||||
|
>"$tmp_repository"
|
||||||
|
if [[ ! -f /etc/apt/sources.list.d/docker.list ]] || \
|
||||||
|
! cmp -s "$tmp_repository" /etc/apt/sources.list.d/docker.list; then
|
||||||
|
if [[ -f /etc/apt/sources.list.d/docker.list ]]; then
|
||||||
|
backup_file /etc/apt/sources.list.d/docker.list
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_repository" /etc/apt/sources.list.d/docker.list
|
||||||
|
fi
|
||||||
|
apt-get update
|
||||||
|
packages=(docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin)
|
||||||
|
fi
|
||||||
|
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
install -d -m 0755 /etc/docker
|
||||||
|
|
||||||
|
if [[ ! -r "$SOURCE_CONFIG" ]]; then
|
||||||
|
printf 'CRITICAL: Docker configuration template is missing: %s\n' "$SOURCE_CONFIG" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
jq empty "$SOURCE_CONFIG"
|
||||||
|
|
||||||
|
tmp_config="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_config")
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: %s is invalid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
jq '. + {
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||||
|
}' "$DOCKER_CONFIG" >"$tmp_config"
|
||||||
|
else
|
||||||
|
install -m 0644 "$SOURCE_CONFIG" "$tmp_config"
|
||||||
|
fi
|
||||||
|
jq empty "$tmp_config"
|
||||||
|
|
||||||
|
config_changed=0
|
||||||
|
if [[ ! -f "$DOCKER_CONFIG" ]] || ! cmp -s "$tmp_config" "$DOCKER_CONFIG"; then
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
backup_file "$DOCKER_CONFIG"
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_config" "$DOCKER_CONFIG"
|
||||||
|
config_changed=1
|
||||||
|
printf 'OK: installed Docker daemon log limits\n'
|
||||||
|
else
|
||||||
|
printf 'OK: Docker daemon configuration is already current\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl enable --now docker
|
||||||
|
if ((config_changed == 1)); then
|
||||||
|
systemctl restart docker
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker version
|
||||||
|
if docker compose version >/dev/null 2>&1; then
|
||||||
|
docker compose version
|
||||||
|
else
|
||||||
|
printf 'WARNING: Docker Compose v2 is unavailable\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: Docker setup completed\n'
|
||||||
Executable
+33
@@ -0,0 +1,33 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
packages=(
|
||||||
|
qemu-system-x86 qemu-utils libvirt-daemon-system libvirt-clients
|
||||||
|
virtinst virt-manager bridge-utils ovmf swtpm swtpm-tools dnsmasq-base
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: libvirt setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
systemctl enable --now libvirtd
|
||||||
|
|
||||||
|
printf '\n== Virtual machines ==\n'
|
||||||
|
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||||
|
printf '\n== Virtual networks ==\n'
|
||||||
|
virsh net-list --all || printf 'WARNING: unable to list virtual networks\n'
|
||||||
|
printf 'OK: libvirt/KVM setup completed\n'
|
||||||
Executable
+88
@@ -0,0 +1,88 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
driver_version=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./06-nvidia-tools.sh [--install-driver VERSION]
|
||||||
|
|
||||||
|
Without --install-driver, only non-driver diagnostic tools are installed.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--install-driver)
|
||||||
|
if (($# < 2)); then
|
||||||
|
printf 'CRITICAL: --install-driver requires a VERSION\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
driver_version="$2"
|
||||||
|
if [[ ! "$driver_version" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: NVIDIA tooling setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y nvtop clinfo pciutils
|
||||||
|
|
||||||
|
printf '\n== NVIDIA PCI devices ==\n'
|
||||||
|
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
|
||||||
|
|
||||||
|
printf '\n== NVIDIA runtime ==\n'
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi || printf 'WARNING: nvidia-smi returned an error\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\n== DKMS ==\n'
|
||||||
|
if command -v dkms >/dev/null 2>&1; then
|
||||||
|
dkms status || printf 'WARNING: dkms status returned an error\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: dkms is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -n "$driver_version" ]]; then
|
||||||
|
driver_package="nvidia-driver-$driver_version"
|
||||||
|
if ! apt-cache show "$driver_package" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: requested NVIDIA driver package is unavailable: %s\n' \
|
||||||
|
"$driver_package" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "$driver_package"
|
||||||
|
printf 'WARNING: NVIDIA driver %s was installed; reboot before validation\n' \
|
||||||
|
"$driver_version"
|
||||||
|
else
|
||||||
|
printf 'OK: NVIDIA diagnostic tools installed; no driver was installed\n'
|
||||||
|
fi
|
||||||
Executable
+67
@@ -0,0 +1,67 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
JOURNAL_SOURCE="$SCRIPT_DIR/../files/systemd/journald-ailab-limits.conf"
|
||||||
|
JOURNAL_DEST="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||||
|
SYSCTL_SOURCE="$SCRIPT_DIR/../files/sysctl/99-ailab.conf"
|
||||||
|
SYSCTL_DEST="/etc/sysctl.d/99-ailab.conf"
|
||||||
|
|
||||||
|
install_config() {
|
||||||
|
local source_path="$1"
|
||||||
|
local destination_path="$2"
|
||||||
|
local mode="$3"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
install -d -m 0755 "$(dirname "$destination_path")"
|
||||||
|
if [[ -f "$destination_path" ]] && cmp -s "$source_path" "$destination_path"; then
|
||||||
|
printf 'OK: %s is already current\n' "$destination_path"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
if [[ -f "$destination_path" ]]; then
|
||||||
|
backup="${destination_path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m "$mode" "$destination_path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$destination_path" "$backup"
|
||||||
|
fi
|
||||||
|
install -m "$mode" "$source_path" "$destination_path"
|
||||||
|
printf 'OK: installed %s\n' "$destination_path"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: tuning setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
for source_path in "$JOURNAL_SOURCE" "$SYSCTL_SOURCE"; do
|
||||||
|
if [[ ! -r "$source_path" ]]; then
|
||||||
|
printf 'CRITICAL: required configuration is missing: %s\n' "$source_path" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ! command -v sysctl >/dev/null 2>&1 || ! command -v systemctl >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: sysctl and systemctl are required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! command -v sensors-detect >/dev/null 2>&1 || \
|
||||||
|
! systemctl cat sysstat.service >/dev/null 2>&1; then
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y lm-sensors sysstat
|
||||||
|
fi
|
||||||
|
|
||||||
|
install_config "$JOURNAL_SOURCE" "$JOURNAL_DEST" 0644
|
||||||
|
install_config "$SYSCTL_SOURCE" "$SYSCTL_DEST" 0644
|
||||||
|
|
||||||
|
sysctl --system
|
||||||
|
systemctl restart systemd-journald
|
||||||
|
systemctl enable --now sysstat
|
||||||
|
|
||||||
|
if command -v sensors-detect >/dev/null 2>&1; then
|
||||||
|
sensors-detect --auto || printf 'WARNING: sensors-detect did not complete successfully\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: host tuning completed\n'
|
||||||
+61
@@ -0,0 +1,61 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
enable_ufw=0
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./08-security-baseline.sh [--enable-ufw]
|
||||||
|
|
||||||
|
Installs fail2ban and UFW. UFW is enabled only with the explicit flag.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--enable-ufw)
|
||||||
|
enable_ufw=1
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: security baseline setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y fail2ban ufw
|
||||||
|
systemctl enable --now fail2ban
|
||||||
|
|
||||||
|
if ((enable_ufw == 1)); then
|
||||||
|
printf 'WARNING: UFW was explicitly requested; SSH and Cockpit rules will be added before enablement\n'
|
||||||
|
ufw allow OpenSSH
|
||||||
|
ufw allow 9090/tcp comment 'Cockpit'
|
||||||
|
ufw --force enable
|
||||||
|
else
|
||||||
|
printf 'WARNING: UFW is installed but was not enabled; use --enable-ufw after reviewing access requirements\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
ufw status verbose || printf 'WARNING: unable to read UFW status\n'
|
||||||
|
printf 'OK: security baseline completed\n'
|
||||||
Executable
+69
@@ -0,0 +1,69 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Failed systemd units"
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||||
|
|
||||||
|
section "Selected service status"
|
||||||
|
for unit in cockpit.socket docker.service libvirtd.service fail2ban.service; do
|
||||||
|
if systemctl cat "$unit" >/dev/null 2>&1; then
|
||||||
|
run_optional "$unit status" systemctl status "$unit" --no-pager
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$unit"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Docker"
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
run_optional "Docker container list" docker ps
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Libvirt"
|
||||||
|
if command -v virsh >/dev/null 2>&1; then
|
||||||
|
run_optional "libvirt guest list" virsh list --all
|
||||||
|
else
|
||||||
|
printf 'INFO: virsh is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
run_optional "NVIDIA status" nvidia-smi
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Filesystems"
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
|
||||||
|
section "Listening ports"
|
||||||
|
if command -v ss >/dev/null 2>&1; then
|
||||||
|
run_optional "listening port report" ss -tulpn
|
||||||
|
else
|
||||||
|
printf 'WARNING: ss is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nOK: postcheck completed; review warnings above\n'
|
||||||
|
exit 0
|
||||||
@@ -1,8 +1,14 @@
|
|||||||
# platform-projects
|
# platform-projects
|
||||||
|
|
||||||
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
|
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
|
||||||
|
|
||||||
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
## Implemented platform projects
|
||||||
|
|
||||||
|
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
|
## Planning areas
|
||||||
|
|
||||||
|
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
||||||
|
|
||||||
- `monitoring-zabbix`
|
- `monitoring-zabbix`
|
||||||
- `elk-log-analysis`
|
- `elk-log-analysis`
|
||||||
|
|||||||
@@ -0,0 +1,233 @@
|
|||||||
|
# Slurm AI/HPC Cluster Automation Lab
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
|
||||||
|
|
||||||
|
The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
|
||||||
|
|
||||||
|
## What this project demonstrates
|
||||||
|
|
||||||
|
- Slurm controller and worker node management.
|
||||||
|
- Munge authentication across the cluster.
|
||||||
|
- GPU node integration through Slurm GRES.
|
||||||
|
- cgroup CPU, memory, and GPU device enforcement.
|
||||||
|
- SlurmDBD with MariaDB-backed accounting.
|
||||||
|
- `sacct`, `sreport`, and `sacctmgr` workflows.
|
||||||
|
- QOS, fairshare, and multifactor priority configuration.
|
||||||
|
- Node provisioning and decommissioning workflows.
|
||||||
|
- Rolling OS upgrades with canary validation.
|
||||||
|
- Health checks and auto-remediation.
|
||||||
|
- Backup and restore-check workflow for the accounting database.
|
||||||
|
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
|
||||||
|
|
||||||
|
## Architecture overview
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
operator[Ansible control node]
|
||||||
|
munge[Munge authentication]
|
||||||
|
controller[Slurm controller<br/>slurmctld]
|
||||||
|
db[MariaDB + SlurmDBD<br/>accounting]
|
||||||
|
shared[Shared filesystem<br/>site dependency]
|
||||||
|
cpu_part[CPU partition]
|
||||||
|
gpu_part[GPU partition]
|
||||||
|
cpu_nodes[CPU compute nodes<br/>slurmd]
|
||||||
|
gpu_node[GPU node<br/>slurmd + GRES]
|
||||||
|
jobs[User jobs<br/>sbatch / srun]
|
||||||
|
|
||||||
|
operator -->|bootstrap and configure| controller
|
||||||
|
operator -->|configure workers| cpu_nodes
|
||||||
|
operator -->|configure GPU worker| gpu_node
|
||||||
|
operator -->|deploy key and service| munge
|
||||||
|
|
||||||
|
munge --> controller
|
||||||
|
munge --> cpu_nodes
|
||||||
|
munge --> gpu_node
|
||||||
|
|
||||||
|
controller -->|accounting RPC| db
|
||||||
|
jobs -->|submit to Slurm| controller
|
||||||
|
controller -->|schedule CPU jobs| cpu_part
|
||||||
|
controller -->|schedule GPU jobs| gpu_part
|
||||||
|
cpu_part --> cpu_nodes
|
||||||
|
gpu_part --> gpu_node
|
||||||
|
|
||||||
|
cpu_nodes --- shared
|
||||||
|
gpu_node --- shared
|
||||||
|
controller --- shared
|
||||||
|
```
|
||||||
|
|
||||||
|
The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
|
||||||
|
|
||||||
|
## Repository layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
inventories/lab/ Sanitized lab inventory and group variables
|
||||||
|
playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup
|
||||||
|
playbooks/core/ Munge, Slurm config, and safe restart workflows
|
||||||
|
playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation
|
||||||
|
playbooks/qos/ QOS, fairshare, and priority configuration
|
||||||
|
playbooks/lifecycle/ Node provisioning, inspection, and decommissioning
|
||||||
|
playbooks/upgrade/ Canary and rolling OS upgrade workflows
|
||||||
|
playbooks/health/ Health checks, repair, and auto-remediation
|
||||||
|
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
|
||||||
|
playbooks/backup/ Slurm and Munge state backup helpers
|
||||||
|
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
|
||||||
|
docs/ Operational runbook
|
||||||
|
prompts/ Documentation prompts used to expand this project
|
||||||
|
```
|
||||||
|
|
||||||
|
## Main operational workflows
|
||||||
|
|
||||||
|
Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
|
||||||
|
|
||||||
|
### Bootstrap access
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Munge
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Slurm config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate CPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate GPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Enable accounting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/tests/test-sreport-usage.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure QOS and fairshare
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provision a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
|
||||||
|
ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decommission a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
|
||||||
|
-e target_node=<node> \
|
||||||
|
-e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
|
||||||
|
-e canary_node=<node> \
|
||||||
|
-e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health check and auto-remediation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Accounting backup and restore-check
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operational maturity
|
||||||
|
|
||||||
|
This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
|
||||||
|
|
||||||
|
- Ansible workflows are designed to be repeatable and readable.
|
||||||
|
- Configuration deployment supports check and diff review before applying changes.
|
||||||
|
- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
|
||||||
|
- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
|
||||||
|
- QOS, fairshare, priority, and association workflows show resource governance.
|
||||||
|
- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
|
||||||
|
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
|
||||||
|
- Health and repair playbooks document remediation paths for common node states.
|
||||||
|
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
|
||||||
|
|
||||||
|
## Tested capabilities
|
||||||
|
|
||||||
|
- [x] CPU job scheduling.
|
||||||
|
- [x] GPU job scheduling.
|
||||||
|
- [x] GPU denial when no GRES is requested.
|
||||||
|
- [x] CPU cgroup enforcement.
|
||||||
|
- [x] SlurmDBD accounting setup.
|
||||||
|
- [x] `sacct` job history visibility.
|
||||||
|
- [x] `sreport` usage reporting.
|
||||||
|
- [x] QOS creation and validation.
|
||||||
|
- [x] Fairshare and priority visibility.
|
||||||
|
- [x] Node decommission and reprovision workflow.
|
||||||
|
- [x] Rolling upgrade canary workflow.
|
||||||
|
- [x] Node health check and auto-remediation workflow.
|
||||||
|
|
||||||
|
These checks represent sanitized lab validation, not a claim of production certification.
|
||||||
|
|
||||||
|
## Safety and sanitization
|
||||||
|
|
||||||
|
This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
|
||||||
|
|
||||||
|
Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
|
||||||
|
|
||||||
|
Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
|
||||||
|
|
||||||
|
## Why this matters for AI/HPC infrastructure roles
|
||||||
|
|
||||||
|
AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
|
||||||
|
|
||||||
|
This project demonstrates practical understanding of:
|
||||||
|
|
||||||
|
- Linux systems operations.
|
||||||
|
- Slurm cluster operations.
|
||||||
|
- GPU infrastructure and GRES scheduling.
|
||||||
|
- Job scheduling and resource isolation.
|
||||||
|
- Accounting, reporting, QOS, fairshare, and priority policy.
|
||||||
|
- Automation and repeatability with Ansible.
|
||||||
|
- Troubleshooting and operational ownership.
|
||||||
|
|
||||||
|
## Deeper docs
|
||||||
|
|
||||||
|
- [Runbook](docs/runbook.md)
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
[defaults]
|
||||||
|
inventory = ./inventories/lab/inventory.yml
|
||||||
|
host_key_checking = False
|
||||||
|
retry_files_enabled = False
|
||||||
|
stdout_callback = default
|
||||||
|
result_format = yaml
|
||||||
|
interpreter_python = auto_silent
|
||||||
|
timeout = 30
|
||||||
|
roles_path = ./roles
|
||||||
|
collections_path = ./collections
|
||||||
|
|
||||||
|
[ssh_connection]
|
||||||
|
pipelining = True
|
||||||
|
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
Generated backups and reports can be stored here locally. This directory is ignored by git.
|
||||||
@@ -0,0 +1,75 @@
|
|||||||
|
# Slurm AI/HPC Lab Runbook
|
||||||
|
|
||||||
|
## Standard deployment order
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Node lifecycle
|
||||||
|
|
||||||
|
Provision a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Decommission a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
Repair a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Run health remediation for nodes that can be recovered by the automated workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
Back up Slurm and Munge state before planned lifecycle work:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/backup/backup-slurm-state.yml
|
||||||
|
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
|
||||||
@@ -0,0 +1,128 @@
|
|||||||
|
---
|
||||||
|
# Example lab inventory variables. Replace addresses, users and node topology for your environment.
|
||||||
|
|
||||||
|
slurm_cluster_name: labcluster
|
||||||
|
|
||||||
|
slurm_control_machine: slurm-ctl01
|
||||||
|
slurm_control_addr: 10.10.10.11
|
||||||
|
|
||||||
|
slurm_config_dir: /etc/slurm
|
||||||
|
slurm_user: slurm
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
slurmctld_port: 6817
|
||||||
|
slurmd_port: 6818
|
||||||
|
|
||||||
|
slurm_job_comp_type: jobcomp/none
|
||||||
|
|
||||||
|
slurm_select_type: select/cons_tres
|
||||||
|
slurm_select_type_parameters: CR_Core_Memory
|
||||||
|
|
||||||
|
slurm_return_to_service: 2
|
||||||
|
slurm_default_mpi_type: none
|
||||||
|
|
||||||
|
slurm_gres_types: gpu
|
||||||
|
|
||||||
|
slurm_nodes:
|
||||||
|
- name: slurm-c01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.12
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: slurm-c02
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.13
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: gpu01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.14
|
||||||
|
cpus: 12
|
||||||
|
real_memory: 60000
|
||||||
|
features: "gpu"
|
||||||
|
gres: "gpu:1"
|
||||||
|
gres_file: /dev/nvidia0
|
||||||
|
topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
|
||||||
|
|
||||||
|
slurm_partitions:
|
||||||
|
- name: debug
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02]"
|
||||||
|
default: "YES"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: gpu
|
||||||
|
managed_state: present
|
||||||
|
nodes: "gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: all
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02],gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
|
||||||
|
# Cgroup enforcement
|
||||||
|
slurm_enable_cgroup: true
|
||||||
|
slurm_task_plugin: task/cgroup,task/affinity
|
||||||
|
slurm_proctrack_type: proctrack/cgroup
|
||||||
|
slurm_job_acct_gather_type: jobacct_gather/cgroup
|
||||||
|
|
||||||
|
# Slurm accounting / SlurmDBD
|
||||||
|
slurm_accounting_storage_type: accounting_storage/slurmdbd
|
||||||
|
slurm_accounting_storage_host: slurm-ctl01
|
||||||
|
slurm_accounting_storage_port: 6819
|
||||||
|
slurm_accounting_storage_enforce: associations,limits,qos
|
||||||
|
slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
|
||||||
|
|
||||||
|
slurmdbd_host: slurm-ctl01
|
||||||
|
slurmdbd_port: 6819
|
||||||
|
slurmdbd_storage_type: accounting_storage/mysql
|
||||||
|
slurmdbd_storage_host: localhost
|
||||||
|
slurmdbd_storage_port: 3306
|
||||||
|
slurmdbd_storage_loc: slurm_acct_db
|
||||||
|
slurmdbd_storage_user: slurm
|
||||||
|
# Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
|
||||||
|
slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
|
||||||
|
|
||||||
|
slurm_account_name: lab
|
||||||
|
slurm_account_description: "AI/HPC Slurm lab account"
|
||||||
|
slurm_account_organization: "labcluster"
|
||||||
|
|
||||||
|
# SlurmDBD purge / retention policy for lab
|
||||||
|
slurmdbd_commit_delay: 1
|
||||||
|
slurmdbd_purge_event_after: 12months
|
||||||
|
slurmdbd_purge_job_after: 12months
|
||||||
|
slurmdbd_purge_resv_after: 12months
|
||||||
|
slurmdbd_purge_step_after: 3months
|
||||||
|
slurmdbd_purge_suspend_after: 3months
|
||||||
|
slurmdbd_purge_txn_after: 12months
|
||||||
|
slurmdbd_purge_usage_after: 24months
|
||||||
|
|
||||||
|
# Archive is disabled for the lab; backup playbooks handle database dumps.
|
||||||
|
slurmdbd_archive_events: no
|
||||||
|
slurmdbd_archive_jobs: no
|
||||||
|
slurmdbd_archive_steps: no
|
||||||
|
slurmdbd_archive_suspend: no
|
||||||
|
slurmdbd_archive_txn: no
|
||||||
|
slurmdbd_archive_usage: no
|
||||||
|
|
||||||
|
# Slurm priority / fairshare
|
||||||
|
slurm_priority_type: priority/multifactor
|
||||||
|
slurm_priority_decay_half_life: 7-0
|
||||||
|
slurm_priority_calc_period: 5
|
||||||
|
slurm_priority_favor_small: "NO"
|
||||||
|
slurm_priority_weight_age: 1000
|
||||||
|
slurm_priority_weight_fairshare: 10000
|
||||||
|
slurm_priority_weight_job_size: 1000
|
||||||
|
slurm_priority_weight_partition: 1000
|
||||||
|
slurm_priority_weight_qos: 10000
|
||||||
|
slurm_priority_max_age: 1-0
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
---
|
||||||
|
# Copy this file to vault.yml and encrypt it with ansible-vault.
|
||||||
|
# ansible-vault encrypt inventories/lab/group_vars/vault.yml
|
||||||
|
|
||||||
|
vault_slurmdbd_storage_pass: CHANGE_ME
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
all:
|
||||||
|
vars:
|
||||||
|
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
|
||||||
|
children:
|
||||||
|
slurm_cluster:
|
||||||
|
children:
|
||||||
|
slurm_controller:
|
||||||
|
hosts:
|
||||||
|
slurm-ctl01:
|
||||||
|
ansible_host: 10.10.10.11
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_compute:
|
||||||
|
hosts:
|
||||||
|
slurm-c01:
|
||||||
|
ansible_host: 10.10.10.12
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm-c02:
|
||||||
|
ansible_host: 10.10.10.13
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_gpu:
|
||||||
|
hosts:
|
||||||
|
gpu01:
|
||||||
|
ansible_host: 10.10.10.14
|
||||||
|
ansible_user: ansible
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
---
|
||||||
|
- name: Backup SlurmDBD MariaDB database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create remote backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurmdbd_backup_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create local fetch directory on Ansible controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_fetch_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate SlurmDBD is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting database exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Dump Slurm accounting database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
|
||||||
|
|
||||||
|
mysqldump \
|
||||||
|
--single-transaction \
|
||||||
|
--routines \
|
||||||
|
--events \
|
||||||
|
--triggers \
|
||||||
|
{{ slurmdbd_storage_loc }} | gzip -9 > "$out"
|
||||||
|
|
||||||
|
chmod 0600 "$out"
|
||||||
|
echo "$out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_dump
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate backup file is non-empty
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ db_dump.stdout }}"
|
||||||
|
register: backup_file
|
||||||
|
|
||||||
|
- name: Fail if backup file is empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Backup file is empty: {{ db_dump.stdout }}"
|
||||||
|
when: backup_file.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Fetch DB backup to Ansible controller
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "{{ db_dump.stdout }}"
|
||||||
|
dest: "{{ local_fetch_dir }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Show DB backup result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Remote backup: {{ db_dump.stdout }}"
|
||||||
|
- "Backup size bytes: {{ backup_file.stat.size }}"
|
||||||
|
- "Fetched to: {{ local_fetch_dir }}/"
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
- name: Initialize Slurm accounting entities
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Wait for sacctmgr connectivity
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sacctmgr -n list cluster
|
||||||
|
register: sacctmgr_cluster_list
|
||||||
|
retries: 20
|
||||||
|
delay: 2
|
||||||
|
until: sacctmgr_cluster_list.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show current accounting state before changes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print current accounting state before changes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Ensure Slurm cluster exists in accounting DB
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
|
||||||
|
echo "Cluster {{ slurm_cluster_name }} already exists"
|
||||||
|
else
|
||||||
|
sacctmgr -i add cluster {{ slurm_cluster_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_cluster
|
||||||
|
changed_when: "'Adding Cluster' in ensure_cluster.stdout"
|
||||||
|
|
||||||
|
- name: Ensure default lab account exists for cluster
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
|
||||||
|
echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add account {{ slurm_account_name }} \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Description="{{ slurm_account_description }}" \
|
||||||
|
Organization="{{ slurm_account_organization }}"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_account
|
||||||
|
changed_when: "'Adding Account' in ensure_account.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists with lab account association
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
|
||||||
|
echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add user slurmuser \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Account={{ slurm_account_name }} \
|
||||||
|
DefaultAccount={{ slurm_account_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_user_assoc
|
||||||
|
changed_when: "'Adding User' in ensure_user_assoc.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser has default account set
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: set_default_account
|
||||||
|
changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
|
||||||
|
|
||||||
|
- name: Show final accounting state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print final accounting state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_after.stdout_lines
|
||||||
+98
@@ -0,0 +1,98 @@
|
|||||||
|
---
|
||||||
|
- name: Restore-check latest SlurmDBD backup into test database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Find latest SlurmDBD backup
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate latest backup exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ latest_backup.stdout }}"
|
||||||
|
register: latest_backup_stat
|
||||||
|
|
||||||
|
- name: Fail if latest backup is missing or empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
|
||||||
|
when:
|
||||||
|
- not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Recreate restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
DROP DATABASE IF EXISTS {{ restore_check_db }};
|
||||||
|
CREATE DATABASE {{ restore_check_db }};
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Import backup into restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate restored table count
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restored_tables
|
||||||
|
changed_when: false
|
||||||
|
failed_when: restored_tables.stdout | int < 1
|
||||||
|
|
||||||
|
- name: Validate restored row count sample
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### restored database"
|
||||||
|
echo "{{ restore_check_db }}"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ restore_check_db }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restore_check_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show restore-check result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Imported backup: {{ latest_backup.stdout }}"
|
||||||
|
- "Restore-check DB: {{ restore_check_db }}"
|
||||||
|
- "Restored tables: {{ restored_tables.stdout }}"
|
||||||
|
- "Summary:"
|
||||||
|
- "{{ restore_check_summary.stdout_lines }}"
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
---
|
||||||
|
- name: Install and configure MariaDB for SlurmDBD
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install MariaDB and SlurmDBD packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- mariadb-server
|
||||||
|
- mariadb-client
|
||||||
|
- slurmdbd
|
||||||
|
- slurm-wlm-mysql-plugin
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure MariaDB is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: mariadb
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Create Slurm accounting database and DB user
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
|
||||||
|
FLUSH PRIVILEGES;
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure /etc/slurm exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/slurm
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Deploy slurmdbd.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurmdbd.conf.j2
|
||||||
|
dest: /etc/slurm/slurmdbd.conf
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0600"
|
||||||
|
notify:
|
||||||
|
- Restart slurmdbd
|
||||||
|
|
||||||
|
- name: Ensure slurmdbd is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Flush handlers before validation
|
||||||
|
ansible.builtin.meta: flush_handlers
|
||||||
|
|
||||||
|
- name: Validate slurmdbd service is active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
register: slurmdbd_active
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate slurmdbd is listening on port
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ss -lntp | grep ':{{ slurmdbd_port }} '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurmdbd_port_check
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_port_check.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show slurmdbd service validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "slurmdbd is active"
|
||||||
|
- "{{ slurmdbd_port_check.stdout_lines }}"
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart slurmdbd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
state: restarted
|
||||||
+178
@@ -0,0 +1,178 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm accounting production-like setup
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate accounting services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active mariadb
|
||||||
|
systemctl is-active slurmdbd
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmdbd listener"
|
||||||
|
ss -lntp | grep ':6819 '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting runtime config
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### accounting config"
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### priority / select / cgroup config"
|
||||||
|
scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sacctmgr entities
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: entity_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit accounting validation job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=acct-prodlike-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/acct-prodlike-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/acct-prodlike-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: acct_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate sacct can read recent jobs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### recent jobs"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sacct_recent
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sreport commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### cluster utilization"
|
||||||
|
sreport cluster utilization start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### account utilization by user"
|
||||||
|
sreport cluster AccountUtilizationByUser start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### user top"
|
||||||
|
sreport user top start=today || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sreport_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB table health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### database exists"
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ slurmdbd_storage_loc }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print accounting validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "### services"
|
||||||
|
- "{{ service_check.stdout_lines }}"
|
||||||
|
- "### runtime config"
|
||||||
|
- "{{ config_check.stdout_lines }}"
|
||||||
|
- "### accounting entities"
|
||||||
|
- "{{ entity_check.stdout_lines }}"
|
||||||
|
- "### accounting validation job"
|
||||||
|
- "{{ acct_job.stdout_lines }}"
|
||||||
|
- "### recent sacct data"
|
||||||
|
- "{{ sacct_recent.stdout_lines }}"
|
||||||
|
- "### sreport"
|
||||||
|
- "{{ sreport_check.stdout_lines }}"
|
||||||
|
- "### database health"
|
||||||
|
- "{{ db_health.stdout_lines }}"
|
||||||
@@ -0,0 +1,83 @@
|
|||||||
|
---
|
||||||
|
- name: Backup Slurm and Munge state on all cluster nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
backup_base_dir: /var/backups/slurm
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create backup base directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ backup_base_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create timestamped backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
dir="{{ backup_base_dir }}/$ts"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
echo "$dir"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_dir_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Store backup directory fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
node_backup_dir: "{{ backup_dir_result.stdout }}"
|
||||||
|
|
||||||
|
- name: Backup Slurm and Munge config/state if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
backup_dir="{{ node_backup_dir }}"
|
||||||
|
|
||||||
|
for p in \
|
||||||
|
/etc/slurm \
|
||||||
|
/etc/slurm-llnl \
|
||||||
|
/etc/munge \
|
||||||
|
/var/spool/slurmctld \
|
||||||
|
/var/spool/slurmd \
|
||||||
|
/var/log/slurm \
|
||||||
|
/var/log/slurm-llnl
|
||||||
|
do
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
cp -a "$p" "$backup_dir/"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
|
||||||
|
systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
|
||||||
|
systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
if command -v sinfo >/dev/null 2>&1; then
|
||||||
|
sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v scontrol >/dev/null 2>&1; then
|
||||||
|
scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
|
||||||
|
scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
|
||||||
|
scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
find "$backup_dir" -maxdepth 2 -type f -o -type d
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_content
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show backup location on node
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Host: {{ inventory_hostname }}"
|
||||||
|
- "Backup directory: {{ node_backup_dir }}"
|
||||||
@@ -0,0 +1,46 @@
|
|||||||
|
---
|
||||||
|
- name: Fetch latest Slurm backups from nodes to pvef
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
remote_backup_base: /var/backups/slurm
|
||||||
|
local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Find latest remote backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1dt {{ remote_backup_base }}/* | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup_dir
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Create local backup directory on pvef
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_backup_base }}/{{ inventory_hostname }}"
|
||||||
|
state: directory
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Archive latest backup directory on remote node
|
||||||
|
ansible.builtin.archive:
|
||||||
|
path: "{{ latest_backup_dir.stdout }}"
|
||||||
|
dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
format: gz
|
||||||
|
force_archive: true
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Fetch archive to pvef
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Remove temporary remote archive
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
state: absent
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
- name: Bootstrap Ansible SSH access from pvef to Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
gather_facts: false
|
||||||
|
become: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Wait for SSH
|
||||||
|
ansible.builtin.wait_for_connection:
|
||||||
|
timeout: 30
|
||||||
|
|
||||||
|
- name: Install Python if missing - Debian/Ubuntu
|
||||||
|
ansible.builtin.raw: |
|
||||||
|
test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure sudo is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-server
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure SSH server is enabled and running
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: ssh
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for login user
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ ansible_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ ansible_user }}"
|
||||||
|
group: "{{ ansible_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Add pvef root public key to login user's authorized_keys
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ ansible_user }}"
|
||||||
|
key: "{{ ansible_controller_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
|
||||||
|
- name: Allow bootstrap login user passwordless sudo
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
{{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
|
||||||
|
validate: "visudo -cf %s"
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
- name: Configure /etc/hosts for Slurm cluster
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Add Slurm cluster hosts to /etc/hosts
|
||||||
|
ansible.builtin.blockinfile:
|
||||||
|
path: /etc/hosts
|
||||||
|
marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
|
||||||
|
block: |
|
||||||
|
{{ slurm_control_addr }} {{ slurm_control_machine }}
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
|
||||||
|
{{ node.addr }} {{ node.name }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,218 @@
|
|||||||
|
---
|
||||||
|
- name: Create slurmuser and generate SSH keys on every Slurm node
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
slurm_operator_shell: /bin/bash
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure useful packages are installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-client
|
||||||
|
- openssh-server
|
||||||
|
- acl
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: "{{ slurm_operator_user }}"
|
||||||
|
shell: "{{ slurm_operator_shell }}"
|
||||||
|
create_home: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for slurmuser
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Generate SSH key for slurmuser if missing
|
||||||
|
ansible.builtin.openssh_keypair:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
type: ed25519
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Read public key from each node
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
register: slurmuser_pubkey_raw
|
||||||
|
|
||||||
|
- name: Store decoded public key as host fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Exchange slurmuser SSH keys across all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install all slurmuser public keys into authorized_keys on every node
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ slurm_operator_user }}"
|
||||||
|
key: "{{ hostvars[item].slurmuser_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
loop: "{{ groups['slurm_cluster'] }}"
|
||||||
|
|
||||||
|
- name: Build SSH known_hosts entries for all cluster nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
mkdir -p /home/{{ slurm_operator_user }}/.ssh
|
||||||
|
touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure SSH permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Ensure private key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
|
||||||
|
- name: Ensure public key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0644"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Configure sudo permissions for slurmuser
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm controller node.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sacctmgr, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm worker/GPU nodes.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate slurmuser SSH mesh and Slurm access
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local Slurm commands as slurmuser
|
||||||
|
ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
|
||||||
|
register: sinfo_test
|
||||||
|
changed_when: false
|
||||||
|
failed_when: sinfo_test.rc != 0
|
||||||
|
|
||||||
|
- name: Show sinfo result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: sinfo_test.stdout_lines
|
||||||
|
|
||||||
|
- name: Test SSH from each node to every other node as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
|
||||||
|
{% endfor %}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
become_user: "{{ slurm_operator_user }}"
|
||||||
|
register: ssh_mesh_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show SSH mesh test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: ssh_mesh_test.stdout_lines
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
---
|
||||||
|
- name: Fix sudo permissions for slurmuser Slurm operations
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl status slurmctld *, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmctld *, \
|
||||||
|
/usr/bin/systemctl restart slurmctld, \
|
||||||
|
/usr/bin/systemctl reload slurmctld, \
|
||||||
|
/usr/bin/systemctl start slurmctld, \
|
||||||
|
/usr/bin/systemctl stop slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmctld *, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmctld, \
|
||||||
|
/usr/bin/journalctl -u slurmctld *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
---
|
||||||
|
- name: Read Munge key from Slurm controller
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller munge.key exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key
|
||||||
|
|
||||||
|
- name: Fail if controller munge.key is missing
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "/etc/munge/munge.key is missing on controller. Do not continue."
|
||||||
|
when: not controller_munge_key.stat.exists
|
||||||
|
|
||||||
|
- name: Read controller munge.key
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key_raw
|
||||||
|
|
||||||
|
- name: Store controller Munge key as fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy controller Munge key to all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
controller_host: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure munge package is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- munge
|
||||||
|
- libmunge2
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure munge group exists
|
||||||
|
ansible.builtin.group:
|
||||||
|
name: munge
|
||||||
|
system: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure munge user exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: munge
|
||||||
|
group: munge
|
||||||
|
system: true
|
||||||
|
shell: /usr/sbin/nologin
|
||||||
|
home: /nonexistent
|
||||||
|
create_home: false
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure /etc/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Deploy shared munge.key from controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/munge/munge.key
|
||||||
|
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0400"
|
||||||
|
notify:
|
||||||
|
- Restart munge
|
||||||
|
|
||||||
|
- name: Ensure /var/log/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure /var/lib/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/lib/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0711"
|
||||||
|
|
||||||
|
- name: Ensure /run/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /run/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure munge is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Munge locally on all nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local munge encode/decode
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
munge -n | unmunge
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: munge_local_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show local Munge validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: munge_local_test.stdout_lines
|
||||||
@@ -0,0 +1,132 @@
|
|||||||
|
---
|
||||||
|
- name: Prepare Slurm config directories and logs
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure Slurm config directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurm_config_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure slurmctld spool directory exists on controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmctld
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Ensure slurmd spool directory exists on workers
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmd
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy Slurm config files
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Backup current slurm.conf before managed deployment
|
||||||
|
ansible.builtin.copy:
|
||||||
|
src: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
|
||||||
|
remote_src: true
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Deploy managed slurm.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed gres.conf only on GPU nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/gres.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/gres.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: inventory_hostname in groups['slurm_gpu']
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Reconfigure slurmctld
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after config deployment
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_config_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show validation output
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_config_validation.stdout_lines
|
||||||
@@ -0,0 +1,103 @@
|
|||||||
|
---
|
||||||
|
- name: Restart Slurm controller safely
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmctld on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmctld to answer
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: scontrol_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: scontrol_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show controller ping
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: scontrol_ping.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Restart Slurm workers safely one by one
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmd to be active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmd
|
||||||
|
register: slurmd_active
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Wait until this node is visible in Slurm
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol show node {{ inventory_hostname }}
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: node_visible
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: node_visible.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after restart
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate Slurm cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
scontrol show nodes
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
scontrol show partitions
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show Slurm validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_validation.stdout_lines
|
||||||
+40
@@ -0,0 +1,40 @@
|
|||||||
|
---
|
||||||
|
- name: Discover node resources for Slurm config
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Discover CPU and memory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "HOST={{ inventory_hostname }}"
|
||||||
|
echo "CPUS=$(nproc)"
|
||||||
|
echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
|
||||||
|
echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_mem
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Discover NVIDIA GPU if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show discovered resources
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "{{ cpu_mem.stdout_lines }}"
|
||||||
|
- "GPU:"
|
||||||
|
- "{{ gpu_info.stdout_lines }}"
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
---
|
||||||
|
- name: Inspect current Slurm and Munge state
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Basic host info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
echo "HOST=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "SHORT_HOST=$(hostname -s)"
|
||||||
|
echo "IP_ADDRESSES=$(hostname -I)"
|
||||||
|
echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: host_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm package info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
dpkg -l | grep -Ei 'slurm|munge' || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: package_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm config paths
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
|
||||||
|
echo "### $p"
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
|
||||||
|
else
|
||||||
|
echo "MISSING"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_paths
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Service state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
for s in munge slurmctld slurmd; do
|
||||||
|
echo "### $s"
|
||||||
|
systemctl is-enabled "$s" 2>/dev/null || true
|
||||||
|
systemctl is-active "$s" 2>/dev/null || true
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
echo "### which"
|
||||||
|
command -v sinfo || true
|
||||||
|
command -v scontrol || true
|
||||||
|
command -v sbatch || true
|
||||||
|
command -v srun || true
|
||||||
|
command -v munge || true
|
||||||
|
command -v unmunge || true
|
||||||
|
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo 2>&1 || true
|
||||||
|
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping 2>&1 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_commands
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show inspection report
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "===== {{ inventory_hostname }} :: host_info ====="
|
||||||
|
- "{{ host_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: packages ====="
|
||||||
|
- "{{ package_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: config_paths ====="
|
||||||
|
- "{{ config_paths.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: services ====="
|
||||||
|
- "{{ service_state.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: slurm_commands ====="
|
||||||
|
- "{{ slurm_commands.stdout_lines }}"
|
||||||
+216
@@ -0,0 +1,216 @@
|
|||||||
|
---
|
||||||
|
- name: Detect problematic Slurm nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Detect nodes needing remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store bad node list
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Show detected problematic nodes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: bad_nodes
|
||||||
|
|
||||||
|
|
||||||
|
- name: Attempt auto-remediation on problematic nodes
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
vars:
|
||||||
|
bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Skip healthy nodes
|
||||||
|
ansible.builtin.meta: end_host
|
||||||
|
when: inventory_hostname not in bad_nodes_from_controller
|
||||||
|
|
||||||
|
- name: Restart Munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local services after remediation attempt
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent slurmd logs"
|
||||||
|
journalctl -u slurmd -n 30 --no-pager || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: local_repair_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print local remediation result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: local_repair_check.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Refresh controller and validate remediated nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart slurmctld to refresh node states
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear maintenance state on previously bad nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "No bad nodes detected. Nothing to clear."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $bad_nodes; do
|
||||||
|
echo "### clearing state on $node"
|
||||||
|
scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
|
||||||
|
done
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: clear_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print clear-state result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: clear_result.stdout_lines
|
||||||
|
|
||||||
|
- name: Detect nodes still unhealthy after remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: still_bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store still bad nodes
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Drain nodes that remain unhealthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$unresolved_nodes" ]; then
|
||||||
|
echo "No unresolved unhealthy nodes."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $unresolved_nodes; do
|
||||||
|
echo "### draining unresolved node $node"
|
||||||
|
scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
|
||||||
|
done
|
||||||
|
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: drain_unresolved
|
||||||
|
changed_when: still_bad_nodes | length > 0
|
||||||
|
|
||||||
|
- name: Show remediation summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### initial bad nodes"
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### still bad nodes"
|
||||||
|
still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$still_bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $still_bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### final sinfo"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: remediation_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print remediation summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: remediation_summary.stdout_lines
|
||||||
@@ -0,0 +1,149 @@
|
|||||||
|
---
|
||||||
|
- name: Check Slurm controller health
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller services and cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### controller services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
systemctl is-active slurmdbd || true
|
||||||
|
systemctl is-active mariadb || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurm ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### problematic nodes"
|
||||||
|
sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting"
|
||||||
|
sacctmgr -n list cluster || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent failed jobs"
|
||||||
|
sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
|
||||||
|
--format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: controller_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print controller health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: controller_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm worker health
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check worker services, config and connectivity
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo "UPTIME=$(uptime -p)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge local test"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller connectivity"
|
||||||
|
getent hosts slurm-ctl01 || true
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### config checksums"
|
||||||
|
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### shared filesystem"
|
||||||
|
test -d /shared
|
||||||
|
touch /shared/.slurm-health-$(hostname)
|
||||||
|
ls -l /shared/.slurm-health-$(hostname)
|
||||||
|
rm -f /shared/.slurm-health-$(hostname)
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### cgroup"
|
||||||
|
mount | grep cgroup || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### gpu check"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: worker_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print worker health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: worker_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm-reported node state consistency
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Build Slurm node health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### node summary"
|
||||||
|
sinfo -N -o "%N %P %T %C %m %G %E"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### full problematic node details"
|
||||||
|
for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
|
||||||
|
echo
|
||||||
|
echo "### $node"
|
||||||
|
scontrol show node "$node"
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_node_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print Slurm node summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_node_summary.stdout_lines
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user