Compare commits
4 Commits
83877fb598
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 4e739c5c99 | |||
| 8cb92de06f | |||
| 1843796e92 | |||
| cd6830334b |
@@ -4,6 +4,8 @@
|
|||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
- Added Linux Fresh Setup Toolkit under `labs/linux/setup` for day-0 Ubuntu lab host bootstrap automation.
|
||||||
|
- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation.
|
||||||
- Python tooling validation for operational scripts.
|
- Python tooling validation for operational scripts.
|
||||||
- `incident-log-summary` for general incident log summarization.
|
- `incident-log-summary` for general incident log summarization.
|
||||||
- `log-diff-checker` for pre-change and post-change log comparison.
|
- `log-diff-checker` for pre-change and post-change log comparison.
|
||||||
@@ -36,6 +38,7 @@
|
|||||||
- IBM AIX 7 role and playbook.
|
- IBM AIX 7 role and playbook.
|
||||||
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
||||||
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
||||||
|
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
|
|||||||
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
|||||||
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||||
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||||
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||||
|
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
## Planned Areas
|
## Planned Areas
|
||||||
|
|
||||||
@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
|
|||||||
- Veritas VxVM/VCS operational awareness.
|
- Veritas VxVM/VCS operational awareness.
|
||||||
- GPFS / IBM Spectrum Scale operational awareness.
|
- GPFS / IBM Spectrum Scale operational awareness.
|
||||||
- Ansible role organization for selected hardening controls.
|
- Ansible role organization for selected hardening controls.
|
||||||
|
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||||
- Clear documentation of what was tested and what still needs a real system.
|
- Clear documentation of what was tested and what still needs a real system.
|
||||||
|
|||||||
@@ -10,6 +10,11 @@ Current subdirectories are planning areas unless their own README documents a ru
|
|||||||
- `ci-cd`
|
- `ci-cd`
|
||||||
- `docker`
|
- `docker`
|
||||||
|
|
||||||
|
## Linux operations labs
|
||||||
|
|
||||||
|
- [Linux Fresh Setup Toolkit](./linux/setup/) - Bootstrap automation for fresh Ubuntu lab hosts, including shell profile, Cockpit, Docker, libvirt/KVM, NVIDIA diagnostics, tuning and safe baseline defaults.
|
||||||
|
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
|
||||||
|
|
||||||
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
||||||
|
|
||||||
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
||||||
|
|||||||
@@ -0,0 +1,308 @@
|
|||||||
|
# AI Lab Maintenance Toolkit
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
|
||||||
|
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
|
||||||
|
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
|
||||||
|
configuration backup, and non-destructive VM inventory into a small toolkit
|
||||||
|
that is readable enough for review and guarded enough for homelab use.
|
||||||
|
|
||||||
|
This is a portfolio and lab implementation, not evidence of production
|
||||||
|
certification. Review package policy, backup coverage, maintenance windows, and
|
||||||
|
application impact before deploying it to another host.
|
||||||
|
|
||||||
|
## Problem solved
|
||||||
|
|
||||||
|
AI lab hosts accumulate operating system packages, kernel packages, container
|
||||||
|
images, build cache, journals, and configuration changes while also carrying
|
||||||
|
stateful workloads. Manual maintenance is easy to defer and risky to perform
|
||||||
|
without evidence. This project provides scheduled, logged tasks with explicit
|
||||||
|
safety boundaries and separate read-only audit commands.
|
||||||
|
|
||||||
|
## What this demonstrates
|
||||||
|
|
||||||
|
- Bash strict mode, input validation, dependency checks, and operational exit
|
||||||
|
codes.
|
||||||
|
- Dry-run-first maintenance with explicit authorization for changes.
|
||||||
|
- systemd oneshot services and persistent calendar timers.
|
||||||
|
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
|
||||||
|
- Docker cleanup that preserves volumes.
|
||||||
|
- Configuration-focused backups with bounded retention.
|
||||||
|
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
|
||||||
|
- Idempotent installation and guarded JSON configuration updates.
|
||||||
|
|
||||||
|
## Architecture and directory layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
ailab-maintenance/
|
||||||
|
├── README.md
|
||||||
|
├── install.sh
|
||||||
|
├── scripts/
|
||||||
|
│ ├── ailab-healthcheck.sh
|
||||||
|
│ ├── ailab-disk-watch.sh
|
||||||
|
│ ├── ailab-apt-cleanup.sh
|
||||||
|
│ ├── ailab-kernel-cleanup.sh
|
||||||
|
│ ├── ailab-docker-cleanup.sh
|
||||||
|
│ ├── ailab-config-backup.sh
|
||||||
|
│ └── ailab-vm-audit.sh
|
||||||
|
└── systemd/
|
||||||
|
├── ailab-apt-cleanup.service
|
||||||
|
├── ailab-apt-cleanup.timer
|
||||||
|
├── ailab-kernel-cleanup.service
|
||||||
|
├── ailab-kernel-cleanup.timer
|
||||||
|
├── ailab-docker-cleanup.service
|
||||||
|
├── ailab-docker-cleanup.timer
|
||||||
|
├── ailab-config-backup.service
|
||||||
|
├── ailab-config-backup.timer
|
||||||
|
├── ailab-disk-watch.service
|
||||||
|
└── ailab-disk-watch.timer
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer deploys scripts to `/usr/local/sbin` and units to
|
||||||
|
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
|
||||||
|
through an additional framework.
|
||||||
|
|
||||||
|
## Maintenance tasks
|
||||||
|
|
||||||
|
| Command | Purpose | Change behavior |
|
||||||
|
| --- | --- | --- |
|
||||||
|
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
|
||||||
|
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
|
||||||
|
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
|
||||||
|
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
|
||||||
|
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
|
||||||
|
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
|
||||||
|
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
|
||||||
|
|
||||||
|
## Safety model
|
||||||
|
|
||||||
|
Change-capable scripts default to dry-run behavior. Manual execution requires
|
||||||
|
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
|
||||||
|
use `--execute --non-interactive`; installing and enabling those reviewed unit
|
||||||
|
files is the explicit authorization for scheduled maintenance.
|
||||||
|
|
||||||
|
Exit codes follow the repository convention:
|
||||||
|
|
||||||
|
- `0`: completed successfully or an optional component was absent.
|
||||||
|
- `1`: an operational check or maintenance action failed.
|
||||||
|
- `2`: invalid input, missing required dependency, or insufficient privilege.
|
||||||
|
|
||||||
|
The scripts do not bypass APT or Docker locks, delete VM resources, manually
|
||||||
|
select kernel names for removal, or hide command failures.
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
Review every script and unit first. Installation changes package state,
|
||||||
|
journald settings, Docker daemon settings when Docker exists, and enabled timer
|
||||||
|
state.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd labs/linux/ailab-maintenance
|
||||||
|
sudo ./install.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer:
|
||||||
|
|
||||||
|
1. Installs the documented Ubuntu utilities.
|
||||||
|
2. Deploys scripts and systemd units with fixed permissions.
|
||||||
|
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
|
||||||
|
4. Restarts `systemd-journald`.
|
||||||
|
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
|
||||||
|
with `jq`, and attempts a Docker restart.
|
||||||
|
6. Enables all five timers.
|
||||||
|
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
|
||||||
|
|
||||||
|
The installer is intended for Ubuntu 26.04. It is not run automatically by
|
||||||
|
repository validation.
|
||||||
|
|
||||||
|
## Manual commands
|
||||||
|
|
||||||
|
Read-only reports:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||||
|
sudo /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Preview maintenance:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-config-backup.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Apply reviewed maintenance interactively:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
|
||||||
|
sudo /usr/local/sbin/ailab-config-backup.sh --execute
|
||||||
|
```
|
||||||
|
|
||||||
|
`--non-interactive` is reserved for reviewed automation and is rejected unless
|
||||||
|
`--execute` is also present.
|
||||||
|
|
||||||
|
## Systemd timers
|
||||||
|
|
||||||
|
| Timer | Schedule |
|
||||||
|
| --- | --- |
|
||||||
|
| `ailab-config-backup.timer` | Daily at 03:30 |
|
||||||
|
| `ailab-disk-watch.timer` | Hourly |
|
||||||
|
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
|
||||||
|
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
|
||||||
|
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
|
||||||
|
|
||||||
|
All timers use `Persistent=true`, so a missed event runs after the host becomes
|
||||||
|
available. Inspect timer and service evidence with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl list-timers --all | grep ailab-
|
||||||
|
systemctl status ailab-config-backup.timer
|
||||||
|
journalctl -u ailab-kernel-cleanup.service
|
||||||
|
```
|
||||||
|
|
||||||
|
## Logs
|
||||||
|
|
||||||
|
Scheduled and manual maintenance writes to:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/var/log/ailab-apt-cleanup.log
|
||||||
|
/var/log/ailab-kernel-cleanup.log
|
||||||
|
/var/log/ailab-docker-cleanup.log
|
||||||
|
/var/log/ailab-config-backup.log
|
||||||
|
/var/log/ailab-disk-watch.log
|
||||||
|
```
|
||||||
|
|
||||||
|
systemd also records service output in the journal. Logrotate is installed as a
|
||||||
|
dependency, but this lab does not create a custom rotation policy for these
|
||||||
|
small maintenance logs.
|
||||||
|
|
||||||
|
## Docker policy
|
||||||
|
|
||||||
|
Docker cleanup runs `docker system prune -af` and removes build cache older
|
||||||
|
than seven days. It never passes `--volumes`. Named and anonymous volumes
|
||||||
|
remain outside this automated policy and require application-aware review.
|
||||||
|
|
||||||
|
The installer configures the `json-file` driver with a maximum size of `50m`
|
||||||
|
and five files. Existing valid JSON is backed up and merged. Invalid JSON
|
||||||
|
causes installation to stop rather than overwrite operator configuration.
|
||||||
|
|
||||||
|
## Kernel policy
|
||||||
|
|
||||||
|
Kernel removal is delegated to `apt autoremove --purge`; package names are not
|
||||||
|
constructed or purged with regular expressions. Before execution, the script
|
||||||
|
logs the APT simulation and refuses cleanup unless at least two installed
|
||||||
|
versioned kernel image packages remain after simulated removals.
|
||||||
|
|
||||||
|
This protects a fallback kernel while preserving Ubuntu dependency policy.
|
||||||
|
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
|
||||||
|
Secure Boot state, and the simulated removal set before manual execution.
|
||||||
|
|
||||||
|
## Backup policy
|
||||||
|
|
||||||
|
Backups are written to `/srv/backups/ailab-config` as
|
||||||
|
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
|
||||||
|
deleted only after a new archive is created.
|
||||||
|
|
||||||
|
The backup covers `/etc`, selected root shell configuration,
|
||||||
|
`/opt/ailab-maintenance` when present, and libvirt configuration under
|
||||||
|
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
|
||||||
|
Ollama models, VM disk images, or other large application datasets. Because
|
||||||
|
`/etc` is included, explicitly listed configuration subdirectories are already
|
||||||
|
covered even when optional-path reporting mentions them separately.
|
||||||
|
|
||||||
|
This is a local configuration backup, not a disaster-recovery design. A real
|
||||||
|
deployment should copy archives to independently protected storage and test
|
||||||
|
restoration.
|
||||||
|
|
||||||
|
## Journald policy
|
||||||
|
|
||||||
|
The installer applies:
|
||||||
|
|
||||||
|
```ini
|
||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
|
```
|
||||||
|
|
||||||
|
These settings bound journal growth while retaining useful troubleshooting
|
||||||
|
evidence. Capacity and retention should be adjusted to the host's disk size
|
||||||
|
and incident-response requirements.
|
||||||
|
|
||||||
|
## Disk watch policy
|
||||||
|
|
||||||
|
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
|
||||||
|
`1` when any checked filesystem meets or exceeds the threshold. Override the
|
||||||
|
threshold for a manual or unit invocation with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The script reports every filesystem as `OK` or `WARNING`; it does not delete
|
||||||
|
data or attempt remediation.
|
||||||
|
|
||||||
|
## Example operational workflows
|
||||||
|
|
||||||
|
### Weekly maintenance review
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||||
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||||
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||||
|
systemctl list-timers --all | grep ailab-
|
||||||
|
```
|
||||||
|
|
||||||
|
Review the kernel simulation, Docker usage, failed units, backup freshness, and
|
||||||
|
disk warnings before approving manual changes.
|
||||||
|
|
||||||
|
### Disk pressure investigation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
|
||||||
|
sudo docker system df
|
||||||
|
sudo journalctl --disk-usage
|
||||||
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
Use the evidence to identify ownership. Do not treat Docker pruning or file
|
||||||
|
deletion as a substitute for application-specific retention policy.
|
||||||
|
|
||||||
|
### Post-maintenance evidence
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/sbin/ailab-healthcheck.sh \
|
||||||
|
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
|
||||||
|
journalctl --since today -u 'ailab-*.service'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Interview talking points
|
||||||
|
|
||||||
|
- Why timer units explicitly carry the non-interactive execution boundary.
|
||||||
|
- Why APT dependency policy is safer than regex-based kernel deletion.
|
||||||
|
- How Docker volume preservation separates platform hygiene from application
|
||||||
|
data lifecycle decisions.
|
||||||
|
- How optional dependency handling keeps one health command useful across
|
||||||
|
container, GPU, and virtualization host variants.
|
||||||
|
- Why configuration backup and application-data backup are separate concerns.
|
||||||
|
- How exit codes, persistent timers, logs, and post-checks support operations.
|
||||||
|
|
||||||
|
## Future improvements
|
||||||
|
|
||||||
|
- Add a dedicated logrotate policy after measuring log growth.
|
||||||
|
- Export disk-watch status to a monitoring system instead of relying only on
|
||||||
|
timer failure state.
|
||||||
|
- Add automated archive integrity checks and off-host replication.
|
||||||
|
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
|
||||||
|
commands.
|
||||||
|
- Add package-lock detection with bounded retry policy if recurring contention
|
||||||
|
is observed.
|
||||||
|
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
|
||||||
|
dedicated read-only audit.
|
||||||
Executable
+103
@@ -0,0 +1,103 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||||
|
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||||
|
packages=(
|
||||||
|
logrotate
|
||||||
|
needrestart
|
||||||
|
smartmontools
|
||||||
|
nvme-cli
|
||||||
|
sysstat
|
||||||
|
iotop
|
||||||
|
ncdu
|
||||||
|
duf
|
||||||
|
jq
|
||||||
|
lsof
|
||||||
|
psmisc
|
||||||
|
tar
|
||||||
|
gzip
|
||||||
|
)
|
||||||
|
timers=(
|
||||||
|
ailab-apt-cleanup.timer
|
||||||
|
ailab-kernel-cleanup.timer
|
||||||
|
ailab-docker-cleanup.timer
|
||||||
|
ailab-config-backup.timer
|
||||||
|
ailab-disk-watch.timer
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: install.sh must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in apt-get install systemctl; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf 'Installing maintenance dependencies...\n'
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
|
||||||
|
printf 'Installing scripts and systemd units...\n'
|
||||||
|
for script in "$SCRIPT_DIR"/scripts/*.sh; do
|
||||||
|
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
|
||||||
|
done
|
||||||
|
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
|
||||||
|
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
|
||||||
|
done
|
||||||
|
|
||||||
|
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
|
||||||
|
tmp_journald="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
|
||||||
|
cat >"$tmp_journald" <<'EOF'
|
||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
|
EOF
|
||||||
|
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
|
||||||
|
systemctl restart systemd-journald
|
||||||
|
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
printf 'Configuring Docker log rotation limits...\n'
|
||||||
|
install -d -m 0755 /etc/docker
|
||||||
|
tmp_docker="$(mktemp)"
|
||||||
|
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$DOCKER_CONFIG" "$backup"
|
||||||
|
jq '. + {
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||||
|
}' "$DOCKER_CONFIG" >"$tmp_docker"
|
||||||
|
else
|
||||||
|
jq -n '{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {"max-size": "50m", "max-file": "5"}
|
||||||
|
}' >"$tmp_docker"
|
||||||
|
fi
|
||||||
|
|
||||||
|
jq empty "$tmp_docker"
|
||||||
|
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
|
||||||
|
systemctl restart docker || true
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl enable --now "${timers[@]}"
|
||||||
|
|
||||||
|
printf '\nEnabled AI Lab timers:\n'
|
||||||
|
systemctl list-timers --all --no-pager | grep 'ailab-' || true
|
||||||
|
|
||||||
|
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
|
||||||
|
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
|
||||||
@@ -0,0 +1,66 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-apt-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! command -v apt >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
|
||||||
|
printf 'INFO: simulated autoremove follows\n'
|
||||||
|
LC_ALL=C apt -s autoremove --purge
|
||||||
|
printf 'INFO: rerun with --execute and confirm to apply changes\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt update
|
||||||
|
apt autoremove --purge -y
|
||||||
|
apt autoclean -y
|
||||||
|
if command -v needrestart >/dev/null 2>&1; then
|
||||||
|
needrestart -b || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: needrestart is not installed\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: APT cleanup completed\n'
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-config-backup.log"
|
||||||
|
BACKUP_DIR="/srv/backups/ailab-config"
|
||||||
|
RETENTION_DAYS=30
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in tar gzip find; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
timestamp="$(date '+%Y%m%d-%H%M%S')"
|
||||||
|
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
|
||||||
|
candidate_paths=(
|
||||||
|
/etc
|
||||||
|
/root/.bashrc
|
||||||
|
/root/.bashrc.d
|
||||||
|
/opt/ailab-maintenance
|
||||||
|
/var/lib/libvirt/qemu
|
||||||
|
)
|
||||||
|
source_paths=()
|
||||||
|
|
||||||
|
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
|
||||||
|
for path in "${candidate_paths[@]}"; do
|
||||||
|
if [[ -e "$path" ]]; then
|
||||||
|
source_paths+=("${path#/}")
|
||||||
|
printf 'OK: include %s\n' "$path"
|
||||||
|
else
|
||||||
|
printf 'INFO: optional path is absent: %s\n' "$path"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((${#source_paths[@]} == 0)); then
|
||||||
|
printf 'CRITICAL: no backup source paths are present\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'Backup destination: %s\n' "$archive"
|
||||||
|
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
|
||||||
|
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
|
||||||
|
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'Type EXECUTE to create the archive and apply retention: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -m 0750 "$BACKUP_DIR"
|
||||||
|
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
|
||||||
|
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
|
||||||
|
printf 'OK: configuration backup created: %s\n' "$archive"
|
||||||
@@ -0,0 +1,38 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-disk-watch.log"
|
||||||
|
threshold="${AILAB_DISK_THRESHOLD:-85}"
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
|
||||||
|
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
|
||||||
|
|
||||||
|
status=0
|
||||||
|
while read -r filesystem _blocks _used available use_percent mountpoint; do
|
||||||
|
usage="${use_percent%\%}"
|
||||||
|
|
||||||
|
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
|
||||||
|
status=1
|
||||||
|
elif ((usage >= threshold)); then
|
||||||
|
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
|
||||||
|
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
|
||||||
|
status=1
|
||||||
|
else
|
||||||
|
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
|
||||||
|
fi
|
||||||
|
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
|
||||||
|
|
||||||
|
exit "$status"
|
||||||
@@ -0,0 +1,70 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-docker-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
|
||||||
|
if ! command -v docker >/dev/null 2>&1; then
|
||||||
|
printf 'INFO: Docker is not installed; nothing to do\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
|
||||||
|
printf 'INFO: docker.service is inactive; nothing to do\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nDocker disk usage before cleanup:\n'
|
||||||
|
docker system df
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; would run docker system prune -af\n'
|
||||||
|
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
|
||||||
|
printf 'INFO: Docker volumes are never included in this cleanup\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker system prune -af
|
||||||
|
docker builder prune -af --filter "until=168h"
|
||||||
|
|
||||||
|
printf '\nDocker disk usage after cleanup:\n'
|
||||||
|
docker system df
|
||||||
|
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
|
||||||
+111
@@ -0,0 +1,111 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Host identity"
|
||||||
|
if command -v hostnamectl >/dev/null 2>&1; then
|
||||||
|
run_optional "hostnamectl" hostnamectl
|
||||||
|
else
|
||||||
|
run_optional "hostname" hostname
|
||||||
|
fi
|
||||||
|
run_optional "kernel information" uname -a
|
||||||
|
run_optional "uptime" uptime
|
||||||
|
|
||||||
|
section "Memory"
|
||||||
|
if command -v free >/dev/null 2>&1; then
|
||||||
|
run_optional "memory report" free -h
|
||||||
|
else
|
||||||
|
printf 'WARNING: free is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Filesystems"
|
||||||
|
if command -v df >/dev/null 2>&1; then
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
printf '\nKey mountpoints present:\n'
|
||||||
|
for mountpoint in / /boot /var /srv /opt /home; do
|
||||||
|
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
|
||||||
|
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
else
|
||||||
|
printf 'WARNING: df is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Journal usage"
|
||||||
|
if command -v journalctl >/dev/null 2>&1; then
|
||||||
|
run_optional "journal disk usage" journalctl --disk-usage
|
||||||
|
else
|
||||||
|
printf 'WARNING: journalctl is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Docker"
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "Docker service state" systemctl is-active docker
|
||||||
|
fi
|
||||||
|
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
|
||||||
|
run_optional "Docker disk usage" docker system df
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Libvirt"
|
||||||
|
if command -v virsh >/dev/null 2>&1; then
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "libvirtd service state" systemctl is-active libvirtd
|
||||||
|
fi
|
||||||
|
run_optional "libvirt guest list" virsh list --all
|
||||||
|
else
|
||||||
|
printf 'INFO: virsh is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
run_optional "NVIDIA status" nvidia-smi
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Failed systemd units"
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is not available\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "SMART quick health"
|
||||||
|
if command -v smartctl >/dev/null 2>&1; then
|
||||||
|
shopt -s nullglob
|
||||||
|
devices=(/dev/sd? /dev/nvme?n?)
|
||||||
|
shopt -u nullglob
|
||||||
|
|
||||||
|
if ((${#devices[@]} == 0)); then
|
||||||
|
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
|
||||||
|
else
|
||||||
|
for device in "${devices[@]}"; do
|
||||||
|
printf '\n-- %s --\n' "$device"
|
||||||
|
run_optional "SMART health check for $device" smartctl -H "$device"
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'INFO: smartctl is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
exit 0
|
||||||
@@ -0,0 +1,117 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
# APT autoremove respects package dependencies and kernel protection rules. That
|
||||||
|
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
|
||||||
|
|
||||||
|
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
|
||||||
|
execute=false
|
||||||
|
non_interactive=false
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||||
|
}
|
||||||
|
|
||||||
|
kernel_packages() {
|
||||||
|
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
|
||||||
|
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
|
||||||
|
| awk '$1 ~ /^ii/ {print $2}' \
|
||||||
|
| sort -u || true
|
||||||
|
}
|
||||||
|
|
||||||
|
versioned_kernel_images() {
|
||||||
|
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
|
||||||
|
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
|
||||||
|
| sort -u || true
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--execute) execute=true ;;
|
||||||
|
--non-interactive) non_interactive=true ;;
|
||||||
|
-h|--help) usage; exit 0 ;;
|
||||||
|
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||||
|
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: this script must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
for command_name in apt dpkg-query uname; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||||
|
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
|
||||||
|
printf 'Running kernel: %s\n' "$(uname -r)"
|
||||||
|
printf '\nInstalled kernel-related packages before cleanup:\n'
|
||||||
|
kernel_packages
|
||||||
|
|
||||||
|
simulation="$(LC_ALL=C apt -s autoremove --purge)"
|
||||||
|
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
|
||||||
|
|
||||||
|
mapfile -t installed_images < <(versioned_kernel_images)
|
||||||
|
mapfile -t removed_images < <(
|
||||||
|
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
|
||||||
|
)
|
||||||
|
|
||||||
|
remaining_images=0
|
||||||
|
for image in "${installed_images[@]}"; do
|
||||||
|
remove_image=false
|
||||||
|
for removed in "${removed_images[@]}"; do
|
||||||
|
if [[ "$image" == "$removed" ]]; then
|
||||||
|
remove_image=true
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if [[ "$remove_image" != true ]]; then
|
||||||
|
remaining_images=$((remaining_images + 1))
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
|
||||||
|
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
|
||||||
|
|
||||||
|
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
|
||||||
|
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$execute" != true ]]; then
|
||||||
|
printf 'INFO: dry-run mode; no packages were removed\n'
|
||||||
|
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ "$non_interactive" != true ]]; then
|
||||||
|
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
|
||||||
|
printf 'Type EXECUTE to continue: '
|
||||||
|
read -r confirmation
|
||||||
|
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||||
|
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt autoremove --purge -y
|
||||||
|
apt autoclean -y
|
||||||
|
if command -v update-grub >/dev/null 2>&1; then
|
||||||
|
update-grub || true
|
||||||
|
else
|
||||||
|
printf 'WARNING: update-grub is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nInstalled kernel-related packages after cleanup:\n'
|
||||||
|
kernel_packages
|
||||||
|
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
|
||||||
+42
@@ -0,0 +1,42 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ! command -v virsh >/dev/null 2>&1; then
|
||||||
|
printf 'INFO: virsh is not installed; VM audit skipped\n'
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Virtual machines"
|
||||||
|
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||||
|
|
||||||
|
section "Storage pools"
|
||||||
|
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
|
||||||
|
|
||||||
|
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
|
||||||
|
for pool in "${pools[@]}"; do
|
||||||
|
section "Volumes in pool: $pool"
|
||||||
|
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
|
||||||
|
done
|
||||||
|
|
||||||
|
section "Possible VM disk and installation images"
|
||||||
|
search_roots=()
|
||||||
|
for path in /var/lib/libvirt /srv /opt; do
|
||||||
|
[[ -d "$path" ]] && search_roots+=("$path")
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((${#search_roots[@]} == 0)); then
|
||||||
|
printf 'INFO: no configured search roots are present\n'
|
||||||
|
else
|
||||||
|
find "${search_roots[@]}" -xdev -type f \
|
||||||
|
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
|
||||||
|
-printf '%12s bytes %p\n' 2>/dev/null \
|
||||||
|
| sort -nr || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe APT cleanup
|
||||||
|
After=network-online.target
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab APT cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:00:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab configuration backup
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab configuration backup daily
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=*-*-* 03:30:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab disk usage check
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab disk usage check hourly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=hourly
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe Docker cleanup
|
||||||
|
Requires=docker.service
|
||||||
|
After=docker.service
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab Docker cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:40:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=AI Lab safe kernel cleanup
|
||||||
|
After=network-online.target ailab-apt-cleanup.service
|
||||||
|
Wants=network-online.target
|
||||||
|
|
||||||
|
[Service]
|
||||||
|
Type=oneshot
|
||||||
|
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
[Unit]
|
||||||
|
Description=Run AI Lab kernel cleanup weekly
|
||||||
|
|
||||||
|
[Timer]
|
||||||
|
OnCalendar=Sun *-*-* 04:20:00
|
||||||
|
Persistent=true
|
||||||
|
|
||||||
|
[Install]
|
||||||
|
WantedBy=timers.target
|
||||||
@@ -0,0 +1,276 @@
|
|||||||
|
# Linux Fresh Setup Toolkit
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
The Linux Fresh Setup Toolkit is day-0 bootstrap automation for a clean Ubuntu
|
||||||
|
lab server or workstation. It prepares a host for routine administration,
|
||||||
|
Cockpit, Docker workloads, libvirt/KVM virtual machines, optional NVIDIA
|
||||||
|
diagnostics, bounded logging, practical kernel tuning, and a conservative
|
||||||
|
security baseline.
|
||||||
|
|
||||||
|
The scripts are modular and safe to rerun. Optional components remain optional,
|
||||||
|
UFW is not enabled without a specific flag, and an NVIDIA driver is never
|
||||||
|
installed without an explicit version. This is a portfolio and homelab
|
||||||
|
implementation, not a production-certified build standard.
|
||||||
|
|
||||||
|
## Scope and non-goals
|
||||||
|
|
||||||
|
The toolkit supports Ubuntu 24.04 and newer and assumes a systemd-based host
|
||||||
|
with APT package management. It is suitable for a host such as `ailab` that may
|
||||||
|
run WebODM, Open WebUI, Homepage, NVIDIA workloads, or test virtual machines.
|
||||||
|
|
||||||
|
It does not:
|
||||||
|
|
||||||
|
- Deploy applications, containers, or virtual machines.
|
||||||
|
- Configure GPU passthrough, VFIO bindings, bridges, or Windows guests.
|
||||||
|
- Select an NVIDIA driver automatically.
|
||||||
|
- Define a complete firewall policy or compliance baseline.
|
||||||
|
- Replace backup, monitoring, patching, or ongoing maintenance processes.
|
||||||
|
- Claim live validation against every future Ubuntu release.
|
||||||
|
|
||||||
|
## Why this is separate from ailab-maintenance
|
||||||
|
|
||||||
|
This project establishes a fresh host. The sibling
|
||||||
|
[AI Lab Maintenance Toolkit](../ailab-maintenance/) handles day-2 health
|
||||||
|
checks, scheduled cleanup, configuration backup, disk monitoring, and VM
|
||||||
|
inventory after a host is operating.
|
||||||
|
|
||||||
|
Keeping bootstrap and maintenance separate makes the change boundary clear:
|
||||||
|
this toolkit installs platform capabilities and baseline configuration, while
|
||||||
|
the maintenance toolkit manages recurring operational tasks.
|
||||||
|
|
||||||
|
## Directory layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
setup/
|
||||||
|
├── README.md
|
||||||
|
├── install.sh
|
||||||
|
├── scripts/
|
||||||
|
│ ├── 00-preflight.sh
|
||||||
|
│ ├── 00-platform-guard.inc
|
||||||
|
│ ├── 01-base-packages.sh
|
||||||
|
│ ├── 02-shell-profile.sh
|
||||||
|
│ ├── 03-cockpit.sh
|
||||||
|
│ ├── 04-docker.sh
|
||||||
|
│ ├── 05-libvirt.sh
|
||||||
|
│ ├── 06-nvidia-tools.sh
|
||||||
|
│ ├── 07-tuning.sh
|
||||||
|
│ ├── 08-security-baseline.sh
|
||||||
|
│ └── 99-postcheck.sh
|
||||||
|
├── files/
|
||||||
|
│ ├── bashrc.d/ailab.sh
|
||||||
|
│ ├── docker/daemon.json
|
||||||
|
│ ├── sysctl/99-ailab.conf
|
||||||
|
│ └── systemd/journald-ailab-limits.conf
|
||||||
|
└── docs/
|
||||||
|
├── fresh-install-checklist.md
|
||||||
|
├── cockpit.md
|
||||||
|
├── docker.md
|
||||||
|
├── libvirt.md
|
||||||
|
├── nvidia.md
|
||||||
|
└── bash-shell.md
|
||||||
|
```
|
||||||
|
|
||||||
|
`00-platform-guard.inc` is an internal sourced helper used by mutating
|
||||||
|
component scripts; it is not an executable profile.
|
||||||
|
|
||||||
|
## Supported profiles and flags
|
||||||
|
|
||||||
|
| Flag | Result |
|
||||||
|
| --- | --- |
|
||||||
|
| `--base` | Install operational CLI, diagnostic, storage, and network packages |
|
||||||
|
| `--shell` | Install the root AI lab Bash profile |
|
||||||
|
| `--cockpit` | Install and enable Cockpit |
|
||||||
|
| `--docker` | Install Docker and bounded JSON-file logging |
|
||||||
|
| `--libvirt` | Install and enable libvirt/KVM |
|
||||||
|
| `--nvidia-tools` | Install NVIDIA and OpenCL diagnostics without a driver |
|
||||||
|
| `--install-nvidia-driver VERSION` | Install diagnostics and the named Ubuntu driver package |
|
||||||
|
| `--tuning` | Apply journald, sysctl, sensor, and sysstat settings |
|
||||||
|
| `--security` | Install and enable fail2ban; install but do not enable UFW |
|
||||||
|
| `--enable-ufw` | Run security setup and explicitly enable UFW |
|
||||||
|
| `--all` | Run every standard profile without UFW enablement or driver installation |
|
||||||
|
|
||||||
|
`--install-nvidia-driver` implies `--nvidia-tools`. `--enable-ufw` implies
|
||||||
|
`--security`. With no flags, the installer prints help and makes no changes.
|
||||||
|
|
||||||
|
## Installation examples
|
||||||
|
|
||||||
|
Review the scripts and current host access path before execution:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd labs/linux/setup
|
||||||
|
./install.sh
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
sudo ./install.sh --cockpit --docker --libvirt
|
||||||
|
sudo ./install.sh --all
|
||||||
|
```
|
||||||
|
|
||||||
|
Explicit high-impact options can be combined with `--all`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --all --enable-ufw
|
||||||
|
sudo ./install.sh --all --install-nvidia-driver 550
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer runs the read-only preflight once before selected profiles and a
|
||||||
|
postcheck after all successful profile steps.
|
||||||
|
|
||||||
|
## Fresh host workflow
|
||||||
|
|
||||||
|
1. Patch the base Ubuntu installation and confirm console or out-of-band access.
|
||||||
|
2. Review [the fresh install checklist](docs/fresh-install-checklist.md).
|
||||||
|
3. Run `sudo ./install.sh --base --shell`.
|
||||||
|
4. Add only the platform profiles needed by the host.
|
||||||
|
5. Review service state, listening ports, storage, networking, and warnings in
|
||||||
|
the postcheck.
|
||||||
|
6. Reboot if a driver or kernel-related package requires it.
|
||||||
|
7. Capture host-specific configuration and backup requirements separately.
|
||||||
|
|
||||||
|
## AI lab workflow
|
||||||
|
|
||||||
|
A general AI lab host can start with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --base --shell --cockpit --docker --nvidia-tools --tuning --security
|
||||||
|
```
|
||||||
|
|
||||||
|
This installs GPU diagnostics but leaves driver choice to the operator. Add
|
||||||
|
libvirt only when the host will run VMs. Enable UFW only after confirming SSH,
|
||||||
|
Cockpit, application, bridge, and VM networking requirements.
|
||||||
|
|
||||||
|
## Safety model
|
||||||
|
|
||||||
|
- Mutating profiles require root and refuse non-Ubuntu systems or Ubuntu older
|
||||||
|
than 24.04.
|
||||||
|
- Component profiles install their own direct prerequisites.
|
||||||
|
- Existing managed configuration is changed only when content differs.
|
||||||
|
- Changed root shell, Docker, journald, and sysctl files receive timestamped
|
||||||
|
backups.
|
||||||
|
- Existing valid Docker JSON is merged so unrelated settings survive.
|
||||||
|
- Invalid Docker JSON stops configuration rather than being overwritten.
|
||||||
|
- UFW and NVIDIA driver installation require explicit flags.
|
||||||
|
- Package and service failures are not hidden.
|
||||||
|
- Postcheck warnings report optional or inactive components without masking a
|
||||||
|
successfully completed diagnostic script.
|
||||||
|
|
||||||
|
APT installation and service restarts are real system changes. Test first on a
|
||||||
|
disposable host and maintain a console path when changing remote access policy.
|
||||||
|
|
||||||
|
## Bash shell profile
|
||||||
|
|
||||||
|
The shell profile is installed as `/root/.bashrc.d/ailab.sh`, and one exact
|
||||||
|
source line is maintained in `/root/.bashrc`. It adds concise helpers for
|
||||||
|
systemd, journals, Docker, libvirt, NVIDIA, ports, archives, and disk usage.
|
||||||
|
|
||||||
|
See [Bash shell profile](docs/bash-shell.md) for command details and cautions.
|
||||||
|
|
||||||
|
## Cockpit setup
|
||||||
|
|
||||||
|
Cockpit provides browser-based host, storage, network, package, VM, metrics,
|
||||||
|
and support-report views. The installer enables `cockpit.socket` and reports
|
||||||
|
`https://HOSTNAME:9090`. `cockpit-files` is optional because it is not
|
||||||
|
available in every enabled Ubuntu repository.
|
||||||
|
|
||||||
|
See [Cockpit setup](docs/cockpit.md).
|
||||||
|
|
||||||
|
## Docker setup
|
||||||
|
|
||||||
|
The Ubuntu `docker.io` package path is preferred. The Docker official
|
||||||
|
repository is configured only when `docker.io` is unavailable. The daemon uses
|
||||||
|
the `json-file` log driver with five 50 MB files per container.
|
||||||
|
|
||||||
|
The toolkit configures log retention only. It does not prune data, deploy
|
||||||
|
Compose applications, or configure an NVIDIA container runtime.
|
||||||
|
|
||||||
|
See [Docker setup](docs/docker.md).
|
||||||
|
|
||||||
|
## libvirt/KVM setup
|
||||||
|
|
||||||
|
The libvirt profile installs QEMU, OVMF, software TPM support, virt-install,
|
||||||
|
virt-manager, bridge utilities, and libvirt clients and services. It enables
|
||||||
|
`libvirtd` and prints existing guests and networks.
|
||||||
|
|
||||||
|
See [libvirt/KVM setup](docs/libvirt.md).
|
||||||
|
|
||||||
|
## NVIDIA tooling
|
||||||
|
|
||||||
|
The default NVIDIA profile installs `nvtop`, `clinfo`, and PCI diagnostics.
|
||||||
|
It reports detected NVIDIA devices, `nvidia-smi`, and DKMS state when those
|
||||||
|
commands exist.
|
||||||
|
|
||||||
|
Driver installation requires a numeric version that maps to an available
|
||||||
|
Ubuntu package, for example `nvidia-driver-550`. Secure Boot enrollment,
|
||||||
|
driver suitability, CUDA, container runtime support, and passthrough remain
|
||||||
|
operator decisions.
|
||||||
|
|
||||||
|
See [NVIDIA tooling](docs/nvidia.md).
|
||||||
|
|
||||||
|
## Tuning
|
||||||
|
|
||||||
|
The tuning profile bounds persistent journal use, raises inotify limits for
|
||||||
|
development and container workloads, reduces swappiness, enables sysstat, and
|
||||||
|
runs automatic sensor detection when available.
|
||||||
|
|
||||||
|
Review these values against available memory, storage, monitoring retention,
|
||||||
|
and workload behavior before deployment beyond a lab.
|
||||||
|
|
||||||
|
## Security baseline
|
||||||
|
|
||||||
|
The security profile installs UFW and fail2ban and enables fail2ban. It leaves
|
||||||
|
UFW disabled unless `--enable-ufw` is present. Explicit UFW enablement permits
|
||||||
|
OpenSSH and TCP port 9090 before activation.
|
||||||
|
|
||||||
|
This is a minimal access-preservation baseline, not a complete host firewall or
|
||||||
|
hardening standard. Application and VM networking may require additional
|
||||||
|
reviewed rules.
|
||||||
|
|
||||||
|
## Postcheck
|
||||||
|
|
||||||
|
The final script reports:
|
||||||
|
|
||||||
|
- Failed systemd units.
|
||||||
|
- Cockpit, Docker, libvirt, and fail2ban status when installed.
|
||||||
|
- Running Docker containers and defined virtual machines.
|
||||||
|
- NVIDIA runtime state.
|
||||||
|
- Filesystem usage and listening ports.
|
||||||
|
|
||||||
|
Warnings require operator review but optional component absence does not cause
|
||||||
|
the postcheck itself to fail.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
Run individual read-only checks after correcting a failed profile:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./scripts/00-preflight.sh
|
||||||
|
sudo ./scripts/99-postcheck.sh
|
||||||
|
systemctl --failed
|
||||||
|
journalctl -u docker -u libvirtd -u cockpit.socket -u fail2ban
|
||||||
|
```
|
||||||
|
|
||||||
|
Common failure areas are unavailable APT repositories, unsupported package
|
||||||
|
names on a future Ubuntu release, invalid pre-existing Docker JSON, Secure Boot
|
||||||
|
module signing, disabled CPU virtualization, and remote firewall assumptions.
|
||||||
|
|
||||||
|
To roll back a managed configuration, compare the current file with its
|
||||||
|
timestamped `.bak` copy, restore the reviewed backup, and restart or reload the
|
||||||
|
owning service. Package removal is intentionally not automated because it may
|
||||||
|
affect workloads and dependencies.
|
||||||
|
|
||||||
|
## Interview talking points
|
||||||
|
|
||||||
|
- Why day-0 bootstrap and day-2 maintenance have separate ownership.
|
||||||
|
- How explicit flags protect firewall and GPU driver decisions.
|
||||||
|
- Why Docker JSON is validated, backed up, and merged.
|
||||||
|
- How idempotent content checks prevent backup and restart churn.
|
||||||
|
- Why preflight and postcheck evidence surround mutating profiles.
|
||||||
|
- Which virtualization, Secure Boot, IOMMU, and GPU decisions remain manual.
|
||||||
|
|
||||||
|
## Future improvements
|
||||||
|
|
||||||
|
- Add automated tests using disposable Ubuntu VMs.
|
||||||
|
- Add a documented NVIDIA Container Toolkit profile.
|
||||||
|
- Add optional non-root administrative user and group membership management.
|
||||||
|
- Add bridge and VFIO planning checks without applying passthrough changes.
|
||||||
|
- Add package compatibility matrices after validating future Ubuntu releases.
|
||||||
|
- Export postcheck results in a structured format for evidence collection.
|
||||||
@@ -0,0 +1,53 @@
|
|||||||
|
# Bash Shell Profile
|
||||||
|
|
||||||
|
## Installation
|
||||||
|
|
||||||
|
The shell profile is installed for root:
|
||||||
|
|
||||||
|
```text
|
||||||
|
/root/.bashrc.d/ailab.sh
|
||||||
|
```
|
||||||
|
|
||||||
|
The installer maintains one exact source line in `/root/.bashrc` and backs up
|
||||||
|
changed files. Start a new Bash session or run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
source /root/.bashrc
|
||||||
|
```
|
||||||
|
|
||||||
|
## Aliases
|
||||||
|
|
||||||
|
| Alias | Purpose |
|
||||||
|
| --- | --- |
|
||||||
|
| `ll`, `la` | Detailed and hidden-file directory listings |
|
||||||
|
| `ports` | Listening TCP/UDP sockets and processes |
|
||||||
|
| `dus`, `dufh` | Directory and filesystem usage |
|
||||||
|
| `failed`, `jerr`, `timers` | systemd failure, journal error, and timer views |
|
||||||
|
| `dps`, `ddf`, `dcu` | Docker containers, disk use, and Compose startup |
|
||||||
|
| `vms` | All libvirt guests |
|
||||||
|
| `gpu`, `gpuloop` | NVIDIA status once or refreshed every two seconds |
|
||||||
|
| `now` | Current timestamp and timezone |
|
||||||
|
|
||||||
|
`dcu` runs `docker compose up -d` in the current directory and therefore may
|
||||||
|
create or start resources. Review the Compose project before using it.
|
||||||
|
|
||||||
|
## Functions
|
||||||
|
|
||||||
|
- `svc_status SERVICE`
|
||||||
|
- `svc_logs SERVICE [LINES]`
|
||||||
|
- `docker_logs CONTAINER [LINES]`
|
||||||
|
- `docker_restart CONTAINER`
|
||||||
|
- `vm_autostart VM`
|
||||||
|
- `vm_no_autostart VM`
|
||||||
|
- `path_backup PATH`
|
||||||
|
- `extract ARCHIVE`
|
||||||
|
|
||||||
|
Functions validate argument counts, and Docker, libvirt, and NVIDIA helpers
|
||||||
|
report missing commands clearly. `path_backup` creates a timestamped adjacent
|
||||||
|
copy and can consume substantial space for large paths.
|
||||||
|
|
||||||
|
## Rollback
|
||||||
|
|
||||||
|
Review timestamped backups under `/root`, restore the desired `.bashrc` or
|
||||||
|
profile copy, and start a new shell. Avoid restoring a backup without checking
|
||||||
|
for unrelated shell changes made after bootstrap.
|
||||||
@@ -0,0 +1,41 @@
|
|||||||
|
# Cockpit
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
The Cockpit profile installs browser-based host administration modules for
|
||||||
|
system state, storage, networking, packages, virtual machines, metrics, and
|
||||||
|
support reports. It enables the socket-activated service.
|
||||||
|
|
||||||
|
## Installation and validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --cockpit
|
||||||
|
systemctl status cockpit.socket
|
||||||
|
ss -ltnp | grep ':9090'
|
||||||
|
```
|
||||||
|
|
||||||
|
Connect to `https://HOSTNAME:9090`. A browser warning is expected when the
|
||||||
|
default host certificate is not trusted.
|
||||||
|
|
||||||
|
`cockpit-files` is installed when available and skipped with a warning
|
||||||
|
otherwise.
|
||||||
|
|
||||||
|
## Access and firewall
|
||||||
|
|
||||||
|
The Cockpit profile does not change UFW. Explicit toolkit UFW enablement allows
|
||||||
|
TCP 9090, but upstream firewalls and network ACLs remain external concerns.
|
||||||
|
Use normal Linux accounts and review which users may administer the host.
|
||||||
|
|
||||||
|
## Troubleshooting and rollback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u cockpit.socket -u cockpit.service
|
||||||
|
systemctl restart cockpit.socket
|
||||||
|
apt-cache policy cockpit cockpit-machines cockpit-files
|
||||||
|
```
|
||||||
|
|
||||||
|
To disable remote access without removing packages:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl disable --now cockpit.socket
|
||||||
|
```
|
||||||
@@ -0,0 +1,56 @@
|
|||||||
|
# Docker
|
||||||
|
|
||||||
|
## Package policy
|
||||||
|
|
||||||
|
The profile prefers Ubuntu's `docker.io` package. If that package is
|
||||||
|
unavailable after an APT refresh, it configures Docker's official Ubuntu
|
||||||
|
repository and installs Docker Engine, containerd, Buildx, and Compose plugins.
|
||||||
|
|
||||||
|
This fallback requires network access to `download.docker.com`.
|
||||||
|
|
||||||
|
## Daemon configuration
|
||||||
|
|
||||||
|
The managed settings are:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {
|
||||||
|
"max-size": "50m",
|
||||||
|
"max-file": "5"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Existing valid `/etc/docker/daemon.json` content is preserved and merged with
|
||||||
|
these log settings. A changed file is backed up with a timestamp. Invalid JSON
|
||||||
|
causes the profile to stop rather than overwrite operator configuration.
|
||||||
|
|
||||||
|
Log limits apply to newly created containers. Existing containers may retain
|
||||||
|
their original logging configuration until recreated.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker version
|
||||||
|
docker compose version
|
||||||
|
docker info
|
||||||
|
docker ps
|
||||||
|
docker system df
|
||||||
|
jq . /etc/docker/daemon.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting and rollback
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status docker
|
||||||
|
journalctl -u docker
|
||||||
|
jq empty /etc/docker/daemon.json
|
||||||
|
```
|
||||||
|
|
||||||
|
To restore a previous daemon configuration, review a timestamped backup,
|
||||||
|
replace the current file, validate it with `jq empty`, and restart Docker.
|
||||||
|
Do not restore blindly when workloads depend on newer daemon settings.
|
||||||
|
|
||||||
|
The profile does not configure Docker data roots, prune objects, deploy
|
||||||
|
applications, or install the NVIDIA Container Toolkit.
|
||||||
@@ -0,0 +1,47 @@
|
|||||||
|
# Fresh Install Checklist
|
||||||
|
|
||||||
|
## Before bootstrap
|
||||||
|
|
||||||
|
- Confirm Ubuntu 24.04 or newer and record the release and kernel.
|
||||||
|
- Apply firmware settings for virtualization, IOMMU, or Secure Boot as needed.
|
||||||
|
- Confirm console or out-of-band access before firewall work.
|
||||||
|
- Record interfaces, addresses, routes, DNS, storage, and intended mountpoints.
|
||||||
|
- Patch the base system and reboot if required.
|
||||||
|
- Decide whether the host needs Docker, libvirt, Cockpit, or NVIDIA support.
|
||||||
|
- Review application ports and VM networking before enabling UFW.
|
||||||
|
- Confirm backups exist for any pre-existing host configuration.
|
||||||
|
|
||||||
|
## Bootstrap
|
||||||
|
|
||||||
|
Start with the least capability required:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
```
|
||||||
|
|
||||||
|
Add reviewed platform profiles:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --cockpit --docker --libvirt --nvidia-tools --tuning --security
|
||||||
|
```
|
||||||
|
|
||||||
|
Do not select `--enable-ufw` until remote access and application rules are
|
||||||
|
understood. Do not install an NVIDIA driver until hardware, kernel, Secure Boot,
|
||||||
|
and workload compatibility are known.
|
||||||
|
|
||||||
|
## Post-bootstrap evidence
|
||||||
|
|
||||||
|
- Review all installer warnings.
|
||||||
|
- Run `systemctl --failed`.
|
||||||
|
- Confirm expected services with `systemctl status`.
|
||||||
|
- Review `ss -tulpn`, `df -hT`, `ip -brief address`, and `ip route`.
|
||||||
|
- Confirm Docker with `docker version` and `docker compose version`.
|
||||||
|
- Confirm libvirt with `virsh list --all` and `virsh net-list --all`.
|
||||||
|
- Confirm GPU state with `lspci -nn | grep -i nvidia` and `nvidia-smi`.
|
||||||
|
- Reboot after driver installation and repeat the postcheck.
|
||||||
|
|
||||||
|
## Handover
|
||||||
|
|
||||||
|
Document host-specific storage, network, firewall, backup, application, GPU,
|
||||||
|
and VM decisions. Install the separate `ailab-maintenance` toolkit only after
|
||||||
|
reviewing its scheduled day-2 behavior.
|
||||||
@@ -0,0 +1,54 @@
|
|||||||
|
# libvirt and KVM
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
The libvirt profile installs QEMU/KVM administration, UEFI firmware, software
|
||||||
|
TPM support, VM creation tools, bridge utilities, and the libvirt daemon. This
|
||||||
|
supports later Linux or Windows 11 VM work without defining guests.
|
||||||
|
|
||||||
|
## Firmware pre-checks
|
||||||
|
|
||||||
|
Confirm CPU virtualization is enabled:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lscpu | grep -E 'Virtualization|Hypervisor'
|
||||||
|
grep -Eom1 '(vmx|svm)' /proc/cpuinfo
|
||||||
|
```
|
||||||
|
|
||||||
|
IOMMU and GPU passthrough require separate firmware, kernel command-line,
|
||||||
|
device isolation, driver binding, and recovery planning. This toolkit reports
|
||||||
|
hints but does not apply those changes.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status libvirtd
|
||||||
|
virsh list --all
|
||||||
|
virsh net-list --all
|
||||||
|
virsh pool-list --all
|
||||||
|
```
|
||||||
|
|
||||||
|
Use `virt-host-validate` when available for a broader host capability report.
|
||||||
|
Desktop use of `virt-manager` requires a graphical environment or remote
|
||||||
|
display strategy.
|
||||||
|
|
||||||
|
## Networking and Windows 11
|
||||||
|
|
||||||
|
The default libvirt NAT network is distinct from host bridge networking. Review
|
||||||
|
DHCP, DNS, forwarding, and firewall behavior before changing it.
|
||||||
|
|
||||||
|
Windows 11 typically needs UEFI and a TPM device. The installed OVMF and swtpm
|
||||||
|
packages provide those building blocks, but guest creation and licensing remain
|
||||||
|
manual.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u libvirtd
|
||||||
|
virsh net-info default
|
||||||
|
virsh pool-list --all
|
||||||
|
lsmod | grep kvm
|
||||||
|
```
|
||||||
|
|
||||||
|
Disabling `libvirtd` does not remove VM disks or definitions. Package removal
|
||||||
|
and VM data deletion are intentionally outside this toolkit.
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# NVIDIA Tooling
|
||||||
|
|
||||||
|
## Diagnostic-only default
|
||||||
|
|
||||||
|
The normal NVIDIA profile installs `nvtop`, `clinfo`, and PCI utilities. It
|
||||||
|
does not install or select a driver:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --nvidia-tools
|
||||||
|
```
|
||||||
|
|
||||||
|
Review hardware and current module state:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lspci -nn | grep -i nvidia
|
||||||
|
nvidia-smi
|
||||||
|
dkms status
|
||||||
|
mokutil --sb-state
|
||||||
|
```
|
||||||
|
|
||||||
|
## Explicit driver installation
|
||||||
|
|
||||||
|
Install only a reviewed Ubuntu driver package version:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo ./install.sh --install-nvidia-driver 550
|
||||||
|
```
|
||||||
|
|
||||||
|
The numeric value maps directly to `nvidia-driver-VERSION`. The profile refuses
|
||||||
|
an unavailable package. Reboot after installation, then validate `nvidia-smi`,
|
||||||
|
kernel logs, DKMS state, and application behavior.
|
||||||
|
|
||||||
|
## Selection considerations
|
||||||
|
|
||||||
|
- GPU generation and supported driver branch.
|
||||||
|
- Ubuntu release, kernel, and HWE stack.
|
||||||
|
- Secure Boot module enrollment.
|
||||||
|
- CUDA or application compatibility.
|
||||||
|
- Docker NVIDIA Container Toolkit requirements.
|
||||||
|
- Whether the device will be bound to VFIO instead of the host driver.
|
||||||
|
|
||||||
|
## Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -k | grep -Ei 'nvidia|nouveau|NVRM'
|
||||||
|
lsmod | grep -E 'nvidia|nouveau'
|
||||||
|
dkms status
|
||||||
|
apt-cache policy 'nvidia-driver-*'
|
||||||
|
```
|
||||||
|
|
||||||
|
Driver rollback is environment-specific and is not automated. Preserve console
|
||||||
|
access and a known-good kernel before changing GPU or Secure Boot configuration.
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# AI lab operational shell helpers. This file is intended to be sourced.
|
||||||
|
|
||||||
|
alias ll='ls -alF'
|
||||||
|
alias la='ls -A'
|
||||||
|
alias ports='ss -tulpn'
|
||||||
|
alias dus='du -xhd1 2>/dev/null | sort -h'
|
||||||
|
alias dufh='df -hT'
|
||||||
|
alias failed='systemctl --failed --no-pager'
|
||||||
|
alias jerr='journalctl -p err -b --no-pager'
|
||||||
|
alias timers='systemctl list-timers --all --no-pager'
|
||||||
|
alias dps='command -v docker >/dev/null 2>&1 && docker ps --format "table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}" || printf "Docker is not installed\n"'
|
||||||
|
alias ddf='command -v docker >/dev/null 2>&1 && docker system df || printf "Docker is not installed\n"'
|
||||||
|
alias dcu='command -v docker >/dev/null 2>&1 && docker compose up -d || printf "Docker Compose is not available\n"'
|
||||||
|
alias vms='command -v virsh >/dev/null 2>&1 && virsh list --all || printf "virsh is not installed\n"'
|
||||||
|
alias gpu='command -v nvidia-smi >/dev/null 2>&1 && nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||||
|
alias gpuloop='command -v nvidia-smi >/dev/null 2>&1 && watch -n 2 nvidia-smi || printf "nvidia-smi is not installed\n"'
|
||||||
|
alias now='date "+%Y-%m-%d %H:%M:%S %Z"'
|
||||||
|
|
||||||
|
svc_status() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: svc_status SERVICE\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
systemctl status "$1" --no-pager
|
||||||
|
}
|
||||||
|
|
||||||
|
svc_logs() {
|
||||||
|
if (($# < 1 || $# > 2)); then
|
||||||
|
printf 'Usage: svc_logs SERVICE [LINES]\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
local lines="${2:-100}"
|
||||||
|
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||||
|
printf 'LINES must be numeric\n' >&2
|
||||||
|
return 2
|
||||||
|
}
|
||||||
|
journalctl -u "$1" -n "$lines" --no-pager
|
||||||
|
}
|
||||||
|
|
||||||
|
docker_logs() {
|
||||||
|
if (($# < 1 || $# > 2)); then
|
||||||
|
printf 'Usage: docker_logs CONTAINER [LINES]\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v docker >/dev/null 2>&1 || {
|
||||||
|
printf 'Docker is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
local lines="${2:-100}"
|
||||||
|
[[ "$lines" =~ ^[0-9]+$ ]] || {
|
||||||
|
printf 'LINES must be numeric\n' >&2
|
||||||
|
return 2
|
||||||
|
}
|
||||||
|
docker logs --tail "$lines" "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
docker_restart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: docker_restart CONTAINER\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v docker >/dev/null 2>&1 || {
|
||||||
|
printf 'Docker is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
docker restart "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
vm_autostart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: vm_autostart VM\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v virsh >/dev/null 2>&1 || {
|
||||||
|
printf 'virsh is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
virsh autostart "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
vm_no_autostart() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: vm_no_autostart VM\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
command -v virsh >/dev/null 2>&1 || {
|
||||||
|
printf 'virsh is not installed\n' >&2
|
||||||
|
return 1
|
||||||
|
}
|
||||||
|
virsh autostart --disable "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
path_backup() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: path_backup PATH\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
if [[ ! -e "$1" ]]; then
|
||||||
|
printf 'Path does not exist: %s\n' "$1" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
local destination
|
||||||
|
destination="${1%/}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
cp -a -- "$1" "$destination"
|
||||||
|
printf 'Backup created: %s\n' "$destination"
|
||||||
|
}
|
||||||
|
|
||||||
|
extract() {
|
||||||
|
if (($# != 1)); then
|
||||||
|
printf 'Usage: extract ARCHIVE\n' >&2
|
||||||
|
return 2
|
||||||
|
fi
|
||||||
|
if [[ ! -f "$1" ]]; then
|
||||||
|
printf 'Archive does not exist: %s\n' "$1" >&2
|
||||||
|
return 1
|
||||||
|
fi
|
||||||
|
case "$1" in
|
||||||
|
*.tar.bz2|*.tbz2) tar xjf "$1" ;;
|
||||||
|
*.tar.gz|*.tgz) tar xzf "$1" ;;
|
||||||
|
*.tar.xz|*.txz) tar xJf "$1" ;;
|
||||||
|
*.tar) tar xf "$1" ;;
|
||||||
|
*.bz2) bunzip2 "$1" ;;
|
||||||
|
*.gz) gunzip "$1" ;;
|
||||||
|
*.zip) unzip "$1" ;;
|
||||||
|
*.7z) 7z x "$1" ;;
|
||||||
|
*.rar) unrar x "$1" ;;
|
||||||
|
*)
|
||||||
|
printf 'Unsupported archive type: %s\n' "$1" >&2
|
||||||
|
return 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
@@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": {
|
||||||
|
"max-size": "50m",
|
||||||
|
"max-file": "5"
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
fs.inotify.max_user_watches=1048576
|
||||||
|
fs.inotify.max_user_instances=1024
|
||||||
|
vm.swappiness=10
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
[Journal]
|
||||||
|
SystemMaxUse=1G
|
||||||
|
SystemKeepFree=2G
|
||||||
|
MaxRetentionSec=14day
|
||||||
|
Compress=yes
|
||||||
Executable
+182
@@ -0,0 +1,182 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
|
||||||
|
run_base=0
|
||||||
|
run_shell=0
|
||||||
|
run_cockpit=0
|
||||||
|
run_docker=0
|
||||||
|
run_libvirt=0
|
||||||
|
run_nvidia=0
|
||||||
|
run_tuning=0
|
||||||
|
run_security=0
|
||||||
|
enable_ufw=0
|
||||||
|
nvidia_driver_version=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./install.sh [OPTIONS]
|
||||||
|
|
||||||
|
Day-0 bootstrap automation for Ubuntu 24.04 or newer.
|
||||||
|
|
||||||
|
Profiles:
|
||||||
|
--base Install baseline operational packages
|
||||||
|
--shell Install the root shell profile
|
||||||
|
--cockpit Install and enable Cockpit
|
||||||
|
--docker Install and configure Docker
|
||||||
|
--libvirt Install and enable libvirt/KVM
|
||||||
|
--nvidia-tools Install NVIDIA diagnostic tools only
|
||||||
|
--install-nvidia-driver VERSION
|
||||||
|
Install diagnostic tools and the explicit driver
|
||||||
|
--tuning Install journald and sysctl tuning
|
||||||
|
--security Install fail2ban and UFW; do not enable UFW
|
||||||
|
--enable-ufw Run security profile and explicitly enable UFW
|
||||||
|
--all Run every profile without enabling UFW or
|
||||||
|
installing an NVIDIA driver
|
||||||
|
-h, --help Show this help
|
||||||
|
|
||||||
|
Examples:
|
||||||
|
sudo ./install.sh --base --shell
|
||||||
|
sudo ./install.sh --all
|
||||||
|
sudo ./install.sh --all --enable-ufw
|
||||||
|
sudo ./install.sh --nvidia-tools --install-nvidia-driver 550
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
require_supported_ubuntu() {
|
||||||
|
if [[ ! -r /etc/os-release ]]; then
|
||||||
|
printf 'CRITICAL: /etc/os-release is unavailable; refusing system changes\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||||
|
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||||
|
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||||
|
"${VERSION_ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
if (($# == 0)); then
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--base)
|
||||||
|
run_base=1
|
||||||
|
;;
|
||||||
|
--shell)
|
||||||
|
run_shell=1
|
||||||
|
;;
|
||||||
|
--cockpit)
|
||||||
|
run_cockpit=1
|
||||||
|
;;
|
||||||
|
--docker)
|
||||||
|
run_docker=1
|
||||||
|
;;
|
||||||
|
--libvirt)
|
||||||
|
run_libvirt=1
|
||||||
|
;;
|
||||||
|
--nvidia-tools)
|
||||||
|
run_nvidia=1
|
||||||
|
;;
|
||||||
|
--install-nvidia-driver)
|
||||||
|
if (($# < 2)); then
|
||||||
|
printf 'CRITICAL: --install-nvidia-driver requires a VERSION\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
nvidia_driver_version="$2"
|
||||||
|
if [[ ! "$nvidia_driver_version" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
run_nvidia=1
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
--tuning)
|
||||||
|
run_tuning=1
|
||||||
|
;;
|
||||||
|
--security)
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
--enable-ufw)
|
||||||
|
enable_ufw=1
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
--all)
|
||||||
|
run_base=1
|
||||||
|
run_shell=1
|
||||||
|
run_cockpit=1
|
||||||
|
run_docker=1
|
||||||
|
run_libvirt=1
|
||||||
|
run_nvidia=1
|
||||||
|
run_tuning=1
|
||||||
|
run_security=1
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n\n' "$1" >&2
|
||||||
|
usage >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: install.sh must run as root for selected profiles\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
for required_command in bash dpkg; do
|
||||||
|
if ! command -v "$required_command" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$required_command" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
require_supported_ubuntu
|
||||||
|
|
||||||
|
printf 'INFO: running read-only preflight\n'
|
||||||
|
"$SCRIPT_DIR/scripts/00-preflight.sh"
|
||||||
|
|
||||||
|
((run_base == 0)) || "$SCRIPT_DIR/scripts/01-base-packages.sh"
|
||||||
|
((run_shell == 0)) || "$SCRIPT_DIR/scripts/02-shell-profile.sh"
|
||||||
|
((run_cockpit == 0)) || "$SCRIPT_DIR/scripts/03-cockpit.sh"
|
||||||
|
((run_docker == 0)) || "$SCRIPT_DIR/scripts/04-docker.sh"
|
||||||
|
((run_libvirt == 0)) || "$SCRIPT_DIR/scripts/05-libvirt.sh"
|
||||||
|
|
||||||
|
if ((run_nvidia == 1)); then
|
||||||
|
if [[ -n "$nvidia_driver_version" ]]; then
|
||||||
|
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh" --install-driver "$nvidia_driver_version"
|
||||||
|
else
|
||||||
|
"$SCRIPT_DIR/scripts/06-nvidia-tools.sh"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
((run_tuning == 0)) || "$SCRIPT_DIR/scripts/07-tuning.sh"
|
||||||
|
|
||||||
|
if ((run_security == 1)); then
|
||||||
|
if ((enable_ufw == 1)); then
|
||||||
|
"$SCRIPT_DIR/scripts/08-security-baseline.sh" --enable-ufw
|
||||||
|
else
|
||||||
|
"$SCRIPT_DIR/scripts/08-security-baseline.sh"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nINFO: running post-install checks\n'
|
||||||
|
"$SCRIPT_DIR/scripts/99-postcheck.sh"
|
||||||
|
printf '\nOK: selected Linux setup profiles completed\n'
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
# shellcheck shell=bash
|
||||||
|
|
||||||
|
require_supported_ubuntu() {
|
||||||
|
if [[ ! -r /etc/os-release ]] || ! command -v dpkg >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: Ubuntu release detection requires /etc/os-release and dpkg\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
if [[ "${ID:-}" != "ubuntu" ]]; then
|
||||||
|
printf 'CRITICAL: this toolkit supports Ubuntu only; detected %s\n' "${ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
if ! dpkg --compare-versions "${VERSION_ID:-0}" ge "24.04"; then
|
||||||
|
printf 'CRITICAL: Ubuntu 24.04 or newer is required; detected %s\n' \
|
||||||
|
"${VERSION_ID:-unknown}" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
}
|
||||||
Executable
+124
@@ -0,0 +1,124 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Operating system"
|
||||||
|
if [[ -r /etc/os-release ]]; then
|
||||||
|
run_optional "OS release report" cat /etc/os-release
|
||||||
|
else
|
||||||
|
printf 'WARNING: /etc/os-release is unavailable\n'
|
||||||
|
fi
|
||||||
|
run_optional "kernel report" uname -a
|
||||||
|
|
||||||
|
section "Host"
|
||||||
|
run_optional "hostname report" hostname
|
||||||
|
run_optional "uptime report" uptime
|
||||||
|
|
||||||
|
section "CPU and virtualization"
|
||||||
|
if command -v lscpu >/dev/null 2>&1; then
|
||||||
|
run_optional "CPU report" lscpu
|
||||||
|
printf '\nVirtualization flags:\n'
|
||||||
|
lscpu | grep -E 'Virtualization|Hypervisor vendor' || \
|
||||||
|
printf 'INFO: no virtualization summary reported by lscpu\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: lscpu is unavailable\n'
|
||||||
|
fi
|
||||||
|
if grep -Eqm1 '(^|[[:space:]])(vmx|svm)([[:space:]]|$)' /proc/cpuinfo; then
|
||||||
|
printf 'OK: CPU virtualization flags detected\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: CPU virtualization flags were not detected\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Memory"
|
||||||
|
if command -v free >/dev/null 2>&1; then
|
||||||
|
run_optional "memory report" free -h
|
||||||
|
else
|
||||||
|
run_optional "memory report" cat /proc/meminfo
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Disks"
|
||||||
|
if command -v lsblk >/dev/null 2>&1; then
|
||||||
|
run_optional "block device report" lsblk -o NAME,TYPE,SIZE,FSTYPE,MOUNTPOINTS,MODEL
|
||||||
|
else
|
||||||
|
printf 'WARNING: lsblk is unavailable\n'
|
||||||
|
fi
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
|
||||||
|
section "Network"
|
||||||
|
if command -v ip >/dev/null 2>&1; then
|
||||||
|
run_optional "network interface report" ip -brief address
|
||||||
|
run_optional "route report" ip route
|
||||||
|
else
|
||||||
|
printf 'WARNING: ip is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Firmware and Secure Boot"
|
||||||
|
if [[ -d /sys/firmware/efi ]]; then
|
||||||
|
printf 'OK: boot mode is UEFI\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: boot mode appears to be legacy BIOS\n'
|
||||||
|
fi
|
||||||
|
if command -v mokutil >/dev/null 2>&1; then
|
||||||
|
run_optional "Secure Boot report" mokutil --sb-state
|
||||||
|
else
|
||||||
|
printf 'INFO: mokutil is unavailable; Secure Boot state not queried\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "IOMMU"
|
||||||
|
if [[ -r /proc/cmdline ]]; then
|
||||||
|
printf 'Kernel command line:\n'
|
||||||
|
cat /proc/cmdline
|
||||||
|
if grep -Eq '(^|[[:space:]])(intel_iommu=on|amd_iommu=on|iommu=)' /proc/cmdline; then
|
||||||
|
printf 'OK: IOMMU-related kernel arguments detected\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: no explicit IOMMU kernel argument detected\n'
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
if command -v dmesg >/dev/null 2>&1; then
|
||||||
|
dmesg 2>/dev/null | grep -Ei 'DMAR|IOMMU|AMD-Vi' | tail -n 30 || \
|
||||||
|
printf 'INFO: no readable IOMMU hints found in dmesg\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA hardware"
|
||||||
|
if command -v lspci >/dev/null 2>&1; then
|
||||||
|
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: lspci is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Existing platform components"
|
||||||
|
for command_name in docker virsh cockpit-bridge; do
|
||||||
|
if command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'OK: %s is installed at %s\n' "$command_name" "$(command -v "$command_name")"
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$command_name"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
for unit in docker.service libvirtd.service cockpit.socket; do
|
||||||
|
if systemctl cat "$unit" >/dev/null 2>&1; then
|
||||||
|
state="$(systemctl is-active "$unit" 2>/dev/null || true)"
|
||||||
|
printf 'INFO: %-20s state=%s\n' "$unit" "${state:-unknown}"
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$unit"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nOK: preflight completed without modifying the host\n'
|
||||||
Executable
+41
@@ -0,0 +1,41 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
packages=(
|
||||||
|
curl wget git vim nano tmux byobu htop btop glances
|
||||||
|
jq unzip zip rsync tree ncdu duf
|
||||||
|
lsof strace tcpdump nmap dnsutils net-tools iperf3 ethtool
|
||||||
|
smartmontools nvme-cli lm-sensors pciutils usbutils hwinfo
|
||||||
|
sysstat iotop iftop nload
|
||||||
|
ca-certificates gnupg software-properties-common apt-transport-https
|
||||||
|
needrestart unattended-upgrades logrotate
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: base package setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'INFO: refreshing APT metadata\n'
|
||||||
|
apt-get update
|
||||||
|
printf 'INFO: installing baseline operational packages\n'
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
systemctl enable --now sysstat
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is unavailable; sysstat was not enabled\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf 'OK: baseline operational packages are installed\n'
|
||||||
Executable
+60
@@ -0,0 +1,60 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
SOURCE_FILE="$SCRIPT_DIR/../files/bashrc.d/ailab.sh"
|
||||||
|
PROFILE_DIR="/root/.bashrc.d"
|
||||||
|
PROFILE_FILE="$PROFILE_DIR/ailab.sh"
|
||||||
|
BASHRC="/root/.bashrc"
|
||||||
|
SOURCE_LINE='[[ -f /root/.bashrc.d/ailab.sh ]] && source /root/.bashrc.d/ailab.sh'
|
||||||
|
|
||||||
|
backup_file() {
|
||||||
|
local path="$1"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: shell profile setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if [[ ! -r "$SOURCE_FILE" ]]; then
|
||||||
|
printf 'CRITICAL: shell profile source is missing: %s\n' "$SOURCE_FILE" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
install -d -m 0755 "$PROFILE_DIR"
|
||||||
|
if [[ ! -f "$PROFILE_FILE" ]] || ! cmp -s "$SOURCE_FILE" "$PROFILE_FILE"; then
|
||||||
|
if [[ -f "$PROFILE_FILE" ]]; then
|
||||||
|
backup_file "$PROFILE_FILE"
|
||||||
|
fi
|
||||||
|
install -m 0644 "$SOURCE_FILE" "$PROFILE_FILE"
|
||||||
|
printf 'OK: installed %s\n' "$PROFILE_FILE"
|
||||||
|
else
|
||||||
|
printf 'OK: shell profile is already current\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ ! -f "$BASHRC" ]]; then
|
||||||
|
install -m 0644 /dev/null "$BASHRC"
|
||||||
|
fi
|
||||||
|
|
||||||
|
source_count="$(grep -Fxc "$SOURCE_LINE" "$BASHRC" || true)"
|
||||||
|
if [[ "$source_count" != "1" ]]; then
|
||||||
|
tmp_bashrc="$(mktemp)"
|
||||||
|
trap 'rm -f "$tmp_bashrc"' EXIT
|
||||||
|
grep -Fvx "$SOURCE_LINE" "$BASHRC" >"$tmp_bashrc" || true
|
||||||
|
printf '\n%s\n' "$SOURCE_LINE" >>"$tmp_bashrc"
|
||||||
|
backup_file "$BASHRC"
|
||||||
|
install -m 0644 "$tmp_bashrc" "$BASHRC"
|
||||||
|
printf 'OK: configured %s to source the AI lab profile exactly once\n' "$BASHRC"
|
||||||
|
else
|
||||||
|
printf 'OK: %s already sources the AI lab profile exactly once\n' "$BASHRC"
|
||||||
|
fi
|
||||||
Executable
+36
@@ -0,0 +1,36 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
required_packages=(
|
||||||
|
cockpit cockpit-system cockpit-storaged cockpit-networkmanager
|
||||||
|
cockpit-packagekit cockpit-machines cockpit-sosreport cockpit-pcp
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: Cockpit setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${required_packages[@]}"
|
||||||
|
|
||||||
|
if apt-cache show cockpit-files >/dev/null 2>&1; then
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y cockpit-files
|
||||||
|
printf 'OK: installed optional cockpit-files package\n'
|
||||||
|
else
|
||||||
|
printf 'WARNING: cockpit-files is unavailable; continuing without it\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl enable --now cockpit.socket
|
||||||
|
printf 'OK: Cockpit is enabled at https://%s:9090\n' "$(hostname)"
|
||||||
Executable
+136
@@ -0,0 +1,136 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
SOURCE_CONFIG="$SCRIPT_DIR/../files/docker/daemon.json"
|
||||||
|
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||||
|
temporary_files=()
|
||||||
|
|
||||||
|
cleanup() {
|
||||||
|
local path
|
||||||
|
|
||||||
|
for path in "${temporary_files[@]}"; do
|
||||||
|
rm -f "$path"
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
trap cleanup EXIT
|
||||||
|
|
||||||
|
backup_file() {
|
||||||
|
local path="$1"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
backup="${path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m 0644 "$path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$path" "$backup"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: Docker setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
for command_name in apt-get apt-cache; do
|
||||||
|
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y ca-certificates curl gnupg jq
|
||||||
|
|
||||||
|
if apt-cache show docker.io >/dev/null 2>&1; then
|
||||||
|
packages=(docker.io)
|
||||||
|
if apt-cache show docker-compose-v2 >/dev/null 2>&1; then
|
||||||
|
packages+=(docker-compose-v2)
|
||||||
|
else
|
||||||
|
printf 'WARNING: docker-compose-v2 is unavailable from Ubuntu repositories\n'
|
||||||
|
fi
|
||||||
|
else
|
||||||
|
printf 'WARNING: docker.io is unavailable; configuring Docker official repository\n'
|
||||||
|
install -d -m 0755 /etc/apt/keyrings
|
||||||
|
tmp_key="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_key")
|
||||||
|
curl -fsSL https://download.docker.com/linux/ubuntu/gpg \
|
||||||
|
| gpg --dearmor --yes -o "$tmp_key"
|
||||||
|
if [[ ! -f /etc/apt/keyrings/docker.gpg ]] || \
|
||||||
|
! cmp -s "$tmp_key" /etc/apt/keyrings/docker.gpg; then
|
||||||
|
if [[ -f /etc/apt/keyrings/docker.gpg ]]; then
|
||||||
|
backup_file /etc/apt/keyrings/docker.gpg
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_key" /etc/apt/keyrings/docker.gpg
|
||||||
|
fi
|
||||||
|
|
||||||
|
# shellcheck disable=SC1091
|
||||||
|
source /etc/os-release
|
||||||
|
architecture="$(dpkg --print-architecture)"
|
||||||
|
tmp_repository="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_repository")
|
||||||
|
printf 'deb [arch=%s signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu %s stable\n' \
|
||||||
|
"$architecture" "${VERSION_CODENAME:?}" \
|
||||||
|
>"$tmp_repository"
|
||||||
|
if [[ ! -f /etc/apt/sources.list.d/docker.list ]] || \
|
||||||
|
! cmp -s "$tmp_repository" /etc/apt/sources.list.d/docker.list; then
|
||||||
|
if [[ -f /etc/apt/sources.list.d/docker.list ]]; then
|
||||||
|
backup_file /etc/apt/sources.list.d/docker.list
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_repository" /etc/apt/sources.list.d/docker.list
|
||||||
|
fi
|
||||||
|
apt-get update
|
||||||
|
packages=(docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin)
|
||||||
|
fi
|
||||||
|
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
install -d -m 0755 /etc/docker
|
||||||
|
|
||||||
|
if [[ ! -r "$SOURCE_CONFIG" ]]; then
|
||||||
|
printf 'CRITICAL: Docker configuration template is missing: %s\n' "$SOURCE_CONFIG" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
jq empty "$SOURCE_CONFIG"
|
||||||
|
|
||||||
|
tmp_config="$(mktemp)"
|
||||||
|
temporary_files+=("$tmp_config")
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: %s is invalid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
jq '. + {
|
||||||
|
"log-driver": "json-file",
|
||||||
|
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||||
|
}' "$DOCKER_CONFIG" >"$tmp_config"
|
||||||
|
else
|
||||||
|
install -m 0644 "$SOURCE_CONFIG" "$tmp_config"
|
||||||
|
fi
|
||||||
|
jq empty "$tmp_config"
|
||||||
|
|
||||||
|
config_changed=0
|
||||||
|
if [[ ! -f "$DOCKER_CONFIG" ]] || ! cmp -s "$tmp_config" "$DOCKER_CONFIG"; then
|
||||||
|
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||||
|
backup_file "$DOCKER_CONFIG"
|
||||||
|
fi
|
||||||
|
install -m 0644 "$tmp_config" "$DOCKER_CONFIG"
|
||||||
|
config_changed=1
|
||||||
|
printf 'OK: installed Docker daemon log limits\n'
|
||||||
|
else
|
||||||
|
printf 'OK: Docker daemon configuration is already current\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
systemctl enable --now docker
|
||||||
|
if ((config_changed == 1)); then
|
||||||
|
systemctl restart docker
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker version
|
||||||
|
if docker compose version >/dev/null 2>&1; then
|
||||||
|
docker compose version
|
||||||
|
else
|
||||||
|
printf 'WARNING: Docker Compose v2 is unavailable\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: Docker setup completed\n'
|
||||||
Executable
+33
@@ -0,0 +1,33 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
packages=(
|
||||||
|
qemu-system-x86 qemu-utils libvirt-daemon-system libvirt-clients
|
||||||
|
virtinst virt-manager bridge-utils ovmf swtpm swtpm-tools dnsmasq-base
|
||||||
|
)
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: libvirt setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||||
|
systemctl enable --now libvirtd
|
||||||
|
|
||||||
|
printf '\n== Virtual machines ==\n'
|
||||||
|
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||||
|
printf '\n== Virtual networks ==\n'
|
||||||
|
virsh net-list --all || printf 'WARNING: unable to list virtual networks\n'
|
||||||
|
printf 'OK: libvirt/KVM setup completed\n'
|
||||||
Executable
+88
@@ -0,0 +1,88 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
driver_version=""
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./06-nvidia-tools.sh [--install-driver VERSION]
|
||||||
|
|
||||||
|
Without --install-driver, only non-driver diagnostic tools are installed.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--install-driver)
|
||||||
|
if (($# < 2)); then
|
||||||
|
printf 'CRITICAL: --install-driver requires a VERSION\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
driver_version="$2"
|
||||||
|
if [[ ! "$driver_version" =~ ^[0-9]+$ ]]; then
|
||||||
|
printf 'CRITICAL: NVIDIA driver VERSION must contain digits only\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
shift
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: NVIDIA tooling setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y nvtop clinfo pciutils
|
||||||
|
|
||||||
|
printf '\n== NVIDIA PCI devices ==\n'
|
||||||
|
lspci -nn | grep -i nvidia || printf 'INFO: no NVIDIA PCI devices detected\n'
|
||||||
|
|
||||||
|
printf '\n== NVIDIA runtime ==\n'
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi || printf 'WARNING: nvidia-smi returned an error\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\n== DKMS ==\n'
|
||||||
|
if command -v dkms >/dev/null 2>&1; then
|
||||||
|
dkms status || printf 'WARNING: dkms status returned an error\n'
|
||||||
|
else
|
||||||
|
printf 'INFO: dkms is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [[ -n "$driver_version" ]]; then
|
||||||
|
driver_package="nvidia-driver-$driver_version"
|
||||||
|
if ! apt-cache show "$driver_package" >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: requested NVIDIA driver package is unavailable: %s\n' \
|
||||||
|
"$driver_package" >&2
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y "$driver_package"
|
||||||
|
printf 'WARNING: NVIDIA driver %s was installed; reboot before validation\n' \
|
||||||
|
"$driver_version"
|
||||||
|
else
|
||||||
|
printf 'OK: NVIDIA diagnostic tools installed; no driver was installed\n'
|
||||||
|
fi
|
||||||
Executable
+67
@@ -0,0 +1,67 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
JOURNAL_SOURCE="$SCRIPT_DIR/../files/systemd/journald-ailab-limits.conf"
|
||||||
|
JOURNAL_DEST="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||||
|
SYSCTL_SOURCE="$SCRIPT_DIR/../files/sysctl/99-ailab.conf"
|
||||||
|
SYSCTL_DEST="/etc/sysctl.d/99-ailab.conf"
|
||||||
|
|
||||||
|
install_config() {
|
||||||
|
local source_path="$1"
|
||||||
|
local destination_path="$2"
|
||||||
|
local mode="$3"
|
||||||
|
local backup
|
||||||
|
|
||||||
|
install -d -m 0755 "$(dirname "$destination_path")"
|
||||||
|
if [[ -f "$destination_path" ]] && cmp -s "$source_path" "$destination_path"; then
|
||||||
|
printf 'OK: %s is already current\n' "$destination_path"
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
if [[ -f "$destination_path" ]]; then
|
||||||
|
backup="${destination_path}.$(date '+%Y%m%d-%H%M%S').bak"
|
||||||
|
install -m "$mode" "$destination_path" "$backup"
|
||||||
|
printf 'INFO: backed up %s to %s\n' "$destination_path" "$backup"
|
||||||
|
fi
|
||||||
|
install -m "$mode" "$source_path" "$destination_path"
|
||||||
|
printf 'OK: installed %s\n' "$destination_path"
|
||||||
|
}
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: tuning setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
for source_path in "$JOURNAL_SOURCE" "$SYSCTL_SOURCE"; do
|
||||||
|
if [[ ! -r "$source_path" ]]; then
|
||||||
|
printf 'CRITICAL: required configuration is missing: %s\n' "$source_path" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
if ! command -v sysctl >/dev/null 2>&1 || ! command -v systemctl >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: sysctl and systemctl are required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! command -v sensors-detect >/dev/null 2>&1 || \
|
||||||
|
! systemctl cat sysstat.service >/dev/null 2>&1; then
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y lm-sensors sysstat
|
||||||
|
fi
|
||||||
|
|
||||||
|
install_config "$JOURNAL_SOURCE" "$JOURNAL_DEST" 0644
|
||||||
|
install_config "$SYSCTL_SOURCE" "$SYSCTL_DEST" 0644
|
||||||
|
|
||||||
|
sysctl --system
|
||||||
|
systemctl restart systemd-journald
|
||||||
|
systemctl enable --now sysstat
|
||||||
|
|
||||||
|
if command -v sensors-detect >/dev/null 2>&1; then
|
||||||
|
sensors-detect --auto || printf 'WARNING: sensors-detect did not complete successfully\n'
|
||||||
|
fi
|
||||||
|
printf 'OK: host tuning completed\n'
|
||||||
+61
@@ -0,0 +1,61 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
# shellcheck source=00-platform-guard.inc
|
||||||
|
source "$SCRIPT_DIR/00-platform-guard.inc"
|
||||||
|
|
||||||
|
enable_ufw=0
|
||||||
|
|
||||||
|
usage() {
|
||||||
|
cat <<'EOF'
|
||||||
|
Usage: sudo ./08-security-baseline.sh [--enable-ufw]
|
||||||
|
|
||||||
|
Installs fail2ban and UFW. UFW is enabled only with the explicit flag.
|
||||||
|
EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
while (($# > 0)); do
|
||||||
|
case "$1" in
|
||||||
|
--enable-ufw)
|
||||||
|
enable_ufw=1
|
||||||
|
;;
|
||||||
|
-h|--help)
|
||||||
|
usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
printf 'CRITICAL: unknown option: %s\n' "$1" >&2
|
||||||
|
exit 2
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
shift
|
||||||
|
done
|
||||||
|
|
||||||
|
if ((EUID != 0)); then
|
||||||
|
printf 'CRITICAL: security baseline setup must run as root\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
require_supported_ubuntu
|
||||||
|
if ! command -v apt-get >/dev/null 2>&1; then
|
||||||
|
printf 'CRITICAL: apt-get is required\n' >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
apt-get update
|
||||||
|
DEBIAN_FRONTEND=noninteractive apt-get install -y fail2ban ufw
|
||||||
|
systemctl enable --now fail2ban
|
||||||
|
|
||||||
|
if ((enable_ufw == 1)); then
|
||||||
|
printf 'WARNING: UFW was explicitly requested; SSH and Cockpit rules will be added before enablement\n'
|
||||||
|
ufw allow OpenSSH
|
||||||
|
ufw allow 9090/tcp comment 'Cockpit'
|
||||||
|
ufw --force enable
|
||||||
|
else
|
||||||
|
printf 'WARNING: UFW is installed but was not enabled; use --enable-ufw after reviewing access requirements\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
ufw status verbose || printf 'WARNING: unable to read UFW status\n'
|
||||||
|
printf 'OK: security baseline completed\n'
|
||||||
Executable
+69
@@ -0,0 +1,69 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
set -o errexit
|
||||||
|
set -o nounset
|
||||||
|
set -o pipefail
|
||||||
|
|
||||||
|
section() {
|
||||||
|
printf '\n== %s ==\n' "$1"
|
||||||
|
}
|
||||||
|
|
||||||
|
run_optional() {
|
||||||
|
local description="$1"
|
||||||
|
shift
|
||||||
|
|
||||||
|
if "$@"; then
|
||||||
|
return 0
|
||||||
|
fi
|
||||||
|
printf 'WARNING: %s failed\n' "$description"
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
section "Failed systemd units"
|
||||||
|
if command -v systemctl >/dev/null 2>&1; then
|
||||||
|
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||||
|
|
||||||
|
section "Selected service status"
|
||||||
|
for unit in cockpit.socket docker.service libvirtd.service fail2ban.service; do
|
||||||
|
if systemctl cat "$unit" >/dev/null 2>&1; then
|
||||||
|
run_optional "$unit status" systemctl status "$unit" --no-pager
|
||||||
|
else
|
||||||
|
printf 'INFO: %s is not installed\n' "$unit"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
else
|
||||||
|
printf 'WARNING: systemctl is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Docker"
|
||||||
|
if command -v docker >/dev/null 2>&1; then
|
||||||
|
run_optional "Docker container list" docker ps
|
||||||
|
else
|
||||||
|
printf 'INFO: Docker is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Libvirt"
|
||||||
|
if command -v virsh >/dev/null 2>&1; then
|
||||||
|
run_optional "libvirt guest list" virsh list --all
|
||||||
|
else
|
||||||
|
printf 'INFO: virsh is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "NVIDIA"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
run_optional "NVIDIA status" nvidia-smi
|
||||||
|
else
|
||||||
|
printf 'INFO: nvidia-smi is not installed\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
section "Filesystems"
|
||||||
|
run_optional "filesystem report" df -hT
|
||||||
|
|
||||||
|
section "Listening ports"
|
||||||
|
if command -v ss >/dev/null 2>&1; then
|
||||||
|
run_optional "listening port report" ss -tulpn
|
||||||
|
else
|
||||||
|
printf 'WARNING: ss is unavailable\n'
|
||||||
|
fi
|
||||||
|
|
||||||
|
printf '\nOK: postcheck completed; review warnings above\n'
|
||||||
|
exit 0
|
||||||
@@ -1,8 +1,14 @@
|
|||||||
# platform-projects
|
# platform-projects
|
||||||
|
|
||||||
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
|
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
|
||||||
|
|
||||||
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
## Implemented platform projects
|
||||||
|
|
||||||
|
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
|
## Planning areas
|
||||||
|
|
||||||
|
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
||||||
|
|
||||||
- `monitoring-zabbix`
|
- `monitoring-zabbix`
|
||||||
- `elk-log-analysis`
|
- `elk-log-analysis`
|
||||||
|
|||||||
@@ -0,0 +1,233 @@
|
|||||||
|
# Slurm AI/HPC Cluster Automation Lab
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
|
||||||
|
|
||||||
|
The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
|
||||||
|
|
||||||
|
## What this project demonstrates
|
||||||
|
|
||||||
|
- Slurm controller and worker node management.
|
||||||
|
- Munge authentication across the cluster.
|
||||||
|
- GPU node integration through Slurm GRES.
|
||||||
|
- cgroup CPU, memory, and GPU device enforcement.
|
||||||
|
- SlurmDBD with MariaDB-backed accounting.
|
||||||
|
- `sacct`, `sreport`, and `sacctmgr` workflows.
|
||||||
|
- QOS, fairshare, and multifactor priority configuration.
|
||||||
|
- Node provisioning and decommissioning workflows.
|
||||||
|
- Rolling OS upgrades with canary validation.
|
||||||
|
- Health checks and auto-remediation.
|
||||||
|
- Backup and restore-check workflow for the accounting database.
|
||||||
|
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
|
||||||
|
|
||||||
|
## Architecture overview
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
operator[Ansible control node]
|
||||||
|
munge[Munge authentication]
|
||||||
|
controller[Slurm controller<br/>slurmctld]
|
||||||
|
db[MariaDB + SlurmDBD<br/>accounting]
|
||||||
|
shared[Shared filesystem<br/>site dependency]
|
||||||
|
cpu_part[CPU partition]
|
||||||
|
gpu_part[GPU partition]
|
||||||
|
cpu_nodes[CPU compute nodes<br/>slurmd]
|
||||||
|
gpu_node[GPU node<br/>slurmd + GRES]
|
||||||
|
jobs[User jobs<br/>sbatch / srun]
|
||||||
|
|
||||||
|
operator -->|bootstrap and configure| controller
|
||||||
|
operator -->|configure workers| cpu_nodes
|
||||||
|
operator -->|configure GPU worker| gpu_node
|
||||||
|
operator -->|deploy key and service| munge
|
||||||
|
|
||||||
|
munge --> controller
|
||||||
|
munge --> cpu_nodes
|
||||||
|
munge --> gpu_node
|
||||||
|
|
||||||
|
controller -->|accounting RPC| db
|
||||||
|
jobs -->|submit to Slurm| controller
|
||||||
|
controller -->|schedule CPU jobs| cpu_part
|
||||||
|
controller -->|schedule GPU jobs| gpu_part
|
||||||
|
cpu_part --> cpu_nodes
|
||||||
|
gpu_part --> gpu_node
|
||||||
|
|
||||||
|
cpu_nodes --- shared
|
||||||
|
gpu_node --- shared
|
||||||
|
controller --- shared
|
||||||
|
```
|
||||||
|
|
||||||
|
The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
|
||||||
|
|
||||||
|
## Repository layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
inventories/lab/ Sanitized lab inventory and group variables
|
||||||
|
playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup
|
||||||
|
playbooks/core/ Munge, Slurm config, and safe restart workflows
|
||||||
|
playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation
|
||||||
|
playbooks/qos/ QOS, fairshare, and priority configuration
|
||||||
|
playbooks/lifecycle/ Node provisioning, inspection, and decommissioning
|
||||||
|
playbooks/upgrade/ Canary and rolling OS upgrade workflows
|
||||||
|
playbooks/health/ Health checks, repair, and auto-remediation
|
||||||
|
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
|
||||||
|
playbooks/backup/ Slurm and Munge state backup helpers
|
||||||
|
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
|
||||||
|
docs/ Operational runbook
|
||||||
|
prompts/ Documentation prompts used to expand this project
|
||||||
|
```
|
||||||
|
|
||||||
|
## Main operational workflows
|
||||||
|
|
||||||
|
Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
|
||||||
|
|
||||||
|
### Bootstrap access
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Munge
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Slurm config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate CPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate GPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Enable accounting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/tests/test-sreport-usage.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure QOS and fairshare
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provision a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
|
||||||
|
ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decommission a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
|
||||||
|
-e target_node=<node> \
|
||||||
|
-e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
|
||||||
|
-e canary_node=<node> \
|
||||||
|
-e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health check and auto-remediation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Accounting backup and restore-check
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operational maturity
|
||||||
|
|
||||||
|
This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
|
||||||
|
|
||||||
|
- Ansible workflows are designed to be repeatable and readable.
|
||||||
|
- Configuration deployment supports check and diff review before applying changes.
|
||||||
|
- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
|
||||||
|
- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
|
||||||
|
- QOS, fairshare, priority, and association workflows show resource governance.
|
||||||
|
- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
|
||||||
|
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
|
||||||
|
- Health and repair playbooks document remediation paths for common node states.
|
||||||
|
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
|
||||||
|
|
||||||
|
## Tested capabilities
|
||||||
|
|
||||||
|
- [x] CPU job scheduling.
|
||||||
|
- [x] GPU job scheduling.
|
||||||
|
- [x] GPU denial when no GRES is requested.
|
||||||
|
- [x] CPU cgroup enforcement.
|
||||||
|
- [x] SlurmDBD accounting setup.
|
||||||
|
- [x] `sacct` job history visibility.
|
||||||
|
- [x] `sreport` usage reporting.
|
||||||
|
- [x] QOS creation and validation.
|
||||||
|
- [x] Fairshare and priority visibility.
|
||||||
|
- [x] Node decommission and reprovision workflow.
|
||||||
|
- [x] Rolling upgrade canary workflow.
|
||||||
|
- [x] Node health check and auto-remediation workflow.
|
||||||
|
|
||||||
|
These checks represent sanitized lab validation, not a claim of production certification.
|
||||||
|
|
||||||
|
## Safety and sanitization
|
||||||
|
|
||||||
|
This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
|
||||||
|
|
||||||
|
Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
|
||||||
|
|
||||||
|
Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
|
||||||
|
|
||||||
|
## Why this matters for AI/HPC infrastructure roles
|
||||||
|
|
||||||
|
AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
|
||||||
|
|
||||||
|
This project demonstrates practical understanding of:
|
||||||
|
|
||||||
|
- Linux systems operations.
|
||||||
|
- Slurm cluster operations.
|
||||||
|
- GPU infrastructure and GRES scheduling.
|
||||||
|
- Job scheduling and resource isolation.
|
||||||
|
- Accounting, reporting, QOS, fairshare, and priority policy.
|
||||||
|
- Automation and repeatability with Ansible.
|
||||||
|
- Troubleshooting and operational ownership.
|
||||||
|
|
||||||
|
## Deeper docs
|
||||||
|
|
||||||
|
- [Runbook](docs/runbook.md)
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
[defaults]
|
||||||
|
inventory = ./inventories/lab/inventory.yml
|
||||||
|
host_key_checking = False
|
||||||
|
retry_files_enabled = False
|
||||||
|
stdout_callback = default
|
||||||
|
result_format = yaml
|
||||||
|
interpreter_python = auto_silent
|
||||||
|
timeout = 30
|
||||||
|
roles_path = ./roles
|
||||||
|
collections_path = ./collections
|
||||||
|
|
||||||
|
[ssh_connection]
|
||||||
|
pipelining = True
|
||||||
|
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
Generated backups and reports can be stored here locally. This directory is ignored by git.
|
||||||
@@ -0,0 +1,75 @@
|
|||||||
|
# Slurm AI/HPC Lab Runbook
|
||||||
|
|
||||||
|
## Standard deployment order
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Node lifecycle
|
||||||
|
|
||||||
|
Provision a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Decommission a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
Repair a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Run health remediation for nodes that can be recovered by the automated workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
Back up Slurm and Munge state before planned lifecycle work:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/backup/backup-slurm-state.yml
|
||||||
|
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
|
||||||
@@ -0,0 +1,128 @@
|
|||||||
|
---
|
||||||
|
# Example lab inventory variables. Replace addresses, users and node topology for your environment.
|
||||||
|
|
||||||
|
slurm_cluster_name: labcluster
|
||||||
|
|
||||||
|
slurm_control_machine: slurm-ctl01
|
||||||
|
slurm_control_addr: 10.10.10.11
|
||||||
|
|
||||||
|
slurm_config_dir: /etc/slurm
|
||||||
|
slurm_user: slurm
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
slurmctld_port: 6817
|
||||||
|
slurmd_port: 6818
|
||||||
|
|
||||||
|
slurm_job_comp_type: jobcomp/none
|
||||||
|
|
||||||
|
slurm_select_type: select/cons_tres
|
||||||
|
slurm_select_type_parameters: CR_Core_Memory
|
||||||
|
|
||||||
|
slurm_return_to_service: 2
|
||||||
|
slurm_default_mpi_type: none
|
||||||
|
|
||||||
|
slurm_gres_types: gpu
|
||||||
|
|
||||||
|
slurm_nodes:
|
||||||
|
- name: slurm-c01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.12
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: slurm-c02
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.13
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: gpu01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.14
|
||||||
|
cpus: 12
|
||||||
|
real_memory: 60000
|
||||||
|
features: "gpu"
|
||||||
|
gres: "gpu:1"
|
||||||
|
gres_file: /dev/nvidia0
|
||||||
|
topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
|
||||||
|
|
||||||
|
slurm_partitions:
|
||||||
|
- name: debug
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02]"
|
||||||
|
default: "YES"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: gpu
|
||||||
|
managed_state: present
|
||||||
|
nodes: "gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: all
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02],gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
|
||||||
|
# Cgroup enforcement
|
||||||
|
slurm_enable_cgroup: true
|
||||||
|
slurm_task_plugin: task/cgroup,task/affinity
|
||||||
|
slurm_proctrack_type: proctrack/cgroup
|
||||||
|
slurm_job_acct_gather_type: jobacct_gather/cgroup
|
||||||
|
|
||||||
|
# Slurm accounting / SlurmDBD
|
||||||
|
slurm_accounting_storage_type: accounting_storage/slurmdbd
|
||||||
|
slurm_accounting_storage_host: slurm-ctl01
|
||||||
|
slurm_accounting_storage_port: 6819
|
||||||
|
slurm_accounting_storage_enforce: associations,limits,qos
|
||||||
|
slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
|
||||||
|
|
||||||
|
slurmdbd_host: slurm-ctl01
|
||||||
|
slurmdbd_port: 6819
|
||||||
|
slurmdbd_storage_type: accounting_storage/mysql
|
||||||
|
slurmdbd_storage_host: localhost
|
||||||
|
slurmdbd_storage_port: 3306
|
||||||
|
slurmdbd_storage_loc: slurm_acct_db
|
||||||
|
slurmdbd_storage_user: slurm
|
||||||
|
# Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
|
||||||
|
slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
|
||||||
|
|
||||||
|
slurm_account_name: lab
|
||||||
|
slurm_account_description: "AI/HPC Slurm lab account"
|
||||||
|
slurm_account_organization: "labcluster"
|
||||||
|
|
||||||
|
# SlurmDBD purge / retention policy for lab
|
||||||
|
slurmdbd_commit_delay: 1
|
||||||
|
slurmdbd_purge_event_after: 12months
|
||||||
|
slurmdbd_purge_job_after: 12months
|
||||||
|
slurmdbd_purge_resv_after: 12months
|
||||||
|
slurmdbd_purge_step_after: 3months
|
||||||
|
slurmdbd_purge_suspend_after: 3months
|
||||||
|
slurmdbd_purge_txn_after: 12months
|
||||||
|
slurmdbd_purge_usage_after: 24months
|
||||||
|
|
||||||
|
# Archive is disabled for the lab; backup playbooks handle database dumps.
|
||||||
|
slurmdbd_archive_events: no
|
||||||
|
slurmdbd_archive_jobs: no
|
||||||
|
slurmdbd_archive_steps: no
|
||||||
|
slurmdbd_archive_suspend: no
|
||||||
|
slurmdbd_archive_txn: no
|
||||||
|
slurmdbd_archive_usage: no
|
||||||
|
|
||||||
|
# Slurm priority / fairshare
|
||||||
|
slurm_priority_type: priority/multifactor
|
||||||
|
slurm_priority_decay_half_life: 7-0
|
||||||
|
slurm_priority_calc_period: 5
|
||||||
|
slurm_priority_favor_small: "NO"
|
||||||
|
slurm_priority_weight_age: 1000
|
||||||
|
slurm_priority_weight_fairshare: 10000
|
||||||
|
slurm_priority_weight_job_size: 1000
|
||||||
|
slurm_priority_weight_partition: 1000
|
||||||
|
slurm_priority_weight_qos: 10000
|
||||||
|
slurm_priority_max_age: 1-0
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
---
|
||||||
|
# Copy this file to vault.yml and encrypt it with ansible-vault.
|
||||||
|
# ansible-vault encrypt inventories/lab/group_vars/vault.yml
|
||||||
|
|
||||||
|
vault_slurmdbd_storage_pass: CHANGE_ME
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
all:
|
||||||
|
vars:
|
||||||
|
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
|
||||||
|
children:
|
||||||
|
slurm_cluster:
|
||||||
|
children:
|
||||||
|
slurm_controller:
|
||||||
|
hosts:
|
||||||
|
slurm-ctl01:
|
||||||
|
ansible_host: 10.10.10.11
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_compute:
|
||||||
|
hosts:
|
||||||
|
slurm-c01:
|
||||||
|
ansible_host: 10.10.10.12
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm-c02:
|
||||||
|
ansible_host: 10.10.10.13
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_gpu:
|
||||||
|
hosts:
|
||||||
|
gpu01:
|
||||||
|
ansible_host: 10.10.10.14
|
||||||
|
ansible_user: ansible
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
---
|
||||||
|
- name: Backup SlurmDBD MariaDB database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create remote backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurmdbd_backup_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create local fetch directory on Ansible controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_fetch_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate SlurmDBD is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting database exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Dump Slurm accounting database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
|
||||||
|
|
||||||
|
mysqldump \
|
||||||
|
--single-transaction \
|
||||||
|
--routines \
|
||||||
|
--events \
|
||||||
|
--triggers \
|
||||||
|
{{ slurmdbd_storage_loc }} | gzip -9 > "$out"
|
||||||
|
|
||||||
|
chmod 0600 "$out"
|
||||||
|
echo "$out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_dump
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate backup file is non-empty
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ db_dump.stdout }}"
|
||||||
|
register: backup_file
|
||||||
|
|
||||||
|
- name: Fail if backup file is empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Backup file is empty: {{ db_dump.stdout }}"
|
||||||
|
when: backup_file.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Fetch DB backup to Ansible controller
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "{{ db_dump.stdout }}"
|
||||||
|
dest: "{{ local_fetch_dir }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Show DB backup result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Remote backup: {{ db_dump.stdout }}"
|
||||||
|
- "Backup size bytes: {{ backup_file.stat.size }}"
|
||||||
|
- "Fetched to: {{ local_fetch_dir }}/"
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
- name: Initialize Slurm accounting entities
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Wait for sacctmgr connectivity
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sacctmgr -n list cluster
|
||||||
|
register: sacctmgr_cluster_list
|
||||||
|
retries: 20
|
||||||
|
delay: 2
|
||||||
|
until: sacctmgr_cluster_list.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show current accounting state before changes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print current accounting state before changes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Ensure Slurm cluster exists in accounting DB
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
|
||||||
|
echo "Cluster {{ slurm_cluster_name }} already exists"
|
||||||
|
else
|
||||||
|
sacctmgr -i add cluster {{ slurm_cluster_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_cluster
|
||||||
|
changed_when: "'Adding Cluster' in ensure_cluster.stdout"
|
||||||
|
|
||||||
|
- name: Ensure default lab account exists for cluster
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
|
||||||
|
echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add account {{ slurm_account_name }} \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Description="{{ slurm_account_description }}" \
|
||||||
|
Organization="{{ slurm_account_organization }}"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_account
|
||||||
|
changed_when: "'Adding Account' in ensure_account.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists with lab account association
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
|
||||||
|
echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add user slurmuser \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Account={{ slurm_account_name }} \
|
||||||
|
DefaultAccount={{ slurm_account_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_user_assoc
|
||||||
|
changed_when: "'Adding User' in ensure_user_assoc.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser has default account set
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: set_default_account
|
||||||
|
changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
|
||||||
|
|
||||||
|
- name: Show final accounting state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print final accounting state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_after.stdout_lines
|
||||||
+98
@@ -0,0 +1,98 @@
|
|||||||
|
---
|
||||||
|
- name: Restore-check latest SlurmDBD backup into test database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Find latest SlurmDBD backup
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate latest backup exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ latest_backup.stdout }}"
|
||||||
|
register: latest_backup_stat
|
||||||
|
|
||||||
|
- name: Fail if latest backup is missing or empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
|
||||||
|
when:
|
||||||
|
- not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Recreate restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
DROP DATABASE IF EXISTS {{ restore_check_db }};
|
||||||
|
CREATE DATABASE {{ restore_check_db }};
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Import backup into restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate restored table count
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restored_tables
|
||||||
|
changed_when: false
|
||||||
|
failed_when: restored_tables.stdout | int < 1
|
||||||
|
|
||||||
|
- name: Validate restored row count sample
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### restored database"
|
||||||
|
echo "{{ restore_check_db }}"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ restore_check_db }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restore_check_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show restore-check result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Imported backup: {{ latest_backup.stdout }}"
|
||||||
|
- "Restore-check DB: {{ restore_check_db }}"
|
||||||
|
- "Restored tables: {{ restored_tables.stdout }}"
|
||||||
|
- "Summary:"
|
||||||
|
- "{{ restore_check_summary.stdout_lines }}"
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
---
|
||||||
|
- name: Install and configure MariaDB for SlurmDBD
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install MariaDB and SlurmDBD packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- mariadb-server
|
||||||
|
- mariadb-client
|
||||||
|
- slurmdbd
|
||||||
|
- slurm-wlm-mysql-plugin
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure MariaDB is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: mariadb
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Create Slurm accounting database and DB user
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
|
||||||
|
FLUSH PRIVILEGES;
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure /etc/slurm exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/slurm
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Deploy slurmdbd.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurmdbd.conf.j2
|
||||||
|
dest: /etc/slurm/slurmdbd.conf
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0600"
|
||||||
|
notify:
|
||||||
|
- Restart slurmdbd
|
||||||
|
|
||||||
|
- name: Ensure slurmdbd is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Flush handlers before validation
|
||||||
|
ansible.builtin.meta: flush_handlers
|
||||||
|
|
||||||
|
- name: Validate slurmdbd service is active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
register: slurmdbd_active
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate slurmdbd is listening on port
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ss -lntp | grep ':{{ slurmdbd_port }} '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurmdbd_port_check
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_port_check.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show slurmdbd service validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "slurmdbd is active"
|
||||||
|
- "{{ slurmdbd_port_check.stdout_lines }}"
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart slurmdbd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
state: restarted
|
||||||
+178
@@ -0,0 +1,178 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm accounting production-like setup
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate accounting services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active mariadb
|
||||||
|
systemctl is-active slurmdbd
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmdbd listener"
|
||||||
|
ss -lntp | grep ':6819 '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting runtime config
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### accounting config"
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### priority / select / cgroup config"
|
||||||
|
scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sacctmgr entities
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: entity_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit accounting validation job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=acct-prodlike-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/acct-prodlike-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/acct-prodlike-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: acct_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate sacct can read recent jobs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### recent jobs"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sacct_recent
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sreport commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### cluster utilization"
|
||||||
|
sreport cluster utilization start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### account utilization by user"
|
||||||
|
sreport cluster AccountUtilizationByUser start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### user top"
|
||||||
|
sreport user top start=today || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sreport_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB table health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### database exists"
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ slurmdbd_storage_loc }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print accounting validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "### services"
|
||||||
|
- "{{ service_check.stdout_lines }}"
|
||||||
|
- "### runtime config"
|
||||||
|
- "{{ config_check.stdout_lines }}"
|
||||||
|
- "### accounting entities"
|
||||||
|
- "{{ entity_check.stdout_lines }}"
|
||||||
|
- "### accounting validation job"
|
||||||
|
- "{{ acct_job.stdout_lines }}"
|
||||||
|
- "### recent sacct data"
|
||||||
|
- "{{ sacct_recent.stdout_lines }}"
|
||||||
|
- "### sreport"
|
||||||
|
- "{{ sreport_check.stdout_lines }}"
|
||||||
|
- "### database health"
|
||||||
|
- "{{ db_health.stdout_lines }}"
|
||||||
@@ -0,0 +1,83 @@
|
|||||||
|
---
|
||||||
|
- name: Backup Slurm and Munge state on all cluster nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
backup_base_dir: /var/backups/slurm
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create backup base directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ backup_base_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create timestamped backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
dir="{{ backup_base_dir }}/$ts"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
echo "$dir"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_dir_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Store backup directory fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
node_backup_dir: "{{ backup_dir_result.stdout }}"
|
||||||
|
|
||||||
|
- name: Backup Slurm and Munge config/state if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
backup_dir="{{ node_backup_dir }}"
|
||||||
|
|
||||||
|
for p in \
|
||||||
|
/etc/slurm \
|
||||||
|
/etc/slurm-llnl \
|
||||||
|
/etc/munge \
|
||||||
|
/var/spool/slurmctld \
|
||||||
|
/var/spool/slurmd \
|
||||||
|
/var/log/slurm \
|
||||||
|
/var/log/slurm-llnl
|
||||||
|
do
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
cp -a "$p" "$backup_dir/"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
|
||||||
|
systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
|
||||||
|
systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
if command -v sinfo >/dev/null 2>&1; then
|
||||||
|
sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v scontrol >/dev/null 2>&1; then
|
||||||
|
scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
|
||||||
|
scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
|
||||||
|
scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
find "$backup_dir" -maxdepth 2 -type f -o -type d
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_content
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show backup location on node
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Host: {{ inventory_hostname }}"
|
||||||
|
- "Backup directory: {{ node_backup_dir }}"
|
||||||
@@ -0,0 +1,46 @@
|
|||||||
|
---
|
||||||
|
- name: Fetch latest Slurm backups from nodes to pvef
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
remote_backup_base: /var/backups/slurm
|
||||||
|
local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Find latest remote backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1dt {{ remote_backup_base }}/* | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup_dir
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Create local backup directory on pvef
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_backup_base }}/{{ inventory_hostname }}"
|
||||||
|
state: directory
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Archive latest backup directory on remote node
|
||||||
|
ansible.builtin.archive:
|
||||||
|
path: "{{ latest_backup_dir.stdout }}"
|
||||||
|
dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
format: gz
|
||||||
|
force_archive: true
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Fetch archive to pvef
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Remove temporary remote archive
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
state: absent
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
- name: Bootstrap Ansible SSH access from pvef to Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
gather_facts: false
|
||||||
|
become: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Wait for SSH
|
||||||
|
ansible.builtin.wait_for_connection:
|
||||||
|
timeout: 30
|
||||||
|
|
||||||
|
- name: Install Python if missing - Debian/Ubuntu
|
||||||
|
ansible.builtin.raw: |
|
||||||
|
test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure sudo is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-server
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure SSH server is enabled and running
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: ssh
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for login user
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ ansible_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ ansible_user }}"
|
||||||
|
group: "{{ ansible_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Add pvef root public key to login user's authorized_keys
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ ansible_user }}"
|
||||||
|
key: "{{ ansible_controller_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
|
||||||
|
- name: Allow bootstrap login user passwordless sudo
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
{{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
|
||||||
|
validate: "visudo -cf %s"
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
- name: Configure /etc/hosts for Slurm cluster
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Add Slurm cluster hosts to /etc/hosts
|
||||||
|
ansible.builtin.blockinfile:
|
||||||
|
path: /etc/hosts
|
||||||
|
marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
|
||||||
|
block: |
|
||||||
|
{{ slurm_control_addr }} {{ slurm_control_machine }}
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
|
||||||
|
{{ node.addr }} {{ node.name }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,218 @@
|
|||||||
|
---
|
||||||
|
- name: Create slurmuser and generate SSH keys on every Slurm node
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
slurm_operator_shell: /bin/bash
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure useful packages are installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-client
|
||||||
|
- openssh-server
|
||||||
|
- acl
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: "{{ slurm_operator_user }}"
|
||||||
|
shell: "{{ slurm_operator_shell }}"
|
||||||
|
create_home: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for slurmuser
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Generate SSH key for slurmuser if missing
|
||||||
|
ansible.builtin.openssh_keypair:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
type: ed25519
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Read public key from each node
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
register: slurmuser_pubkey_raw
|
||||||
|
|
||||||
|
- name: Store decoded public key as host fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Exchange slurmuser SSH keys across all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install all slurmuser public keys into authorized_keys on every node
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ slurm_operator_user }}"
|
||||||
|
key: "{{ hostvars[item].slurmuser_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
loop: "{{ groups['slurm_cluster'] }}"
|
||||||
|
|
||||||
|
- name: Build SSH known_hosts entries for all cluster nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
mkdir -p /home/{{ slurm_operator_user }}/.ssh
|
||||||
|
touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure SSH permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Ensure private key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
|
||||||
|
- name: Ensure public key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0644"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Configure sudo permissions for slurmuser
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm controller node.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sacctmgr, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm worker/GPU nodes.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate slurmuser SSH mesh and Slurm access
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local Slurm commands as slurmuser
|
||||||
|
ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
|
||||||
|
register: sinfo_test
|
||||||
|
changed_when: false
|
||||||
|
failed_when: sinfo_test.rc != 0
|
||||||
|
|
||||||
|
- name: Show sinfo result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: sinfo_test.stdout_lines
|
||||||
|
|
||||||
|
- name: Test SSH from each node to every other node as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
|
||||||
|
{% endfor %}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
become_user: "{{ slurm_operator_user }}"
|
||||||
|
register: ssh_mesh_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show SSH mesh test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: ssh_mesh_test.stdout_lines
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
---
|
||||||
|
- name: Fix sudo permissions for slurmuser Slurm operations
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl status slurmctld *, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmctld *, \
|
||||||
|
/usr/bin/systemctl restart slurmctld, \
|
||||||
|
/usr/bin/systemctl reload slurmctld, \
|
||||||
|
/usr/bin/systemctl start slurmctld, \
|
||||||
|
/usr/bin/systemctl stop slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmctld *, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmctld, \
|
||||||
|
/usr/bin/journalctl -u slurmctld *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
---
|
||||||
|
- name: Read Munge key from Slurm controller
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller munge.key exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key
|
||||||
|
|
||||||
|
- name: Fail if controller munge.key is missing
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "/etc/munge/munge.key is missing on controller. Do not continue."
|
||||||
|
when: not controller_munge_key.stat.exists
|
||||||
|
|
||||||
|
- name: Read controller munge.key
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key_raw
|
||||||
|
|
||||||
|
- name: Store controller Munge key as fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy controller Munge key to all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
controller_host: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure munge package is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- munge
|
||||||
|
- libmunge2
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure munge group exists
|
||||||
|
ansible.builtin.group:
|
||||||
|
name: munge
|
||||||
|
system: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure munge user exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: munge
|
||||||
|
group: munge
|
||||||
|
system: true
|
||||||
|
shell: /usr/sbin/nologin
|
||||||
|
home: /nonexistent
|
||||||
|
create_home: false
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure /etc/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Deploy shared munge.key from controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/munge/munge.key
|
||||||
|
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0400"
|
||||||
|
notify:
|
||||||
|
- Restart munge
|
||||||
|
|
||||||
|
- name: Ensure /var/log/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure /var/lib/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/lib/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0711"
|
||||||
|
|
||||||
|
- name: Ensure /run/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /run/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure munge is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Munge locally on all nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local munge encode/decode
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
munge -n | unmunge
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: munge_local_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show local Munge validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: munge_local_test.stdout_lines
|
||||||
@@ -0,0 +1,132 @@
|
|||||||
|
---
|
||||||
|
- name: Prepare Slurm config directories and logs
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure Slurm config directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurm_config_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure slurmctld spool directory exists on controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmctld
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Ensure slurmd spool directory exists on workers
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmd
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy Slurm config files
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Backup current slurm.conf before managed deployment
|
||||||
|
ansible.builtin.copy:
|
||||||
|
src: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
|
||||||
|
remote_src: true
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Deploy managed slurm.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed gres.conf only on GPU nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/gres.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/gres.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: inventory_hostname in groups['slurm_gpu']
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Reconfigure slurmctld
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after config deployment
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_config_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show validation output
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_config_validation.stdout_lines
|
||||||
@@ -0,0 +1,103 @@
|
|||||||
|
---
|
||||||
|
- name: Restart Slurm controller safely
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmctld on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmctld to answer
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: scontrol_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: scontrol_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show controller ping
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: scontrol_ping.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Restart Slurm workers safely one by one
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmd to be active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmd
|
||||||
|
register: slurmd_active
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Wait until this node is visible in Slurm
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol show node {{ inventory_hostname }}
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: node_visible
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: node_visible.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after restart
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate Slurm cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
scontrol show nodes
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
scontrol show partitions
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show Slurm validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_validation.stdout_lines
|
||||||
+40
@@ -0,0 +1,40 @@
|
|||||||
|
---
|
||||||
|
- name: Discover node resources for Slurm config
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Discover CPU and memory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "HOST={{ inventory_hostname }}"
|
||||||
|
echo "CPUS=$(nproc)"
|
||||||
|
echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
|
||||||
|
echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_mem
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Discover NVIDIA GPU if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show discovered resources
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "{{ cpu_mem.stdout_lines }}"
|
||||||
|
- "GPU:"
|
||||||
|
- "{{ gpu_info.stdout_lines }}"
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
---
|
||||||
|
- name: Inspect current Slurm and Munge state
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Basic host info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
echo "HOST=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "SHORT_HOST=$(hostname -s)"
|
||||||
|
echo "IP_ADDRESSES=$(hostname -I)"
|
||||||
|
echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: host_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm package info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
dpkg -l | grep -Ei 'slurm|munge' || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: package_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm config paths
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
|
||||||
|
echo "### $p"
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
|
||||||
|
else
|
||||||
|
echo "MISSING"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_paths
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Service state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
for s in munge slurmctld slurmd; do
|
||||||
|
echo "### $s"
|
||||||
|
systemctl is-enabled "$s" 2>/dev/null || true
|
||||||
|
systemctl is-active "$s" 2>/dev/null || true
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
echo "### which"
|
||||||
|
command -v sinfo || true
|
||||||
|
command -v scontrol || true
|
||||||
|
command -v sbatch || true
|
||||||
|
command -v srun || true
|
||||||
|
command -v munge || true
|
||||||
|
command -v unmunge || true
|
||||||
|
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo 2>&1 || true
|
||||||
|
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping 2>&1 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_commands
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show inspection report
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "===== {{ inventory_hostname }} :: host_info ====="
|
||||||
|
- "{{ host_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: packages ====="
|
||||||
|
- "{{ package_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: config_paths ====="
|
||||||
|
- "{{ config_paths.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: services ====="
|
||||||
|
- "{{ service_state.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: slurm_commands ====="
|
||||||
|
- "{{ slurm_commands.stdout_lines }}"
|
||||||
+216
@@ -0,0 +1,216 @@
|
|||||||
|
---
|
||||||
|
- name: Detect problematic Slurm nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Detect nodes needing remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store bad node list
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Show detected problematic nodes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: bad_nodes
|
||||||
|
|
||||||
|
|
||||||
|
- name: Attempt auto-remediation on problematic nodes
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
vars:
|
||||||
|
bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Skip healthy nodes
|
||||||
|
ansible.builtin.meta: end_host
|
||||||
|
when: inventory_hostname not in bad_nodes_from_controller
|
||||||
|
|
||||||
|
- name: Restart Munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local services after remediation attempt
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent slurmd logs"
|
||||||
|
journalctl -u slurmd -n 30 --no-pager || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: local_repair_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print local remediation result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: local_repair_check.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Refresh controller and validate remediated nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart slurmctld to refresh node states
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear maintenance state on previously bad nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "No bad nodes detected. Nothing to clear."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $bad_nodes; do
|
||||||
|
echo "### clearing state on $node"
|
||||||
|
scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
|
||||||
|
done
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: clear_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print clear-state result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: clear_result.stdout_lines
|
||||||
|
|
||||||
|
- name: Detect nodes still unhealthy after remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: still_bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store still bad nodes
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Drain nodes that remain unhealthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$unresolved_nodes" ]; then
|
||||||
|
echo "No unresolved unhealthy nodes."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $unresolved_nodes; do
|
||||||
|
echo "### draining unresolved node $node"
|
||||||
|
scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
|
||||||
|
done
|
||||||
|
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: drain_unresolved
|
||||||
|
changed_when: still_bad_nodes | length > 0
|
||||||
|
|
||||||
|
- name: Show remediation summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### initial bad nodes"
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### still bad nodes"
|
||||||
|
still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$still_bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $still_bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### final sinfo"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: remediation_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print remediation summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: remediation_summary.stdout_lines
|
||||||
@@ -0,0 +1,149 @@
|
|||||||
|
---
|
||||||
|
- name: Check Slurm controller health
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller services and cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### controller services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
systemctl is-active slurmdbd || true
|
||||||
|
systemctl is-active mariadb || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurm ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### problematic nodes"
|
||||||
|
sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting"
|
||||||
|
sacctmgr -n list cluster || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent failed jobs"
|
||||||
|
sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
|
||||||
|
--format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: controller_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print controller health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: controller_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm worker health
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check worker services, config and connectivity
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo "UPTIME=$(uptime -p)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge local test"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller connectivity"
|
||||||
|
getent hosts slurm-ctl01 || true
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### config checksums"
|
||||||
|
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### shared filesystem"
|
||||||
|
test -d /shared
|
||||||
|
touch /shared/.slurm-health-$(hostname)
|
||||||
|
ls -l /shared/.slurm-health-$(hostname)
|
||||||
|
rm -f /shared/.slurm-health-$(hostname)
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### cgroup"
|
||||||
|
mount | grep cgroup || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### gpu check"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: worker_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print worker health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: worker_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm-reported node state consistency
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Build Slurm node health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### node summary"
|
||||||
|
sinfo -N -o "%N %P %T %C %m %G %E"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### full problematic node details"
|
||||||
|
for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
|
||||||
|
echo
|
||||||
|
echo "### $node"
|
||||||
|
scontrol show node "$node"
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_node_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print Slurm node summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_node_summary.stdout_lines
|
||||||
@@ -0,0 +1,217 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target node
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook repair-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Capture node state before repair
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show target node state before repair
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### scontrol"
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### jobs"
|
||||||
|
squeue -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print target node state before repair
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_before.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Repair local services on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart Munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
when:
|
||||||
|
- inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Validate local repair
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent slurmd logs"
|
||||||
|
journalctl -u slurmd -n 40 --no-pager || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: local_repair_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print local repair state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: local_repair_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Clear Slurm maintenance/down state after repair
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart controller to refresh node state
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear target node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ target_node }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ target_node }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ target_node }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: clear_state
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until node is healthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_health_after
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- node_health_after.rc == 0
|
||||||
|
- "'not_responding' not in node_health_after.stdout.lower()"
|
||||||
|
- "'down' not in node_health_after.stdout.lower()"
|
||||||
|
- "'drain' not in node_health_after.stdout.lower()"
|
||||||
|
- "'idle*' not in node_health_after.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print node state after repair
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_health_after.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit repair validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit validation job to repaired node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=repair-node-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ target_node }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --qos=normal
|
||||||
|
#SBATCH --output=/shared/repair-node-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/repair-node-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: repair_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print repair validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: repair_validation_job.stdout_lines
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target_node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook decommission-slurm-node.yml -e target_node=<hostname> [-e decom_reason='reason']"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Drain target node and wait for jobs to leave
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
decom_reason_effective: "{{ decom_reason | default('decommission by Ansible') }}"
|
||||||
|
decom_wait_retries_effective: "{{ decom_wait_retries | default(120) }}"
|
||||||
|
decom_wait_delay_effective: "{{ decom_wait_delay | default(10) }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show current target node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print current target node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Drain target node
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=DRAIN Reason="{{ decom_reason_effective }}"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until no jobs are running on target node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: jobs_on_node
|
||||||
|
retries: "{{ decom_wait_retries_effective | int }}"
|
||||||
|
delay: "{{ decom_wait_delay_effective | int }}"
|
||||||
|
until: jobs_on_node.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show drained node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_drained
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print drained node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_drained.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Stop Slurm worker service on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Stop slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: stopped
|
||||||
|
enabled: false
|
||||||
|
when:
|
||||||
|
- inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Show slurmd state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
systemctl is-enabled slurmd 2>/dev/null || true
|
||||||
|
systemctl is-active slurmd 2>/dev/null || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurmd_state_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print slurmd state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurmd_state_after.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Mark node down in Slurm controller
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Mark target node DOWN after service stop
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=DOWN Reason="decommissioned"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show final node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: final_node_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print final node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: final_node_state.stdout_lines
|
||||||
@@ -0,0 +1,246 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target_node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook provision-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Prepare OS, packages and Slurm directories on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure target is a Slurm worker or GPU node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "{{ inventory_hostname }} must be in slurm_compute or slurm_gpu group"
|
||||||
|
when:
|
||||||
|
- inventory_hostname not in groups.get('slurm_compute', [])
|
||||||
|
- inventory_hostname not in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Install Slurm worker packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- munge
|
||||||
|
- libmunge2
|
||||||
|
- slurm-client
|
||||||
|
- slurmd
|
||||||
|
- slurm-wlm-basic-plugins
|
||||||
|
- slurm-wlm-plugins
|
||||||
|
- slurm-wlm-mysql-plugin
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure Slurm config directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurm_config_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure slurmd spool directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmd
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure munge dirs exist
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item.path }}"
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "{{ item.mode }}"
|
||||||
|
loop:
|
||||||
|
- { path: /etc/munge, mode: "0700" }
|
||||||
|
- { path: /var/log/munge, mode: "0755" }
|
||||||
|
- { path: /var/lib/munge, mode: "0711" }
|
||||||
|
- { path: /run/munge, mode: "0755" }
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy Munge key from controller to target node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Read controller munge.key
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key_raw
|
||||||
|
|
||||||
|
- name: Store controller Munge key as fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Configure target node with Munge and Slurm files
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
controller_host: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Deploy shared munge.key
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/munge/munge.key
|
||||||
|
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0400"
|
||||||
|
notify:
|
||||||
|
- Restart munge
|
||||||
|
|
||||||
|
- name: Deploy managed slurm.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed gres.conf on GPU nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/gres.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/gres.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Ensure munge is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Ensure slurmd is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy updated Slurm config to whole cluster and reconfigure controller
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Deploy managed slurm.conf to all nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf to all nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
|
||||||
|
|
||||||
|
- name: Reconfigure Slurm and validate target node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure Slurm controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart Slurm controller after node reprovision
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for Slurm controller after restart
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping_after_restart
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping_after_restart.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Resume target node in Slurm
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=RESUME
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until target node is visible and not down
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: target_node_state
|
||||||
|
retries: 20
|
||||||
|
delay: 3
|
||||||
|
until:
|
||||||
|
- target_node_state.rc == 0
|
||||||
|
- "'down' not in target_node_state.stdout.lower()"
|
||||||
|
- "'not_responding' not in target_node_state.stdout.lower()"
|
||||||
|
- "'idle*' not in target_node_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show target node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: target_node_state.stdout_lines
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
---
|
||||||
|
- name: Show Slurm node state
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook show-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Show node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### scontrol"
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### jobs on node"
|
||||||
|
squeue -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_lifecycle_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print node lifecycle state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_lifecycle_state.stdout_lines
|
||||||
@@ -0,0 +1,169 @@
|
|||||||
|
---
|
||||||
|
- name: Configure Slurm QOS, limits and fairshare
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure sacctmgr is avgpu01le
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sacctmgr -n list cluster
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate accounting GPU TRES exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### configured AccountingStorageTRES"
|
||||||
|
scontrol show config | grep -E "AccountingStorageTRES|AccountingStorageType|AccountingStorageEnforce"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### known TRES"
|
||||||
|
sacctmgr show tres
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### checking gres/gpu"
|
||||||
|
sacctmgr -n show tres format=Type,Name | awk '$1=="gres" && $2=="gpu" {found=1} END {exit !found}'
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_tres_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Ensure normal QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos normal Priority=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_normal
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_normal.stdout + add_qos_normal.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_normal.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
|
||||||
|
'already exists' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
|
||||||
|
'Already existing' not in (add_qos_normal.stdout + add_qos_normal.stderr)
|
||||||
|
|
||||||
|
- name: Ensure debug-short QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos debug-short Priority=500
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_debug
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_debug.stdout + add_qos_debug.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_debug.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
|
||||||
|
'already exists' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
|
||||||
|
'Already existing' not in (add_qos_debug.stdout + add_qos_debug.stderr)
|
||||||
|
|
||||||
|
- name: Ensure gpu-short QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos gpu-short Priority=1000
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_gpu
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_gpu.stdout + add_qos_gpu.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_gpu.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
|
||||||
|
'already exists' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
|
||||||
|
'Already existing' not in (add_qos_gpu.stdout + add_qos_gpu.stderr)
|
||||||
|
|
||||||
|
- name: Ensure maintenance QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos maintenance Priority=5000
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_maintenance
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_maintenance.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
|
||||||
|
'already exists' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
|
||||||
|
'Already existing' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)
|
||||||
|
|
||||||
|
- name: Normalize normal QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos normal set Priority=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize debug-short QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos debug-short set Priority=500 MaxWall=00:10:00 MaxTRESPU=cpu=2 MaxJobsPU=4
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize gpu-short QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos gpu-short set Priority=1000 MaxWall=01:00:00 MaxTRESPU=gres/gpu=1,cpu=12 MaxJobsPU=2
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize maintenance QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos maintenance set Priority=5000 MaxWall=02:00:00
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign QOS set to lab account
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify account {{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign default account to slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign QOS set to slurmuser association
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser account={{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show configured QOS and associations
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### TRES"
|
||||||
|
sacctmgr show tres
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### QOS"
|
||||||
|
sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%40,MaxJobsPU
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### Associations"
|
||||||
|
sacctmgr show assoc format=Cluster,Account,User,Share,QOS%60,DefaultQOS,Fairshare
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### Fairshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: qos_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print QOS state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: qos_state.stdout_lines
|
||||||
@@ -0,0 +1,235 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm QOS, fairshare and priority
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate priority runtime config
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### priority config"
|
||||||
|
scontrol show config | grep -E "PriorityType|PriorityWeight|PriorityDecay|PriorityCalc|PriorityMaxAge|PriorityFavor"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting enforcement"
|
||||||
|
scontrol show config | grep -E "AccountingStorageType|AccountingStorageEnforce|AccountingStorageTRES"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### QOS"
|
||||||
|
sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%50,MaxJobsPU
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr show assoc format=Cluster,Account,User,Share,QOS%80,DefaultQOS,Fairshare
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### fairshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: priority_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit debug-short QOS job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-debug-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --qos=debug-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/qos-debug-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "QOS=${SLURM_JOB_QOS:-}"
|
||||||
|
echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/qos-debug-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: debug_qos_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Submit gpu-short QOS job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --qos=gpu-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/qos-gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "QOS=${SLURM_JOB_QOS:-}"
|
||||||
|
echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
nvidia-smi
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/qos-gpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_qos_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate debug-short walltime limit behavior
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
set +e
|
||||||
|
output="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH' 2>&1
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-limit-fail
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --qos=debug-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:30:00
|
||||||
|
#SBATCH --output=/shared/qos-limit-fail-%j.out
|
||||||
|
|
||||||
|
sleep 10
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
rc=$?
|
||||||
|
set -e
|
||||||
|
|
||||||
|
echo "RC=$rc"
|
||||||
|
echo "$output"
|
||||||
|
|
||||||
|
if [ "$rc" -ne 0 ]; then
|
||||||
|
echo "Limit rejection test passed at submit time"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
job_id="$output"
|
||||||
|
echo "Submitted job despite expected limit check: $job_id"
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
echo "### squeue"
|
||||||
|
squeue -j "$job_id" -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### job detail"
|
||||||
|
scontrol show job "$job_id" || true
|
||||||
|
|
||||||
|
state="$(squeue -h -j "$job_id" -o "%T" || true)"
|
||||||
|
reason="$(squeue -h -j "$job_id" -o "%R" || true)"
|
||||||
|
|
||||||
|
echo "STATE=$state"
|
||||||
|
echo "REASON=$reason"
|
||||||
|
|
||||||
|
if echo "$state" | grep -qE "PENDING|CONFIGURING"; then
|
||||||
|
if echo "$reason" | grep -qiE "qos|limit|time|max|assoc"; then
|
||||||
|
echo "Limit enforcement test passed via pending reason"
|
||||||
|
scancel "$job_id" || true
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Job was accepted without an obvious QOS/limit pending reason"
|
||||||
|
scancel "$job_id" || true
|
||||||
|
exit 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: limit_rejection
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show priority and fairshare snapshot
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### queue"
|
||||||
|
squeue || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sprio"
|
||||||
|
sprio || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent sacct"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -40
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: priority_snapshot
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print validation result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "### priority state"
|
||||||
|
- "{{ priority_state.stdout_lines }}"
|
||||||
|
- "### debug QOS job"
|
||||||
|
- "{{ debug_qos_job.stdout_lines }}"
|
||||||
|
- "### GPU QOS job"
|
||||||
|
- "{{ gpu_qos_job.stdout_lines }}"
|
||||||
|
- "### limit rejection"
|
||||||
|
- "{{ limit_rejection.stdout_lines }}"
|
||||||
|
- "### priority snapshot"
|
||||||
|
- "{{ priority_snapshot.stdout_lines }}"
|
||||||
@@ -0,0 +1,59 @@
|
|||||||
|
---
|
||||||
|
- name: Test CPU cgroup enforcement on gpu01
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit cgroup CPU test to gpu01
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=cgroup-cpu-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist=gpu01
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/cgroup-cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "MEM_ALLOWED=$(grep Mems_allowed_list /proc/self/status || true)"
|
||||||
|
echo
|
||||||
|
echo "### cgroup"
|
||||||
|
cat /proc/self/cgroup
|
||||||
|
echo
|
||||||
|
echo "### mounted cgroups"
|
||||||
|
mount | grep cgroup || true
|
||||||
|
sleep 5
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/cgroup-cpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cgroup_cpu_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show cgroup CPU result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cgroup_cpu_result.stdout_lines
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
---
|
||||||
|
- name: Submit CPU test job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit test job to debug partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=cpu-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=512M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
if [ -f "/shared/cpu-test-${job_id}.out" ]; then
|
||||||
|
cat "/shared/cpu-test-${job_id}.out"
|
||||||
|
else
|
||||||
|
echo "Output file not found: /shared/cpu-test-${job_id}.out"
|
||||||
|
find /shared -maxdepth 1 -name "cpu-test-*.out" -ls | tail -5 || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_job_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show CPU job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cpu_job_result.stdout_lines
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
- name: Test GPU access without GRES allocation
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit job to gpu01 without requesting GPU
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=gpu-deny-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist=gpu01
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/gpu-deny-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
echo "### ls nvidia devices"
|
||||||
|
ls -l /dev/nvidia* 2>&1 || true
|
||||||
|
echo
|
||||||
|
echo "### nvidia-smi without GRES"
|
||||||
|
nvidia-smi 2>&1 || true
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/gpu-deny-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_deny_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show GPU deny test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_deny_result.stdout_lines
|
||||||
@@ -0,0 +1,70 @@
|
|||||||
|
---
|
||||||
|
- name: Submit GPU test job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit test job to gpu partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=2G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
|
||||||
|
echo "### nvidia-smi"
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### GPU process table"
|
||||||
|
nvidia-smi pmon -c 1 || true
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
if [ -f "/shared/gpu-test-${job_id}.out" ]; then
|
||||||
|
cat "/shared/gpu-test-${job_id}.out"
|
||||||
|
else
|
||||||
|
echo "Output file not found: /shared/gpu-test-${job_id}.out"
|
||||||
|
find /shared -maxdepth 1 -name "gpu-test-*.out" -ls | tail -5 || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_job_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show GPU job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_job_result.stdout_lines
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
---
|
||||||
|
- name: Submit job to specific Slurm node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook test-specific-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Submit test job to target node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=node-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --nodelist={{ target_node }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --qos=normal
|
||||||
|
#SBATCH --output=/shared/node-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
echo "### waiting for job to leave queue"
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### waiting for output file"
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
if [ -s "/shared/node-test-${job_id}.out" ]; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### waiting for sacct final state"
|
||||||
|
final_state=""
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
final_state="$(
|
||||||
|
sacct -n -P -j "$job_id" --format=State 2>/dev/null \
|
||||||
|
| head -n 1 \
|
||||||
|
| cut -d'|' -f1 \
|
||||||
|
| awk '{print $1}'
|
||||||
|
)"
|
||||||
|
|
||||||
|
if echo "$final_state" | grep -qE "COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY"; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "FINAL_STATE=${final_state:-UNKNOWN}"
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/node-test-${job_id}.out"
|
||||||
|
|
||||||
|
if [ "${final_state:-UNKNOWN}" != "COMPLETED" ]; then
|
||||||
|
echo "Job did not reach COMPLETED state according to sacct"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_test
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show node test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_test.stdout_lines
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
---
|
||||||
|
- name: Generate measurable Slurm usage for sreport
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit CPU usage job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=sreport-usage
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=512M
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/sreport-usage-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "Burning CPU for 90 seconds"
|
||||||
|
|
||||||
|
timeout 90 bash -c 'while true; do :; done' &
|
||||||
|
timeout 90 bash -c 'while true; do :; done' &
|
||||||
|
wait
|
||||||
|
|
||||||
|
echo "Done"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 150); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 2
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/sreport-usage-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sreport_usage_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show usage job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: sreport_usage_job.stdout_lines
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm operator user and SSH mesh
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: "{{ slurm_operator_user | default('slurmuser') }}"
|
||||||
|
slurm_hosts: "{{ groups['slurm_cluster'] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmuser exists
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: id {{ slurm_operator_user }}
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sinfo as slurmuser
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate squeue as slurmuser
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate SSH mesh as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
for h in {{ slurm_hosts | join(' ') }}; do
|
||||||
|
echo "=== $h ==="
|
||||||
|
ssh -o BatchMode=yes -o ConnectTimeout=5 "$h" hostname
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
become_user: "{{ slurm_operator_user }}"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm controller commands
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmctld status through sudo
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmctld --no-pager
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate controller Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
sudo -iu {{ slurm_operator_user }} scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm worker commands
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmd status through sudo
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmd --no-pager
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate worker Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
sudo -iu {{ slurm_operator_user }} scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate basic job submission
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit simple Slurm test job as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu {{ slurm_operator_user }} sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=ansible-validate
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --time=00:01:00
|
||||||
|
#SBATCH --output=/tmp/ansible-validate-%j.out
|
||||||
|
|
||||||
|
hostname
|
||||||
|
whoami
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 20); do
|
||||||
|
state="$(sudo -iu {{ slurm_operator_user }} squeue -h -j "$job_id" -o "%T" || true)"
|
||||||
|
if [ -z "$state" ]; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "job_state=$state"
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
sudo -iu {{ slurm_operator_user }} sacct -j "$job_id" --format=JobID,JobName,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
if ls /tmp/ansible-validate-"$job_id".out >/dev/null 2>&1; then
|
||||||
|
cat /tmp/ansible-validate-"$job_id".out
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_job_test
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show basic job submission result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_job_test.stdout_lines
|
||||||
+236
@@ -0,0 +1,236 @@
|
|||||||
|
---
|
||||||
|
- name: Validate canary node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure canary node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "canary_node={{ canary_node_effective }} is not in inventory"
|
||||||
|
when: canary_node_effective not in groups['all']
|
||||||
|
|
||||||
|
- name: Ensure canary node is not the controller
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Do not use controller as canary for worker rolling upgrade"
|
||||||
|
when: canary_node_effective in groups['slurm_controller']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Drain canary node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show canary state before drain
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ canary_node_effective }} || true
|
||||||
|
scontrol show node {{ canary_node_effective }} || true
|
||||||
|
squeue -w {{ canary_node_effective }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print canary state before drain
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: canary_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Drain canary node
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ canary_node_effective }} State=DRAIN Reason="canary OS upgrade"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until canary has no running jobs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ canary_node_effective }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_jobs
|
||||||
|
retries: 120
|
||||||
|
delay: 10
|
||||||
|
until: canary_jobs.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Upgrade canary node OS packages
|
||||||
|
hosts: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure apt cache is updated
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: apt_upgrade_result
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required
|
||||||
|
|
||||||
|
- name: Show upgrade summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Host: {{ inventory_hostname }}"
|
||||||
|
- "Apt changed: {{ apt_upgrade_result.changed }}"
|
||||||
|
- "Reboot required: {{ reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot canary if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after canary OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 20
|
||||||
|
when: reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Ensure munge is running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Ensure slurmd is running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
scontrol ping
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Resume canary node and run canary job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart controller to refresh node state
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear canary node maintenance state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
sinfo -N -n {{ canary_node_effective }}
|
||||||
|
scontrol show node {{ canary_node_effective }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: resume_canary
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until canary is IDLE and responding
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ canary_node_effective }}
|
||||||
|
scontrol show node {{ canary_node_effective }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_state
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- canary_state.rc == 0
|
||||||
|
- "'not_responding' not in canary_state.stdout.lower()"
|
||||||
|
- "'down' not in canary_state.stdout.lower()"
|
||||||
|
- "'drain' not in canary_state.stdout.lower()"
|
||||||
|
- "'idle*' not in canary_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit canary test job to upgraded node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=canary-upgrade-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ canary_node_effective }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/canary-upgrade-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=\$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/canary-upgrade-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show canary test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: canary_job.stdout_lines
|
||||||
+197
@@ -0,0 +1,197 @@
|
|||||||
|
---
|
||||||
|
- name: Rolling upgrade Slurm worker nodes
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
vars:
|
||||||
|
skip_canary_node: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
do_skip_canary: "{{ skip_canary | default(true) | bool }}"
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Skip canary node if requested
|
||||||
|
ansible.builtin.meta: end_host
|
||||||
|
when:
|
||||||
|
- do_skip_canary
|
||||||
|
- inventory_hostname == skip_canary_node
|
||||||
|
|
||||||
|
- name: Drain node before OS upgrade
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ inventory_hostname }} State=DRAIN Reason="rolling OS upgrade"
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until no jobs are running on this node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ inventory_hostname }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: jobs_on_node
|
||||||
|
retries: 120
|
||||||
|
delay: 10
|
||||||
|
until: jobs_on_node.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Update apt cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: apt_upgrade_result
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required
|
||||||
|
|
||||||
|
- name: Show upgrade status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Node: {{ inventory_hostname }}"
|
||||||
|
- "Apt changed: {{ apt_upgrade_result.changed }}"
|
||||||
|
- "Reboot required: {{ reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot node if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after rolling OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 20
|
||||||
|
when: reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local slurm services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
scontrol ping
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Restart controller to refresh state after node upgrade
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
run_once: false
|
||||||
|
|
||||||
|
- name: Wait for controller after restart
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear upgraded node maintenance state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
sinfo -N -n {{ inventory_hostname }}
|
||||||
|
scontrol show node {{ inventory_hostname }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: resume_node
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until node is healthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ inventory_hostname }}
|
||||||
|
scontrol show node {{ inventory_hostname }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: upgraded_node_state
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- upgraded_node_state.rc == 0
|
||||||
|
- "'not_responding' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'down' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'drain' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'idle*' not in upgraded_node_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit node-local post-upgrade test job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=rolling-upgrade-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ inventory_hostname }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/rolling-upgrade-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=\$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/rolling-upgrade-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: node_test_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show node post-upgrade test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_test_job.stdout_lines
|
||||||
@@ -0,0 +1,94 @@
|
|||||||
|
---
|
||||||
|
- name: Upgrade Slurm controller OS safely
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show cluster state before controller upgrade
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
squeue
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
systemctl is-active slurmdbd || true
|
||||||
|
systemctl is-active mariadb || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: before_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print cluster state before controller upgrade
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: before_state.stdout_lines
|
||||||
|
|
||||||
|
- name: Update apt cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade controller packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: controller_upgrade
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: controller_reboot_required
|
||||||
|
|
||||||
|
- name: Show controller upgrade status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Apt changed: {{ controller_upgrade.changed }}"
|
||||||
|
- "Reboot required: {{ controller_reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot controller if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after controller OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 30
|
||||||
|
when: controller_reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Restart controller services
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
loop:
|
||||||
|
- munge
|
||||||
|
- mariadb
|
||||||
|
- slurmdbd
|
||||||
|
- slurmctld
|
||||||
|
|
||||||
|
- name: Wait for slurmctld
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 20
|
||||||
|
delay: 3
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate controller after upgrade
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
squeue
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -20
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: controller_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print controller validation after upgrade
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: controller_after.stdout_lines
|
||||||
+207
@@ -0,0 +1,207 @@
|
|||||||
|
---
|
||||||
|
- name: Validate cluster after OS rolling upgrade
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate Slurm controller and cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### slurmctld ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### important config"
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType|SelectType|ClusterName"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting recent jobs"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cluster_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print cluster state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cluster_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate worker services after OS rolling upgrade
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate local worker services and Slurm connectivity
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo "UPTIME=$(uptime -p)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge local test"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### local slurm.conf checksum"
|
||||||
|
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### gpu check if present"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv,noheader || true
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: worker_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print worker state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: worker_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit post-upgrade CPU validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit CPU validation job to debug partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=os-upgrade-cpu-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/os-upgrade-cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/os-upgrade-cpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print CPU validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cpu_validation_job.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit post-upgrade GPU validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit GPU validation job to gpu partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=os-upgrade-gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/os-upgrade-gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo
|
||||||
|
nvidia-smi
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/os-upgrade-gpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print GPU validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_validation_job.stdout_lines
|
||||||
@@ -0,0 +1,15 @@
|
|||||||
|
# Codex prompt: generate repository documentation
|
||||||
|
|
||||||
|
You are working in an Ansible repository that automates a Slurm AI/HPC lab.
|
||||||
|
|
||||||
|
Please review the repository and generate or improve documentation under `docs/` with the following goals:
|
||||||
|
|
||||||
|
1. Explain the architecture and repository layout.
|
||||||
|
2. Document the end-to-end deployment sequence.
|
||||||
|
3. Document operational workflows: provisioning, decommissioning, rolling upgrades, health checks and auto-remediation.
|
||||||
|
4. Document SlurmDBD accounting, QOS, fairshare and priority workflows.
|
||||||
|
5. Add troubleshooting notes based on the playbooks and templates.
|
||||||
|
6. Avoid exposing secrets, real IP addresses, real hostnames, SQL dumps, backup archives, private keys or vault content.
|
||||||
|
7. Keep all text in English.
|
||||||
|
|
||||||
|
Output should be practical, operator-focused and suitable for a public Git repository.
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
# Slurm cgroup configuration
|
||||||
|
|
||||||
|
CgroupPlugin=autodetect
|
||||||
|
|
||||||
|
ConstrainCores=yes
|
||||||
|
ConstrainRAMSpace=yes
|
||||||
|
ConstrainSwapSpace=no
|
||||||
|
ConstrainDevices=yes
|
||||||
|
|
||||||
|
AllowedRAMSpace=100
|
||||||
|
AllowedSwapSpace=0
|
||||||
|
MaxRAMPercent=100
|
||||||
|
MaxSwapPercent=0
|
||||||
|
|
||||||
|
MinRAMSpace=30
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' and node.gres | default('') | length > 0 %}
|
||||||
|
NodeName={{ node.name }} Name=gpu File={{ node.gres_file | default('/dev/nvidia0') }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,67 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
ClusterName={{ slurm_cluster_name }}
|
||||||
|
SlurmctldHost={{ slurm_control_machine }}({{ slurm_control_addr }})
|
||||||
|
|
||||||
|
SlurmUser={{ slurm_user }}
|
||||||
|
AuthType=auth/munge
|
||||||
|
StateSaveLocation=/var/spool/slurmctld
|
||||||
|
SlurmdSpoolDir=/var/spool/slurmd
|
||||||
|
SwitchType=switch/none
|
||||||
|
MpiDefault={{ slurm_default_mpi_type }}
|
||||||
|
ProctrackType={{ slurm_proctrack_type }}
|
||||||
|
ReturnToService={{ slurm_return_to_service }}
|
||||||
|
{% if slurm_gres_types is defined and slurm_gres_types | length > 0 %}
|
||||||
|
GresTypes={{ slurm_gres_types }}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
SlurmctldPidFile=/run/slurmctld.pid
|
||||||
|
SlurmdPidFile=/run/slurmd.pid
|
||||||
|
SlurmctldPort={{ slurmctld_port }}
|
||||||
|
SlurmdPort={{ slurmd_port }}
|
||||||
|
|
||||||
|
TaskPlugin={{ slurm_task_plugin }}
|
||||||
|
SelectType={{ slurm_select_type }}
|
||||||
|
SelectTypeParameters={{ slurm_select_type_parameters }}
|
||||||
|
|
||||||
|
SchedulerType=sched/backfill
|
||||||
|
# Priority / fairshare
|
||||||
|
PriorityType={{ slurm_priority_type | default('priority/multifactor') }}
|
||||||
|
PriorityDecayHalfLife={{ slurm_priority_decay_half_life | default('7-0') }}
|
||||||
|
PriorityCalcPeriod={{ slurm_priority_calc_period | default(5) }}
|
||||||
|
PriorityFavorSmall={{ slurm_priority_favor_small | default('NO') }}
|
||||||
|
PriorityWeightAge={{ slurm_priority_weight_age | default(1000) }}
|
||||||
|
PriorityWeightFairshare={{ slurm_priority_weight_fairshare | default(10000) }}
|
||||||
|
PriorityWeightJobSize={{ slurm_priority_weight_job_size | default(1000) }}
|
||||||
|
PriorityWeightPartition={{ slurm_priority_weight_partition | default(1000) }}
|
||||||
|
PriorityWeightQOS={{ slurm_priority_weight_qos | default(10000) }}
|
||||||
|
PriorityMaxAge={{ slurm_priority_max_age | default('1-0') }}
|
||||||
|
|
||||||
|
SlurmctldTimeout=120
|
||||||
|
SlurmdTimeout=300
|
||||||
|
InactiveLimit=0
|
||||||
|
KillWait=30
|
||||||
|
Waittime=0
|
||||||
|
|
||||||
|
AccountingStorageType={{ slurm_accounting_storage_type }}
|
||||||
|
{% if slurm_accounting_storage_type == "accounting_storage/slurmdbd" %}
|
||||||
|
AccountingStorageHost={{ slurm_accounting_storage_host }}
|
||||||
|
AccountingStoragePort={{ slurm_accounting_storage_port }}
|
||||||
|
AccountingStorageEnforce={{ slurm_accounting_storage_enforce | default('associations,limits,qos') }}
|
||||||
|
AccountingStorageTRES={{ slurm_accounting_storage_tres | default('cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu') }}
|
||||||
|
{% endif %}
|
||||||
|
JobAcctGatherType={{ slurm_job_acct_gather_type | default('jobacct_gather/none') }}
|
||||||
|
JobCompType={{ slurm_job_comp_type }}
|
||||||
|
|
||||||
|
SlurmctldDebug=info
|
||||||
|
SlurmdDebug=info
|
||||||
|
SlurmctldLogFile=/var/log/slurm/slurmctld.log
|
||||||
|
SlurmdLogFile=/var/log/slurm/slurmd.log
|
||||||
|
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
|
||||||
|
NodeName={{ node.name }} NodeAddr={{ node.addr }} CPUs={{ node.cpus }}{% if node.topology | default('') | length > 0 %} {{ node.topology }}{% endif %} RealMemory={{ node.real_memory }}{% if node.gres | default('') | length > 0 %} Gres={{ node.gres }}{% endif %}{% if node.features | default('') | length > 0 %} Feature={{ node.features }}{% endif %} State=UNKNOWN
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
{% for partition in slurm_partitions %}
|
||||||
|
PartitionName={{ partition.name }} Nodes={{ partition.nodes }} Default={{ partition.default }} MaxTime={{ partition.max_time }} State={{ partition.state }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,38 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
# Slurm database daemon configuration
|
||||||
|
|
||||||
|
AuthType=auth/munge
|
||||||
|
|
||||||
|
DbdHost={{ slurmdbd_host }}
|
||||||
|
DbdPort={{ slurmdbd_port }}
|
||||||
|
|
||||||
|
SlurmUser={{ slurm_user }}
|
||||||
|
|
||||||
|
DebugLevel=info
|
||||||
|
LogFile=/var/log/slurm/slurmdbd.log
|
||||||
|
PidFile=/run/slurmdbd.pid
|
||||||
|
|
||||||
|
CommitDelay={{ slurmdbd_commit_delay | default(1) }}
|
||||||
|
|
||||||
|
StorageType={{ slurmdbd_storage_type }}
|
||||||
|
StorageHost={{ slurmdbd_storage_host }}
|
||||||
|
StoragePort={{ slurmdbd_storage_port }}
|
||||||
|
StorageLoc={{ slurmdbd_storage_loc }}
|
||||||
|
StorageUser={{ slurmdbd_storage_user }}
|
||||||
|
StoragePass={{ slurmdbd_storage_pass }}
|
||||||
|
|
||||||
|
# Retention / purge policy
|
||||||
|
PurgeEventAfter={{ slurmdbd_purge_event_after | default('12months') }}
|
||||||
|
PurgeJobAfter={{ slurmdbd_purge_job_after | default('12months') }}
|
||||||
|
PurgeResvAfter={{ slurmdbd_purge_resv_after | default('12months') }}
|
||||||
|
PurgeStepAfter={{ slurmdbd_purge_step_after | default('3months') }}
|
||||||
|
PurgeSuspendAfter={{ slurmdbd_purge_suspend_after | default('3months') }}
|
||||||
|
PurgeTXNAfter={{ slurmdbd_purge_txn_after | default('12months') }}
|
||||||
|
PurgeUsageAfter={{ slurmdbd_purge_usage_after | default('24months') }}
|
||||||
|
|
||||||
|
ArchiveEvents={{ slurmdbd_archive_events | default('no') }}
|
||||||
|
ArchiveJobs={{ slurmdbd_archive_jobs | default('no') }}
|
||||||
|
ArchiveSteps={{ slurmdbd_archive_steps | default('no') }}
|
||||||
|
ArchiveSuspend={{ slurmdbd_archive_suspend | default('no') }}
|
||||||
|
ArchiveTXN={{ slurmdbd_archive_txn | default('no') }}
|
||||||
|
ArchiveUsage={{ slurmdbd_archive_usage | default('no') }}
|
||||||
Reference in New Issue
Block a user