This commit is contained in:
@@ -10,6 +10,10 @@ Current subdirectories are planning areas unless their own README documents a ru
|
||||
- `ci-cd`
|
||||
- `docker`
|
||||
|
||||
## Linux operations labs
|
||||
|
||||
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
|
||||
|
||||
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
|
||||
|
||||
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
|
||||
|
||||
@@ -0,0 +1,308 @@
|
||||
# AI Lab Maintenance Toolkit
|
||||
|
||||
## Executive summary
|
||||
|
||||
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
|
||||
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
|
||||
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
|
||||
configuration backup, and non-destructive VM inventory into a small toolkit
|
||||
that is readable enough for review and guarded enough for homelab use.
|
||||
|
||||
This is a portfolio and lab implementation, not evidence of production
|
||||
certification. Review package policy, backup coverage, maintenance windows, and
|
||||
application impact before deploying it to another host.
|
||||
|
||||
## Problem solved
|
||||
|
||||
AI lab hosts accumulate operating system packages, kernel packages, container
|
||||
images, build cache, journals, and configuration changes while also carrying
|
||||
stateful workloads. Manual maintenance is easy to defer and risky to perform
|
||||
without evidence. This project provides scheduled, logged tasks with explicit
|
||||
safety boundaries and separate read-only audit commands.
|
||||
|
||||
## What this demonstrates
|
||||
|
||||
- Bash strict mode, input validation, dependency checks, and operational exit
|
||||
codes.
|
||||
- Dry-run-first maintenance with explicit authorization for changes.
|
||||
- systemd oneshot services and persistent calendar timers.
|
||||
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
|
||||
- Docker cleanup that preserves volumes.
|
||||
- Configuration-focused backups with bounded retention.
|
||||
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
|
||||
- Idempotent installation and guarded JSON configuration updates.
|
||||
|
||||
## Architecture and directory layout
|
||||
|
||||
```text
|
||||
ailab-maintenance/
|
||||
├── README.md
|
||||
├── install.sh
|
||||
├── scripts/
|
||||
│ ├── ailab-healthcheck.sh
|
||||
│ ├── ailab-disk-watch.sh
|
||||
│ ├── ailab-apt-cleanup.sh
|
||||
│ ├── ailab-kernel-cleanup.sh
|
||||
│ ├── ailab-docker-cleanup.sh
|
||||
│ ├── ailab-config-backup.sh
|
||||
│ └── ailab-vm-audit.sh
|
||||
└── systemd/
|
||||
├── ailab-apt-cleanup.service
|
||||
├── ailab-apt-cleanup.timer
|
||||
├── ailab-kernel-cleanup.service
|
||||
├── ailab-kernel-cleanup.timer
|
||||
├── ailab-docker-cleanup.service
|
||||
├── ailab-docker-cleanup.timer
|
||||
├── ailab-config-backup.service
|
||||
├── ailab-config-backup.timer
|
||||
├── ailab-disk-watch.service
|
||||
└── ailab-disk-watch.timer
|
||||
```
|
||||
|
||||
The installer deploys scripts to `/usr/local/sbin` and units to
|
||||
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
|
||||
through an additional framework.
|
||||
|
||||
## Maintenance tasks
|
||||
|
||||
| Command | Purpose | Change behavior |
|
||||
| --- | --- | --- |
|
||||
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
|
||||
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
|
||||
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
|
||||
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
|
||||
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
|
||||
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
|
||||
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
|
||||
|
||||
## Safety model
|
||||
|
||||
Change-capable scripts default to dry-run behavior. Manual execution requires
|
||||
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
|
||||
use `--execute --non-interactive`; installing and enabling those reviewed unit
|
||||
files is the explicit authorization for scheduled maintenance.
|
||||
|
||||
Exit codes follow the repository convention:
|
||||
|
||||
- `0`: completed successfully or an optional component was absent.
|
||||
- `1`: an operational check or maintenance action failed.
|
||||
- `2`: invalid input, missing required dependency, or insufficient privilege.
|
||||
|
||||
The scripts do not bypass APT or Docker locks, delete VM resources, manually
|
||||
select kernel names for removal, or hide command failures.
|
||||
|
||||
## Installation
|
||||
|
||||
Review every script and unit first. Installation changes package state,
|
||||
journald settings, Docker daemon settings when Docker exists, and enabled timer
|
||||
state.
|
||||
|
||||
```bash
|
||||
cd labs/linux/ailab-maintenance
|
||||
sudo ./install.sh
|
||||
```
|
||||
|
||||
The installer:
|
||||
|
||||
1. Installs the documented Ubuntu utilities.
|
||||
2. Deploys scripts and systemd units with fixed permissions.
|
||||
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
|
||||
4. Restarts `systemd-journald`.
|
||||
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
|
||||
with `jq`, and attempts a Docker restart.
|
||||
6. Enables all five timers.
|
||||
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
|
||||
|
||||
The installer is intended for Ubuntu 26.04. It is not run automatically by
|
||||
repository validation.
|
||||
|
||||
## Manual commands
|
||||
|
||||
Read-only reports:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||
sudo /usr/local/sbin/ailab-disk-watch.sh
|
||||
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||
```
|
||||
|
||||
Preview maintenance:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-apt-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-config-backup.sh
|
||||
```
|
||||
|
||||
Apply reviewed maintenance interactively:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
|
||||
sudo /usr/local/sbin/ailab-config-backup.sh --execute
|
||||
```
|
||||
|
||||
`--non-interactive` is reserved for reviewed automation and is rejected unless
|
||||
`--execute` is also present.
|
||||
|
||||
## Systemd timers
|
||||
|
||||
| Timer | Schedule |
|
||||
| --- | --- |
|
||||
| `ailab-config-backup.timer` | Daily at 03:30 |
|
||||
| `ailab-disk-watch.timer` | Hourly |
|
||||
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
|
||||
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
|
||||
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
|
||||
|
||||
All timers use `Persistent=true`, so a missed event runs after the host becomes
|
||||
available. Inspect timer and service evidence with:
|
||||
|
||||
```bash
|
||||
systemctl list-timers --all | grep ailab-
|
||||
systemctl status ailab-config-backup.timer
|
||||
journalctl -u ailab-kernel-cleanup.service
|
||||
```
|
||||
|
||||
## Logs
|
||||
|
||||
Scheduled and manual maintenance writes to:
|
||||
|
||||
```text
|
||||
/var/log/ailab-apt-cleanup.log
|
||||
/var/log/ailab-kernel-cleanup.log
|
||||
/var/log/ailab-docker-cleanup.log
|
||||
/var/log/ailab-config-backup.log
|
||||
/var/log/ailab-disk-watch.log
|
||||
```
|
||||
|
||||
systemd also records service output in the journal. Logrotate is installed as a
|
||||
dependency, but this lab does not create a custom rotation policy for these
|
||||
small maintenance logs.
|
||||
|
||||
## Docker policy
|
||||
|
||||
Docker cleanup runs `docker system prune -af` and removes build cache older
|
||||
than seven days. It never passes `--volumes`. Named and anonymous volumes
|
||||
remain outside this automated policy and require application-aware review.
|
||||
|
||||
The installer configures the `json-file` driver with a maximum size of `50m`
|
||||
and five files. Existing valid JSON is backed up and merged. Invalid JSON
|
||||
causes installation to stop rather than overwrite operator configuration.
|
||||
|
||||
## Kernel policy
|
||||
|
||||
Kernel removal is delegated to `apt autoremove --purge`; package names are not
|
||||
constructed or purged with regular expressions. Before execution, the script
|
||||
logs the APT simulation and refuses cleanup unless at least two installed
|
||||
versioned kernel image packages remain after simulated removals.
|
||||
|
||||
This protects a fallback kernel while preserving Ubuntu dependency policy.
|
||||
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
|
||||
Secure Boot state, and the simulated removal set before manual execution.
|
||||
|
||||
## Backup policy
|
||||
|
||||
Backups are written to `/srv/backups/ailab-config` as
|
||||
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
|
||||
deleted only after a new archive is created.
|
||||
|
||||
The backup covers `/etc`, selected root shell configuration,
|
||||
`/opt/ailab-maintenance` when present, and libvirt configuration under
|
||||
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
|
||||
Ollama models, VM disk images, or other large application datasets. Because
|
||||
`/etc` is included, explicitly listed configuration subdirectories are already
|
||||
covered even when optional-path reporting mentions them separately.
|
||||
|
||||
This is a local configuration backup, not a disaster-recovery design. A real
|
||||
deployment should copy archives to independently protected storage and test
|
||||
restoration.
|
||||
|
||||
## Journald policy
|
||||
|
||||
The installer applies:
|
||||
|
||||
```ini
|
||||
[Journal]
|
||||
SystemMaxUse=1G
|
||||
SystemKeepFree=2G
|
||||
MaxRetentionSec=14day
|
||||
Compress=yes
|
||||
```
|
||||
|
||||
These settings bound journal growth while retaining useful troubleshooting
|
||||
evidence. Capacity and retention should be adjusted to the host's disk size
|
||||
and incident-response requirements.
|
||||
|
||||
## Disk watch policy
|
||||
|
||||
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
|
||||
`1` when any checked filesystem meets or exceeds the threshold. Override the
|
||||
threshold for a manual or unit invocation with:
|
||||
|
||||
```bash
|
||||
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
|
||||
```
|
||||
|
||||
The script reports every filesystem as `OK` or `WARNING`; it does not delete
|
||||
data or attempt remediation.
|
||||
|
||||
## Example operational workflows
|
||||
|
||||
### Weekly maintenance review
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||||
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||||
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||||
systemctl list-timers --all | grep ailab-
|
||||
```
|
||||
|
||||
Review the kernel simulation, Docker usage, failed units, backup freshness, and
|
||||
disk warnings before approving manual changes.
|
||||
|
||||
### Disk pressure investigation
|
||||
|
||||
```bash
|
||||
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
|
||||
sudo docker system df
|
||||
sudo journalctl --disk-usage
|
||||
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||||
```
|
||||
|
||||
Use the evidence to identify ownership. Do not treat Docker pruning or file
|
||||
deletion as a substitute for application-specific retention policy.
|
||||
|
||||
### Post-maintenance evidence
|
||||
|
||||
```bash
|
||||
sudo /usr/local/sbin/ailab-healthcheck.sh \
|
||||
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
|
||||
journalctl --since today -u 'ailab-*.service'
|
||||
```
|
||||
|
||||
## Interview talking points
|
||||
|
||||
- Why timer units explicitly carry the non-interactive execution boundary.
|
||||
- Why APT dependency policy is safer than regex-based kernel deletion.
|
||||
- How Docker volume preservation separates platform hygiene from application
|
||||
data lifecycle decisions.
|
||||
- How optional dependency handling keeps one health command useful across
|
||||
container, GPU, and virtualization host variants.
|
||||
- Why configuration backup and application-data backup are separate concerns.
|
||||
- How exit codes, persistent timers, logs, and post-checks support operations.
|
||||
|
||||
## Future improvements
|
||||
|
||||
- Add a dedicated logrotate policy after measuring log growth.
|
||||
- Export disk-watch status to a monitoring system instead of relying only on
|
||||
timer failure state.
|
||||
- Add automated archive integrity checks and off-host replication.
|
||||
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
|
||||
commands.
|
||||
- Add package-lock detection with bounded retry policy if recurring contention
|
||||
is observed.
|
||||
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
|
||||
dedicated read-only audit.
|
||||
Executable
+103
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
|
||||
DOCKER_CONFIG="/etc/docker/daemon.json"
|
||||
packages=(
|
||||
logrotate
|
||||
needrestart
|
||||
smartmontools
|
||||
nvme-cli
|
||||
sysstat
|
||||
iotop
|
||||
ncdu
|
||||
duf
|
||||
jq
|
||||
lsof
|
||||
psmisc
|
||||
tar
|
||||
gzip
|
||||
)
|
||||
timers=(
|
||||
ailab-apt-cleanup.timer
|
||||
ailab-kernel-cleanup.timer
|
||||
ailab-docker-cleanup.timer
|
||||
ailab-config-backup.timer
|
||||
ailab-disk-watch.timer
|
||||
)
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: install.sh must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in apt-get install systemctl; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
printf 'Installing maintenance dependencies...\n'
|
||||
apt-get update
|
||||
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
|
||||
|
||||
printf 'Installing scripts and systemd units...\n'
|
||||
for script in "$SCRIPT_DIR"/scripts/*.sh; do
|
||||
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
|
||||
done
|
||||
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
|
||||
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
|
||||
done
|
||||
|
||||
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
|
||||
tmp_journald="$(mktemp)"
|
||||
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
|
||||
cat >"$tmp_journald" <<'EOF'
|
||||
[Journal]
|
||||
SystemMaxUse=1G
|
||||
SystemKeepFree=2G
|
||||
MaxRetentionSec=14day
|
||||
Compress=yes
|
||||
EOF
|
||||
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
|
||||
systemctl restart systemd-journald
|
||||
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
printf 'Configuring Docker log rotation limits...\n'
|
||||
install -d -m 0755 /etc/docker
|
||||
tmp_docker="$(mktemp)"
|
||||
|
||||
if [[ -f "$DOCKER_CONFIG" ]]; then
|
||||
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
|
||||
exit 1
|
||||
fi
|
||||
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
|
||||
install -m 0644 "$DOCKER_CONFIG" "$backup"
|
||||
jq '. + {
|
||||
"log-driver": "json-file",
|
||||
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
|
||||
}' "$DOCKER_CONFIG" >"$tmp_docker"
|
||||
else
|
||||
jq -n '{
|
||||
"log-driver": "json-file",
|
||||
"log-opts": {"max-size": "50m", "max-file": "5"}
|
||||
}' >"$tmp_docker"
|
||||
fi
|
||||
|
||||
jq empty "$tmp_docker"
|
||||
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
|
||||
systemctl restart docker || true
|
||||
else
|
||||
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
|
||||
fi
|
||||
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now "${timers[@]}"
|
||||
|
||||
printf '\nEnabled AI Lab timers:\n'
|
||||
systemctl list-timers --all --no-pager | grep 'ailab-' || true
|
||||
|
||||
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
|
||||
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
|
||||
@@ -0,0 +1,66 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-apt-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ! command -v apt >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: apt is required\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
|
||||
printf 'INFO: simulated autoremove follows\n'
|
||||
LC_ALL=C apt -s autoremove --purge
|
||||
printf 'INFO: rerun with --execute and confirm to apply changes\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
apt update
|
||||
apt autoremove --purge -y
|
||||
apt autoclean -y
|
||||
if command -v needrestart >/dev/null 2>&1; then
|
||||
needrestart -b || true
|
||||
else
|
||||
printf 'WARNING: needrestart is not installed\n'
|
||||
fi
|
||||
printf 'OK: APT cleanup completed\n'
|
||||
@@ -0,0 +1,90 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-config-backup.log"
|
||||
BACKUP_DIR="/srv/backups/ailab-config"
|
||||
RETENTION_DAYS=30
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in tar gzip find; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
timestamp="$(date '+%Y%m%d-%H%M%S')"
|
||||
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
|
||||
candidate_paths=(
|
||||
/etc
|
||||
/root/.bashrc
|
||||
/root/.bashrc.d
|
||||
/opt/ailab-maintenance
|
||||
/var/lib/libvirt/qemu
|
||||
)
|
||||
source_paths=()
|
||||
|
||||
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
|
||||
for path in "${candidate_paths[@]}"; do
|
||||
if [[ -e "$path" ]]; then
|
||||
source_paths+=("${path#/}")
|
||||
printf 'OK: include %s\n' "$path"
|
||||
else
|
||||
printf 'INFO: optional path is absent: %s\n' "$path"
|
||||
fi
|
||||
done
|
||||
|
||||
if ((${#source_paths[@]} == 0)); then
|
||||
printf 'CRITICAL: no backup source paths are present\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
printf 'Backup destination: %s\n' "$archive"
|
||||
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
|
||||
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
|
||||
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'Type EXECUTE to create the archive and apply retention: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
install -d -m 0750 "$BACKUP_DIR"
|
||||
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
|
||||
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
|
||||
printf 'OK: configuration backup created: %s\n' "$archive"
|
||||
@@ -0,0 +1,38 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-disk-watch.log"
|
||||
threshold="${AILAB_DISK_THRESHOLD:-85}"
|
||||
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
|
||||
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
|
||||
|
||||
status=0
|
||||
while read -r filesystem _blocks _used available use_percent mountpoint; do
|
||||
usage="${use_percent%\%}"
|
||||
|
||||
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
|
||||
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
|
||||
status=1
|
||||
elif ((usage >= threshold)); then
|
||||
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
|
||||
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
|
||||
status=1
|
||||
else
|
||||
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
|
||||
fi
|
||||
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
|
||||
|
||||
exit "$status"
|
||||
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
LOG_FILE="/var/log/ailab-docker-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
|
||||
|
||||
if ! command -v docker >/dev/null 2>&1; then
|
||||
printf 'INFO: Docker is not installed; nothing to do\n'
|
||||
exit 0
|
||||
fi
|
||||
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
|
||||
printf 'INFO: docker.service is inactive; nothing to do\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
printf '\nDocker disk usage before cleanup:\n'
|
||||
docker system df
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; would run docker system prune -af\n'
|
||||
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
|
||||
printf 'INFO: Docker volumes are never included in this cleanup\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
docker system prune -af
|
||||
docker builder prune -af --filter "until=168h"
|
||||
|
||||
printf '\nDocker disk usage after cleanup:\n'
|
||||
docker system df
|
||||
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
|
||||
+111
@@ -0,0 +1,111 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
run_optional() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if "$@"; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
printf 'WARNING: %s failed\n' "$description"
|
||||
return 0
|
||||
}
|
||||
|
||||
section "Host identity"
|
||||
if command -v hostnamectl >/dev/null 2>&1; then
|
||||
run_optional "hostnamectl" hostnamectl
|
||||
else
|
||||
run_optional "hostname" hostname
|
||||
fi
|
||||
run_optional "kernel information" uname -a
|
||||
run_optional "uptime" uptime
|
||||
|
||||
section "Memory"
|
||||
if command -v free >/dev/null 2>&1; then
|
||||
run_optional "memory report" free -h
|
||||
else
|
||||
printf 'WARNING: free is not available\n'
|
||||
fi
|
||||
|
||||
section "Filesystems"
|
||||
if command -v df >/dev/null 2>&1; then
|
||||
run_optional "filesystem report" df -hT
|
||||
printf '\nKey mountpoints present:\n'
|
||||
for mountpoint in / /boot /var /srv /opt /home; do
|
||||
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
|
||||
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
|
||||
fi
|
||||
done
|
||||
else
|
||||
printf 'WARNING: df is not available\n'
|
||||
fi
|
||||
|
||||
section "Journal usage"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
run_optional "journal disk usage" journalctl --disk-usage
|
||||
else
|
||||
printf 'WARNING: journalctl is not available\n'
|
||||
fi
|
||||
|
||||
section "Docker"
|
||||
if command -v docker >/dev/null 2>&1; then
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "Docker service state" systemctl is-active docker
|
||||
fi
|
||||
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
|
||||
run_optional "Docker disk usage" docker system df
|
||||
else
|
||||
printf 'INFO: Docker is not installed\n'
|
||||
fi
|
||||
|
||||
section "Libvirt"
|
||||
if command -v virsh >/dev/null 2>&1; then
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "libvirtd service state" systemctl is-active libvirtd
|
||||
fi
|
||||
run_optional "libvirt guest list" virsh list --all
|
||||
else
|
||||
printf 'INFO: virsh is not installed\n'
|
||||
fi
|
||||
|
||||
section "NVIDIA"
|
||||
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||
run_optional "NVIDIA status" nvidia-smi
|
||||
else
|
||||
printf 'INFO: nvidia-smi is not installed\n'
|
||||
fi
|
||||
|
||||
section "Failed systemd units"
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
run_optional "failed systemd unit report" systemctl --failed --no-pager
|
||||
else
|
||||
printf 'WARNING: systemctl is not available\n'
|
||||
fi
|
||||
|
||||
section "SMART quick health"
|
||||
if command -v smartctl >/dev/null 2>&1; then
|
||||
shopt -s nullglob
|
||||
devices=(/dev/sd? /dev/nvme?n?)
|
||||
shopt -u nullglob
|
||||
|
||||
if ((${#devices[@]} == 0)); then
|
||||
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
|
||||
else
|
||||
for device in "${devices[@]}"; do
|
||||
printf '\n-- %s --\n' "$device"
|
||||
run_optional "SMART health check for $device" smartctl -H "$device"
|
||||
done
|
||||
fi
|
||||
else
|
||||
printf 'INFO: smartctl is not installed\n'
|
||||
fi
|
||||
|
||||
exit 0
|
||||
@@ -0,0 +1,117 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
# APT autoremove respects package dependencies and kernel protection rules. That
|
||||
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
|
||||
|
||||
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
|
||||
execute=false
|
||||
non_interactive=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
kernel_packages() {
|
||||
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
|
||||
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
|
||||
| awk '$1 ~ /^ii/ {print $2}' \
|
||||
| sort -u || true
|
||||
}
|
||||
|
||||
versioned_kernel_images() {
|
||||
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
|
||||
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
|
||||
| sort -u || true
|
||||
}
|
||||
|
||||
while (($# > 0)); do
|
||||
case "$1" in
|
||||
--execute) execute=true ;;
|
||||
--non-interactive) non_interactive=true ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
|
||||
esac
|
||||
shift
|
||||
done
|
||||
|
||||
if [[ "$non_interactive" == true && "$execute" != true ]]; then
|
||||
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
if ((EUID != 0)); then
|
||||
printf 'CRITICAL: this script must run as root\n' >&2
|
||||
exit 2
|
||||
fi
|
||||
for command_name in apt dpkg-query uname; do
|
||||
if ! command -v "$command_name" >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
|
||||
exit 2
|
||||
fi
|
||||
done
|
||||
|
||||
exec > >(tee -a "$LOG_FILE") 2>&1
|
||||
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
|
||||
printf 'Running kernel: %s\n' "$(uname -r)"
|
||||
printf '\nInstalled kernel-related packages before cleanup:\n'
|
||||
kernel_packages
|
||||
|
||||
simulation="$(LC_ALL=C apt -s autoremove --purge)"
|
||||
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
|
||||
|
||||
mapfile -t installed_images < <(versioned_kernel_images)
|
||||
mapfile -t removed_images < <(
|
||||
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
|
||||
)
|
||||
|
||||
remaining_images=0
|
||||
for image in "${installed_images[@]}"; do
|
||||
remove_image=false
|
||||
for removed in "${removed_images[@]}"; do
|
||||
if [[ "$image" == "$removed" ]]; then
|
||||
remove_image=true
|
||||
break
|
||||
fi
|
||||
done
|
||||
if [[ "$remove_image" != true ]]; then
|
||||
remaining_images=$((remaining_images + 1))
|
||||
fi
|
||||
done
|
||||
|
||||
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
|
||||
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
|
||||
|
||||
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
|
||||
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$execute" != true ]]; then
|
||||
printf 'INFO: dry-run mode; no packages were removed\n'
|
||||
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$non_interactive" != true ]]; then
|
||||
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
printf 'CRITICAL: confirmation failed; no changes made\n'
|
||||
exit 2
|
||||
fi
|
||||
fi
|
||||
|
||||
apt autoremove --purge -y
|
||||
apt autoclean -y
|
||||
if command -v update-grub >/dev/null 2>&1; then
|
||||
update-grub || true
|
||||
else
|
||||
printf 'WARNING: update-grub is not installed\n'
|
||||
fi
|
||||
|
||||
printf '\nInstalled kernel-related packages after cleanup:\n'
|
||||
kernel_packages
|
||||
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
|
||||
+42
@@ -0,0 +1,42 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
if ! command -v virsh >/dev/null 2>&1; then
|
||||
printf 'INFO: virsh is not installed; VM audit skipped\n'
|
||||
exit 0
|
||||
fi
|
||||
|
||||
section "Virtual machines"
|
||||
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
|
||||
|
||||
section "Storage pools"
|
||||
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
|
||||
|
||||
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
|
||||
for pool in "${pools[@]}"; do
|
||||
section "Volumes in pool: $pool"
|
||||
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
|
||||
done
|
||||
|
||||
section "Possible VM disk and installation images"
|
||||
search_roots=()
|
||||
for path in /var/lib/libvirt /srv /opt; do
|
||||
[[ -d "$path" ]] && search_roots+=("$path")
|
||||
done
|
||||
|
||||
if ((${#search_roots[@]} == 0)); then
|
||||
printf 'INFO: no configured search roots are present\n'
|
||||
else
|
||||
find "${search_roots[@]}" -xdev -type f \
|
||||
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
|
||||
-printf '%12s bytes %p\n' 2>/dev/null \
|
||||
| sort -nr || true
|
||||
fi
|
||||
|
||||
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe APT cleanup
|
||||
After=network-online.target
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab APT cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:00:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,6 @@
|
||||
[Unit]
|
||||
Description=AI Lab configuration backup
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab configuration backup daily
|
||||
|
||||
[Timer]
|
||||
OnCalendar=*-*-* 03:30:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,6 @@
|
||||
[Unit]
|
||||
Description=AI Lab disk usage check
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab disk usage check hourly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=hourly
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe Docker cleanup
|
||||
Requires=docker.service
|
||||
After=docker.service
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab Docker cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:40:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
@@ -0,0 +1,8 @@
|
||||
[Unit]
|
||||
Description=AI Lab safe kernel cleanup
|
||||
After=network-online.target ailab-apt-cleanup.service
|
||||
Wants=network-online.target
|
||||
|
||||
[Service]
|
||||
Type=oneshot
|
||||
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
|
||||
@@ -0,0 +1,9 @@
|
||||
[Unit]
|
||||
Description=Run AI Lab kernel cleanup weekly
|
||||
|
||||
[Timer]
|
||||
OnCalendar=Sun *-*-* 04:20:00
|
||||
Persistent=true
|
||||
|
||||
[Install]
|
||||
WantedBy=timers.target
|
||||
Reference in New Issue
Block a user