Compare commits
2 Commits
1843796e92
...
83877fb598
| Author | SHA1 | Date | |
|---|---|---|---|
| 83877fb598 | |||
| d300d490f5 |
@@ -36,6 +36,7 @@
|
|||||||
- IBM AIX 7 role and playbook.
|
- IBM AIX 7 role and playbook.
|
||||||
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
- Shared sanitized Ansible inventory defaults for Linux and AIX examples.
|
||||||
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
- Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
|
||||||
|
- Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
|
|||||||
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
|
|||||||
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
- [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
|
||||||
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
- [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
|
||||||
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
- [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
|
||||||
|
- [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
## Planned Areas
|
## Planned Areas
|
||||||
|
|
||||||
@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
|
|||||||
- Veritas VxVM/VCS operational awareness.
|
- Veritas VxVM/VCS operational awareness.
|
||||||
- GPFS / IBM Spectrum Scale operational awareness.
|
- GPFS / IBM Spectrum Scale operational awareness.
|
||||||
- Ansible role organization for selected hardening controls.
|
- Ansible role organization for selected hardening controls.
|
||||||
|
- Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
|
||||||
- Clear documentation of what was tested and what still needs a real system.
|
- Clear documentation of what was tested and what still needs a real system.
|
||||||
|
|||||||
@@ -1,8 +1,14 @@
|
|||||||
# platform-projects
|
# platform-projects
|
||||||
|
|
||||||
This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
|
This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
|
||||||
|
|
||||||
Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
## Implemented platform projects
|
||||||
|
|
||||||
|
- [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
|
||||||
|
|
||||||
|
## Planning areas
|
||||||
|
|
||||||
|
These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
|
||||||
|
|
||||||
- `monitoring-zabbix`
|
- `monitoring-zabbix`
|
||||||
- `elk-log-analysis`
|
- `elk-log-analysis`
|
||||||
|
|||||||
@@ -0,0 +1,236 @@
|
|||||||
|
# Slurm AI/HPC Cluster Automation Lab
|
||||||
|
|
||||||
|
## Executive summary
|
||||||
|
|
||||||
|
This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
|
||||||
|
|
||||||
|
The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
|
||||||
|
|
||||||
|
## What this project demonstrates
|
||||||
|
|
||||||
|
- Slurm controller and worker node management.
|
||||||
|
- Munge authentication across the cluster.
|
||||||
|
- GPU node integration through Slurm GRES.
|
||||||
|
- cgroup CPU, memory, and GPU device enforcement.
|
||||||
|
- SlurmDBD with MariaDB-backed accounting.
|
||||||
|
- `sacct`, `sreport`, and `sacctmgr` workflows.
|
||||||
|
- QOS, fairshare, and multifactor priority configuration.
|
||||||
|
- Node provisioning and decommissioning workflows.
|
||||||
|
- Rolling OS upgrades with canary validation.
|
||||||
|
- Health checks and auto-remediation.
|
||||||
|
- Backup and restore-check workflow for the accounting database.
|
||||||
|
- Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
|
||||||
|
|
||||||
|
## Architecture overview
|
||||||
|
|
||||||
|
```mermaid
|
||||||
|
flowchart LR
|
||||||
|
operator[Ansible control node]
|
||||||
|
munge[Munge authentication]
|
||||||
|
controller[Slurm controller<br/>slurmctld]
|
||||||
|
db[MariaDB + SlurmDBD<br/>accounting]
|
||||||
|
shared[Shared filesystem<br/>site dependency]
|
||||||
|
cpu_part[CPU partition]
|
||||||
|
gpu_part[GPU partition]
|
||||||
|
cpu_nodes[CPU compute nodes<br/>slurmd]
|
||||||
|
gpu_node[GPU node<br/>slurmd + GRES]
|
||||||
|
jobs[User jobs<br/>sbatch / srun]
|
||||||
|
|
||||||
|
operator -->|bootstrap and configure| controller
|
||||||
|
operator -->|configure workers| cpu_nodes
|
||||||
|
operator -->|configure GPU worker| gpu_node
|
||||||
|
operator -->|deploy key and service| munge
|
||||||
|
|
||||||
|
munge --> controller
|
||||||
|
munge --> cpu_nodes
|
||||||
|
munge --> gpu_node
|
||||||
|
|
||||||
|
controller -->|accounting RPC| db
|
||||||
|
jobs -->|submit to Slurm| controller
|
||||||
|
controller -->|schedule CPU jobs| cpu_part
|
||||||
|
controller -->|schedule GPU jobs| gpu_part
|
||||||
|
cpu_part --> cpu_nodes
|
||||||
|
gpu_part --> gpu_node
|
||||||
|
|
||||||
|
cpu_nodes --- shared
|
||||||
|
gpu_node --- shared
|
||||||
|
controller --- shared
|
||||||
|
```
|
||||||
|
|
||||||
|
The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
|
||||||
|
|
||||||
|
## Repository layout
|
||||||
|
|
||||||
|
```text
|
||||||
|
inventories/lab/ Sanitized lab inventory and group variables
|
||||||
|
playbooks/bootstrap/ Initial SSH, sudo, operator user, and host setup
|
||||||
|
playbooks/core/ Munge, Slurm config, and safe restart workflows
|
||||||
|
playbooks/accounting/ SlurmDBD, MariaDB, backup, restore-check, and reporting validation
|
||||||
|
playbooks/qos/ QOS, fairshare, and priority configuration
|
||||||
|
playbooks/lifecycle/ Node provisioning, inspection, and decommissioning
|
||||||
|
playbooks/upgrade/ Canary and rolling OS upgrade workflows
|
||||||
|
playbooks/health/ Health checks, repair, and auto-remediation
|
||||||
|
playbooks/tests/ CPU, GPU, cgroup, accounting, and reporting validation jobs
|
||||||
|
playbooks/backup/ Slurm and Munge state backup helpers
|
||||||
|
templates/ Slurm, cgroup, GRES, and SlurmDBD templates
|
||||||
|
docs/ Runbook, interview notes, and troubleshooting cases
|
||||||
|
prompts/ Documentation prompts used to expand this project
|
||||||
|
```
|
||||||
|
|
||||||
|
## Main operational workflows
|
||||||
|
|
||||||
|
Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
|
||||||
|
|
||||||
|
### Bootstrap access
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Munge
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Deploy Slurm config
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate CPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Validate GPU jobs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Enable accounting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/tests/test-sreport-usage.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Configure QOS and fairshare
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Provision a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
|
||||||
|
ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decommission a node
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
|
||||||
|
-e target_node=<node> \
|
||||||
|
-e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
|
||||||
|
-e canary_node=<node> \
|
||||||
|
-e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health check and auto-remediation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Accounting backup and restore-check
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Operational maturity
|
||||||
|
|
||||||
|
This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
|
||||||
|
|
||||||
|
- Ansible workflows are designed to be repeatable and readable.
|
||||||
|
- Configuration deployment supports check and diff review before applying changes.
|
||||||
|
- Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
|
||||||
|
- SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
|
||||||
|
- QOS, fairshare, priority, and association workflows show resource governance.
|
||||||
|
- Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
|
||||||
|
- Rolling upgrade playbooks include canary validation before broader worker upgrades.
|
||||||
|
- Health and repair playbooks document remediation paths for common node states.
|
||||||
|
- Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
|
||||||
|
- Troubleshooting cases document real lab failure modes without exposing private infrastructure details.
|
||||||
|
|
||||||
|
## Tested capabilities
|
||||||
|
|
||||||
|
- [x] CPU job scheduling.
|
||||||
|
- [x] GPU job scheduling.
|
||||||
|
- [x] GPU denial when no GRES is requested.
|
||||||
|
- [x] CPU cgroup enforcement.
|
||||||
|
- [x] SlurmDBD accounting setup.
|
||||||
|
- [x] `sacct` job history visibility.
|
||||||
|
- [x] `sreport` usage reporting.
|
||||||
|
- [x] QOS creation and validation.
|
||||||
|
- [x] Fairshare and priority visibility.
|
||||||
|
- [x] Node decommission and reprovision workflow.
|
||||||
|
- [x] Rolling upgrade canary workflow.
|
||||||
|
- [x] Node health check and auto-remediation workflow.
|
||||||
|
|
||||||
|
These checks represent sanitized lab validation, not a claim of production certification.
|
||||||
|
|
||||||
|
## Safety and sanitization
|
||||||
|
|
||||||
|
This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
|
||||||
|
|
||||||
|
Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
|
||||||
|
|
||||||
|
Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
|
||||||
|
|
||||||
|
## Why this matters for AI/HPC infrastructure roles
|
||||||
|
|
||||||
|
AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
|
||||||
|
|
||||||
|
This project demonstrates practical understanding of:
|
||||||
|
|
||||||
|
- Linux systems operations.
|
||||||
|
- Slurm cluster operations.
|
||||||
|
- GPU infrastructure and GRES scheduling.
|
||||||
|
- Job scheduling and resource isolation.
|
||||||
|
- Accounting, reporting, QOS, fairshare, and priority policy.
|
||||||
|
- Automation and repeatability with Ansible.
|
||||||
|
- Troubleshooting and operational ownership.
|
||||||
|
|
||||||
|
## Deeper docs
|
||||||
|
|
||||||
|
- [Runbook](docs/runbook.md)
|
||||||
|
- [Interview cheatsheet](docs/interview-cheatsheet.md)
|
||||||
|
- [Troubleshooting cases](docs/troubleshooting-cases.md)
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
[defaults]
|
||||||
|
inventory = ./inventories/lab/inventory.yml
|
||||||
|
host_key_checking = False
|
||||||
|
retry_files_enabled = False
|
||||||
|
stdout_callback = default
|
||||||
|
result_format = yaml
|
||||||
|
interpreter_python = auto_silent
|
||||||
|
timeout = 30
|
||||||
|
roles_path = ./roles
|
||||||
|
collections_path = ./collections
|
||||||
|
|
||||||
|
[ssh_connection]
|
||||||
|
pipelining = True
|
||||||
|
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
|
||||||
@@ -0,0 +1 @@
|
|||||||
|
Generated backups and reports can be stored here locally. This directory is ignored by git.
|
||||||
@@ -0,0 +1,22 @@
|
|||||||
|
# Interview Cheatsheet: Slurm AI/HPC Lab
|
||||||
|
|
||||||
|
## One-minute summary
|
||||||
|
|
||||||
|
I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
|
||||||
|
|
||||||
|
## Topics I can discuss
|
||||||
|
|
||||||
|
- How Slurm schedules CPU and GPU workloads.
|
||||||
|
- Difference between GRES scheduling and cgroup device enforcement.
|
||||||
|
- Why Munge key consistency matters.
|
||||||
|
- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
|
||||||
|
- How QOS, account associations, fairshare and multifactor priority work.
|
||||||
|
- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
|
||||||
|
|
||||||
|
## Real troubleshooting examples
|
||||||
|
|
||||||
|
- `IDLE+NOT_RESPONDING` after node reprovisioning.
|
||||||
|
- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
|
||||||
|
- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
|
||||||
|
- `sacctmgr` idempotency issues such as `Nothing new added`.
|
||||||
|
- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.
|
||||||
@@ -0,0 +1,75 @@
|
|||||||
|
# Slurm AI/HPC Lab Runbook
|
||||||
|
|
||||||
|
## Standard deployment order
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
|
||||||
|
ansible-playbook playbooks/bootstrap/slurm-hosts.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
|
||||||
|
ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/core/manage-munge.yml
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
|
||||||
|
ansible-playbook playbooks/core/manage-slurm-config.yml --diff
|
||||||
|
ansible-playbook playbooks/core/restart-slurm-safe.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/tests/validate-slurm-operator.yml
|
||||||
|
ansible-playbook playbooks/tests/test-cpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-job.yml
|
||||||
|
ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/accounting/setup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
|
||||||
|
ansible-playbook playbooks/accounting/backup-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
|
||||||
|
ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/qos/configure-slurm-qos.yml
|
||||||
|
ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
|
||||||
|
|
||||||
|
ansible-playbook playbooks/health/check-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Node lifecycle
|
||||||
|
|
||||||
|
Provision a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Decommission a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
|
||||||
|
```
|
||||||
|
|
||||||
|
Repair a node:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
|
||||||
|
```
|
||||||
|
|
||||||
|
Run health remediation for nodes that can be recovered by the automated workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
Back up Slurm and Munge state before planned lifecycle work:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/backup/backup-slurm-state.yml
|
||||||
|
ansible-playbook playbooks/backup/fetch-slurm-backups.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Rolling OS upgrade
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
|
||||||
|
ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
|
||||||
|
ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
|
||||||
|
ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
|
||||||
@@ -0,0 +1,28 @@
|
|||||||
|
# Troubleshooting Cases
|
||||||
|
|
||||||
|
## `IDLE+NOT_RESPONDING` after node maintenance
|
||||||
|
|
||||||
|
Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
|
||||||
|
|
||||||
|
Actions:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl restart munge
|
||||||
|
systemctl restart slurmd
|
||||||
|
systemctl restart slurmctld
|
||||||
|
scontrol update NodeName=<node> State=RESUME || true
|
||||||
|
scontrol update NodeName=<node> State=UNDRAIN || true
|
||||||
|
scontrol update NodeName=<node> State=IDLE || true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Missing GPU TRES
|
||||||
|
|
||||||
|
Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
|
||||||
|
|
||||||
|
Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
|
||||||
|
|
||||||
|
## SlurmDBD objects already exist
|
||||||
|
|
||||||
|
Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
|
||||||
|
|
||||||
|
Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
|
||||||
@@ -0,0 +1,128 @@
|
|||||||
|
---
|
||||||
|
# Example lab inventory variables. Replace addresses, users and node topology for your environment.
|
||||||
|
|
||||||
|
slurm_cluster_name: labcluster
|
||||||
|
|
||||||
|
slurm_control_machine: slurm-ctl01
|
||||||
|
slurm_control_addr: 10.10.10.11
|
||||||
|
|
||||||
|
slurm_config_dir: /etc/slurm
|
||||||
|
slurm_user: slurm
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
slurmctld_port: 6817
|
||||||
|
slurmd_port: 6818
|
||||||
|
|
||||||
|
slurm_job_comp_type: jobcomp/none
|
||||||
|
|
||||||
|
slurm_select_type: select/cons_tres
|
||||||
|
slurm_select_type_parameters: CR_Core_Memory
|
||||||
|
|
||||||
|
slurm_return_to_service: 2
|
||||||
|
slurm_default_mpi_type: none
|
||||||
|
|
||||||
|
slurm_gres_types: gpu
|
||||||
|
|
||||||
|
slurm_nodes:
|
||||||
|
- name: slurm-c01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.12
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: slurm-c02
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.13
|
||||||
|
cpus: 2
|
||||||
|
real_memory: 1800
|
||||||
|
features: ""
|
||||||
|
gres: ""
|
||||||
|
topology: ""
|
||||||
|
- name: gpu01
|
||||||
|
managed_state: present
|
||||||
|
addr: 10.10.10.14
|
||||||
|
cpus: 12
|
||||||
|
real_memory: 60000
|
||||||
|
features: "gpu"
|
||||||
|
gres: "gpu:1"
|
||||||
|
gres_file: /dev/nvidia0
|
||||||
|
topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
|
||||||
|
|
||||||
|
slurm_partitions:
|
||||||
|
- name: debug
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02]"
|
||||||
|
default: "YES"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: gpu
|
||||||
|
managed_state: present
|
||||||
|
nodes: "gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
- name: all
|
||||||
|
managed_state: present
|
||||||
|
nodes: "slurm-c[01-02],gpu01"
|
||||||
|
default: "NO"
|
||||||
|
max_time: "INFINITE"
|
||||||
|
state: "UP"
|
||||||
|
|
||||||
|
# Cgroup enforcement
|
||||||
|
slurm_enable_cgroup: true
|
||||||
|
slurm_task_plugin: task/cgroup,task/affinity
|
||||||
|
slurm_proctrack_type: proctrack/cgroup
|
||||||
|
slurm_job_acct_gather_type: jobacct_gather/cgroup
|
||||||
|
|
||||||
|
# Slurm accounting / SlurmDBD
|
||||||
|
slurm_accounting_storage_type: accounting_storage/slurmdbd
|
||||||
|
slurm_accounting_storage_host: slurm-ctl01
|
||||||
|
slurm_accounting_storage_port: 6819
|
||||||
|
slurm_accounting_storage_enforce: associations,limits,qos
|
||||||
|
slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
|
||||||
|
|
||||||
|
slurmdbd_host: slurm-ctl01
|
||||||
|
slurmdbd_port: 6819
|
||||||
|
slurmdbd_storage_type: accounting_storage/mysql
|
||||||
|
slurmdbd_storage_host: localhost
|
||||||
|
slurmdbd_storage_port: 3306
|
||||||
|
slurmdbd_storage_loc: slurm_acct_db
|
||||||
|
slurmdbd_storage_user: slurm
|
||||||
|
# Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
|
||||||
|
slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
|
||||||
|
|
||||||
|
slurm_account_name: lab
|
||||||
|
slurm_account_description: "AI/HPC Slurm lab account"
|
||||||
|
slurm_account_organization: "labcluster"
|
||||||
|
|
||||||
|
# SlurmDBD purge / retention policy for lab
|
||||||
|
slurmdbd_commit_delay: 1
|
||||||
|
slurmdbd_purge_event_after: 12months
|
||||||
|
slurmdbd_purge_job_after: 12months
|
||||||
|
slurmdbd_purge_resv_after: 12months
|
||||||
|
slurmdbd_purge_step_after: 3months
|
||||||
|
slurmdbd_purge_suspend_after: 3months
|
||||||
|
slurmdbd_purge_txn_after: 12months
|
||||||
|
slurmdbd_purge_usage_after: 24months
|
||||||
|
|
||||||
|
# Archive is disabled for the lab; backup playbooks handle database dumps.
|
||||||
|
slurmdbd_archive_events: no
|
||||||
|
slurmdbd_archive_jobs: no
|
||||||
|
slurmdbd_archive_steps: no
|
||||||
|
slurmdbd_archive_suspend: no
|
||||||
|
slurmdbd_archive_txn: no
|
||||||
|
slurmdbd_archive_usage: no
|
||||||
|
|
||||||
|
# Slurm priority / fairshare
|
||||||
|
slurm_priority_type: priority/multifactor
|
||||||
|
slurm_priority_decay_half_life: 7-0
|
||||||
|
slurm_priority_calc_period: 5
|
||||||
|
slurm_priority_favor_small: "NO"
|
||||||
|
slurm_priority_weight_age: 1000
|
||||||
|
slurm_priority_weight_fairshare: 10000
|
||||||
|
slurm_priority_weight_job_size: 1000
|
||||||
|
slurm_priority_weight_partition: 1000
|
||||||
|
slurm_priority_weight_qos: 10000
|
||||||
|
slurm_priority_max_age: 1-0
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
---
|
||||||
|
# Copy this file to vault.yml and encrypt it with ansible-vault.
|
||||||
|
# ansible-vault encrypt inventories/lab/group_vars/vault.yml
|
||||||
|
|
||||||
|
vault_slurmdbd_storage_pass: CHANGE_ME
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
all:
|
||||||
|
vars:
|
||||||
|
ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
|
||||||
|
children:
|
||||||
|
slurm_cluster:
|
||||||
|
children:
|
||||||
|
slurm_controller:
|
||||||
|
hosts:
|
||||||
|
slurm-ctl01:
|
||||||
|
ansible_host: 10.10.10.11
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_compute:
|
||||||
|
hosts:
|
||||||
|
slurm-c01:
|
||||||
|
ansible_host: 10.10.10.12
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm-c02:
|
||||||
|
ansible_host: 10.10.10.13
|
||||||
|
ansible_user: ansible
|
||||||
|
slurm_gpu:
|
||||||
|
hosts:
|
||||||
|
gpu01:
|
||||||
|
ansible_host: 10.10.10.14
|
||||||
|
ansible_user: ansible
|
||||||
@@ -0,0 +1,90 @@
|
|||||||
|
---
|
||||||
|
- name: Backup SlurmDBD MariaDB database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create remote backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurmdbd_backup_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create local fetch directory on Ansible controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_fetch_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate SlurmDBD is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting database exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Dump Slurm accounting database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
|
||||||
|
|
||||||
|
mysqldump \
|
||||||
|
--single-transaction \
|
||||||
|
--routines \
|
||||||
|
--events \
|
||||||
|
--triggers \
|
||||||
|
{{ slurmdbd_storage_loc }} | gzip -9 > "$out"
|
||||||
|
|
||||||
|
chmod 0600 "$out"
|
||||||
|
echo "$out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_dump
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate backup file is non-empty
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ db_dump.stdout }}"
|
||||||
|
register: backup_file
|
||||||
|
|
||||||
|
- name: Fail if backup file is empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Backup file is empty: {{ db_dump.stdout }}"
|
||||||
|
when: backup_file.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Fetch DB backup to Ansible controller
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "{{ db_dump.stdout }}"
|
||||||
|
dest: "{{ local_fetch_dir }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Show DB backup result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Remote backup: {{ db_dump.stdout }}"
|
||||||
|
- "Backup size bytes: {{ backup_file.stat.size }}"
|
||||||
|
- "Fetched to: {{ local_fetch_dir }}/"
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
- name: Initialize Slurm accounting entities
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Wait for sacctmgr connectivity
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sacctmgr -n list cluster
|
||||||
|
register: sacctmgr_cluster_list
|
||||||
|
retries: 20
|
||||||
|
delay: 2
|
||||||
|
until: sacctmgr_cluster_list.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show current accounting state before changes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print current accounting state before changes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Ensure Slurm cluster exists in accounting DB
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
|
||||||
|
echo "Cluster {{ slurm_cluster_name }} already exists"
|
||||||
|
else
|
||||||
|
sacctmgr -i add cluster {{ slurm_cluster_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_cluster
|
||||||
|
changed_when: "'Adding Cluster' in ensure_cluster.stdout"
|
||||||
|
|
||||||
|
- name: Ensure default lab account exists for cluster
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
|
||||||
|
echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add account {{ slurm_account_name }} \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Description="{{ slurm_account_description }}" \
|
||||||
|
Organization="{{ slurm_account_organization }}"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_account
|
||||||
|
changed_when: "'Adding Account' in ensure_account.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists with lab account association
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
|
||||||
|
echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
|
||||||
|
else
|
||||||
|
sacctmgr -i add user slurmuser \
|
||||||
|
Cluster={{ slurm_cluster_name }} \
|
||||||
|
Account={{ slurm_account_name }} \
|
||||||
|
DefaultAccount={{ slurm_account_name }}
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: ensure_user_assoc
|
||||||
|
changed_when: "'Adding User' in ensure_user_assoc.stdout"
|
||||||
|
|
||||||
|
- name: Ensure slurmuser has default account set
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: set_default_account
|
||||||
|
changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
|
||||||
|
|
||||||
|
- name: Show final accounting state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: accounting_state_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print final accounting state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: accounting_state_after.stdout_lines
|
||||||
+98
@@ -0,0 +1,98 @@
|
|||||||
|
---
|
||||||
|
- name: Restore-check latest SlurmDBD backup into test database
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
|
||||||
|
slurmdbd_backup_dir: /var/backups/slurmdbd
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate MariaDB is running
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active mariadb
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Find latest SlurmDBD backup
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate latest backup exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: "{{ latest_backup.stdout }}"
|
||||||
|
register: latest_backup_stat
|
||||||
|
|
||||||
|
- name: Fail if latest backup is missing or empty
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
|
||||||
|
when:
|
||||||
|
- not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
|
||||||
|
|
||||||
|
- name: Recreate restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
DROP DATABASE IF EXISTS {{ restore_check_db }};
|
||||||
|
CREATE DATABASE {{ restore_check_db }};
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Import backup into restore-check database
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate restored table count
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restored_tables
|
||||||
|
changed_when: false
|
||||||
|
failed_when: restored_tables.stdout | int < 1
|
||||||
|
|
||||||
|
- name: Validate restored row count sample
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### restored database"
|
||||||
|
echo "{{ restore_check_db }}"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ restore_check_db }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: restore_check_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show restore-check result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Imported backup: {{ latest_backup.stdout }}"
|
||||||
|
- "Restore-check DB: {{ restore_check_db }}"
|
||||||
|
- "Restored tables: {{ restored_tables.stdout }}"
|
||||||
|
- "Summary:"
|
||||||
|
- "{{ restore_check_summary.stdout_lines }}"
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
---
|
||||||
|
- name: Install and configure MariaDB for SlurmDBD
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install MariaDB and SlurmDBD packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- mariadb-server
|
||||||
|
- mariadb-client
|
||||||
|
- slurmdbd
|
||||||
|
- slurm-wlm-mysql-plugin
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure MariaDB is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: mariadb
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Create Slurm accounting database and DB user
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
mysql <<SQL
|
||||||
|
CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
|
||||||
|
GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
|
||||||
|
FLUSH PRIVILEGES;
|
||||||
|
SQL
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure /etc/slurm exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/slurm
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Deploy slurmdbd.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurmdbd.conf.j2
|
||||||
|
dest: /etc/slurm/slurmdbd.conf
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0600"
|
||||||
|
notify:
|
||||||
|
- Restart slurmdbd
|
||||||
|
|
||||||
|
- name: Ensure slurmdbd is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Flush handlers before validation
|
||||||
|
ansible.builtin.meta: flush_handlers
|
||||||
|
|
||||||
|
- name: Validate slurmdbd service is active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmdbd
|
||||||
|
register: slurmdbd_active
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate slurmdbd is listening on port
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ss -lntp | grep ':{{ slurmdbd_port }} '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurmdbd_port_check
|
||||||
|
retries: 10
|
||||||
|
delay: 2
|
||||||
|
until: slurmdbd_port_check.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show slurmdbd service validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "slurmdbd is active"
|
||||||
|
- "{{ slurmdbd_port_check.stdout_lines }}"
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart slurmdbd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmdbd
|
||||||
|
state: restarted
|
||||||
+178
@@ -0,0 +1,178 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm accounting production-like setup
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate accounting services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active mariadb
|
||||||
|
systemctl is-active slurmdbd
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmdbd listener"
|
||||||
|
ss -lntp | grep ':6819 '
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate Slurm accounting runtime config
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### accounting config"
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### priority / select / cgroup config"
|
||||||
|
scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sacctmgr entities
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### clusters"
|
||||||
|
sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounts"
|
||||||
|
sacctmgr list account format=Account,Descr,Org
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### users"
|
||||||
|
sacctmgr list user format=User,DefaultAccount,Admin
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: entity_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit accounting validation job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=acct-prodlike-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/acct-prodlike-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/acct-prodlike-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: acct_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate sacct can read recent jobs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### recent jobs"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sacct_recent
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sreport commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### cluster utilization"
|
||||||
|
sreport cluster utilization start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### account utilization by user"
|
||||||
|
sreport cluster AccountUtilizationByUser start=today || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### user top"
|
||||||
|
sreport user top start=today || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sreport_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate MariaDB table health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### database exists"
|
||||||
|
mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### table count"
|
||||||
|
mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### largest tables"
|
||||||
|
mysql -N -B -e "
|
||||||
|
SELECT table_name, table_rows
|
||||||
|
FROM information_schema.tables
|
||||||
|
WHERE table_schema='{{ slurmdbd_storage_loc }}'
|
||||||
|
ORDER BY table_rows DESC
|
||||||
|
LIMIT 10;
|
||||||
|
"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: db_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print accounting validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "### services"
|
||||||
|
- "{{ service_check.stdout_lines }}"
|
||||||
|
- "### runtime config"
|
||||||
|
- "{{ config_check.stdout_lines }}"
|
||||||
|
- "### accounting entities"
|
||||||
|
- "{{ entity_check.stdout_lines }}"
|
||||||
|
- "### accounting validation job"
|
||||||
|
- "{{ acct_job.stdout_lines }}"
|
||||||
|
- "### recent sacct data"
|
||||||
|
- "{{ sacct_recent.stdout_lines }}"
|
||||||
|
- "### sreport"
|
||||||
|
- "{{ sreport_check.stdout_lines }}"
|
||||||
|
- "### database health"
|
||||||
|
- "{{ db_health.stdout_lines }}"
|
||||||
@@ -0,0 +1,83 @@
|
|||||||
|
---
|
||||||
|
- name: Backup Slurm and Munge state on all cluster nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
backup_base_dir: /var/backups/slurm
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Create backup base directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ backup_base_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Create timestamped backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ts="$(date +%F-%H%M%S)"
|
||||||
|
dir="{{ backup_base_dir }}/$ts"
|
||||||
|
mkdir -p "$dir"
|
||||||
|
echo "$dir"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_dir_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Store backup directory fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
node_backup_dir: "{{ backup_dir_result.stdout }}"
|
||||||
|
|
||||||
|
- name: Backup Slurm and Munge config/state if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
backup_dir="{{ node_backup_dir }}"
|
||||||
|
|
||||||
|
for p in \
|
||||||
|
/etc/slurm \
|
||||||
|
/etc/slurm-llnl \
|
||||||
|
/etc/munge \
|
||||||
|
/var/spool/slurmctld \
|
||||||
|
/var/spool/slurmd \
|
||||||
|
/var/log/slurm \
|
||||||
|
/var/log/slurm-llnl
|
||||||
|
do
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
cp -a "$p" "$backup_dir/"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
|
||||||
|
systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
|
||||||
|
systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
|
||||||
|
journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
|
||||||
|
|
||||||
|
if command -v sinfo >/dev/null 2>&1; then
|
||||||
|
sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
if command -v scontrol >/dev/null 2>&1; then
|
||||||
|
scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
|
||||||
|
scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
|
||||||
|
scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
|
||||||
|
fi
|
||||||
|
|
||||||
|
find "$backup_dir" -maxdepth 2 -type f -o -type d
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: backup_content
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show backup location on node
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Host: {{ inventory_hostname }}"
|
||||||
|
- "Backup directory: {{ node_backup_dir }}"
|
||||||
@@ -0,0 +1,46 @@
|
|||||||
|
---
|
||||||
|
- name: Fetch latest Slurm backups from nodes to pvef
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
remote_backup_base: /var/backups/slurm
|
||||||
|
local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Find latest remote backup directory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
ls -1dt {{ remote_backup_base }}/* | head -n 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: latest_backup_dir
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Create local backup directory on pvef
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ local_backup_base }}/{{ inventory_hostname }}"
|
||||||
|
state: directory
|
||||||
|
mode: "0700"
|
||||||
|
delegate_to: localhost
|
||||||
|
become: false
|
||||||
|
|
||||||
|
- name: Archive latest backup directory on remote node
|
||||||
|
ansible.builtin.archive:
|
||||||
|
path: "{{ latest_backup_dir.stdout }}"
|
||||||
|
dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
format: gz
|
||||||
|
force_archive: true
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Fetch archive to pvef
|
||||||
|
ansible.builtin.fetch:
|
||||||
|
src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
|
||||||
|
flat: true
|
||||||
|
|
||||||
|
- name: Remove temporary remote archive
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
|
||||||
|
state: absent
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
- name: Bootstrap Ansible SSH access from pvef to Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
gather_facts: false
|
||||||
|
become: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Wait for SSH
|
||||||
|
ansible.builtin.wait_for_connection:
|
||||||
|
timeout: 30
|
||||||
|
|
||||||
|
- name: Install Python if missing - Debian/Ubuntu
|
||||||
|
ansible.builtin.raw: |
|
||||||
|
test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure sudo is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-server
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure SSH server is enabled and running
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: ssh
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for login user
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ ansible_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ ansible_user }}"
|
||||||
|
group: "{{ ansible_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Add pvef root public key to login user's authorized_keys
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ ansible_user }}"
|
||||||
|
key: "{{ ansible_controller_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
|
||||||
|
- name: Allow bootstrap login user passwordless sudo
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
{{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
|
||||||
|
validate: "visudo -cf %s"
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
---
|
||||||
|
- name: Configure /etc/hosts for Slurm cluster
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Add Slurm cluster hosts to /etc/hosts
|
||||||
|
ansible.builtin.blockinfile:
|
||||||
|
path: /etc/hosts
|
||||||
|
marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
|
||||||
|
block: |
|
||||||
|
{{ slurm_control_addr }} {{ slurm_control_machine }}
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
|
||||||
|
{{ node.addr }} {{ node.name }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,218 @@
|
|||||||
|
---
|
||||||
|
- name: Create slurmuser and generate SSH keys on every Slurm node
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
slurm_operator_shell: /bin/bash
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure useful packages are installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- sudo
|
||||||
|
- openssh-client
|
||||||
|
- openssh-server
|
||||||
|
- acl
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure slurmuser exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: "{{ slurm_operator_user }}"
|
||||||
|
shell: "{{ slurm_operator_shell }}"
|
||||||
|
create_home: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure .ssh directory exists for slurmuser
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Generate SSH key for slurmuser if missing
|
||||||
|
ansible.builtin.openssh_keypair:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
type: ed25519
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Read public key from each node
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
register: slurmuser_pubkey_raw
|
||||||
|
|
||||||
|
- name: Store decoded public key as host fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Exchange slurmuser SSH keys across all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Install all slurmuser public keys into authorized_keys on every node
|
||||||
|
ansible.builtin.authorized_key:
|
||||||
|
user: "{{ slurm_operator_user }}"
|
||||||
|
key: "{{ hostvars[item].slurmuser_pubkey }}"
|
||||||
|
state: present
|
||||||
|
manage_dir: true
|
||||||
|
loop: "{{ groups['slurm_cluster'] }}"
|
||||||
|
|
||||||
|
- name: Build SSH known_hosts entries for all cluster nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
mkdir -p /home/{{ slurm_operator_user }}/.ssh
|
||||||
|
touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Ensure SSH permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Ensure private key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0600"
|
||||||
|
|
||||||
|
- name: Ensure public key permissions are correct
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
|
||||||
|
owner: "{{ slurm_operator_user }}"
|
||||||
|
group: "{{ slurm_operator_user }}"
|
||||||
|
mode: "0644"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Configure sudo permissions for slurmuser
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm controller node.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sacctmgr, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
# Operator access for Slurm worker/GPU nodes.
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/scontrol, \
|
||||||
|
/usr/bin/sinfo, \
|
||||||
|
/usr/bin/squeue, \
|
||||||
|
/usr/bin/scancel, \
|
||||||
|
/usr/bin/sacct, \
|
||||||
|
/usr/bin/sbatch, \
|
||||||
|
/usr/bin/srun, \
|
||||||
|
/usr/bin/salloc
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate slurmuser SSH mesh and Slurm access
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local Slurm commands as slurmuser
|
||||||
|
ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
|
||||||
|
register: sinfo_test
|
||||||
|
changed_when: false
|
||||||
|
failed_when: sinfo_test.rc != 0
|
||||||
|
|
||||||
|
- name: Show sinfo result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: sinfo_test.stdout_lines
|
||||||
|
|
||||||
|
- name: Test SSH from each node to every other node as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
{% for host in groups['slurm_cluster'] %}
|
||||||
|
ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
|
||||||
|
{% endfor %}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
become_user: "{{ slurm_operator_user }}"
|
||||||
|
register: ssh_mesh_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show SSH mesh test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: ssh_mesh_test.stdout_lines
|
||||||
@@ -0,0 +1,112 @@
|
|||||||
|
---
|
||||||
|
- name: Fix sudo permissions for slurmuser Slurm operations
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Configure sudoers for slurmuser on controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-controller
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
|
||||||
|
/bin/systemctl status slurmctld, \
|
||||||
|
/bin/systemctl status slurmctld *, \
|
||||||
|
/bin/systemctl restart slurmctld, \
|
||||||
|
/bin/systemctl reload slurmctld, \
|
||||||
|
/bin/systemctl start slurmctld, \
|
||||||
|
/bin/systemctl stop slurmctld, \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmctld *, \
|
||||||
|
/usr/bin/systemctl restart slurmctld, \
|
||||||
|
/usr/bin/systemctl reload slurmctld, \
|
||||||
|
/usr/bin/systemctl start slurmctld, \
|
||||||
|
/usr/bin/systemctl stop slurmctld, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
|
||||||
|
/bin/journalctl -u slurmctld, \
|
||||||
|
/bin/journalctl -u slurmctld *, \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmctld, \
|
||||||
|
/usr/bin/journalctl -u slurmctld *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Configure sudoers for slurmuser on compute and GPU nodes
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/sudoers.d/91-slurmuser-slurm-compute
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0440"
|
||||||
|
content: |
|
||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
|
||||||
|
/bin/systemctl status slurmd, \
|
||||||
|
/bin/systemctl status slurmd *, \
|
||||||
|
/bin/systemctl restart slurmd, \
|
||||||
|
/bin/systemctl reload slurmd, \
|
||||||
|
/bin/systemctl start slurmd, \
|
||||||
|
/bin/systemctl stop slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd, \
|
||||||
|
/usr/bin/systemctl status slurmd *, \
|
||||||
|
/usr/bin/systemctl restart slurmd, \
|
||||||
|
/usr/bin/systemctl reload slurmd, \
|
||||||
|
/usr/bin/systemctl start slurmd, \
|
||||||
|
/usr/bin/systemctl stop slurmd
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
|
||||||
|
/bin/journalctl -u slurmd, \
|
||||||
|
/bin/journalctl -u slurmd *, \
|
||||||
|
/usr/bin/journalctl -u slurmd, \
|
||||||
|
/usr/bin/journalctl -u slurmd *
|
||||||
|
|
||||||
|
Cmnd_Alias SLURM_COMMANDS = \
|
||||||
|
/usr/bin/scontrol, /usr/bin/scontrol *, \
|
||||||
|
/usr/bin/sinfo, /usr/bin/sinfo *, \
|
||||||
|
/usr/bin/squeue, /usr/bin/squeue *, \
|
||||||
|
/usr/bin/scancel, /usr/bin/scancel *, \
|
||||||
|
/usr/bin/sacct, /usr/bin/sacct *, \
|
||||||
|
/usr/bin/sbatch, /usr/bin/sbatch *, \
|
||||||
|
/usr/bin/srun, /usr/bin/srun *, \
|
||||||
|
/usr/bin/salloc, /usr/bin/salloc *
|
||||||
|
|
||||||
|
{{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
|
||||||
|
validate: "visudo -cf %s"
|
||||||
|
when: inventory_hostname not in groups['slurm_controller']
|
||||||
@@ -0,0 +1,133 @@
|
|||||||
|
---
|
||||||
|
- name: Read Munge key from Slurm controller
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller munge.key exists
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key
|
||||||
|
|
||||||
|
- name: Fail if controller munge.key is missing
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "/etc/munge/munge.key is missing on controller. Do not continue."
|
||||||
|
when: not controller_munge_key.stat.exists
|
||||||
|
|
||||||
|
- name: Read controller munge.key
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key_raw
|
||||||
|
|
||||||
|
- name: Store controller Munge key as fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy controller Munge key to all Slurm nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
controller_host: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure munge package is installed
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- munge
|
||||||
|
- libmunge2
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure munge group exists
|
||||||
|
ansible.builtin.group:
|
||||||
|
name: munge
|
||||||
|
system: true
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure munge user exists
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: munge
|
||||||
|
group: munge
|
||||||
|
system: true
|
||||||
|
shell: /usr/sbin/nologin
|
||||||
|
home: /nonexistent
|
||||||
|
create_home: false
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Ensure /etc/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /etc/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0700"
|
||||||
|
|
||||||
|
- name: Deploy shared munge.key from controller
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/munge/munge.key
|
||||||
|
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0400"
|
||||||
|
notify:
|
||||||
|
- Restart munge
|
||||||
|
|
||||||
|
- name: Ensure /var/log/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure /var/lib/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/lib/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0711"
|
||||||
|
|
||||||
|
- name: Ensure /run/munge exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /run/munge
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure munge is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Munge locally on all nodes
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Test local munge encode/decode
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
munge -n | unmunge
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: munge_local_test
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show local Munge validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: munge_local_test.stdout_lines
|
||||||
@@ -0,0 +1,132 @@
|
|||||||
|
---
|
||||||
|
- name: Prepare Slurm config directories and logs
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure Slurm config directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurm_config_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure slurmctld spool directory exists on controller
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmctld
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
|
||||||
|
- name: Ensure slurmd spool directory exists on workers
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmd
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy Slurm config files
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Backup current slurm.conf before managed deployment
|
||||||
|
ansible.builtin.copy:
|
||||||
|
src: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
|
||||||
|
remote_src: true
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
force: false
|
||||||
|
|
||||||
|
- name: Deploy managed slurm.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed gres.conf only on GPU nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/gres.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/gres.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: inventory_hostname in groups['slurm_gpu']
|
||||||
|
notify:
|
||||||
|
- Reconfigure slurmctld
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Reconfigure slurmctld
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
when: inventory_hostname in groups['slurm_controller']
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after config deployment
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_config_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show validation output
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_config_validation.stdout_lines
|
||||||
@@ -0,0 +1,103 @@
|
|||||||
|
---
|
||||||
|
- name: Restart Slurm controller safely
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmctld on controller
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmctld to answer
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: scontrol_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: scontrol_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show controller ping
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: scontrol_ping.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Restart Slurm workers safely one by one
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart munge on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd on worker
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Wait for slurmd to be active
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: systemctl is-active slurmd
|
||||||
|
register: slurmd_active
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmd_active.stdout == "active"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Wait until this node is visible in Slurm
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol show node {{ inventory_hostname }}
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: node_visible
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: node_visible.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm after restart
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate Slurm cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
scontrol show nodes
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
scontrol show partitions
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_validation
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show Slurm validation
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_validation.stdout_lines
|
||||||
+40
@@ -0,0 +1,40 @@
|
|||||||
|
---
|
||||||
|
- name: Discover node resources for Slurm config
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Discover CPU and memory
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "HOST={{ inventory_hostname }}"
|
||||||
|
echo "CPUS=$(nproc)"
|
||||||
|
echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
|
||||||
|
echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_mem
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Discover NVIDIA GPU if present
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show discovered resources
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "{{ cpu_mem.stdout_lines }}"
|
||||||
|
- "GPU:"
|
||||||
|
- "{{ gpu_info.stdout_lines }}"
|
||||||
@@ -0,0 +1,89 @@
|
|||||||
|
---
|
||||||
|
- name: Inspect current Slurm and Munge state
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Basic host info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
echo "HOST=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "SHORT_HOST=$(hostname -s)"
|
||||||
|
echo "IP_ADDRESSES=$(hostname -I)"
|
||||||
|
echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: host_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm package info
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
dpkg -l | grep -Ei 'slurm|munge' || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: package_info
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm config paths
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -e
|
||||||
|
for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
|
||||||
|
echo "### $p"
|
||||||
|
if [ -e "$p" ]; then
|
||||||
|
find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
|
||||||
|
else
|
||||||
|
echo "MISSING"
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: config_paths
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Service state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
for s in munge slurmctld slurmd; do
|
||||||
|
echo "### $s"
|
||||||
|
systemctl is-enabled "$s" 2>/dev/null || true
|
||||||
|
systemctl is-active "$s" 2>/dev/null || true
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: service_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
echo "### which"
|
||||||
|
command -v sinfo || true
|
||||||
|
command -v scontrol || true
|
||||||
|
command -v sbatch || true
|
||||||
|
command -v srun || true
|
||||||
|
command -v munge || true
|
||||||
|
command -v unmunge || true
|
||||||
|
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo 2>&1 || true
|
||||||
|
|
||||||
|
echo "### scontrol ping"
|
||||||
|
scontrol ping 2>&1 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_commands
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show inspection report
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "===== {{ inventory_hostname }} :: host_info ====="
|
||||||
|
- "{{ host_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: packages ====="
|
||||||
|
- "{{ package_info.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: config_paths ====="
|
||||||
|
- "{{ config_paths.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: services ====="
|
||||||
|
- "{{ service_state.stdout_lines }}"
|
||||||
|
- "===== {{ inventory_hostname }} :: slurm_commands ====="
|
||||||
|
- "{{ slurm_commands.stdout_lines }}"
|
||||||
+216
@@ -0,0 +1,216 @@
|
|||||||
|
---
|
||||||
|
- name: Detect problematic Slurm nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Detect nodes needing remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store bad node list
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Show detected problematic nodes
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: bad_nodes
|
||||||
|
|
||||||
|
|
||||||
|
- name: Attempt auto-remediation on problematic nodes
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
vars:
|
||||||
|
bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Skip healthy nodes
|
||||||
|
ansible.builtin.meta: end_host
|
||||||
|
when: inventory_hostname not in bad_nodes_from_controller
|
||||||
|
|
||||||
|
- name: Restart Munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local services after remediation attempt
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent slurmd logs"
|
||||||
|
journalctl -u slurmd -n 30 --no-pager || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: local_repair_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print local remediation result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: local_repair_check.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Refresh controller and validate remediated nodes
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart slurmctld to refresh node states
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear maintenance state on previously bad nodes
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "No bad nodes detected. Nothing to clear."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $bad_nodes; do
|
||||||
|
echo "### clearing state on $node"
|
||||||
|
scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
|
||||||
|
done
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: clear_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print clear-state result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: clear_result.stdout_lines
|
||||||
|
|
||||||
|
- name: Detect nodes still unhealthy after remediation
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
sinfo -N -h -o "%N %T" | awk '
|
||||||
|
tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
|
||||||
|
' | sort -u
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: still_bad_nodes_raw
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Store still bad nodes
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
|
||||||
|
|
||||||
|
- name: Drain nodes that remain unhealthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
|
||||||
|
|
||||||
|
if [ -z "$unresolved_nodes" ]; then
|
||||||
|
echo "No unresolved unhealthy nodes."
|
||||||
|
sinfo -N
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
for node in $unresolved_nodes; do
|
||||||
|
echo "### draining unresolved node $node"
|
||||||
|
scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
|
||||||
|
done
|
||||||
|
|
||||||
|
sinfo -N
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: drain_unresolved
|
||||||
|
changed_when: still_bad_nodes | length > 0
|
||||||
|
|
||||||
|
- name: Show remediation summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### initial bad nodes"
|
||||||
|
bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### still bad nodes"
|
||||||
|
still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
|
||||||
|
if [ -z "$still_bad_nodes" ]; then
|
||||||
|
echo "none"
|
||||||
|
else
|
||||||
|
printf '%s\n' $still_bad_nodes
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### final sinfo"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: remediation_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print remediation summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: remediation_summary.stdout_lines
|
||||||
@@ -0,0 +1,149 @@
|
|||||||
|
---
|
||||||
|
- name: Check Slurm controller health
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check controller services and cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### controller services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
systemctl is-active slurmdbd || true
|
||||||
|
systemctl is-active mariadb || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurm ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### problematic nodes"
|
||||||
|
sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting"
|
||||||
|
sacctmgr -n list cluster || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent failed jobs"
|
||||||
|
sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
|
||||||
|
--format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: controller_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print controller health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: controller_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm worker health
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Check worker services, config and connectivity
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo "UPTIME=$(uptime -p)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge local test"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller connectivity"
|
||||||
|
getent hosts slurm-ctl01 || true
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### config checksums"
|
||||||
|
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### shared filesystem"
|
||||||
|
test -d /shared
|
||||||
|
touch /shared/.slurm-health-$(hostname)
|
||||||
|
ls -l /shared/.slurm-health-$(hostname)
|
||||||
|
rm -f /shared/.slurm-health-$(hostname)
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### cgroup"
|
||||||
|
mount | grep cgroup || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### gpu check"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: worker_health
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print worker health
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: worker_health.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Check Slurm-reported node state consistency
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Build Slurm node health summary
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### node summary"
|
||||||
|
sinfo -N -o "%N %P %T %C %m %G %E"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### full problematic node details"
|
||||||
|
for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
|
||||||
|
echo
|
||||||
|
echo "### $node"
|
||||||
|
scontrol show node "$node"
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_node_summary
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print Slurm node summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_node_summary.stdout_lines
|
||||||
@@ -0,0 +1,217 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target node
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook repair-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Capture node state before repair
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show target node state before repair
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### scontrol"
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### jobs"
|
||||||
|
squeue -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print target node state before repair
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_before.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Repair local services on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart Munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
when:
|
||||||
|
- inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Validate local repair
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### slurmd listener"
|
||||||
|
ss -lntp | grep ':6818 ' || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent slurmd logs"
|
||||||
|
journalctl -u slurmd -n 40 --no-pager || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: local_repair_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print local repair state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: local_repair_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Clear Slurm maintenance/down state after repair
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Restart controller to refresh node state
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear target node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ target_node }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ target_node }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ target_node }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: clear_state
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until node is healthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_health_after
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- node_health_after.rc == 0
|
||||||
|
- "'not_responding' not in node_health_after.stdout.lower()"
|
||||||
|
- "'down' not in node_health_after.stdout.lower()"
|
||||||
|
- "'drain' not in node_health_after.stdout.lower()"
|
||||||
|
- "'idle*' not in node_health_after.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print node state after repair
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_health_after.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit repair validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit validation job to repaired node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=repair-node-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ target_node }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --qos=normal
|
||||||
|
#SBATCH --output=/shared/repair-node-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/repair-node-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: repair_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print repair validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: repair_validation_job.stdout_lines
|
||||||
+126
@@ -0,0 +1,126 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target_node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook decommission-slurm-node.yml -e target_node=<hostname> [-e decom_reason='reason']"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Drain target node and wait for jobs to leave
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
decom_reason_effective: "{{ decom_reason | default('decommission by Ansible') }}"
|
||||||
|
decom_wait_retries_effective: "{{ decom_wait_retries | default(120) }}"
|
||||||
|
decom_wait_delay_effective: "{{ decom_wait_delay | default(10) }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show current target node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print current target node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Drain target node
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=DRAIN Reason="{{ decom_reason_effective }}"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until no jobs are running on target node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: jobs_on_node
|
||||||
|
retries: "{{ decom_wait_retries_effective | int }}"
|
||||||
|
delay: "{{ decom_wait_delay_effective | int }}"
|
||||||
|
until: jobs_on_node.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show drained node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_state_drained
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print drained node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_state_drained.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Stop Slurm worker service on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Stop slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: stopped
|
||||||
|
enabled: false
|
||||||
|
when:
|
||||||
|
- inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Show slurmd state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
systemctl is-enabled slurmd 2>/dev/null || true
|
||||||
|
systemctl is-active slurmd 2>/dev/null || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurmd_state_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print slurmd state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurmd_state_after.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Mark node down in Slurm controller
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Mark target node DOWN after service stop
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=DOWN Reason="decommissioned"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show final node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: final_node_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print final node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: final_node_state.stdout_lines
|
||||||
@@ -0,0 +1,246 @@
|
|||||||
|
---
|
||||||
|
- name: Validate target_node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook provision-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Ensure target_node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "target_node={{ target_node }} is not in Ansible inventory"
|
||||||
|
when: target_node not in groups['all']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Prepare OS, packages and Slurm directories on target node
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure target is a Slurm worker or GPU node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "{{ inventory_hostname }} must be in slurm_compute or slurm_gpu group"
|
||||||
|
when:
|
||||||
|
- inventory_hostname not in groups.get('slurm_compute', [])
|
||||||
|
- inventory_hostname not in groups.get('slurm_gpu', [])
|
||||||
|
|
||||||
|
- name: Install Slurm worker packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name:
|
||||||
|
- munge
|
||||||
|
- libmunge2
|
||||||
|
- slurm-client
|
||||||
|
- slurmd
|
||||||
|
- slurm-wlm-basic-plugins
|
||||||
|
- slurm-wlm-plugins
|
||||||
|
- slurm-wlm-mysql-plugin
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Ensure Slurm config directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ slurm_config_dir }}"
|
||||||
|
state: directory
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure Slurm log directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/log/slurm
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure slurmd spool directory exists
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: /var/spool/slurmd
|
||||||
|
state: directory
|
||||||
|
owner: slurm
|
||||||
|
group: slurm
|
||||||
|
mode: "0755"
|
||||||
|
|
||||||
|
- name: Ensure munge dirs exist
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item.path }}"
|
||||||
|
state: directory
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "{{ item.mode }}"
|
||||||
|
loop:
|
||||||
|
- { path: /etc/munge, mode: "0700" }
|
||||||
|
- { path: /var/log/munge, mode: "0755" }
|
||||||
|
- { path: /var/lib/munge, mode: "0711" }
|
||||||
|
- { path: /run/munge, mode: "0755" }
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy Munge key from controller to target node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Read controller munge.key
|
||||||
|
ansible.builtin.slurp:
|
||||||
|
src: /etc/munge/munge.key
|
||||||
|
register: controller_munge_key_raw
|
||||||
|
|
||||||
|
- name: Store controller Munge key as fact
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
|
||||||
|
|
||||||
|
|
||||||
|
- name: Configure target node with Munge and Slurm files
|
||||||
|
hosts: "{{ target_node }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
controller_host: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Deploy shared munge.key
|
||||||
|
ansible.builtin.copy:
|
||||||
|
dest: /etc/munge/munge.key
|
||||||
|
content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
|
||||||
|
owner: munge
|
||||||
|
group: munge
|
||||||
|
mode: "0400"
|
||||||
|
notify:
|
||||||
|
- Restart munge
|
||||||
|
|
||||||
|
- name: Deploy managed slurm.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Deploy managed gres.conf on GPU nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/gres.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/gres.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: inventory_hostname in groups.get('slurm_gpu', [])
|
||||||
|
notify:
|
||||||
|
- Restart slurmd
|
||||||
|
|
||||||
|
- name: Ensure munge is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
- name: Ensure slurmd is enabled and running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
enabled: true
|
||||||
|
state: started
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
|
||||||
|
- name: Deploy updated Slurm config to whole cluster and reconfigure controller
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Deploy managed slurm.conf to all nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/slurm.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/slurm.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
|
||||||
|
- name: Deploy managed cgroup.conf to all nodes
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: ../../templates/cgroup.conf.j2
|
||||||
|
dest: "{{ slurm_config_dir }}/cgroup.conf"
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: "0644"
|
||||||
|
when: slurm_enable_cgroup | default(false) | bool
|
||||||
|
|
||||||
|
|
||||||
|
- name: Reconfigure Slurm and validate target node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure Slurm controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart Slurm controller after node reprovision
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for Slurm controller after restart
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping_after_restart
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping_after_restart.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Resume target node in Slurm
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ target_node }} State=RESUME
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until target node is visible and not down
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol show node {{ target_node }}
|
||||||
|
sinfo -N -n {{ target_node }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: target_node_state
|
||||||
|
retries: 20
|
||||||
|
delay: 3
|
||||||
|
until:
|
||||||
|
- target_node_state.rc == 0
|
||||||
|
- "'down' not in target_node_state.stdout.lower()"
|
||||||
|
- "'not_responding' not in target_node_state.stdout.lower()"
|
||||||
|
- "'idle*' not in target_node_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show target node state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: target_node_state.stdout_lines
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
---
|
||||||
|
- name: Show Slurm node state
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook show-slurm-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Show node state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
echo "### sinfo"
|
||||||
|
sinfo -N -n {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### scontrol"
|
||||||
|
scontrol show node {{ target_node }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### jobs on node"
|
||||||
|
squeue -w {{ target_node }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_lifecycle_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print node lifecycle state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_lifecycle_state.stdout_lines
|
||||||
@@ -0,0 +1,169 @@
|
|||||||
|
---
|
||||||
|
- name: Configure Slurm QOS, limits and fairshare
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure sacctmgr is avgpu01le
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sacctmgr -n list cluster
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate accounting GPU TRES exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### configured AccountingStorageTRES"
|
||||||
|
scontrol show config | grep -E "AccountingStorageTRES|AccountingStorageType|AccountingStorageEnforce"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### known TRES"
|
||||||
|
sacctmgr show tres
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### checking gres/gpu"
|
||||||
|
sacctmgr -n show tres format=Type,Name | awk '$1=="gres" && $2=="gpu" {found=1} END {exit !found}'
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_tres_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Ensure normal QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos normal Priority=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_normal
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_normal.stdout + add_qos_normal.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_normal.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
|
||||||
|
'already exists' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
|
||||||
|
'Already existing' not in (add_qos_normal.stdout + add_qos_normal.stderr)
|
||||||
|
|
||||||
|
- name: Ensure debug-short QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos debug-short Priority=500
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_debug
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_debug.stdout + add_qos_debug.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_debug.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
|
||||||
|
'already exists' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
|
||||||
|
'Already existing' not in (add_qos_debug.stdout + add_qos_debug.stderr)
|
||||||
|
|
||||||
|
- name: Ensure gpu-short QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos gpu-short Priority=1000
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_gpu
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_gpu.stdout + add_qos_gpu.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_gpu.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
|
||||||
|
'already exists' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
|
||||||
|
'Already existing' not in (add_qos_gpu.stdout + add_qos_gpu.stderr)
|
||||||
|
|
||||||
|
- name: Ensure maintenance QOS exists
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i add qos maintenance Priority=5000
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: add_qos_maintenance
|
||||||
|
changed_when: "'Adding QOS' in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)"
|
||||||
|
failed_when: >
|
||||||
|
add_qos_maintenance.rc != 0 and
|
||||||
|
'Nothing new added' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
|
||||||
|
'already exists' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
|
||||||
|
'Already existing' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)
|
||||||
|
|
||||||
|
- name: Normalize normal QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos normal set Priority=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize debug-short QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos debug-short set Priority=500 MaxWall=00:10:00 MaxTRESPU=cpu=2 MaxJobsPU=4
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize gpu-short QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos gpu-short set Priority=1000 MaxWall=01:00:00 MaxTRESPU=gres/gpu=1,cpu=12 MaxJobsPU=2
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Normalize maintenance QOS settings
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify qos maintenance set Priority=5000 MaxWall=02:00:00
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign QOS set to lab account
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify account {{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign default account to slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Assign QOS set to slurmuser association
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sacctmgr -i modify user where name=slurmuser account={{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show configured QOS and associations
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### TRES"
|
||||||
|
sacctmgr show tres
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### QOS"
|
||||||
|
sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%40,MaxJobsPU
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### Associations"
|
||||||
|
sacctmgr show assoc format=Cluster,Account,User,Share,QOS%60,DefaultQOS,Fairshare
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### Fairshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: qos_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print QOS state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: qos_state.stdout_lines
|
||||||
@@ -0,0 +1,235 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm QOS, fairshare and priority
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate priority runtime config
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### priority config"
|
||||||
|
scontrol show config | grep -E "PriorityType|PriorityWeight|PriorityDecay|PriorityCalc|PriorityMaxAge|PriorityFavor"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting enforcement"
|
||||||
|
scontrol show config | grep -E "AccountingStorageType|AccountingStorageEnforce|AccountingStorageTRES"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### QOS"
|
||||||
|
sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%50,MaxJobsPU
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### associations"
|
||||||
|
sacctmgr show assoc format=Cluster,Account,User,Share,QOS%80,DefaultQOS,Fairshare
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### fairshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: priority_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit debug-short QOS job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-debug-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --qos=debug-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/qos-debug-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "QOS=${SLURM_JOB_QOS:-}"
|
||||||
|
echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/qos-debug-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: debug_qos_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Submit gpu-short QOS job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --qos=gpu-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/qos-gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "QOS=${SLURM_JOB_QOS:-}"
|
||||||
|
echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
nvidia-smi
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/qos-gpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_qos_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Validate debug-short walltime limit behavior
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
set +e
|
||||||
|
output="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH' 2>&1
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=qos-limit-fail
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --qos=debug-short
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:30:00
|
||||||
|
#SBATCH --output=/shared/qos-limit-fail-%j.out
|
||||||
|
|
||||||
|
sleep 10
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
rc=$?
|
||||||
|
set -e
|
||||||
|
|
||||||
|
echo "RC=$rc"
|
||||||
|
echo "$output"
|
||||||
|
|
||||||
|
if [ "$rc" -ne 0 ]; then
|
||||||
|
echo "Limit rejection test passed at submit time"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
|
||||||
|
job_id="$output"
|
||||||
|
echo "Submitted job despite expected limit check: $job_id"
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
echo "### squeue"
|
||||||
|
squeue -j "$job_id" -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### job detail"
|
||||||
|
scontrol show job "$job_id" || true
|
||||||
|
|
||||||
|
state="$(squeue -h -j "$job_id" -o "%T" || true)"
|
||||||
|
reason="$(squeue -h -j "$job_id" -o "%R" || true)"
|
||||||
|
|
||||||
|
echo "STATE=$state"
|
||||||
|
echo "REASON=$reason"
|
||||||
|
|
||||||
|
if echo "$state" | grep -qE "PENDING|CONFIGURING"; then
|
||||||
|
if echo "$reason" | grep -qiE "qos|limit|time|max|assoc"; then
|
||||||
|
echo "Limit enforcement test passed via pending reason"
|
||||||
|
scancel "$job_id" || true
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
echo "Job was accepted without an obvious QOS/limit pending reason"
|
||||||
|
scancel "$job_id" || true
|
||||||
|
exit 1
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: limit_rejection
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show priority and fairshare snapshot
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### queue"
|
||||||
|
squeue || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sprio"
|
||||||
|
sprio || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### sshare"
|
||||||
|
sshare -A {{ slurm_account_name }} || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### recent sacct"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -40
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: priority_snapshot
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print validation result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "### priority state"
|
||||||
|
- "{{ priority_state.stdout_lines }}"
|
||||||
|
- "### debug QOS job"
|
||||||
|
- "{{ debug_qos_job.stdout_lines }}"
|
||||||
|
- "### GPU QOS job"
|
||||||
|
- "{{ gpu_qos_job.stdout_lines }}"
|
||||||
|
- "### limit rejection"
|
||||||
|
- "{{ limit_rejection.stdout_lines }}"
|
||||||
|
- "### priority snapshot"
|
||||||
|
- "{{ priority_snapshot.stdout_lines }}"
|
||||||
@@ -0,0 +1,59 @@
|
|||||||
|
---
|
||||||
|
- name: Test CPU cgroup enforcement on gpu01
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit cgroup CPU test to gpu01
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=cgroup-cpu-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist=gpu01
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/cgroup-cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "MEM_ALLOWED=$(grep Mems_allowed_list /proc/self/status || true)"
|
||||||
|
echo
|
||||||
|
echo "### cgroup"
|
||||||
|
cat /proc/self/cgroup
|
||||||
|
echo
|
||||||
|
echo "### mounted cgroups"
|
||||||
|
mount | grep cgroup || true
|
||||||
|
sleep 5
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/cgroup-cpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cgroup_cpu_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show cgroup CPU result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cgroup_cpu_result.stdout_lines
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
---
|
||||||
|
- name: Submit CPU test job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit test job to debug partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=cpu-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=512M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
if [ -f "/shared/cpu-test-${job_id}.out" ]; then
|
||||||
|
cat "/shared/cpu-test-${job_id}.out"
|
||||||
|
else
|
||||||
|
echo "Output file not found: /shared/cpu-test-${job_id}.out"
|
||||||
|
find /shared -maxdepth 1 -name "cpu-test-*.out" -ls | tail -5 || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_job_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show CPU job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cpu_job_result.stdout_lines
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
---
|
||||||
|
- name: Test GPU access without GRES allocation
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit job to gpu01 without requesting GPU
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=gpu-deny-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist=gpu01
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/gpu-deny-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
echo "### ls nvidia devices"
|
||||||
|
ls -l /dev/nvidia* 2>&1 || true
|
||||||
|
echo
|
||||||
|
echo "### nvidia-smi without GRES"
|
||||||
|
nvidia-smi 2>&1 || true
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 60); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/gpu-deny-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_deny_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show GPU deny test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_deny_result.stdout_lines
|
||||||
@@ -0,0 +1,70 @@
|
|||||||
|
---
|
||||||
|
- name: Submit GPU test job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit test job to gpu partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=2G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo
|
||||||
|
|
||||||
|
echo "### nvidia-smi"
|
||||||
|
nvidia-smi
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### GPU process table"
|
||||||
|
nvidia-smi pmon -c 1 || true
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
sudo -iu slurmuser squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
if [ -f "/shared/gpu-test-${job_id}.out" ]; then
|
||||||
|
cat "/shared/gpu-test-${job_id}.out"
|
||||||
|
else
|
||||||
|
echo "Output file not found: /shared/gpu-test-${job_id}.out"
|
||||||
|
find /shared -maxdepth 1 -name "gpu-test-*.out" -ls | tail -5 || true
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_job_result
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show GPU job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_job_result.stdout_lines
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
---
|
||||||
|
- name: Submit job to specific Slurm node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Require target_node
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Use: ansible-playbook test-specific-node.yml -e target_node=<hostname>"
|
||||||
|
when: target_node is not defined
|
||||||
|
|
||||||
|
- name: Submit test job to target node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=node-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --nodelist={{ target_node }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --account=lab
|
||||||
|
#SBATCH --qos=normal
|
||||||
|
#SBATCH --output=/shared/node-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
echo "### waiting for job to leave queue"
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### waiting for output file"
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
if [ -s "/shared/node-test-${job_id}.out" ]; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### waiting for sacct final state"
|
||||||
|
final_state=""
|
||||||
|
for i in $(seq 1 30); do
|
||||||
|
final_state="$(
|
||||||
|
sacct -n -P -j "$job_id" --format=State 2>/dev/null \
|
||||||
|
| head -n 1 \
|
||||||
|
| cut -d'|' -f1 \
|
||||||
|
| awk '{print $1}'
|
||||||
|
)"
|
||||||
|
|
||||||
|
if echo "$final_state" | grep -qE "COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY"; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "FINAL_STATE=${final_state:-UNKNOWN}"
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/node-test-${job_id}.out"
|
||||||
|
|
||||||
|
if [ "${final_state:-UNKNOWN}" != "COMPLETED" ]; then
|
||||||
|
echo "Job did not reach COMPLETED state according to sacct"
|
||||||
|
exit 1
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: node_test
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show node test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_test.stdout_lines
|
||||||
@@ -0,0 +1,60 @@
|
|||||||
|
---
|
||||||
|
- name: Generate measurable Slurm usage for sreport
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit CPU usage job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=sreport-usage
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=512M
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/sreport-usage-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "Burning CPU for 90 seconds"
|
||||||
|
|
||||||
|
timeout 90 bash -c 'while true; do :; done' &
|
||||||
|
timeout 90 bash -c 'while true; do :; done' &
|
||||||
|
wait
|
||||||
|
|
||||||
|
echo "Done"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 150); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 2
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/sreport-usage-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: sreport_usage_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show usage job result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: sreport_usage_job.stdout_lines
|
||||||
@@ -0,0 +1,140 @@
|
|||||||
|
---
|
||||||
|
- name: Validate Slurm operator user and SSH mesh
|
||||||
|
hosts: slurm_cluster
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: "{{ slurm_operator_user | default('slurmuser') }}"
|
||||||
|
slurm_hosts: "{{ groups['slurm_cluster'] }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmuser exists
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: id {{ slurm_operator_user }}
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate sinfo as slurmuser
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate squeue as slurmuser
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate SSH mesh as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
for h in {{ slurm_hosts | join(' ') }}; do
|
||||||
|
echo "=== $h ==="
|
||||||
|
ssh -o BatchMode=yes -o ConnectTimeout=5 "$h" hostname
|
||||||
|
done
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
become_user: "{{ slurm_operator_user }}"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm controller commands
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmctld status through sudo
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmctld --no-pager
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate controller Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
sudo -iu {{ slurm_operator_user }} scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate Slurm worker commands
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate slurmd status through sudo
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmd --no-pager
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate worker Slurm commands
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sudo -iu {{ slurm_operator_user }} sinfo
|
||||||
|
sudo -iu {{ slurm_operator_user }} squeue
|
||||||
|
sudo -iu {{ slurm_operator_user }} scontrol show nodes
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate basic job submission
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
slurm_operator_user: slurmuser
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit simple Slurm test job as slurmuser
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu {{ slurm_operator_user }} sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=ansible-validate
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --time=00:01:00
|
||||||
|
#SBATCH --output=/tmp/ansible-validate-%j.out
|
||||||
|
|
||||||
|
hostname
|
||||||
|
whoami
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 20); do
|
||||||
|
state="$(sudo -iu {{ slurm_operator_user }} squeue -h -j "$job_id" -o "%T" || true)"
|
||||||
|
if [ -z "$state" ]; then
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
echo "job_state=$state"
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
|
||||||
|
sudo -iu {{ slurm_operator_user }} sacct -j "$job_id" --format=JobID,JobName,State,ExitCode 2>/dev/null || true
|
||||||
|
|
||||||
|
if ls /tmp/ansible-validate-"$job_id".out >/dev/null 2>&1; then
|
||||||
|
cat /tmp/ansible-validate-"$job_id".out
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: slurm_job_test
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show basic job submission result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: slurm_job_test.stdout_lines
|
||||||
+236
@@ -0,0 +1,236 @@
|
|||||||
|
---
|
||||||
|
- name: Validate canary node variable
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure canary node is in inventory
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "canary_node={{ canary_node_effective }} is not in inventory"
|
||||||
|
when: canary_node_effective not in groups['all']
|
||||||
|
|
||||||
|
- name: Ensure canary node is not the controller
|
||||||
|
ansible.builtin.fail:
|
||||||
|
msg: "Do not use controller as canary for worker rolling upgrade"
|
||||||
|
when: canary_node_effective in groups['slurm_controller']
|
||||||
|
|
||||||
|
|
||||||
|
- name: Drain canary node
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show canary state before drain
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ canary_node_effective }} || true
|
||||||
|
scontrol show node {{ canary_node_effective }} || true
|
||||||
|
squeue -w {{ canary_node_effective }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print canary state before drain
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: canary_before.stdout_lines
|
||||||
|
|
||||||
|
- name: Drain canary node
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ canary_node_effective }} State=DRAIN Reason="canary OS upgrade"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until canary has no running jobs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ canary_node_effective }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_jobs
|
||||||
|
retries: 120
|
||||||
|
delay: 10
|
||||||
|
until: canary_jobs.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Upgrade canary node OS packages
|
||||||
|
hosts: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Ensure apt cache is updated
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: apt_upgrade_result
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required
|
||||||
|
|
||||||
|
- name: Show upgrade summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Host: {{ inventory_hostname }}"
|
||||||
|
- "Apt changed: {{ apt_upgrade_result.changed }}"
|
||||||
|
- "Reboot required: {{ reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot canary if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after canary OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 20
|
||||||
|
when: reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Ensure munge is running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Ensure slurmd is running
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
scontrol ping
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
|
||||||
|
- name: Resume canary node and run canary job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
vars:
|
||||||
|
canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Reconfigure controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol reconfigure
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Restart controller to refresh node state
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: Wait for controller
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear canary node maintenance state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ canary_node_effective }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
sinfo -N -n {{ canary_node_effective }}
|
||||||
|
scontrol show node {{ canary_node_effective }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: resume_canary
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until canary is IDLE and responding
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ canary_node_effective }}
|
||||||
|
scontrol show node {{ canary_node_effective }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_state
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- canary_state.rc == 0
|
||||||
|
- "'not_responding' not in canary_state.stdout.lower()"
|
||||||
|
- "'down' not in canary_state.stdout.lower()"
|
||||||
|
- "'drain' not in canary_state.stdout.lower()"
|
||||||
|
- "'idle*' not in canary_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit canary test job to upgraded node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=canary-upgrade-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ canary_node_effective }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/canary-upgrade-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "USER=\$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=\$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/canary-upgrade-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: canary_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show canary test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: canary_job.stdout_lines
|
||||||
+197
@@ -0,0 +1,197 @@
|
|||||||
|
---
|
||||||
|
- name: Rolling upgrade Slurm worker nodes
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
serial: 1
|
||||||
|
|
||||||
|
vars:
|
||||||
|
skip_canary_node: "{{ canary_node | default('slurm-c02') }}"
|
||||||
|
do_skip_canary: "{{ skip_canary | default(true) | bool }}"
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Skip canary node if requested
|
||||||
|
ansible.builtin.meta: end_host
|
||||||
|
when:
|
||||||
|
- do_skip_canary
|
||||||
|
- inventory_hostname == skip_canary_node
|
||||||
|
|
||||||
|
- name: Drain node before OS upgrade
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol update NodeName={{ inventory_hostname }} State=DRAIN Reason="rolling OS upgrade"
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until no jobs are running on this node
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
squeue -h -w {{ inventory_hostname }} || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: jobs_on_node
|
||||||
|
retries: 120
|
||||||
|
delay: 10
|
||||||
|
until: jobs_on_node.stdout | trim == ""
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Update apt cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: apt_upgrade_result
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required
|
||||||
|
|
||||||
|
- name: Show upgrade status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Node: {{ inventory_hostname }}"
|
||||||
|
- "Apt changed: {{ apt_upgrade_result.changed }}"
|
||||||
|
- "Reboot required: {{ reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot node if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after rolling OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 20
|
||||||
|
when: reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Restart munge
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: munge
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Restart slurmd
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmd
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Validate local slurm services
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
scontrol ping
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Restart controller to refresh state after node upgrade
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: slurmctld
|
||||||
|
state: restarted
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
run_once: false
|
||||||
|
|
||||||
|
- name: Wait for controller after restart
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 15
|
||||||
|
delay: 2
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Clear upgraded node maintenance state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=RESUME 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=UNDRAIN 2>/dev/null || true
|
||||||
|
scontrol update NodeName={{ inventory_hostname }} State=IDLE 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep 3
|
||||||
|
sinfo -N -n {{ inventory_hostname }}
|
||||||
|
scontrol show node {{ inventory_hostname }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: resume_node
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Wait until node is healthy
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
sinfo -N -n {{ inventory_hostname }}
|
||||||
|
scontrol show node {{ inventory_hostname }}
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: upgraded_node_state
|
||||||
|
retries: 30
|
||||||
|
delay: 5
|
||||||
|
until:
|
||||||
|
- upgraded_node_state.rc == 0
|
||||||
|
- "'not_responding' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'down' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'drain' not in upgraded_node_state.stdout.lower()"
|
||||||
|
- "'idle*' not in upgraded_node_state.stdout.lower()"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Submit node-local post-upgrade test job
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<SBATCH
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=rolling-upgrade-test
|
||||||
|
#SBATCH --partition=all
|
||||||
|
#SBATCH --nodelist={{ inventory_hostname }}
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/rolling-upgrade-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=\$(hostname)"
|
||||||
|
echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=\$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/rolling-upgrade-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
delegate_to: "{{ groups['slurm_controller'][0] }}"
|
||||||
|
register: node_test_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Show node post-upgrade test result
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: node_test_job.stdout_lines
|
||||||
@@ -0,0 +1,94 @@
|
|||||||
|
---
|
||||||
|
- name: Upgrade Slurm controller OS safely
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Show cluster state before controller upgrade
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
squeue
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmctld
|
||||||
|
systemctl is-active slurmdbd || true
|
||||||
|
systemctl is-active mariadb || true
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: before_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print cluster state before controller upgrade
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: before_state.stdout_lines
|
||||||
|
|
||||||
|
- name: Update apt cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 1800
|
||||||
|
|
||||||
|
- name: Full upgrade controller packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
autoremove: true
|
||||||
|
autoclean: true
|
||||||
|
register: controller_upgrade
|
||||||
|
|
||||||
|
- name: Check if reboot is required
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: controller_reboot_required
|
||||||
|
|
||||||
|
- name: Show controller upgrade status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
- "Apt changed: {{ controller_upgrade.changed }}"
|
||||||
|
- "Reboot required: {{ controller_reboot_required.stat.exists }}"
|
||||||
|
|
||||||
|
- name: Reboot controller if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Reboot after controller OS upgrade"
|
||||||
|
reboot_timeout: 900
|
||||||
|
connect_timeout: 20
|
||||||
|
pre_reboot_delay: 5
|
||||||
|
post_reboot_delay: 30
|
||||||
|
when: controller_reboot_required.stat.exists
|
||||||
|
|
||||||
|
- name: Restart controller services
|
||||||
|
ansible.builtin.systemd:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
|
loop:
|
||||||
|
- munge
|
||||||
|
- mariadb
|
||||||
|
- slurmdbd
|
||||||
|
- slurmctld
|
||||||
|
|
||||||
|
- name: Wait for slurmctld
|
||||||
|
ansible.builtin.command:
|
||||||
|
cmd: scontrol ping
|
||||||
|
register: slurmctld_ping
|
||||||
|
retries: 20
|
||||||
|
delay: 3
|
||||||
|
until: slurmctld_ping.rc == 0
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate controller after upgrade
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
scontrol ping
|
||||||
|
sinfo
|
||||||
|
squeue
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -20
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: controller_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print controller validation after upgrade
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: controller_after.stdout_lines
|
||||||
+207
@@ -0,0 +1,207 @@
|
|||||||
|
---
|
||||||
|
- name: Validate cluster after OS rolling upgrade
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate Slurm controller and cluster state
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "### slurmctld ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### nodes"
|
||||||
|
sinfo -N
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### partitions"
|
||||||
|
sinfo
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### queue"
|
||||||
|
squeue
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### important config"
|
||||||
|
scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType|SelectType|ClusterName"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### accounting recent jobs"
|
||||||
|
sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cluster_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print cluster state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cluster_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Validate worker services after OS rolling upgrade
|
||||||
|
hosts: slurm_compute:slurm_gpu
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Validate local worker services and Slurm connectivity
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo "UPTIME=$(uptime -p)"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### services"
|
||||||
|
systemctl is-active munge
|
||||||
|
systemctl is-active slurmd
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### munge local test"
|
||||||
|
munge -n | unmunge >/dev/null
|
||||||
|
echo "munge OK"
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### controller ping"
|
||||||
|
scontrol ping
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### local slurm.conf checksum"
|
||||||
|
sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
|
||||||
|
|
||||||
|
echo
|
||||||
|
echo "### gpu check if present"
|
||||||
|
if command -v nvidia-smi >/dev/null 2>&1; then
|
||||||
|
nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv,noheader || true
|
||||||
|
else
|
||||||
|
echo "NO_NVIDIA_SMI"
|
||||||
|
fi
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: worker_state
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Print worker state
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: worker_state.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit post-upgrade CPU validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit CPU validation job to debug partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=os-upgrade-cpu-test
|
||||||
|
#SBATCH --partition=debug
|
||||||
|
#SBATCH --cpus-per-task=1
|
||||||
|
#SBATCH --mem=256M
|
||||||
|
#SBATCH --time=00:02:00
|
||||||
|
#SBATCH --output=/shared/os-upgrade-cpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
date
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 90); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/os-upgrade-cpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: cpu_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print CPU validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: cpu_validation_job.stdout_lines
|
||||||
|
|
||||||
|
|
||||||
|
- name: Submit post-upgrade GPU validation job
|
||||||
|
hosts: slurm_controller
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Submit GPU validation job to gpu partition
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
job_id="$(
|
||||||
|
sudo -iu slurmuser sbatch --parsable <<'SBATCH'
|
||||||
|
#!/bin/bash
|
||||||
|
#SBATCH --job-name=os-upgrade-gpu-test
|
||||||
|
#SBATCH --partition=gpu
|
||||||
|
#SBATCH --gres=gpu:1
|
||||||
|
#SBATCH --cpus-per-task=2
|
||||||
|
#SBATCH --mem=1G
|
||||||
|
#SBATCH --time=00:03:00
|
||||||
|
#SBATCH --output=/shared/os-upgrade-gpu-test-%j.out
|
||||||
|
|
||||||
|
echo "HOST=$(hostname)"
|
||||||
|
echo "USER=$(whoami)"
|
||||||
|
echo "SLURM_JOB_ID=$SLURM_JOB_ID"
|
||||||
|
echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
|
||||||
|
echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
|
||||||
|
echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
|
||||||
|
echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
|
||||||
|
echo "KERNEL=$(uname -r)"
|
||||||
|
echo
|
||||||
|
nvidia-smi
|
||||||
|
SBATCH
|
||||||
|
)"
|
||||||
|
|
||||||
|
echo "JOB_ID=$job_id"
|
||||||
|
|
||||||
|
for i in $(seq 1 120); do
|
||||||
|
if squeue -h -j "$job_id" | grep -q .; then
|
||||||
|
squeue -j "$job_id"
|
||||||
|
sleep 1
|
||||||
|
else
|
||||||
|
break
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "### sacct"
|
||||||
|
sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
|
||||||
|
|
||||||
|
echo "### output"
|
||||||
|
cat "/shared/os-upgrade-gpu-test-${job_id}.out"
|
||||||
|
args:
|
||||||
|
executable: /bin/bash
|
||||||
|
register: gpu_validation_job
|
||||||
|
changed_when: true
|
||||||
|
|
||||||
|
- name: Print GPU validation job
|
||||||
|
ansible.builtin.debug:
|
||||||
|
var: gpu_validation_job.stdout_lines
|
||||||
@@ -0,0 +1,15 @@
|
|||||||
|
# Codex prompt: generate repository documentation
|
||||||
|
|
||||||
|
You are working in an Ansible repository that automates a Slurm AI/HPC lab.
|
||||||
|
|
||||||
|
Please review the repository and generate or improve documentation under `docs/` with the following goals:
|
||||||
|
|
||||||
|
1. Explain the architecture and repository layout.
|
||||||
|
2. Document the end-to-end deployment sequence.
|
||||||
|
3. Document operational workflows: provisioning, decommissioning, rolling upgrades, health checks and auto-remediation.
|
||||||
|
4. Document SlurmDBD accounting, QOS, fairshare and priority workflows.
|
||||||
|
5. Add troubleshooting notes based on the playbooks and templates.
|
||||||
|
6. Avoid exposing secrets, real IP addresses, real hostnames, SQL dumps, backup archives, private keys or vault content.
|
||||||
|
7. Keep all text in English.
|
||||||
|
|
||||||
|
Output should be practical, operator-focused and suitable for a public Git repository.
|
||||||
@@ -0,0 +1,16 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
# Slurm cgroup configuration
|
||||||
|
|
||||||
|
CgroupPlugin=autodetect
|
||||||
|
|
||||||
|
ConstrainCores=yes
|
||||||
|
ConstrainRAMSpace=yes
|
||||||
|
ConstrainSwapSpace=no
|
||||||
|
ConstrainDevices=yes
|
||||||
|
|
||||||
|
AllowedRAMSpace=100
|
||||||
|
AllowedSwapSpace=0
|
||||||
|
MaxRAMPercent=100
|
||||||
|
MaxSwapPercent=0
|
||||||
|
|
||||||
|
MinRAMSpace=30
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' and node.gres | default('') | length > 0 %}
|
||||||
|
NodeName={{ node.name }} Name=gpu File={{ node.gres_file | default('/dev/nvidia0') }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,67 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
|
||||||
|
ClusterName={{ slurm_cluster_name }}
|
||||||
|
SlurmctldHost={{ slurm_control_machine }}({{ slurm_control_addr }})
|
||||||
|
|
||||||
|
SlurmUser={{ slurm_user }}
|
||||||
|
AuthType=auth/munge
|
||||||
|
StateSaveLocation=/var/spool/slurmctld
|
||||||
|
SlurmdSpoolDir=/var/spool/slurmd
|
||||||
|
SwitchType=switch/none
|
||||||
|
MpiDefault={{ slurm_default_mpi_type }}
|
||||||
|
ProctrackType={{ slurm_proctrack_type }}
|
||||||
|
ReturnToService={{ slurm_return_to_service }}
|
||||||
|
{% if slurm_gres_types is defined and slurm_gres_types | length > 0 %}
|
||||||
|
GresTypes={{ slurm_gres_types }}
|
||||||
|
{% endif %}
|
||||||
|
|
||||||
|
SlurmctldPidFile=/run/slurmctld.pid
|
||||||
|
SlurmdPidFile=/run/slurmd.pid
|
||||||
|
SlurmctldPort={{ slurmctld_port }}
|
||||||
|
SlurmdPort={{ slurmd_port }}
|
||||||
|
|
||||||
|
TaskPlugin={{ slurm_task_plugin }}
|
||||||
|
SelectType={{ slurm_select_type }}
|
||||||
|
SelectTypeParameters={{ slurm_select_type_parameters }}
|
||||||
|
|
||||||
|
SchedulerType=sched/backfill
|
||||||
|
# Priority / fairshare
|
||||||
|
PriorityType={{ slurm_priority_type | default('priority/multifactor') }}
|
||||||
|
PriorityDecayHalfLife={{ slurm_priority_decay_half_life | default('7-0') }}
|
||||||
|
PriorityCalcPeriod={{ slurm_priority_calc_period | default(5) }}
|
||||||
|
PriorityFavorSmall={{ slurm_priority_favor_small | default('NO') }}
|
||||||
|
PriorityWeightAge={{ slurm_priority_weight_age | default(1000) }}
|
||||||
|
PriorityWeightFairshare={{ slurm_priority_weight_fairshare | default(10000) }}
|
||||||
|
PriorityWeightJobSize={{ slurm_priority_weight_job_size | default(1000) }}
|
||||||
|
PriorityWeightPartition={{ slurm_priority_weight_partition | default(1000) }}
|
||||||
|
PriorityWeightQOS={{ slurm_priority_weight_qos | default(10000) }}
|
||||||
|
PriorityMaxAge={{ slurm_priority_max_age | default('1-0') }}
|
||||||
|
|
||||||
|
SlurmctldTimeout=120
|
||||||
|
SlurmdTimeout=300
|
||||||
|
InactiveLimit=0
|
||||||
|
KillWait=30
|
||||||
|
Waittime=0
|
||||||
|
|
||||||
|
AccountingStorageType={{ slurm_accounting_storage_type }}
|
||||||
|
{% if slurm_accounting_storage_type == "accounting_storage/slurmdbd" %}
|
||||||
|
AccountingStorageHost={{ slurm_accounting_storage_host }}
|
||||||
|
AccountingStoragePort={{ slurm_accounting_storage_port }}
|
||||||
|
AccountingStorageEnforce={{ slurm_accounting_storage_enforce | default('associations,limits,qos') }}
|
||||||
|
AccountingStorageTRES={{ slurm_accounting_storage_tres | default('cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu') }}
|
||||||
|
{% endif %}
|
||||||
|
JobAcctGatherType={{ slurm_job_acct_gather_type | default('jobacct_gather/none') }}
|
||||||
|
JobCompType={{ slurm_job_comp_type }}
|
||||||
|
|
||||||
|
SlurmctldDebug=info
|
||||||
|
SlurmdDebug=info
|
||||||
|
SlurmctldLogFile=/var/log/slurm/slurmctld.log
|
||||||
|
SlurmdLogFile=/var/log/slurm/slurmd.log
|
||||||
|
|
||||||
|
{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
|
||||||
|
NodeName={{ node.name }} NodeAddr={{ node.addr }} CPUs={{ node.cpus }}{% if node.topology | default('') | length > 0 %} {{ node.topology }}{% endif %} RealMemory={{ node.real_memory }}{% if node.gres | default('') | length > 0 %} Gres={{ node.gres }}{% endif %}{% if node.features | default('') | length > 0 %} Feature={{ node.features }}{% endif %} State=UNKNOWN
|
||||||
|
{% endfor %}
|
||||||
|
|
||||||
|
{% for partition in slurm_partitions %}
|
||||||
|
PartitionName={{ partition.name }} Nodes={{ partition.nodes }} Default={{ partition.default }} MaxTime={{ partition.max_time }} State={{ partition.state }}
|
||||||
|
{% endfor %}
|
||||||
@@ -0,0 +1,38 @@
|
|||||||
|
# Managed by Ansible
|
||||||
|
# Slurm database daemon configuration
|
||||||
|
|
||||||
|
AuthType=auth/munge
|
||||||
|
|
||||||
|
DbdHost={{ slurmdbd_host }}
|
||||||
|
DbdPort={{ slurmdbd_port }}
|
||||||
|
|
||||||
|
SlurmUser={{ slurm_user }}
|
||||||
|
|
||||||
|
DebugLevel=info
|
||||||
|
LogFile=/var/log/slurm/slurmdbd.log
|
||||||
|
PidFile=/run/slurmdbd.pid
|
||||||
|
|
||||||
|
CommitDelay={{ slurmdbd_commit_delay | default(1) }}
|
||||||
|
|
||||||
|
StorageType={{ slurmdbd_storage_type }}
|
||||||
|
StorageHost={{ slurmdbd_storage_host }}
|
||||||
|
StoragePort={{ slurmdbd_storage_port }}
|
||||||
|
StorageLoc={{ slurmdbd_storage_loc }}
|
||||||
|
StorageUser={{ slurmdbd_storage_user }}
|
||||||
|
StoragePass={{ slurmdbd_storage_pass }}
|
||||||
|
|
||||||
|
# Retention / purge policy
|
||||||
|
PurgeEventAfter={{ slurmdbd_purge_event_after | default('12months') }}
|
||||||
|
PurgeJobAfter={{ slurmdbd_purge_job_after | default('12months') }}
|
||||||
|
PurgeResvAfter={{ slurmdbd_purge_resv_after | default('12months') }}
|
||||||
|
PurgeStepAfter={{ slurmdbd_purge_step_after | default('3months') }}
|
||||||
|
PurgeSuspendAfter={{ slurmdbd_purge_suspend_after | default('3months') }}
|
||||||
|
PurgeTXNAfter={{ slurmdbd_purge_txn_after | default('12months') }}
|
||||||
|
PurgeUsageAfter={{ slurmdbd_purge_usage_after | default('24months') }}
|
||||||
|
|
||||||
|
ArchiveEvents={{ slurmdbd_archive_events | default('no') }}
|
||||||
|
ArchiveJobs={{ slurmdbd_archive_jobs | default('no') }}
|
||||||
|
ArchiveSteps={{ slurmdbd_archive_steps | default('no') }}
|
||||||
|
ArchiveSuspend={{ slurmdbd_archive_suspend | default('no') }}
|
||||||
|
ArchiveTXN={{ slurmdbd_archive_txn | default('no') }}
|
||||||
|
ArchiveUsage={{ slurmdbd_archive_usage | default('no') }}
|
||||||
Reference in New Issue
Block a user