Add Slurm AI/HPC cluster platform project

2026-06-04 19:41:05 +00:00
parent e2624a7533
commit d300d490f5
49 changed files with 4777 additions and 0 deletions
@@ -0,0 +1,59 @@
+# Ansible Slurm AI/HPC Lab
+
+Ansible automation for a small Slurm AI/HPC lab with CPU nodes, a GPU node, Munge, cgroups, GRES, SlurmDBD accounting, QOS/fairshare, node lifecycle workflows, rolling OS upgrades and health remediation.
+
+This repository is sanitized for publication. Replace the example inventory values under `inventories/lab/` with your own hostnames, IP addresses and users before running it.
+
+## What this lab covers
+
+- Slurm controller and worker configuration
+- Munge key distribution
+- GPU GRES configuration
+- cgroup CPU/GPU/device enforcement
+- SlurmDBD + MariaDB accounting
+- `sacct`, `sreport`, `sacctmgr` validation
+- QOS, limits, fairshare and priority/multifactor
+- Node provisioning and decommissioning
+- Rolling OS upgrades with canary validation
+- Health checks and node auto-remediation
+
+## Repository layout
+
+```text
+inventories/lab/          Example inventory and group variables
+templates/                Slurm, cgroup, gres and slurmdbd templates
+playbooks/bootstrap/      Initial SSH, sudo and /etc/hosts setup
+playbooks/core/           Munge, Slurm config and safe restart workflows
+playbooks/accounting/     SlurmDBD, backup/restore-check and accounting validation
+playbooks/qos/            QOS, fairshare and priority configuration
+playbooks/lifecycle/      Provisioning and decommissioning nodes
+playbooks/upgrade/        Rolling OS upgrade and canary workflow
+playbooks/health/         Health checks and auto-remediation
+playbooks/tests/          CPU/GPU/cgroup/accounting validation jobs
+playbooks/backup/         Slurm config backup helpers
+docs/                     Runbooks and interview notes
+prompts/codex/            Prompts for generating or expanding documentation
+```
+
+## Quick start
+
+1. Edit `inventories/lab/inventory.yml`.
+2. Edit `inventories/lab/group_vars/slurm_cluster.yml`.
+3. Create and encrypt a vault file for database credentials:
+
+```bash
+cp inventories/lab/group_vars/vault.example.yml inventories/lab/group_vars/vault.yml
+ansible-vault encrypt inventories/lab/group_vars/vault.yml
+```
+
+4. Run syntax checks:
+
+```bash
+find playbooks -name '*.yml' -print0 | xargs -0 -n1 ansible-playbook --syntax-check
+```
+
+5. Run the bootstrap/core workflows in the order described in `docs/runbook.md`.
+
+## Security notes
+
+Do not commit real inventories, backup archives, SQL dumps, Munge keys, private SSH keys or Ansible Vault files. This repository intentionally excludes generated backup artifacts.
@@ -0,0 +1,14 @@
+[defaults]
+inventory = ./inventories/lab/inventory.yml
+host_key_checking = False
+retry_files_enabled = False
+stdout_callback = default
+result_format = yaml
+interpreter_python = auto_silent
+timeout = 30
+roles_path = ./roles
+collections_path = ./collections
+
+[ssh_connection]
+pipelining = True
+ssh_args = -o ControlMaster=auto -o ControlPersist=60s
@@ -0,0 +1 @@
+Generated backups and reports can be stored here locally. This directory is ignored by git.
@@ -0,0 +1,22 @@
+# Interview Cheatsheet: Slurm AI/HPC Lab
+
+## One-minute summary
+
+I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
+
+## Topics I can discuss
+
+- How Slurm schedules CPU and GPU workloads.
+- Difference between GRES scheduling and cgroup device enforcement.
+- Why Munge key consistency matters.
+- How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
+- How QOS, account associations, fairshare and multifactor priority work.
+- Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
+
+## Real troubleshooting examples
+
+- `IDLE+NOT_RESPONDING` after node reprovisioning.
+- Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
+- Missing `gres/gpu` TRES before QOS GPU limits could be configured.
+- `sacctmgr` idempotency issues such as `Nothing new added`.
+- Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.
@@ -0,0 +1,62 @@
+# Slurm AI/HPC Lab Runbook
+
+## Standard deployment order
+
+```bash
+ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
+ansible-playbook playbooks/bootstrap/slurm-hosts.yml
+ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
+ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
+
+ansible-playbook playbooks/core/manage-munge.yml
+ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
+ansible-playbook playbooks/core/manage-slurm-config.yml --diff
+ansible-playbook playbooks/core/restart-slurm-safe.yml
+
+ansible-playbook playbooks/tests/validate-slurm-operator.yml
+ansible-playbook playbooks/tests/test-cpu-job.yml
+ansible-playbook playbooks/tests/test-gpu-job.yml
+ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
+
+ansible-playbook playbooks/accounting/setup-slurmdbd.yml
+ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
+ansible-playbook playbooks/accounting/backup-slurmdbd.yml
+ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
+ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
+
+ansible-playbook playbooks/qos/configure-slurm-qos.yml
+ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
+
+ansible-playbook playbooks/health/check-slurm-health.yml
+```
+
+## Node lifecycle
+
+Provision a node:
+
+```bash
+ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
+```
+
+Decommission a node:
+
+```bash
+ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
+```
+
+Repair a node:
+
+```bash
+ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
+```
+
+## Rolling OS upgrade
+
+```bash
+ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
+ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
+ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
+ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
+```
+
+If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
@@ -0,0 +1,28 @@
+# Troubleshooting Cases
+
+## `IDLE+NOT_RESPONDING` after node maintenance
+
+Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
+
+Actions:
+
+```bash
+systemctl restart munge
+systemctl restart slurmd
+systemctl restart slurmctld
+scontrol update NodeName=<node> State=RESUME || true
+scontrol update NodeName=<node> State=UNDRAIN || true
+scontrol update NodeName=<node> State=IDLE || true
+```
+
+## Missing GPU TRES
+
+Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
+
+Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
+
+## SlurmDBD objects already exist
+
+Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
+
+Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
@@ -0,0 +1,128 @@
+---
+# Example lab inventory variables. Replace addresses, users and node topology for your environment.
+
+slurm_cluster_name: labcluster
+
+slurm_control_machine: slurm-ctl01
+slurm_control_addr: 10.10.10.11
+
+slurm_config_dir: /etc/slurm
+slurm_user: slurm
+slurm_operator_user: slurmuser
+
+slurmctld_port: 6817
+slurmd_port: 6818
+
+slurm_job_comp_type: jobcomp/none
+
+slurm_select_type: select/cons_tres
+slurm_select_type_parameters: CR_Core_Memory
+
+slurm_return_to_service: 2
+slurm_default_mpi_type: none
+
+slurm_gres_types: gpu
+
+slurm_nodes:
+  - name: slurm-c01
+    managed_state: present
+    addr: 10.10.10.12
+    cpus: 2
+    real_memory: 1800
+    features: ""
+    gres: ""
+    topology: ""
+  - name: slurm-c02
+    managed_state: present
+    addr: 10.10.10.13
+    cpus: 2
+    real_memory: 1800
+    features: ""
+    gres: ""
+    topology: ""
+  - name: gpu01
+    managed_state: present
+    addr: 10.10.10.14
+    cpus: 12
+    real_memory: 60000
+    features: "gpu"
+    gres: "gpu:1"
+    gres_file: /dev/nvidia0
+    topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
+
+slurm_partitions:
+  - name: debug
+    managed_state: present
+    nodes: "slurm-c[01-02]"
+    default: "YES"
+    max_time: "INFINITE"
+    state: "UP"
+  - name: gpu
+    managed_state: present
+    nodes: "gpu01"
+    default: "NO"
+    max_time: "INFINITE"
+    state: "UP"
+  - name: all
+    managed_state: present
+    nodes: "slurm-c[01-02],gpu01"
+    default: "NO"
+    max_time: "INFINITE"
+    state: "UP"
+
+# Cgroup enforcement
+slurm_enable_cgroup: true
+slurm_task_plugin: task/cgroup,task/affinity
+slurm_proctrack_type: proctrack/cgroup
+slurm_job_acct_gather_type: jobacct_gather/cgroup
+
+# Slurm accounting / SlurmDBD
+slurm_accounting_storage_type: accounting_storage/slurmdbd
+slurm_accounting_storage_host: slurm-ctl01
+slurm_accounting_storage_port: 6819
+slurm_accounting_storage_enforce: associations,limits,qos
+slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
+
+slurmdbd_host: slurm-ctl01
+slurmdbd_port: 6819
+slurmdbd_storage_type: accounting_storage/mysql
+slurmdbd_storage_host: localhost
+slurmdbd_storage_port: 3306
+slurmdbd_storage_loc: slurm_acct_db
+slurmdbd_storage_user: slurm
+# Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
+slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
+
+slurm_account_name: lab
+slurm_account_description: "AI/HPC Slurm lab account"
+slurm_account_organization: "labcluster"
+
+# SlurmDBD purge / retention policy for lab
+slurmdbd_commit_delay: 1
+slurmdbd_purge_event_after: 12months
+slurmdbd_purge_job_after: 12months
+slurmdbd_purge_resv_after: 12months
+slurmdbd_purge_step_after: 3months
+slurmdbd_purge_suspend_after: 3months
+slurmdbd_purge_txn_after: 12months
+slurmdbd_purge_usage_after: 24months
+
+# Archive is disabled for the lab; backup playbooks handle database dumps.
+slurmdbd_archive_events: no
+slurmdbd_archive_jobs: no
+slurmdbd_archive_steps: no
+slurmdbd_archive_suspend: no
+slurmdbd_archive_txn: no
+slurmdbd_archive_usage: no
+
+# Slurm priority / fairshare
+slurm_priority_type: priority/multifactor
+slurm_priority_decay_half_life: 7-0
+slurm_priority_calc_period: 5
+slurm_priority_favor_small: "NO"
+slurm_priority_weight_age: 1000
+slurm_priority_weight_fairshare: 10000
+slurm_priority_weight_job_size: 1000
+slurm_priority_weight_partition: 1000
+slurm_priority_weight_qos: 10000
+slurm_priority_max_age: 1-0
@@ -0,0 +1,5 @@
+---
+# Copy this file to vault.yml and encrypt it with ansible-vault.
+# ansible-vault encrypt inventories/lab/group_vars/vault.yml
+
+vault_slurmdbd_storage_pass: CHANGE_ME
@@ -0,0 +1,24 @@
+all:
+  vars:
+    ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
+  children:
+    slurm_cluster:
+      children:
+        slurm_controller:
+          hosts:
+            slurm-ctl01:
+              ansible_host: 10.10.10.11
+              ansible_user: ansible
+        slurm_compute:
+          hosts:
+            slurm-c01:
+              ansible_host: 10.10.10.12
+              ansible_user: ansible
+            slurm-c02:
+              ansible_host: 10.10.10.13
+              ansible_user: ansible
+        slurm_gpu:
+          hosts:
+            gpu01:
+              ansible_host: 10.10.10.14
+              ansible_user: ansible
@@ -0,0 +1,90 @@
+---
+- name: Backup SlurmDBD MariaDB database
+  hosts: slurm_controller
+  become: true
+  gather_facts: true
+
+  vars:
+    slurmdbd_backup_dir: /var/backups/slurmdbd
+    local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
+
+  tasks:
+    - name: Create remote backup directory
+      ansible.builtin.file:
+        path: "{{ slurmdbd_backup_dir }}"
+        state: directory
+        owner: root
+        group: root
+        mode: "0700"
+
+    - name: Create local fetch directory on Ansible controller
+      ansible.builtin.file:
+        path: "{{ local_fetch_dir }}"
+        state: directory
+        owner: root
+        group: root
+        mode: "0700"
+      delegate_to: localhost
+      become: false
+
+    - name: Validate MariaDB is running
+      ansible.builtin.command:
+        cmd: systemctl is-active mariadb
+      changed_when: false
+
+    - name: Validate SlurmDBD is running
+      ansible.builtin.command:
+        cmd: systemctl is-active slurmdbd
+      changed_when: false
+
+    - name: Validate Slurm accounting database exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
+      args:
+        executable: /bin/bash
+      changed_when: false
+
+    - name: Dump Slurm accounting database
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        ts="$(date +%F-%H%M%S)"
+        out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
+
+        mysqldump \
+          --single-transaction \
+          --routines \
+          --events \
+          --triggers \
+          {{ slurmdbd_storage_loc }} | gzip -9 > "$out"
+
+        chmod 0600 "$out"
+        echo "$out"
+      args:
+        executable: /bin/bash
+      register: db_dump
+      changed_when: true
+
+    - name: Validate backup file is non-empty
+      ansible.builtin.stat:
+        path: "{{ db_dump.stdout }}"
+      register: backup_file
+
+    - name: Fail if backup file is empty
+      ansible.builtin.fail:
+        msg: "Backup file is empty: {{ db_dump.stdout }}"
+      when: backup_file.stat.size | int < 1024
+
+    - name: Fetch DB backup to Ansible controller
+      ansible.builtin.fetch:
+        src: "{{ db_dump.stdout }}"
+        dest: "{{ local_fetch_dir }}/"
+        flat: true
+
+    - name: Show DB backup result
+      ansible.builtin.debug:
+        msg:
+          - "Remote backup: {{ db_dump.stdout }}"
+          - "Backup size bytes: {{ backup_file.stat.size }}"
+          - "Fetched to: {{ local_fetch_dir }}/"
@@ -0,0 +1,126 @@
+---
+- name: Initialize Slurm accounting entities
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Wait for sacctmgr connectivity
+      ansible.builtin.command:
+        cmd: sacctmgr -n list cluster
+      register: sacctmgr_cluster_list
+      retries: 20
+      delay: 2
+      until: sacctmgr_cluster_list.rc == 0
+      changed_when: false
+
+    - name: Show current accounting state before changes
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### clusters"
+        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
+
+        echo
+        echo "### accounts"
+        sacctmgr list account format=Account,Descr,Org
+
+        echo
+        echo "### users"
+        sacctmgr list user format=User,DefaultAccount,Admin
+
+        echo
+        echo "### associations"
+        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
+      args:
+        executable: /bin/bash
+      register: accounting_state_before
+      changed_when: false
+
+    - name: Print current accounting state before changes
+      ansible.builtin.debug:
+        var: accounting_state_before.stdout_lines
+
+    - name: Ensure Slurm cluster exists in accounting DB
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
+          echo "Cluster {{ slurm_cluster_name }} already exists"
+        else
+          sacctmgr -i add cluster {{ slurm_cluster_name }}
+        fi
+      args:
+        executable: /bin/bash
+      register: ensure_cluster
+      changed_when: "'Adding Cluster' in ensure_cluster.stdout"
+
+    - name: Ensure default lab account exists for cluster
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
+          echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
+        else
+          sacctmgr -i add account {{ slurm_account_name }} \
+            Cluster={{ slurm_cluster_name }} \
+            Description="{{ slurm_account_description }}" \
+            Organization="{{ slurm_account_organization }}"
+        fi
+      args:
+        executable: /bin/bash
+      register: ensure_account
+      changed_when: "'Adding Account' in ensure_account.stdout"
+
+    - name: Ensure slurmuser exists with lab account association
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
+          echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
+        else
+          sacctmgr -i add user slurmuser \
+            Cluster={{ slurm_cluster_name }} \
+            Account={{ slurm_account_name }} \
+            DefaultAccount={{ slurm_account_name }}
+        fi
+      args:
+        executable: /bin/bash
+      register: ensure_user_assoc
+      changed_when: "'Adding User' in ensure_user_assoc.stdout"
+
+    - name: Ensure slurmuser has default account set
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
+      args:
+        executable: /bin/bash
+      register: set_default_account
+      changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
+
+    - name: Show final accounting state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### clusters"
+        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
+
+        echo
+        echo "### accounts"
+        sacctmgr list account format=Account,Descr,Org
+
+        echo
+        echo "### users"
+        sacctmgr list user format=User,DefaultAccount,Admin
+
+        echo
+        echo "### associations"
+        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
+      args:
+        executable: /bin/bash
+      register: accounting_state_after
+      changed_when: false
+
+    - name: Print final accounting state
+      ansible.builtin.debug:
+        var: accounting_state_after.stdout_lines
@@ -0,0 +1,98 @@
+---
+- name: Restore-check latest SlurmDBD backup into test database
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
+    slurmdbd_backup_dir: /var/backups/slurmdbd
+
+  tasks:
+    - name: Validate MariaDB is running
+      ansible.builtin.command:
+        cmd: systemctl is-active mariadb
+      changed_when: false
+
+    - name: Find latest SlurmDBD backup
+      ansible.builtin.shell: |
+        set -euo pipefail
+        ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
+      args:
+        executable: /bin/bash
+      register: latest_backup
+      changed_when: false
+
+    - name: Validate latest backup exists
+      ansible.builtin.stat:
+        path: "{{ latest_backup.stdout }}"
+      register: latest_backup_stat
+
+    - name: Fail if latest backup is missing or empty
+      ansible.builtin.fail:
+        msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
+      when:
+        - not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
+
+    - name: Recreate restore-check database
+      ansible.builtin.shell: |
+        set -euo pipefail
+        mysql <<SQL
+        DROP DATABASE IF EXISTS {{ restore_check_db }};
+        CREATE DATABASE {{ restore_check_db }};
+        SQL
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Import backup into restore-check database
+      ansible.builtin.shell: |
+        set -euo pipefail
+        zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Validate restored table count
+      ansible.builtin.shell: |
+        set -euo pipefail
+        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
+      args:
+        executable: /bin/bash
+      register: restored_tables
+      changed_when: false
+      failed_when: restored_tables.stdout | int < 1
+
+    - name: Validate restored row count sample
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### restored database"
+        echo "{{ restore_check_db }}"
+
+        echo
+        echo "### table count"
+        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
+
+        echo
+        echo "### largest tables"
+        mysql -N -B -e "
+          SELECT table_name, table_rows
+          FROM information_schema.tables
+          WHERE table_schema='{{ restore_check_db }}'
+          ORDER BY table_rows DESC
+          LIMIT 10;
+        "
+      args:
+        executable: /bin/bash
+      register: restore_check_summary
+      changed_when: false
+
+    - name: Show restore-check result
+      ansible.builtin.debug:
+        msg:
+          - "Imported backup: {{ latest_backup.stdout }}"
+          - "Restore-check DB: {{ restore_check_db }}"
+          - "Restored tables: {{ restored_tables.stdout }}"
+          - "Summary:"
+          - "{{ restore_check_summary.stdout_lines }}"
@@ -0,0 +1,105 @@
+---
+- name: Install and configure MariaDB for SlurmDBD
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Install MariaDB and SlurmDBD packages
+      ansible.builtin.apt:
+        name:
+          - mariadb-server
+          - mariadb-client
+          - slurmdbd
+          - slurm-wlm-mysql-plugin
+        state: present
+        update_cache: true
+
+    - name: Ensure MariaDB is enabled and running
+      ansible.builtin.systemd:
+        name: mariadb
+        enabled: true
+        state: started
+
+    - name: Ensure Slurm log directory exists
+      ansible.builtin.file:
+        path: /var/log/slurm
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+
+    - name: Create Slurm accounting database and DB user
+      ansible.builtin.shell: |
+        set -euo pipefail
+        mysql <<SQL
+        CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
+        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
+        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
+        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
+        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
+        FLUSH PRIVILEGES;
+        SQL
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Ensure /etc/slurm exists
+      ansible.builtin.file:
+        path: /etc/slurm
+        state: directory
+        owner: root
+        group: root
+        mode: "0755"
+
+    - name: Deploy slurmdbd.conf
+      ansible.builtin.template:
+        src: ../../templates/slurmdbd.conf.j2
+        dest: /etc/slurm/slurmdbd.conf
+        owner: slurm
+        group: slurm
+        mode: "0600"
+      notify:
+        - Restart slurmdbd
+
+    - name: Ensure slurmdbd is enabled and running
+      ansible.builtin.systemd:
+        name: slurmdbd
+        enabled: true
+        state: started
+
+    - name: Flush handlers before validation
+      ansible.builtin.meta: flush_handlers
+
+    - name: Validate slurmdbd service is active
+      ansible.builtin.command:
+        cmd: systemctl is-active slurmdbd
+      register: slurmdbd_active
+      retries: 10
+      delay: 2
+      until: slurmdbd_active.stdout == "active"
+      changed_when: false
+
+    - name: Validate slurmdbd is listening on port
+      ansible.builtin.shell: |
+        set -euo pipefail
+        ss -lntp | grep ':{{ slurmdbd_port }} '
+      args:
+        executable: /bin/bash
+      register: slurmdbd_port_check
+      retries: 10
+      delay: 2
+      until: slurmdbd_port_check.rc == 0
+      changed_when: false
+
+    - name: Show slurmdbd service validation
+      ansible.builtin.debug:
+        msg:
+          - "slurmdbd is active"
+          - "{{ slurmdbd_port_check.stdout_lines }}"
+
+  handlers:
+    - name: Restart slurmdbd
+      ansible.builtin.systemd:
+        name: slurmdbd
+        state: restarted
@@ -0,0 +1,178 @@
+---
+- name: Validate Slurm accounting production-like setup
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Validate accounting services
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### services"
+        systemctl is-active mariadb
+        systemctl is-active slurmdbd
+        systemctl is-active slurmctld
+
+        echo
+        echo "### slurmdbd listener"
+        ss -lntp | grep ':6819 '
+      args:
+        executable: /bin/bash
+      register: service_check
+      changed_when: false
+
+    - name: Validate Slurm accounting runtime config
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### accounting config"
+        scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
+
+        echo
+        echo "### priority / select / cgroup config"
+        scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
+      args:
+        executable: /bin/bash
+      register: config_check
+      changed_when: false
+
+    - name: Validate sacctmgr entities
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### clusters"
+        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
+
+        echo
+        echo "### accounts"
+        sacctmgr list account format=Account,Descr,Org
+
+        echo
+        echo "### users"
+        sacctmgr list user format=User,DefaultAccount,Admin
+
+        echo
+        echo "### associations"
+        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
+      args:
+        executable: /bin/bash
+      register: entity_check
+      changed_when: false
+
+    - name: Submit accounting validation job
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=acct-prodlike-test
+        #SBATCH --partition=debug
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/acct-prodlike-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/acct-prodlike-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: acct_job
+      changed_when: true
+
+    - name: Validate sacct can read recent jobs
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### recent jobs"
+        sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
+      args:
+        executable: /bin/bash
+      register: sacct_recent
+      changed_when: false
+
+    - name: Validate sreport commands
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### cluster utilization"
+        sreport cluster utilization start=today || true
+
+        echo
+        echo "### account utilization by user"
+        sreport cluster AccountUtilizationByUser start=today || true
+
+        echo
+        echo "### user top"
+        sreport user top start=today || true
+      args:
+        executable: /bin/bash
+      register: sreport_check
+      changed_when: false
+
+    - name: Validate MariaDB table health summary
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### database exists"
+        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
+
+        echo
+        echo "### table count"
+        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
+
+        echo
+        echo "### largest tables"
+        mysql -N -B -e "
+          SELECT table_name, table_rows
+          FROM information_schema.tables
+          WHERE table_schema='{{ slurmdbd_storage_loc }}'
+          ORDER BY table_rows DESC
+          LIMIT 10;
+        "
+      args:
+        executable: /bin/bash
+      register: db_health
+      changed_when: false
+
+    - name: Print accounting validation
+      ansible.builtin.debug:
+        msg:
+          - "### services"
+          - "{{ service_check.stdout_lines }}"
+          - "### runtime config"
+          - "{{ config_check.stdout_lines }}"
+          - "### accounting entities"
+          - "{{ entity_check.stdout_lines }}"
+          - "### accounting validation job"
+          - "{{ acct_job.stdout_lines }}"
+          - "### recent sacct data"
+          - "{{ sacct_recent.stdout_lines }}"
+          - "### sreport"
+          - "{{ sreport_check.stdout_lines }}"
+          - "### database health"
+          - "{{ db_health.stdout_lines }}"
@@ -0,0 +1,83 @@
+---
+- name: Backup Slurm and Munge state on all cluster nodes
+  hosts: slurm_cluster
+  become: true
+  gather_facts: true
+
+  vars:
+    backup_base_dir: /var/backups/slurm
+
+  tasks:
+    - name: Create backup base directory
+      ansible.builtin.file:
+        path: "{{ backup_base_dir }}"
+        state: directory
+        owner: root
+        group: root
+        mode: "0700"
+
+    - name: Create timestamped backup directory
+      ansible.builtin.shell: |
+        set -euo pipefail
+        ts="$(date +%F-%H%M%S)"
+        dir="{{ backup_base_dir }}/$ts"
+        mkdir -p "$dir"
+        echo "$dir"
+      args:
+        executable: /bin/bash
+      register: backup_dir_result
+      changed_when: true
+
+    - name: Store backup directory fact
+      ansible.builtin.set_fact:
+        node_backup_dir: "{{ backup_dir_result.stdout }}"
+
+    - name: Backup Slurm and Munge config/state if present
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        backup_dir="{{ node_backup_dir }}"
+
+        for p in \
+          /etc/slurm \
+          /etc/slurm-llnl \
+          /etc/munge \
+          /var/spool/slurmctld \
+          /var/spool/slurmd \
+          /var/log/slurm \
+          /var/log/slurm-llnl
+        do
+          if [ -e "$p" ]; then
+            cp -a "$p" "$backup_dir/"
+          fi
+        done
+
+        systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
+        systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
+        systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
+
+        journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
+        journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
+        journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
+
+        if command -v sinfo >/dev/null 2>&1; then
+          sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
+        fi
+
+        if command -v scontrol >/dev/null 2>&1; then
+          scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
+          scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
+          scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
+        fi
+
+        find "$backup_dir" -maxdepth 2 -type f -o -type d
+      args:
+        executable: /bin/bash
+      register: backup_content
+      changed_when: true
+
+    - name: Show backup location on node
+      ansible.builtin.debug:
+        msg:
+          - "Host: {{ inventory_hostname }}"
+          - "Backup directory: {{ node_backup_dir }}"
@@ -0,0 +1,46 @@
+---
+- name: Fetch latest Slurm backups from nodes to pvef
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    remote_backup_base: /var/backups/slurm
+    local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
+
+  tasks:
+    - name: Find latest remote backup directory
+      ansible.builtin.shell: |
+        set -euo pipefail
+        ls -1dt {{ remote_backup_base }}/* | head -n 1
+      args:
+        executable: /bin/bash
+      register: latest_backup_dir
+      changed_when: false
+
+    - name: Create local backup directory on pvef
+      ansible.builtin.file:
+        path: "{{ local_backup_base }}/{{ inventory_hostname }}"
+        state: directory
+        mode: "0700"
+      delegate_to: localhost
+      become: false
+
+    - name: Archive latest backup directory on remote node
+      ansible.builtin.archive:
+        path: "{{ latest_backup_dir.stdout }}"
+        dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
+        format: gz
+        force_archive: true
+      changed_when: true
+
+    - name: Fetch archive to pvef
+      ansible.builtin.fetch:
+        src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
+        dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
+        flat: true
+
+    - name: Remove temporary remote archive
+      ansible.builtin.file:
+        path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
+        state: absent
@@ -0,0 +1,58 @@
+---
+- name: Bootstrap Ansible SSH access from pvef to Slurm nodes
+  hosts: slurm_cluster
+  gather_facts: false
+  become: true
+
+  vars:
+    ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
+
+  pre_tasks:
+    - name: Wait for SSH
+      ansible.builtin.wait_for_connection:
+        timeout: 30
+
+    - name: Install Python if missing - Debian/Ubuntu
+      ansible.builtin.raw: |
+        test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
+      changed_when: false
+
+  tasks:
+    - name: Ensure sudo is installed
+      ansible.builtin.apt:
+        name:
+          - sudo
+          - openssh-server
+        state: present
+        update_cache: true
+
+    - name: Ensure SSH server is enabled and running
+      ansible.builtin.service:
+        name: ssh
+        state: started
+        enabled: true
+
+    - name: Ensure .ssh directory exists for login user
+      ansible.builtin.file:
+        path: "/home/{{ ansible_user }}/.ssh"
+        state: directory
+        owner: "{{ ansible_user }}"
+        group: "{{ ansible_user }}"
+        mode: "0700"
+
+    - name: Add pvef root public key to login user's authorized_keys
+      ansible.builtin.authorized_key:
+        user: "{{ ansible_user }}"
+        key: "{{ ansible_controller_pubkey }}"
+        state: present
+        manage_dir: true
+
+    - name: Allow bootstrap login user passwordless sudo
+      ansible.builtin.copy:
+        dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
+        owner: root
+        group: root
+        mode: "0440"
+        content: |
+          {{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
+        validate: "visudo -cf %s"
@@ -0,0 +1,16 @@
+---
+- name: Configure /etc/hosts for Slurm cluster
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Add Slurm cluster hosts to /etc/hosts
+      ansible.builtin.blockinfile:
+        path: /etc/hosts
+        marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
+        block: |
+          {{ slurm_control_addr }} {{ slurm_control_machine }}
+          {% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
+          {{ node.addr }} {{ node.name }}
+          {% endfor %}
@@ -0,0 +1,218 @@
+---
+- name: Create slurmuser and generate SSH keys on every Slurm node
+  hosts: slurm_cluster
+  become: true
+  gather_facts: true
+
+  vars:
+    slurm_operator_user: slurmuser
+    slurm_operator_shell: /bin/bash
+
+  tasks:
+    - name: Ensure useful packages are installed
+      ansible.builtin.apt:
+        name:
+          - sudo
+          - openssh-client
+          - openssh-server
+          - acl
+        state: present
+        update_cache: true
+
+    - name: Ensure slurmuser exists
+      ansible.builtin.user:
+        name: "{{ slurm_operator_user }}"
+        shell: "{{ slurm_operator_shell }}"
+        create_home: true
+        state: present
+
+    - name: Ensure .ssh directory exists for slurmuser
+      ansible.builtin.file:
+        path: "/home/{{ slurm_operator_user }}/.ssh"
+        state: directory
+        owner: "{{ slurm_operator_user }}"
+        group: "{{ slurm_operator_user }}"
+        mode: "0700"
+
+    - name: Generate SSH key for slurmuser if missing
+      ansible.builtin.openssh_keypair:
+        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
+        type: ed25519
+        owner: "{{ slurm_operator_user }}"
+        group: "{{ slurm_operator_user }}"
+        mode: "0600"
+        comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
+        force: false
+
+    - name: Read public key from each node
+      ansible.builtin.slurp:
+        src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
+      register: slurmuser_pubkey_raw
+
+    - name: Store decoded public key as host fact
+      ansible.builtin.set_fact:
+        slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
+
+
+- name: Exchange slurmuser SSH keys across all Slurm nodes
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Install all slurmuser public keys into authorized_keys on every node
+      ansible.builtin.authorized_key:
+        user: "{{ slurm_operator_user }}"
+        key: "{{ hostvars[item].slurmuser_pubkey }}"
+        state: present
+        manage_dir: true
+      loop: "{{ groups['slurm_cluster'] }}"
+
+    - name: Build SSH known_hosts entries for all cluster nodes
+      ansible.builtin.shell: |
+        set -e
+        mkdir -p /home/{{ slurm_operator_user }}/.ssh
+        touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
+
+        {% for host in groups['slurm_cluster'] %}
+        ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
+        {% endfor %}
+
+        sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
+        chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
+        chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Ensure SSH permissions are correct
+      ansible.builtin.file:
+        path: "/home/{{ slurm_operator_user }}/.ssh"
+        state: directory
+        owner: "{{ slurm_operator_user }}"
+        group: "{{ slurm_operator_user }}"
+        mode: "0700"
+
+    - name: Ensure private key permissions are correct
+      ansible.builtin.file:
+        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
+        owner: "{{ slurm_operator_user }}"
+        group: "{{ slurm_operator_user }}"
+        mode: "0600"
+
+    - name: Ensure public key permissions are correct
+      ansible.builtin.file:
+        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
+        owner: "{{ slurm_operator_user }}"
+        group: "{{ slurm_operator_user }}"
+        mode: "0644"
+
+
+- name: Configure sudo permissions for slurmuser
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Configure sudoers for slurmuser on Slurm controller
+      ansible.builtin.copy:
+        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
+        owner: root
+        group: root
+        mode: "0440"
+        content: |
+          # Managed by Ansible
+          # Operator access for Slurm controller node.
+          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
+            /bin/systemctl status slurmctld, \
+            /bin/systemctl restart slurmctld, \
+            /bin/systemctl reload slurmctld, \
+            /bin/systemctl stop slurmctld, \
+            /bin/systemctl start slurmctld, \
+            /bin/systemctl status slurmd, \
+            /bin/systemctl restart slurmd, \
+            /bin/systemctl reload slurmd, \
+            /bin/systemctl stop slurmd, \
+            /bin/systemctl start slurmd, \
+            /bin/journalctl -u slurmctld, \
+            /bin/journalctl -u slurmd, \
+            /usr/bin/scontrol, \
+            /usr/bin/sinfo, \
+            /usr/bin/squeue, \
+            /usr/bin/scancel, \
+            /usr/bin/sacct, \
+            /usr/bin/sacctmgr, \
+            /usr/bin/sbatch, \
+            /usr/bin/srun, \
+            /usr/bin/salloc
+        validate: "visudo -cf %s"
+      when: inventory_hostname in groups['slurm_controller']
+
+    - name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
+      ansible.builtin.copy:
+        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
+        owner: root
+        group: root
+        mode: "0440"
+        content: |
+          # Managed by Ansible
+          # Operator access for Slurm worker/GPU nodes.
+          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
+            /bin/systemctl status slurmd, \
+            /bin/systemctl restart slurmd, \
+            /bin/systemctl reload slurmd, \
+            /bin/systemctl stop slurmd, \
+            /bin/systemctl start slurmd, \
+            /bin/journalctl -u slurmd, \
+            /usr/bin/scontrol, \
+            /usr/bin/sinfo, \
+            /usr/bin/squeue, \
+            /usr/bin/scancel, \
+            /usr/bin/sacct, \
+            /usr/bin/sbatch, \
+            /usr/bin/srun, \
+            /usr/bin/salloc
+        validate: "visudo -cf %s"
+      when: inventory_hostname not in groups['slurm_controller']
+
+
+- name: Validate slurmuser SSH mesh and Slurm access
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Test local Slurm commands as slurmuser
+      ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
+      register: sinfo_test
+      changed_when: false
+      failed_when: sinfo_test.rc != 0
+
+    - name: Show sinfo result
+      ansible.builtin.debug:
+        var: sinfo_test.stdout_lines
+
+    - name: Test SSH from each node to every other node as slurmuser
+      ansible.builtin.shell: |
+        set -e
+        {% for host in groups['slurm_cluster'] %}
+        ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
+        {% endfor %}
+      args:
+        executable: /bin/bash
+      become_user: "{{ slurm_operator_user }}"
+      register: ssh_mesh_test
+      changed_when: false
+
+    - name: Show SSH mesh test result
+      ansible.builtin.debug:
+        var: ssh_mesh_test.stdout_lines
@@ -0,0 +1,112 @@
+---
+- name: Fix sudo permissions for slurmuser Slurm operations
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Configure sudoers for slurmuser on controller
+      ansible.builtin.copy:
+        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
+        owner: root
+        group: root
+        mode: "0440"
+        content: |
+          # Managed by Ansible
+
+          Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
+            /bin/systemctl status slurmctld, \
+            /bin/systemctl status slurmctld *, \
+            /bin/systemctl restart slurmctld, \
+            /bin/systemctl reload slurmctld, \
+            /bin/systemctl start slurmctld, \
+            /bin/systemctl stop slurmctld, \
+            /bin/systemctl status slurmd, \
+            /bin/systemctl status slurmd *, \
+            /bin/systemctl restart slurmd, \
+            /bin/systemctl reload slurmd, \
+            /bin/systemctl start slurmd, \
+            /bin/systemctl stop slurmd, \
+            /usr/bin/systemctl status slurmctld, \
+            /usr/bin/systemctl status slurmctld *, \
+            /usr/bin/systemctl restart slurmctld, \
+            /usr/bin/systemctl reload slurmctld, \
+            /usr/bin/systemctl start slurmctld, \
+            /usr/bin/systemctl stop slurmctld, \
+            /usr/bin/systemctl status slurmd, \
+            /usr/bin/systemctl status slurmd *, \
+            /usr/bin/systemctl restart slurmd, \
+            /usr/bin/systemctl reload slurmd, \
+            /usr/bin/systemctl start slurmd, \
+            /usr/bin/systemctl stop slurmd
+
+          Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
+            /bin/journalctl -u slurmctld, \
+            /bin/journalctl -u slurmctld *, \
+            /bin/journalctl -u slurmd, \
+            /bin/journalctl -u slurmd *, \
+            /usr/bin/journalctl -u slurmctld, \
+            /usr/bin/journalctl -u slurmctld *, \
+            /usr/bin/journalctl -u slurmd, \
+            /usr/bin/journalctl -u slurmd *
+
+          Cmnd_Alias SLURM_COMMANDS = \
+            /usr/bin/scontrol, /usr/bin/scontrol *, \
+            /usr/bin/sinfo, /usr/bin/sinfo *, \
+            /usr/bin/squeue, /usr/bin/squeue *, \
+            /usr/bin/scancel, /usr/bin/scancel *, \
+            /usr/bin/sacct, /usr/bin/sacct *, \
+            /usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
+            /usr/bin/sbatch, /usr/bin/sbatch *, \
+            /usr/bin/srun, /usr/bin/srun *, \
+            /usr/bin/salloc, /usr/bin/salloc *
+
+          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
+        validate: "visudo -cf %s"
+      when: inventory_hostname in groups['slurm_controller']
+
+    - name: Configure sudoers for slurmuser on compute and GPU nodes
+      ansible.builtin.copy:
+        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
+        owner: root
+        group: root
+        mode: "0440"
+        content: |
+          # Managed by Ansible
+
+          Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
+            /bin/systemctl status slurmd, \
+            /bin/systemctl status slurmd *, \
+            /bin/systemctl restart slurmd, \
+            /bin/systemctl reload slurmd, \
+            /bin/systemctl start slurmd, \
+            /bin/systemctl stop slurmd, \
+            /usr/bin/systemctl status slurmd, \
+            /usr/bin/systemctl status slurmd *, \
+            /usr/bin/systemctl restart slurmd, \
+            /usr/bin/systemctl reload slurmd, \
+            /usr/bin/systemctl start slurmd, \
+            /usr/bin/systemctl stop slurmd
+
+          Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
+            /bin/journalctl -u slurmd, \
+            /bin/journalctl -u slurmd *, \
+            /usr/bin/journalctl -u slurmd, \
+            /usr/bin/journalctl -u slurmd *
+
+          Cmnd_Alias SLURM_COMMANDS = \
+            /usr/bin/scontrol, /usr/bin/scontrol *, \
+            /usr/bin/sinfo, /usr/bin/sinfo *, \
+            /usr/bin/squeue, /usr/bin/squeue *, \
+            /usr/bin/scancel, /usr/bin/scancel *, \
+            /usr/bin/sacct, /usr/bin/sacct *, \
+            /usr/bin/sbatch, /usr/bin/sbatch *, \
+            /usr/bin/srun, /usr/bin/srun *, \
+            /usr/bin/salloc, /usr/bin/salloc *
+
+          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
+        validate: "visudo -cf %s"
+      when: inventory_hostname not in groups['slurm_controller']
@@ -0,0 +1,133 @@
+---
+- name: Read Munge key from Slurm controller
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Check controller munge.key exists
+      ansible.builtin.stat:
+        path: /etc/munge/munge.key
+      register: controller_munge_key
+
+    - name: Fail if controller munge.key is missing
+      ansible.builtin.fail:
+        msg: "/etc/munge/munge.key is missing on controller. Do not continue."
+      when: not controller_munge_key.stat.exists
+
+    - name: Read controller munge.key
+      ansible.builtin.slurp:
+        src: /etc/munge/munge.key
+      register: controller_munge_key_raw
+
+    - name: Store controller Munge key as fact
+      ansible.builtin.set_fact:
+        cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
+
+
+- name: Deploy controller Munge key to all Slurm nodes
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    controller_host: "{{ groups['slurm_controller'][0] }}"
+
+  tasks:
+    - name: Ensure munge package is installed
+      ansible.builtin.apt:
+        name:
+          - munge
+          - libmunge2
+        state: present
+        update_cache: true
+
+    - name: Ensure munge group exists
+      ansible.builtin.group:
+        name: munge
+        system: true
+        state: present
+
+    - name: Ensure munge user exists
+      ansible.builtin.user:
+        name: munge
+        group: munge
+        system: true
+        shell: /usr/sbin/nologin
+        home: /nonexistent
+        create_home: false
+        state: present
+
+    - name: Ensure /etc/munge exists
+      ansible.builtin.file:
+        path: /etc/munge
+        state: directory
+        owner: munge
+        group: munge
+        mode: "0700"
+
+    - name: Deploy shared munge.key from controller
+      ansible.builtin.copy:
+        dest: /etc/munge/munge.key
+        content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
+        owner: munge
+        group: munge
+        mode: "0400"
+      notify:
+        - Restart munge
+
+    - name: Ensure /var/log/munge exists
+      ansible.builtin.file:
+        path: /var/log/munge
+        state: directory
+        owner: munge
+        group: munge
+        mode: "0755"
+
+    - name: Ensure /var/lib/munge exists
+      ansible.builtin.file:
+        path: /var/lib/munge
+        state: directory
+        owner: munge
+        group: munge
+        mode: "0711"
+
+    - name: Ensure /run/munge exists
+      ansible.builtin.file:
+        path: /run/munge
+        state: directory
+        owner: munge
+        group: munge
+        mode: "0755"
+
+    - name: Ensure munge is enabled and running
+      ansible.builtin.systemd:
+        name: munge
+        enabled: true
+        state: started
+
+  handlers:
+    - name: Restart munge
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+
+
+- name: Validate Munge locally on all nodes
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Test local munge encode/decode
+      ansible.builtin.shell: |
+        set -euo pipefail
+        munge -n | unmunge
+      args:
+        executable: /bin/bash
+      register: munge_local_test
+      changed_when: false
+
+    - name: Show local Munge validation
+      ansible.builtin.debug:
+        var: munge_local_test.stdout_lines
@@ -0,0 +1,132 @@
+---
+- name: Prepare Slurm config directories and logs
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Ensure Slurm config directory exists
+      ansible.builtin.file:
+        path: "{{ slurm_config_dir }}"
+        state: directory
+        owner: root
+        group: root
+        mode: "0755"
+
+    - name: Ensure Slurm log directory exists
+      ansible.builtin.file:
+        path: /var/log/slurm
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+
+    - name: Ensure slurmctld spool directory exists on controller
+      ansible.builtin.file:
+        path: /var/spool/slurmctld
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+      when: inventory_hostname in groups['slurm_controller']
+
+    - name: Ensure slurmd spool directory exists on workers
+      ansible.builtin.file:
+        path: /var/spool/slurmd
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
+
+
+- name: Deploy Slurm config files
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Backup current slurm.conf before managed deployment
+      ansible.builtin.copy:
+        src: "{{ slurm_config_dir }}/slurm.conf"
+        dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
+        remote_src: true
+        owner: root
+        group: root
+        mode: "0644"
+        force: false
+
+    - name: Deploy managed slurm.conf
+      ansible.builtin.template:
+        src: ../../templates/slurm.conf.j2
+        dest: "{{ slurm_config_dir }}/slurm.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      notify:
+        - Reconfigure slurmctld
+        - Restart slurmd
+
+    - name: Deploy managed cgroup.conf
+      ansible.builtin.template:
+        src: ../../templates/cgroup.conf.j2
+        dest: "{{ slurm_config_dir }}/cgroup.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      when: slurm_enable_cgroup | default(false) | bool
+      notify:
+        - Reconfigure slurmctld
+        - Restart slurmd
+
+    - name: Deploy managed gres.conf only on GPU nodes
+      ansible.builtin.template:
+        src: ../../templates/gres.conf.j2
+        dest: "{{ slurm_config_dir }}/gres.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      when: inventory_hostname in groups['slurm_gpu']
+      notify:
+        - Reconfigure slurmctld
+        - Restart slurmd
+
+  handlers:
+    - name: Reconfigure slurmctld
+      ansible.builtin.command:
+        cmd: scontrol reconfigure
+      when: inventory_hostname in groups['slurm_controller']
+      changed_when: true
+
+    - name: Restart slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
+
+
+- name: Validate Slurm after config deployment
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Reconfigure controller
+      ansible.builtin.command:
+        cmd: scontrol reconfigure
+      changed_when: true
+
+    - name: Validate cluster state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        scontrol ping
+        sinfo
+        scontrol show nodes
+      args:
+        executable: /bin/bash
+      register: slurm_config_validation
+      changed_when: false
+
+    - name: Show validation output
+      ansible.builtin.debug:
+        var: slurm_config_validation.stdout_lines
@@ -0,0 +1,103 @@
+---
+- name: Restart Slurm controller safely
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Restart munge on controller
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Restart slurmctld on controller
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+        enabled: true
+
+    - name: Wait for slurmctld to answer
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: scontrol_ping
+      retries: 15
+      delay: 2
+      until: scontrol_ping.rc == 0
+      changed_when: false
+
+    - name: Show controller ping
+      ansible.builtin.debug:
+        var: scontrol_ping.stdout_lines
+
+
+- name: Restart Slurm workers safely one by one
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: false
+  serial: 1
+
+  tasks:
+    - name: Restart munge on worker
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Restart slurmd on worker
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+        enabled: true
+
+    - name: Wait for slurmd to be active
+      ansible.builtin.command:
+        cmd: systemctl is-active slurmd
+      register: slurmd_active
+      retries: 15
+      delay: 2
+      until: slurmd_active.stdout == "active"
+      changed_when: false
+
+    - name: Wait until this node is visible in Slurm
+      ansible.builtin.command:
+        cmd: scontrol show node {{ inventory_hostname }}
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: node_visible
+      retries: 15
+      delay: 2
+      until: node_visible.rc == 0
+      changed_when: false
+
+
+- name: Validate Slurm after restart
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Validate Slurm cluster state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        echo "### scontrol ping"
+        scontrol ping
+
+        echo
+        echo "### sinfo"
+        sinfo
+
+        echo
+        echo "### nodes"
+        scontrol show nodes
+
+        echo
+        echo "### partitions"
+        scontrol show partitions
+      args:
+        executable: /bin/bash
+      register: slurm_validation
+      changed_when: false
+
+    - name: Show Slurm validation
+      ansible.builtin.debug:
+        var: slurm_validation.stdout_lines
@@ -0,0 +1,40 @@
+---
+- name: Discover node resources for Slurm config
+  hosts: slurm_cluster
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Discover CPU and memory
+      ansible.builtin.shell: |
+        set -euo pipefail
+        echo "HOST={{ inventory_hostname }}"
+        echo "CPUS=$(nproc)"
+        echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
+        echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
+        echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
+        echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
+      args:
+        executable: /bin/bash
+      register: cpu_mem
+      changed_when: false
+
+    - name: Discover NVIDIA GPU if present
+      ansible.builtin.shell: |
+        set -euo pipefail
+        if command -v nvidia-smi >/dev/null 2>&1; then
+          nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
+        else
+          echo "NO_NVIDIA_SMI"
+        fi
+      args:
+        executable: /bin/bash
+      register: gpu_info
+      changed_when: false
+
+    - name: Show discovered resources
+      ansible.builtin.debug:
+        msg:
+          - "{{ cpu_mem.stdout_lines }}"
+          - "GPU:"
+          - "{{ gpu_info.stdout_lines }}"
@@ -0,0 +1,89 @@
+---
+- name: Inspect current Slurm and Munge state
+  hosts: slurm_cluster
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Basic host info
+      ansible.builtin.shell: |
+        set -e
+        echo "HOST=$(hostname -f 2>/dev/null || hostname)"
+        echo "SHORT_HOST=$(hostname -s)"
+        echo "IP_ADDRESSES=$(hostname -I)"
+        echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
+        echo "KERNEL=$(uname -r)"
+      args:
+        executable: /bin/bash
+      register: host_info
+      changed_when: false
+
+    - name: Slurm package info
+      ansible.builtin.shell: |
+        dpkg -l | grep -Ei 'slurm|munge' || true
+      args:
+        executable: /bin/bash
+      register: package_info
+      changed_when: false
+
+    - name: Slurm config paths
+      ansible.builtin.shell: |
+        set -e
+        for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
+          echo "### $p"
+          if [ -e "$p" ]; then
+            find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
+          else
+            echo "MISSING"
+          fi
+        done
+      args:
+        executable: /bin/bash
+      register: config_paths
+      changed_when: false
+
+    - name: Service state
+      ansible.builtin.shell: |
+        for s in munge slurmctld slurmd; do
+          echo "### $s"
+          systemctl is-enabled "$s" 2>/dev/null || true
+          systemctl is-active "$s" 2>/dev/null || true
+        done
+      args:
+        executable: /bin/bash
+      register: service_state
+      changed_when: false
+
+    - name: Slurm commands
+      ansible.builtin.shell: |
+        echo "### which"
+        command -v sinfo || true
+        command -v scontrol || true
+        command -v sbatch || true
+        command -v srun || true
+        command -v munge || true
+        command -v unmunge || true
+
+        echo "### sinfo"
+        sinfo 2>&1 || true
+
+        echo "### scontrol ping"
+        scontrol ping 2>&1 || true
+      args:
+        executable: /bin/bash
+      register: slurm_commands
+      changed_when: false
+
+    - name: Show inspection report
+      ansible.builtin.debug:
+        msg:
+          - "===== {{ inventory_hostname }} :: host_info ====="
+          - "{{ host_info.stdout_lines }}"
+          - "===== {{ inventory_hostname }} :: packages ====="
+          - "{{ package_info.stdout_lines }}"
+          - "===== {{ inventory_hostname }} :: config_paths ====="
+          - "{{ config_paths.stdout_lines }}"
+          - "===== {{ inventory_hostname }} :: services ====="
+          - "{{ service_state.stdout_lines }}"
+          - "===== {{ inventory_hostname }} :: slurm_commands ====="
+          - "{{ slurm_commands.stdout_lines }}"
@@ -0,0 +1,216 @@
+---
+- name: Detect problematic Slurm nodes
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Detect nodes needing remediation
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        sinfo -N -h -o "%N %T" | awk '
+          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
+        ' | sort -u
+      args:
+        executable: /bin/bash
+      register: bad_nodes_raw
+      changed_when: false
+
+    - name: Store bad node list
+      ansible.builtin.set_fact:
+        bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
+
+    - name: Show detected problematic nodes
+      ansible.builtin.debug:
+        var: bad_nodes
+
+
+- name: Attempt auto-remediation on problematic nodes
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: false
+  serial: 1
+
+  vars:
+    bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
+
+  tasks:
+    - name: Skip healthy nodes
+      ansible.builtin.meta: end_host
+      when: inventory_hostname not in bad_nodes_from_controller
+
+    - name: Restart Munge
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Restart slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+        enabled: true
+
+    - name: Validate local services after remediation attempt
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "HOST=$(hostname)"
+
+        echo
+        echo "### services"
+        systemctl is-active munge
+        systemctl is-active slurmd
+
+        echo
+        echo "### munge"
+        munge -n | unmunge >/dev/null
+        echo "munge OK"
+
+        echo
+        echo "### controller ping"
+        scontrol ping
+
+        echo
+        echo "### slurmd listener"
+        ss -lntp | grep ':6818 ' || true
+
+        echo
+        echo "### recent slurmd logs"
+        journalctl -u slurmd -n 30 --no-pager || true
+      args:
+        executable: /bin/bash
+      register: local_repair_check
+      changed_when: false
+
+    - name: Print local remediation result
+      ansible.builtin.debug:
+        var: local_repair_check.stdout_lines
+
+
+- name: Refresh controller and validate remediated nodes
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Restart slurmctld to refresh node states
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+
+    - name: Wait for controller
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: slurmctld_ping
+      retries: 15
+      delay: 2
+      until: slurmctld_ping.rc == 0
+      changed_when: false
+
+    - name: Clear maintenance state on previously bad nodes
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
+
+        if [ -z "$bad_nodes" ]; then
+          echo "No bad nodes detected. Nothing to clear."
+          sinfo -N
+          exit 0
+        fi
+
+        for node in $bad_nodes; do
+          echo "### clearing state on $node"
+          scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
+          scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
+          scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
+        done
+
+        sleep 5
+        sinfo -N
+      args:
+        executable: /bin/bash
+      register: clear_result
+      changed_when: true
+
+    - name: Print clear-state result
+      ansible.builtin.debug:
+        var: clear_result.stdout_lines
+
+    - name: Detect nodes still unhealthy after remediation
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        sinfo -N -h -o "%N %T" | awk '
+          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
+        ' | sort -u
+      args:
+        executable: /bin/bash
+      register: still_bad_nodes_raw
+      changed_when: false
+
+    - name: Store still bad nodes
+      ansible.builtin.set_fact:
+        still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
+
+    - name: Drain nodes that remain unhealthy
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
+
+        if [ -z "$unresolved_nodes" ]; then
+          echo "No unresolved unhealthy nodes."
+          sinfo -N
+          exit 0
+        fi
+
+        for node in $unresolved_nodes; do
+          echo "### draining unresolved node $node"
+          scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
+        done
+
+        sinfo -N
+      args:
+        executable: /bin/bash
+      register: drain_unresolved
+      changed_when: still_bad_nodes | length > 0
+
+    - name: Show remediation summary
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### initial bad nodes"
+        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
+        if [ -z "$bad_nodes" ]; then
+          echo "none"
+        else
+          printf '%s\n' $bad_nodes
+        fi
+
+        echo
+        echo "### still bad nodes"
+        still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
+        if [ -z "$still_bad_nodes" ]; then
+          echo "none"
+        else
+          printf '%s\n' $still_bad_nodes
+        fi
+
+        echo
+        echo "### final sinfo"
+        sinfo -N
+
+        echo
+        echo "### queue"
+        squeue
+      args:
+        executable: /bin/bash
+      register: remediation_summary
+      changed_when: false
+
+    - name: Print remediation summary
+      ansible.builtin.debug:
+        var: remediation_summary.stdout_lines
@@ -0,0 +1,149 @@
+---
+- name: Check Slurm controller health
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Check controller services and cluster state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### controller services"
+        systemctl is-active munge
+        systemctl is-active slurmctld
+        systemctl is-active slurmdbd || true
+        systemctl is-active mariadb || true
+
+        echo
+        echo "### slurm ping"
+        scontrol ping
+
+        echo
+        echo "### nodes"
+        sinfo -N
+
+        echo
+        echo "### partitions"
+        sinfo
+
+        echo
+        echo "### queue"
+        squeue
+
+        echo
+        echo "### problematic nodes"
+        sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
+
+        echo
+        echo "### accounting"
+        sacctmgr -n list cluster || true
+
+        echo
+        echo "### recent failed jobs"
+        sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
+          --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
+      args:
+        executable: /bin/bash
+      register: controller_health
+      changed_when: false
+
+    - name: Print controller health
+      ansible.builtin.debug:
+        var: controller_health.stdout_lines
+
+
+- name: Check Slurm worker health
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Check worker services, config and connectivity
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "HOST=$(hostname)"
+        echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
+        echo "KERNEL=$(uname -r)"
+        echo "UPTIME=$(uptime -p)"
+
+        echo
+        echo "### services"
+        systemctl is-active munge
+        systemctl is-active slurmd
+
+        echo
+        echo "### munge local test"
+        munge -n | unmunge >/dev/null
+        echo "munge OK"
+
+        echo
+        echo "### controller connectivity"
+        getent hosts slurm-ctl01 || true
+        scontrol ping
+
+        echo
+        echo "### slurmd listener"
+        ss -lntp | grep ':6818 ' || true
+
+        echo
+        echo "### config checksums"
+        sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
+
+        echo
+        echo "### shared filesystem"
+        test -d /shared
+        touch /shared/.slurm-health-$(hostname)
+        ls -l /shared/.slurm-health-$(hostname)
+        rm -f /shared/.slurm-health-$(hostname)
+
+        echo
+        echo "### cgroup"
+        mount | grep cgroup || true
+
+        echo
+        echo "### gpu check"
+        if command -v nvidia-smi >/dev/null 2>&1; then
+          nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
+        else
+          echo "NO_NVIDIA_SMI"
+        fi
+      args:
+        executable: /bin/bash
+      register: worker_health
+      changed_when: false
+
+    - name: Print worker health
+      ansible.builtin.debug:
+        var: worker_health.stdout_lines
+
+
+- name: Check Slurm-reported node state consistency
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Build Slurm node health summary
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### node summary"
+        sinfo -N -o "%N %P %T %C %m %G %E"
+
+        echo
+        echo "### full problematic node details"
+        for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
+          echo
+          echo "### $node"
+          scontrol show node "$node"
+        done
+      args:
+        executable: /bin/bash
+      register: slurm_node_summary
+      changed_when: false
+
+    - name: Print Slurm node summary
+      ansible.builtin.debug:
+        var: slurm_node_summary.stdout_lines
@@ -0,0 +1,217 @@
+---
+- name: Validate target node
+  hosts: localhost
+  gather_facts: false
+
+  tasks:
+    - name: Require target_node
+      ansible.builtin.fail:
+        msg: "Use: ansible-playbook repair-slurm-node.yml -e target_node=<hostname>"
+      when: target_node is not defined
+
+    - name: Ensure target_node is in inventory
+      ansible.builtin.fail:
+        msg: "target_node={{ target_node }} is not in Ansible inventory"
+      when: target_node not in groups['all']
+
+
+- name: Capture node state before repair
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Show target node state before repair
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### sinfo"
+        sinfo -N -n {{ target_node }} || true
+
+        echo
+        echo "### scontrol"
+        scontrol show node {{ target_node }} || true
+
+        echo
+        echo "### jobs"
+        squeue -w {{ target_node }} || true
+      args:
+        executable: /bin/bash
+      register: node_state_before
+      changed_when: false
+
+    - name: Print target node state before repair
+      ansible.builtin.debug:
+        var: node_state_before.stdout_lines
+
+
+- name: Repair local services on target node
+  hosts: "{{ target_node }}"
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Restart Munge
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Restart slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+        enabled: true
+      when:
+        - inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
+
+    - name: Validate local repair
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### services"
+        systemctl is-active munge
+        systemctl is-active slurmd
+
+        echo
+        echo "### munge"
+        munge -n | unmunge >/dev/null
+        echo "munge OK"
+
+        echo
+        echo "### controller ping"
+        scontrol ping
+
+        echo
+        echo "### slurmd listener"
+        ss -lntp | grep ':6818 ' || true
+
+        echo
+        echo "### recent slurmd logs"
+        journalctl -u slurmd -n 40 --no-pager || true
+      args:
+        executable: /bin/bash
+      register: local_repair_state
+      changed_when: false
+
+    - name: Print local repair state
+      ansible.builtin.debug:
+        var: local_repair_state.stdout_lines
+
+
+- name: Clear Slurm maintenance/down state after repair
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Restart controller to refresh node state
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+
+    - name: Wait for controller
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: slurmctld_ping
+      retries: 15
+      delay: 2
+      until: slurmctld_ping.rc == 0
+      changed_when: false
+
+    - name: Clear target node state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        scontrol update NodeName={{ target_node }} State=RESUME 2>/dev/null || true
+        scontrol update NodeName={{ target_node }} State=UNDRAIN 2>/dev/null || true
+        scontrol update NodeName={{ target_node }} State=IDLE 2>/dev/null || true
+
+        sleep 5
+
+        sinfo -N -n {{ target_node }}
+        scontrol show node {{ target_node }}
+      args:
+        executable: /bin/bash
+      register: clear_state
+      changed_when: true
+
+    - name: Wait until node is healthy
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ target_node }}
+        scontrol show node {{ target_node }}
+      args:
+        executable: /bin/bash
+      register: node_health_after
+      retries: 30
+      delay: 5
+      until:
+        - node_health_after.rc == 0
+        - "'not_responding' not in node_health_after.stdout.lower()"
+        - "'down' not in node_health_after.stdout.lower()"
+        - "'drain' not in node_health_after.stdout.lower()"
+        - "'idle*' not in node_health_after.stdout.lower()"
+      changed_when: false
+
+    - name: Print node state after repair
+      ansible.builtin.debug:
+        var: node_health_after.stdout_lines
+
+
+- name: Submit repair validation job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit validation job to repaired node
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<SBATCH
+        #!/bin/bash
+        #SBATCH --job-name=repair-node-test
+        #SBATCH --partition=all
+        #SBATCH --nodelist={{ target_node }}
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --account=lab
+        #SBATCH --qos=normal
+        #SBATCH --output=/shared/repair-node-test-%j.out
+
+        echo "HOST=\$(hostname)"
+        echo "USER=\$(whoami)"
+        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList
+
+        echo "### output"
+        cat "/shared/repair-node-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: repair_validation_job
+      changed_when: true
+
+    - name: Print repair validation job
+      ansible.builtin.debug:
+        var: repair_validation_job.stdout_lines
@@ -0,0 +1,126 @@
+---
+- name: Validate target_node variable
+  hosts: localhost
+  gather_facts: false
+
+  tasks:
+    - name: Require target_node
+      ansible.builtin.fail:
+        msg: "Use: ansible-playbook decommission-slurm-node.yml -e target_node=<hostname> [-e decom_reason='reason']"
+      when: target_node is not defined
+
+    - name: Ensure target_node is in inventory
+      ansible.builtin.fail:
+        msg: "target_node={{ target_node }} is not in Ansible inventory"
+      when: target_node not in groups['all']
+
+
+- name: Drain target node and wait for jobs to leave
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    decom_reason_effective: "{{ decom_reason | default('decommission by Ansible') }}"
+    decom_wait_retries_effective: "{{ decom_wait_retries | default(120) }}"
+    decom_wait_delay_effective: "{{ decom_wait_delay | default(10) }}"
+
+  tasks:
+    - name: Show current target node state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ target_node }} || true
+        scontrol show node {{ target_node }} || true
+      args:
+        executable: /bin/bash
+      register: node_state_before
+      changed_when: false
+
+    - name: Print current target node state
+      ansible.builtin.debug:
+        var: node_state_before.stdout_lines
+
+    - name: Drain target node
+      ansible.builtin.command:
+        cmd: scontrol update NodeName={{ target_node }} State=DRAIN Reason="{{ decom_reason_effective }}"
+      changed_when: true
+
+    - name: Wait until no jobs are running on target node
+      ansible.builtin.shell: |
+        set -euo pipefail
+        squeue -h -w {{ target_node }} || true
+      args:
+        executable: /bin/bash
+      register: jobs_on_node
+      retries: "{{ decom_wait_retries_effective | int }}"
+      delay: "{{ decom_wait_delay_effective | int }}"
+      until: jobs_on_node.stdout | trim == ""
+      changed_when: false
+
+    - name: Show drained node state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ target_node }} || true
+        scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
+      args:
+        executable: /bin/bash
+      register: node_state_drained
+      changed_when: false
+
+    - name: Print drained node state
+      ansible.builtin.debug:
+        var: node_state_drained.stdout_lines
+
+
+- name: Stop Slurm worker service on target node
+  hosts: "{{ target_node }}"
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Stop slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: stopped
+        enabled: false
+      when:
+        - inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
+
+    - name: Show slurmd state
+      ansible.builtin.shell: |
+        systemctl is-enabled slurmd 2>/dev/null || true
+        systemctl is-active slurmd 2>/dev/null || true
+      args:
+        executable: /bin/bash
+      register: slurmd_state_after
+      changed_when: false
+
+    - name: Print slurmd state
+      ansible.builtin.debug:
+        var: slurmd_state_after.stdout_lines
+
+
+- name: Mark node down in Slurm controller
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Mark target node DOWN after service stop
+      ansible.builtin.command:
+        cmd: scontrol update NodeName={{ target_node }} State=DOWN Reason="decommissioned"
+      changed_when: true
+
+    - name: Show final node state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ target_node }} || true
+        scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
+      args:
+        executable: /bin/bash
+      register: final_node_state
+      changed_when: false
+
+    - name: Print final node state
+      ansible.builtin.debug:
+        var: final_node_state.stdout_lines
@@ -0,0 +1,246 @@
+---
+- name: Validate target_node variable
+  hosts: localhost
+  gather_facts: false
+
+  tasks:
+    - name: Require target_node
+      ansible.builtin.fail:
+        msg: "Use: ansible-playbook provision-slurm-node.yml -e target_node=<hostname>"
+      when: target_node is not defined
+
+    - name: Ensure target_node is in inventory
+      ansible.builtin.fail:
+        msg: "target_node={{ target_node }} is not in Ansible inventory"
+      when: target_node not in groups['all']
+
+
+- name: Prepare OS, packages and Slurm directories on target node
+  hosts: "{{ target_node }}"
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Ensure target is a Slurm worker or GPU node
+      ansible.builtin.fail:
+        msg: "{{ inventory_hostname }} must be in slurm_compute or slurm_gpu group"
+      when:
+        - inventory_hostname not in groups.get('slurm_compute', [])
+        - inventory_hostname not in groups.get('slurm_gpu', [])
+
+    - name: Install Slurm worker packages
+      ansible.builtin.apt:
+        name:
+          - munge
+          - libmunge2
+          - slurm-client
+          - slurmd
+          - slurm-wlm-basic-plugins
+          - slurm-wlm-plugins
+          - slurm-wlm-mysql-plugin
+        state: present
+        update_cache: true
+
+    - name: Ensure Slurm config directory exists
+      ansible.builtin.file:
+        path: "{{ slurm_config_dir }}"
+        state: directory
+        owner: root
+        group: root
+        mode: "0755"
+
+    - name: Ensure Slurm log directory exists
+      ansible.builtin.file:
+        path: /var/log/slurm
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+
+    - name: Ensure slurmd spool directory exists
+      ansible.builtin.file:
+        path: /var/spool/slurmd
+        state: directory
+        owner: slurm
+        group: slurm
+        mode: "0755"
+
+    - name: Ensure munge dirs exist
+      ansible.builtin.file:
+        path: "{{ item.path }}"
+        state: directory
+        owner: munge
+        group: munge
+        mode: "{{ item.mode }}"
+      loop:
+        - { path: /etc/munge, mode: "0700" }
+        - { path: /var/log/munge, mode: "0755" }
+        - { path: /var/lib/munge, mode: "0711" }
+        - { path: /run/munge, mode: "0755" }
+
+
+- name: Deploy Munge key from controller to target node
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Read controller munge.key
+      ansible.builtin.slurp:
+        src: /etc/munge/munge.key
+      register: controller_munge_key_raw
+
+    - name: Store controller Munge key as fact
+      ansible.builtin.set_fact:
+        cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
+
+
+- name: Configure target node with Munge and Slurm files
+  hosts: "{{ target_node }}"
+  become: true
+  gather_facts: false
+
+  vars:
+    controller_host: "{{ groups['slurm_controller'][0] }}"
+
+  tasks:
+    - name: Deploy shared munge.key
+      ansible.builtin.copy:
+        dest: /etc/munge/munge.key
+        content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
+        owner: munge
+        group: munge
+        mode: "0400"
+      notify:
+        - Restart munge
+
+    - name: Deploy managed slurm.conf
+      ansible.builtin.template:
+        src: ../../templates/slurm.conf.j2
+        dest: "{{ slurm_config_dir }}/slurm.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      notify:
+        - Restart slurmd
+
+    - name: Deploy managed cgroup.conf
+      ansible.builtin.template:
+        src: ../../templates/cgroup.conf.j2
+        dest: "{{ slurm_config_dir }}/cgroup.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      when: slurm_enable_cgroup | default(false) | bool
+      notify:
+        - Restart slurmd
+
+    - name: Deploy managed gres.conf on GPU nodes
+      ansible.builtin.template:
+        src: ../../templates/gres.conf.j2
+        dest: "{{ slurm_config_dir }}/gres.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      when: inventory_hostname in groups.get('slurm_gpu', [])
+      notify:
+        - Restart slurmd
+
+    - name: Ensure munge is enabled and running
+      ansible.builtin.systemd:
+        name: munge
+        enabled: true
+        state: started
+
+    - name: Ensure slurmd is enabled and running
+      ansible.builtin.systemd:
+        name: slurmd
+        enabled: true
+        state: started
+
+  handlers:
+    - name: Restart munge
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+
+    - name: Restart slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+
+
+- name: Deploy updated Slurm config to whole cluster and reconfigure controller
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Deploy managed slurm.conf to all nodes
+      ansible.builtin.template:
+        src: ../../templates/slurm.conf.j2
+        dest: "{{ slurm_config_dir }}/slurm.conf"
+        owner: root
+        group: root
+        mode: "0644"
+
+    - name: Deploy managed cgroup.conf to all nodes
+      ansible.builtin.template:
+        src: ../../templates/cgroup.conf.j2
+        dest: "{{ slurm_config_dir }}/cgroup.conf"
+        owner: root
+        group: root
+        mode: "0644"
+      when: slurm_enable_cgroup | default(false) | bool
+
+
+- name: Reconfigure Slurm and validate target node
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Reconfigure Slurm controller
+      ansible.builtin.command:
+        cmd: scontrol reconfigure
+      changed_when: true
+
+    - name: Restart Slurm controller after node reprovision
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+
+    - name: Wait for Slurm controller after restart
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: slurmctld_ping_after_restart
+      retries: 15
+      delay: 2
+      until: slurmctld_ping_after_restart.rc == 0
+      changed_when: false
+
+    - name: Resume target node in Slurm
+      ansible.builtin.command:
+        cmd: scontrol update NodeName={{ target_node }} State=RESUME
+      changed_when: true
+
+    - name: Wait until target node is visible and not down
+      ansible.builtin.shell: |
+        set -euo pipefail
+        scontrol show node {{ target_node }}
+        sinfo -N -n {{ target_node }}
+      args:
+        executable: /bin/bash
+      register: target_node_state
+      retries: 20
+      delay: 3
+      until:
+        - target_node_state.rc == 0
+        - "'down' not in target_node_state.stdout.lower()"
+        - "'not_responding' not in target_node_state.stdout.lower()"
+        - "'idle*' not in target_node_state.stdout.lower()"
+      changed_when: false
+
+    - name: Show target node state
+      ansible.builtin.debug:
+        var: target_node_state.stdout_lines
@@ -0,0 +1,33 @@
+---
+- name: Show Slurm node state
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Require target_node
+      ansible.builtin.fail:
+        msg: "Use: ansible-playbook show-slurm-node.yml -e target_node=<hostname>"
+      when: target_node is not defined
+
+    - name: Show node state
+      ansible.builtin.shell: |
+        set -euo pipefail
+        echo "### sinfo"
+        sinfo -N -n {{ target_node }} || true
+
+        echo
+        echo "### scontrol"
+        scontrol show node {{ target_node }} || true
+
+        echo
+        echo "### jobs on node"
+        squeue -w {{ target_node }} || true
+      args:
+        executable: /bin/bash
+      register: node_lifecycle_state
+      changed_when: false
+
+    - name: Print node lifecycle state
+      ansible.builtin.debug:
+        var: node_lifecycle_state.stdout_lines
@@ -0,0 +1,169 @@
+---
+- name: Configure Slurm QOS, limits and fairshare
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Ensure sacctmgr is avgpu01le
+      ansible.builtin.command:
+        cmd: sacctmgr -n list cluster
+      changed_when: false
+
+    - name: Validate accounting GPU TRES exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### configured AccountingStorageTRES"
+        scontrol show config | grep -E "AccountingStorageTRES|AccountingStorageType|AccountingStorageEnforce"
+
+        echo
+        echo "### known TRES"
+        sacctmgr show tres
+
+        echo
+        echo "### checking gres/gpu"
+        sacctmgr -n show tres format=Type,Name | awk '$1=="gres" && $2=="gpu" {found=1} END {exit !found}'
+      args:
+        executable: /bin/bash
+      register: gpu_tres_check
+      changed_when: false
+
+    - name: Ensure normal QOS exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i add qos normal Priority=100
+      args:
+        executable: /bin/bash
+      register: add_qos_normal
+      changed_when: "'Adding QOS' in (add_qos_normal.stdout + add_qos_normal.stderr)"
+      failed_when: >
+        add_qos_normal.rc != 0 and
+        'Nothing new added' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
+        'already exists' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
+        'Already existing' not in (add_qos_normal.stdout + add_qos_normal.stderr)
+
+    - name: Ensure debug-short QOS exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i add qos debug-short Priority=500
+      args:
+        executable: /bin/bash
+      register: add_qos_debug
+      changed_when: "'Adding QOS' in (add_qos_debug.stdout + add_qos_debug.stderr)"
+      failed_when: >
+        add_qos_debug.rc != 0 and
+        'Nothing new added' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
+        'already exists' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
+        'Already existing' not in (add_qos_debug.stdout + add_qos_debug.stderr)
+
+    - name: Ensure gpu-short QOS exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i add qos gpu-short Priority=1000
+      args:
+        executable: /bin/bash
+      register: add_qos_gpu
+      changed_when: "'Adding QOS' in (add_qos_gpu.stdout + add_qos_gpu.stderr)"
+      failed_when: >
+        add_qos_gpu.rc != 0 and
+        'Nothing new added' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
+        'already exists' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
+        'Already existing' not in (add_qos_gpu.stdout + add_qos_gpu.stderr)
+
+    - name: Ensure maintenance QOS exists
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i add qos maintenance Priority=5000
+      args:
+        executable: /bin/bash
+      register: add_qos_maintenance
+      changed_when: "'Adding QOS' in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)"
+      failed_when: >
+        add_qos_maintenance.rc != 0 and
+        'Nothing new added' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
+        'already exists' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
+        'Already existing' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)
+
+    - name: Normalize normal QOS settings
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify qos normal set Priority=100
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Normalize debug-short QOS settings
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify qos debug-short set Priority=500 MaxWall=00:10:00 MaxTRESPU=cpu=2 MaxJobsPU=4
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Normalize gpu-short QOS settings
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify qos gpu-short set Priority=1000 MaxWall=01:00:00 MaxTRESPU=gres/gpu=1,cpu=12 MaxJobsPU=2
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Normalize maintenance QOS settings
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify qos maintenance set Priority=5000 MaxWall=02:00:00
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Assign QOS set to lab account
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify account {{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Assign default account to slurmuser
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Assign QOS set to slurmuser association
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sacctmgr -i modify user where name=slurmuser account={{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
+      args:
+        executable: /bin/bash
+      changed_when: true
+
+    - name: Show configured QOS and associations
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### TRES"
+        sacctmgr show tres
+
+        echo
+        echo "### QOS"
+        sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%40,MaxJobsPU
+
+        echo
+        echo "### Associations"
+        sacctmgr show assoc format=Cluster,Account,User,Share,QOS%60,DefaultQOS,Fairshare
+
+        echo
+        echo "### Fairshare"
+        sshare -A {{ slurm_account_name }} || true
+      args:
+        executable: /bin/bash
+      register: qos_state
+      changed_when: false
+
+    - name: Print QOS state
+      ansible.builtin.debug:
+        var: qos_state.stdout_lines
@@ -0,0 +1,235 @@
+---
+- name: Validate Slurm QOS, fairshare and priority
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Validate priority runtime config
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### priority config"
+        scontrol show config | grep -E "PriorityType|PriorityWeight|PriorityDecay|PriorityCalc|PriorityMaxAge|PriorityFavor"
+
+        echo
+        echo "### accounting enforcement"
+        scontrol show config | grep -E "AccountingStorageType|AccountingStorageEnforce|AccountingStorageTRES"
+
+        echo
+        echo "### QOS"
+        sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%50,MaxJobsPU
+
+        echo
+        echo "### associations"
+        sacctmgr show assoc format=Cluster,Account,User,Share,QOS%80,DefaultQOS,Fairshare
+
+        echo
+        echo "### fairshare"
+        sshare -A {{ slurm_account_name }} || true
+      args:
+        executable: /bin/bash
+      register: priority_state
+      changed_when: false
+
+    - name: Submit debug-short QOS job
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=qos-debug-test
+        #SBATCH --partition=debug
+        #SBATCH --qos=debug-short
+        #SBATCH --account=lab
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/qos-debug-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "QOS=${SLURM_JOB_QOS:-}"
+        echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/qos-debug-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: debug_qos_job
+      changed_when: true
+
+    - name: Submit gpu-short QOS job
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=qos-gpu-test
+        #SBATCH --partition=gpu
+        #SBATCH --qos=gpu-short
+        #SBATCH --account=lab
+        #SBATCH --gres=gpu:1
+        #SBATCH --cpus-per-task=2
+        #SBATCH --mem=1G
+        #SBATCH --time=00:03:00
+        #SBATCH --output=/shared/qos-gpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "QOS=${SLURM_JOB_QOS:-}"
+        echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
+        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo
+        nvidia-smi
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 120); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/qos-gpu-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: gpu_qos_job
+      changed_when: true
+
+    - name: Validate debug-short walltime limit behavior
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        set +e
+        output="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH' 2>&1
+        #!/bin/bash
+        #SBATCH --job-name=qos-limit-fail
+        #SBATCH --partition=debug
+        #SBATCH --qos=debug-short
+        #SBATCH --account=lab
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:30:00
+        #SBATCH --output=/shared/qos-limit-fail-%j.out
+
+        sleep 10
+        SBATCH
+        )"
+        rc=$?
+        set -e
+
+        echo "RC=$rc"
+        echo "$output"
+
+        if [ "$rc" -ne 0 ]; then
+          echo "Limit rejection test passed at submit time"
+          exit 0
+        fi
+
+        job_id="$output"
+        echo "Submitted job despite expected limit check: $job_id"
+
+        sleep 3
+
+        echo "### squeue"
+        squeue -j "$job_id" -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" || true
+
+        echo
+        echo "### job detail"
+        scontrol show job "$job_id" || true
+
+        state="$(squeue -h -j "$job_id" -o "%T" || true)"
+        reason="$(squeue -h -j "$job_id" -o "%R" || true)"
+
+        echo "STATE=$state"
+        echo "REASON=$reason"
+
+        if echo "$state" | grep -qE "PENDING|CONFIGURING"; then
+          if echo "$reason" | grep -qiE "qos|limit|time|max|assoc"; then
+            echo "Limit enforcement test passed via pending reason"
+            scancel "$job_id" || true
+            exit 0
+          fi
+        fi
+
+        echo "Job was accepted without an obvious QOS/limit pending reason"
+        scancel "$job_id" || true
+        exit 1
+      args:
+        executable: /bin/bash
+      register: limit_rejection
+      changed_when: false
+
+    - name: Show priority and fairshare snapshot
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### queue"
+        squeue || true
+
+        echo
+        echo "### sprio"
+        sprio || true
+
+        echo
+        echo "### sshare"
+        sshare -A {{ slurm_account_name }} || true
+
+        echo
+        echo "### recent sacct"
+        sacct -S today --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -40
+      args:
+        executable: /bin/bash
+      register: priority_snapshot
+      changed_when: false
+
+    - name: Print validation result
+      ansible.builtin.debug:
+        msg:
+          - "### priority state"
+          - "{{ priority_state.stdout_lines }}"
+          - "### debug QOS job"
+          - "{{ debug_qos_job.stdout_lines }}"
+          - "### GPU QOS job"
+          - "{{ gpu_qos_job.stdout_lines }}"
+          - "### limit rejection"
+          - "{{ limit_rejection.stdout_lines }}"
+          - "### priority snapshot"
+          - "{{ priority_snapshot.stdout_lines }}"
@@ -0,0 +1,59 @@
+---
+- name: Test CPU cgroup enforcement on gpu01
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit cgroup CPU test to gpu01
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=cgroup-cpu-test
+        #SBATCH --partition=all
+        #SBATCH --nodelist=gpu01
+        #SBATCH --cpus-per-task=2
+        #SBATCH --mem=1G
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/cgroup-cpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo "MEM_ALLOWED=$(grep Mems_allowed_list /proc/self/status || true)"
+        echo
+        echo "### cgroup"
+        cat /proc/self/cgroup
+        echo
+        echo "### mounted cgroups"
+        mount | grep cgroup || true
+        sleep 5
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 60); do
+          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
+            sudo -iu slurmuser squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### output"
+        cat "/shared/cgroup-cpu-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: cgroup_cpu_result
+      changed_when: true
+
+    - name: Show cgroup CPU result
+      ansible.builtin.debug:
+        var: cgroup_cpu_result.stdout_lines
@@ -0,0 +1,60 @@
+---
+- name: Submit CPU test job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit test job to debug partition
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=cpu-test
+        #SBATCH --partition=debug
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=512M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/cpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 60); do
+          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
+            sudo -iu slurmuser squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
+
+        echo "### output"
+        if [ -f "/shared/cpu-test-${job_id}.out" ]; then
+          cat "/shared/cpu-test-${job_id}.out"
+        else
+          echo "Output file not found: /shared/cpu-test-${job_id}.out"
+          find /shared -maxdepth 1 -name "cpu-test-*.out" -ls | tail -5 || true
+          exit 1
+        fi
+      args:
+        executable: /bin/bash
+      register: cpu_job_result
+      changed_when: true
+
+    - name: Show CPU job result
+      ansible.builtin.debug:
+        var: cpu_job_result.stdout_lines
@@ -0,0 +1,58 @@
+---
+- name: Test GPU access without GRES allocation
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit job to gpu01 without requesting GPU
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=gpu-deny-test
+        #SBATCH --partition=all
+        #SBATCH --nodelist=gpu01
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=1G
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/gpu-deny-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
+        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo
+        echo "### ls nvidia devices"
+        ls -l /dev/nvidia* 2>&1 || true
+        echo
+        echo "### nvidia-smi without GRES"
+        nvidia-smi 2>&1 || true
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 60); do
+          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
+            sudo -iu slurmuser squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### output"
+        cat "/shared/gpu-deny-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: gpu_deny_result
+      changed_when: true
+
+    - name: Show GPU deny test result
+      ansible.builtin.debug:
+        var: gpu_deny_result.stdout_lines
@@ -0,0 +1,70 @@
+---
+- name: Submit GPU test job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit test job to gpu partition
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=gpu-test
+        #SBATCH --partition=gpu
+        #SBATCH --gres=gpu:1
+        #SBATCH --cpus-per-task=2
+        #SBATCH --mem=2G
+        #SBATCH --time=00:03:00
+        #SBATCH --output=/shared/gpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
+        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo
+
+        echo "### nvidia-smi"
+        nvidia-smi
+
+        echo
+        echo "### GPU process table"
+        nvidia-smi pmon -c 1 || true
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
+            sudo -iu slurmuser squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
+
+        echo "### output"
+        if [ -f "/shared/gpu-test-${job_id}.out" ]; then
+          cat "/shared/gpu-test-${job_id}.out"
+        else
+          echo "Output file not found: /shared/gpu-test-${job_id}.out"
+          find /shared -maxdepth 1 -name "gpu-test-*.out" -ls | tail -5 || true
+          exit 1
+        fi
+      args:
+        executable: /bin/bash
+      register: gpu_job_result
+      changed_when: true
+
+    - name: Show GPU job result
+      ansible.builtin.debug:
+        var: gpu_job_result.stdout_lines
@@ -0,0 +1,95 @@
+---
+- name: Submit job to specific Slurm node
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Require target_node
+      ansible.builtin.fail:
+        msg: "Use: ansible-playbook test-specific-node.yml -e target_node=<hostname>"
+      when: target_node is not defined
+
+    - name: Submit test job to target node
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<SBATCH
+        #!/bin/bash
+        #SBATCH --job-name=node-test
+        #SBATCH --partition=debug
+        #SBATCH --nodelist={{ target_node }}
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --account=lab
+        #SBATCH --qos=normal
+        #SBATCH --output=/shared/node-test-%j.out
+
+        echo "HOST=\$(hostname)"
+        echo "USER=\$(whoami)"
+        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        echo "### waiting for job to leave queue"
+        for i in $(seq 1 120); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### waiting for output file"
+        for i in $(seq 1 30); do
+          if [ -s "/shared/node-test-${job_id}.out" ]; then
+            break
+          fi
+          sleep 1
+        done
+
+        echo "### waiting for sacct final state"
+        final_state=""
+        for i in $(seq 1 30); do
+          final_state="$(
+            sacct -n -P -j "$job_id" --format=State 2>/dev/null \
+              | head -n 1 \
+              | cut -d'|' -f1 \
+              | awk '{print $1}'
+          )"
+
+          if echo "$final_state" | grep -qE "COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY"; then
+            break
+          fi
+
+          sleep 1
+        done
+
+        echo "FINAL_STATE=${final_state:-UNKNOWN}"
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/node-test-${job_id}.out"
+
+        if [ "${final_state:-UNKNOWN}" != "COMPLETED" ]; then
+          echo "Job did not reach COMPLETED state according to sacct"
+          exit 1
+        fi
+      args:
+        executable: /bin/bash
+      register: node_test
+      changed_when: true
+
+    - name: Show node test result
+      ansible.builtin.debug:
+        var: node_test.stdout_lines
@@ -0,0 +1,60 @@
+---
+- name: Generate measurable Slurm usage for sreport
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit CPU usage job
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=sreport-usage
+        #SBATCH --partition=debug
+        #SBATCH --cpus-per-task=2
+        #SBATCH --mem=512M
+        #SBATCH --time=00:03:00
+        #SBATCH --output=/shared/sreport-usage-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo "Burning CPU for 90 seconds"
+
+        timeout 90 bash -c 'while true; do :; done' &
+        timeout 90 bash -c 'while true; do :; done' &
+        wait
+
+        echo "Done"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 150); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 2
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/sreport-usage-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: sreport_usage_job
+      changed_when: true
+
+    - name: Show usage job result
+      ansible.builtin.debug:
+        var: sreport_usage_job.stdout_lines
@@ -0,0 +1,140 @@
+---
+- name: Validate Slurm operator user and SSH mesh
+  hosts: slurm_cluster
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: "{{ slurm_operator_user | default('slurmuser') }}"
+    slurm_hosts: "{{ groups['slurm_cluster'] }}"
+
+  tasks:
+    - name: Validate slurmuser exists
+      ansible.builtin.command:
+        cmd: id {{ slurm_operator_user }}
+      changed_when: false
+
+    - name: Validate sinfo as slurmuser
+      ansible.builtin.command:
+        cmd: sudo -iu {{ slurm_operator_user }} sinfo
+      changed_when: false
+
+    - name: Validate squeue as slurmuser
+      ansible.builtin.command:
+        cmd: sudo -iu {{ slurm_operator_user }} squeue
+      changed_when: false
+
+    - name: Validate SSH mesh as slurmuser
+      ansible.builtin.shell: |
+        set -euo pipefail
+        for h in {{ slurm_hosts | join(' ') }}; do
+          echo "=== $h ==="
+          ssh -o BatchMode=yes -o ConnectTimeout=5 "$h" hostname
+        done
+      args:
+        executable: /bin/bash
+      become_user: "{{ slurm_operator_user }}"
+      changed_when: false
+
+
+- name: Validate Slurm controller commands
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Validate slurmctld status through sudo
+      ansible.builtin.command:
+        cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmctld --no-pager
+      changed_when: false
+
+    - name: Validate controller Slurm commands
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sudo -iu {{ slurm_operator_user }} sinfo
+        sudo -iu {{ slurm_operator_user }} squeue
+        sudo -iu {{ slurm_operator_user }} scontrol show nodes
+      args:
+        executable: /bin/bash
+      changed_when: false
+
+
+- name: Validate Slurm worker commands
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Validate slurmd status through sudo
+      ansible.builtin.command:
+        cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmd --no-pager
+      changed_when: false
+
+    - name: Validate worker Slurm commands
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sudo -iu {{ slurm_operator_user }} sinfo
+        sudo -iu {{ slurm_operator_user }} squeue
+        sudo -iu {{ slurm_operator_user }} scontrol show nodes
+      args:
+        executable: /bin/bash
+      changed_when: false
+
+
+- name: Validate basic job submission
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    slurm_operator_user: slurmuser
+
+  tasks:
+    - name: Submit simple Slurm test job as slurmuser
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu {{ slurm_operator_user }} sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=ansible-validate
+        #SBATCH --partition=debug
+        #SBATCH --time=00:01:00
+        #SBATCH --output=/tmp/ansible-validate-%j.out
+
+        hostname
+        whoami
+        date
+        SBATCH
+        )"
+
+        echo "$job_id"
+
+        for i in $(seq 1 20); do
+          state="$(sudo -iu {{ slurm_operator_user }} squeue -h -j "$job_id" -o "%T" || true)"
+          if [ -z "$state" ]; then
+            break
+          fi
+          echo "job_state=$state"
+          sleep 1
+        done
+
+        sudo -iu {{ slurm_operator_user }} sacct -j "$job_id" --format=JobID,JobName,State,ExitCode 2>/dev/null || true
+
+        if ls /tmp/ansible-validate-"$job_id".out >/dev/null 2>&1; then
+          cat /tmp/ansible-validate-"$job_id".out
+        fi
+      args:
+        executable: /bin/bash
+      register: slurm_job_test
+      changed_when: true
+
+    - name: Show basic job submission result
+      ansible.builtin.debug:
+        var: slurm_job_test.stdout_lines
@@ -0,0 +1,236 @@
+---
+- name: Validate canary node variable
+  hosts: localhost
+  gather_facts: false
+
+  vars:
+    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
+
+  tasks:
+    - name: Ensure canary node is in inventory
+      ansible.builtin.fail:
+        msg: "canary_node={{ canary_node_effective }} is not in inventory"
+      when: canary_node_effective not in groups['all']
+
+    - name: Ensure canary node is not the controller
+      ansible.builtin.fail:
+        msg: "Do not use controller as canary for worker rolling upgrade"
+      when: canary_node_effective in groups['slurm_controller']
+
+
+- name: Drain canary node
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
+
+  tasks:
+    - name: Show canary state before drain
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ canary_node_effective }} || true
+        scontrol show node {{ canary_node_effective }} || true
+        squeue -w {{ canary_node_effective }} || true
+      args:
+        executable: /bin/bash
+      register: canary_before
+      changed_when: false
+
+    - name: Print canary state before drain
+      ansible.builtin.debug:
+        var: canary_before.stdout_lines
+
+    - name: Drain canary node
+      ansible.builtin.command:
+        cmd: scontrol update NodeName={{ canary_node_effective }} State=DRAIN Reason="canary OS upgrade"
+      changed_when: true
+
+    - name: Wait until canary has no running jobs
+      ansible.builtin.shell: |
+        set -euo pipefail
+        squeue -h -w {{ canary_node_effective }} || true
+      args:
+        executable: /bin/bash
+      register: canary_jobs
+      retries: 120
+      delay: 10
+      until: canary_jobs.stdout | trim == ""
+      changed_when: false
+
+
+- name: Upgrade canary node OS packages
+  hosts: "{{ canary_node | default('slurm-c02') }}"
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Ensure apt cache is updated
+      ansible.builtin.apt:
+        update_cache: true
+        cache_valid_time: 1800
+
+    - name: Full upgrade packages
+      ansible.builtin.apt:
+        upgrade: full
+        autoremove: true
+        autoclean: true
+      register: apt_upgrade_result
+
+    - name: Check if reboot is required
+      ansible.builtin.stat:
+        path: /var/run/reboot-required
+      register: reboot_required
+
+    - name: Show upgrade summary
+      ansible.builtin.debug:
+        msg:
+          - "Host: {{ inventory_hostname }}"
+          - "Apt changed: {{ apt_upgrade_result.changed }}"
+          - "Reboot required: {{ reboot_required.stat.exists }}"
+
+    - name: Reboot canary if required
+      ansible.builtin.reboot:
+        msg: "Reboot after canary OS upgrade"
+        reboot_timeout: 900
+        connect_timeout: 20
+        pre_reboot_delay: 5
+        post_reboot_delay: 20
+      when: reboot_required.stat.exists
+
+    - name: Ensure munge is running
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Ensure slurmd is running
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+        enabled: true
+
+    - name: Validate local services
+      ansible.builtin.shell: |
+        set -euo pipefail
+        systemctl is-active munge
+        systemctl is-active slurmd
+        munge -n | unmunge >/dev/null
+        scontrol ping
+      args:
+        executable: /bin/bash
+      changed_when: false
+
+
+- name: Resume canary node and run canary job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  vars:
+    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
+
+  tasks:
+    - name: Reconfigure controller
+      ansible.builtin.command:
+        cmd: scontrol reconfigure
+      changed_when: true
+
+    - name: Restart controller to refresh node state
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+
+    - name: Wait for controller
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: slurmctld_ping
+      retries: 15
+      delay: 2
+      until: slurmctld_ping.rc == 0
+      changed_when: false
+
+    - name: Clear canary node maintenance state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        scontrol update NodeName={{ canary_node_effective }} State=RESUME 2>/dev/null || true
+        scontrol update NodeName={{ canary_node_effective }} State=UNDRAIN 2>/dev/null || true
+        scontrol update NodeName={{ canary_node_effective }} State=IDLE 2>/dev/null || true
+
+        sleep 3
+        sinfo -N -n {{ canary_node_effective }}
+        scontrol show node {{ canary_node_effective }}
+      args:
+        executable: /bin/bash
+      register: resume_canary
+      changed_when: true
+
+    - name: Wait until canary is IDLE and responding
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ canary_node_effective }}
+        scontrol show node {{ canary_node_effective }}
+      args:
+        executable: /bin/bash
+      register: canary_state
+      retries: 30
+      delay: 5
+      until:
+        - canary_state.rc == 0
+        - "'not_responding' not in canary_state.stdout.lower()"
+        - "'down' not in canary_state.stdout.lower()"
+        - "'drain' not in canary_state.stdout.lower()"
+        - "'idle*' not in canary_state.stdout.lower()"
+      changed_when: false
+
+    - name: Submit canary test job to upgraded node
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<SBATCH
+        #!/bin/bash
+        #SBATCH --job-name=canary-upgrade-test
+        #SBATCH --partition=all
+        #SBATCH --nodelist={{ canary_node_effective }}
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/canary-upgrade-test-%j.out
+
+        echo "HOST=\$(hostname)"
+        echo "USER=\$(whoami)"
+        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
+        echo "KERNEL=\$(uname -r)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/canary-upgrade-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: canary_job
+      changed_when: true
+
+    - name: Show canary test result
+      ansible.builtin.debug:
+        var: canary_job.stdout_lines
@@ -0,0 +1,197 @@
+---
+- name: Rolling upgrade Slurm worker nodes
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: true
+  serial: 1
+
+  vars:
+    skip_canary_node: "{{ canary_node | default('slurm-c02') }}"
+    do_skip_canary: "{{ skip_canary | default(true) | bool }}"
+
+  pre_tasks:
+    - name: Skip canary node if requested
+      ansible.builtin.meta: end_host
+      when:
+        - do_skip_canary
+        - inventory_hostname == skip_canary_node
+
+    - name: Drain node before OS upgrade
+      ansible.builtin.command:
+        cmd: scontrol update NodeName={{ inventory_hostname }} State=DRAIN Reason="rolling OS upgrade"
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      changed_when: true
+
+    - name: Wait until no jobs are running on this node
+      ansible.builtin.shell: |
+        set -euo pipefail
+        squeue -h -w {{ inventory_hostname }} || true
+      args:
+        executable: /bin/bash
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: jobs_on_node
+      retries: 120
+      delay: 10
+      until: jobs_on_node.stdout | trim == ""
+      changed_when: false
+
+  tasks:
+    - name: Update apt cache
+      ansible.builtin.apt:
+        update_cache: true
+        cache_valid_time: 1800
+
+    - name: Full upgrade packages
+      ansible.builtin.apt:
+        upgrade: full
+        autoremove: true
+        autoclean: true
+      register: apt_upgrade_result
+
+    - name: Check if reboot is required
+      ansible.builtin.stat:
+        path: /var/run/reboot-required
+      register: reboot_required
+
+    - name: Show upgrade status
+      ansible.builtin.debug:
+        msg:
+          - "Node: {{ inventory_hostname }}"
+          - "Apt changed: {{ apt_upgrade_result.changed }}"
+          - "Reboot required: {{ reboot_required.stat.exists }}"
+
+    - name: Reboot node if required
+      ansible.builtin.reboot:
+        msg: "Reboot after rolling OS upgrade"
+        reboot_timeout: 900
+        connect_timeout: 20
+        pre_reboot_delay: 5
+        post_reboot_delay: 20
+      when: reboot_required.stat.exists
+
+    - name: Restart munge
+      ansible.builtin.systemd:
+        name: munge
+        state: restarted
+        enabled: true
+
+    - name: Restart slurmd
+      ansible.builtin.systemd:
+        name: slurmd
+        state: restarted
+        enabled: true
+
+    - name: Validate local slurm services
+      ansible.builtin.shell: |
+        set -euo pipefail
+        systemctl is-active munge
+        systemctl is-active slurmd
+        munge -n | unmunge >/dev/null
+        scontrol ping
+      args:
+        executable: /bin/bash
+      changed_when: false
+
+  post_tasks:
+    - name: Restart controller to refresh state after node upgrade
+      ansible.builtin.systemd:
+        name: slurmctld
+        state: restarted
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      run_once: false
+
+    - name: Wait for controller after restart
+      ansible.builtin.command:
+        cmd: scontrol ping
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: slurmctld_ping
+      retries: 15
+      delay: 2
+      until: slurmctld_ping.rc == 0
+      changed_when: false
+
+    - name: Clear upgraded node maintenance state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        scontrol update NodeName={{ inventory_hostname }} State=RESUME 2>/dev/null || true
+        scontrol update NodeName={{ inventory_hostname }} State=UNDRAIN 2>/dev/null || true
+        scontrol update NodeName={{ inventory_hostname }} State=IDLE 2>/dev/null || true
+
+        sleep 3
+        sinfo -N -n {{ inventory_hostname }}
+        scontrol show node {{ inventory_hostname }}
+      args:
+        executable: /bin/bash
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: resume_node
+      changed_when: true
+
+    - name: Wait until node is healthy
+      ansible.builtin.shell: |
+        set -euo pipefail
+        sinfo -N -n {{ inventory_hostname }}
+        scontrol show node {{ inventory_hostname }}
+      args:
+        executable: /bin/bash
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: upgraded_node_state
+      retries: 30
+      delay: 5
+      until:
+        - upgraded_node_state.rc == 0
+        - "'not_responding' not in upgraded_node_state.stdout.lower()"
+        - "'down' not in upgraded_node_state.stdout.lower()"
+        - "'drain' not in upgraded_node_state.stdout.lower()"
+        - "'idle*' not in upgraded_node_state.stdout.lower()"
+      changed_when: false
+
+    - name: Submit node-local post-upgrade test job
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<SBATCH
+        #!/bin/bash
+        #SBATCH --job-name=rolling-upgrade-test
+        #SBATCH --partition=all
+        #SBATCH --nodelist={{ inventory_hostname }}
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/rolling-upgrade-test-%j.out
+
+        echo "HOST=\$(hostname)"
+        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
+        echo "KERNEL=\$(uname -r)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/rolling-upgrade-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      delegate_to: "{{ groups['slurm_controller'][0] }}"
+      register: node_test_job
+      changed_when: true
+
+    - name: Show node post-upgrade test result
+      ansible.builtin.debug:
+        var: node_test_job.stdout_lines
@@ -0,0 +1,94 @@
+---
+- name: Upgrade Slurm controller OS safely
+  hosts: slurm_controller
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Show cluster state before controller upgrade
+      ansible.builtin.shell: |
+        set -euo pipefail
+        scontrol ping
+        sinfo
+        squeue
+        systemctl is-active munge
+        systemctl is-active slurmctld
+        systemctl is-active slurmdbd || true
+        systemctl is-active mariadb || true
+      args:
+        executable: /bin/bash
+      register: before_state
+      changed_when: false
+
+    - name: Print cluster state before controller upgrade
+      ansible.builtin.debug:
+        var: before_state.stdout_lines
+
+    - name: Update apt cache
+      ansible.builtin.apt:
+        update_cache: true
+        cache_valid_time: 1800
+
+    - name: Full upgrade controller packages
+      ansible.builtin.apt:
+        upgrade: full
+        autoremove: true
+        autoclean: true
+      register: controller_upgrade
+
+    - name: Check if reboot is required
+      ansible.builtin.stat:
+        path: /var/run/reboot-required
+      register: controller_reboot_required
+
+    - name: Show controller upgrade status
+      ansible.builtin.debug:
+        msg:
+          - "Apt changed: {{ controller_upgrade.changed }}"
+          - "Reboot required: {{ controller_reboot_required.stat.exists }}"
+
+    - name: Reboot controller if required
+      ansible.builtin.reboot:
+        msg: "Reboot after controller OS upgrade"
+        reboot_timeout: 900
+        connect_timeout: 20
+        pre_reboot_delay: 5
+        post_reboot_delay: 30
+      when: controller_reboot_required.stat.exists
+
+    - name: Restart controller services
+      ansible.builtin.systemd:
+        name: "{{ item }}"
+        state: restarted
+        enabled: true
+      loop:
+        - munge
+        - mariadb
+        - slurmdbd
+        - slurmctld
+
+    - name: Wait for slurmctld
+      ansible.builtin.command:
+        cmd: scontrol ping
+      register: slurmctld_ping
+      retries: 20
+      delay: 3
+      until: slurmctld_ping.rc == 0
+      changed_when: false
+
+    - name: Validate controller after upgrade
+      ansible.builtin.shell: |
+        set -euo pipefail
+        scontrol ping
+        sinfo
+        squeue
+        scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType"
+        sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -20
+      args:
+        executable: /bin/bash
+      register: controller_after
+      changed_when: false
+
+    - name: Print controller validation after upgrade
+      ansible.builtin.debug:
+        var: controller_after.stdout_lines
@@ -0,0 +1,207 @@
+---
+- name: Validate cluster after OS rolling upgrade
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Validate Slurm controller and cluster state
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "### slurmctld ping"
+        scontrol ping
+
+        echo
+        echo "### nodes"
+        sinfo -N
+
+        echo
+        echo "### partitions"
+        sinfo
+
+        echo
+        echo "### queue"
+        squeue
+
+        echo
+        echo "### important config"
+        scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType|SelectType|ClusterName"
+
+        echo
+        echo "### accounting recent jobs"
+        sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
+      args:
+        executable: /bin/bash
+      register: cluster_state
+      changed_when: false
+
+    - name: Print cluster state
+      ansible.builtin.debug:
+        var: cluster_state.stdout_lines
+
+
+- name: Validate worker services after OS rolling upgrade
+  hosts: slurm_compute:slurm_gpu
+  become: true
+  gather_facts: true
+
+  tasks:
+    - name: Validate local worker services and Slurm connectivity
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        echo "HOST=$(hostname)"
+        echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
+        echo "KERNEL=$(uname -r)"
+        echo "UPTIME=$(uptime -p)"
+
+        echo
+        echo "### services"
+        systemctl is-active munge
+        systemctl is-active slurmd
+
+        echo
+        echo "### munge local test"
+        munge -n | unmunge >/dev/null
+        echo "munge OK"
+
+        echo
+        echo "### controller ping"
+        scontrol ping
+
+        echo
+        echo "### local slurm.conf checksum"
+        sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
+
+        echo
+        echo "### gpu check if present"
+        if command -v nvidia-smi >/dev/null 2>&1; then
+          nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv,noheader || true
+        else
+          echo "NO_NVIDIA_SMI"
+        fi
+      args:
+        executable: /bin/bash
+      register: worker_state
+      changed_when: false
+
+    - name: Print worker state
+      ansible.builtin.debug:
+        var: worker_state.stdout_lines
+
+
+- name: Submit post-upgrade CPU validation job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit CPU validation job to debug partition
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=os-upgrade-cpu-test
+        #SBATCH --partition=debug
+        #SBATCH --cpus-per-task=1
+        #SBATCH --mem=256M
+        #SBATCH --time=00:02:00
+        #SBATCH --output=/shared/os-upgrade-cpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo "KERNEL=$(uname -r)"
+        date
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 90); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/os-upgrade-cpu-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: cpu_validation_job
+      changed_when: true
+
+    - name: Print CPU validation job
+      ansible.builtin.debug:
+        var: cpu_validation_job.stdout_lines
+
+
+- name: Submit post-upgrade GPU validation job
+  hosts: slurm_controller
+  become: true
+  gather_facts: false
+
+  tasks:
+    - name: Submit GPU validation job to gpu partition
+      ansible.builtin.shell: |
+        set -euo pipefail
+
+        job_id="$(
+          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
+        #!/bin/bash
+        #SBATCH --job-name=os-upgrade-gpu-test
+        #SBATCH --partition=gpu
+        #SBATCH --gres=gpu:1
+        #SBATCH --cpus-per-task=2
+        #SBATCH --mem=1G
+        #SBATCH --time=00:03:00
+        #SBATCH --output=/shared/os-upgrade-gpu-test-%j.out
+
+        echo "HOST=$(hostname)"
+        echo "USER=$(whoami)"
+        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
+        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
+        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
+        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
+        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
+        echo "KERNEL=$(uname -r)"
+        echo
+        nvidia-smi
+        SBATCH
+        )"
+
+        echo "JOB_ID=$job_id"
+
+        for i in $(seq 1 120); do
+          if squeue -h -j "$job_id" | grep -q .; then
+            squeue -j "$job_id"
+            sleep 1
+          else
+            break
+          fi
+        done
+
+        echo "### sacct"
+        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
+
+        echo "### output"
+        cat "/shared/os-upgrade-gpu-test-${job_id}.out"
+      args:
+        executable: /bin/bash
+      register: gpu_validation_job
+      changed_when: true
+
+    - name: Print GPU validation job
+      ansible.builtin.debug:
+        var: gpu_validation_job.stdout_lines
@@ -0,0 +1,15 @@
+# Codex prompt: generate repository documentation
+
+You are working in an Ansible repository that automates a Slurm AI/HPC lab.
+
+Please review the repository and generate or improve documentation under `docs/` with the following goals:
+
+1. Explain the architecture and repository layout.
+2. Document the end-to-end deployment sequence.
+3. Document operational workflows: provisioning, decommissioning, rolling upgrades, health checks and auto-remediation.
+4. Document SlurmDBD accounting, QOS, fairshare and priority workflows.
+5. Add troubleshooting notes based on the playbooks and templates.
+6. Avoid exposing secrets, real IP addresses, real hostnames, SQL dumps, backup archives, private keys or vault content.
+7. Keep all text in English.
+
+Output should be practical, operator-focused and suitable for a public Git repository.
@@ -0,0 +1,16 @@
+# Managed by Ansible
+# Slurm cgroup configuration
+
+CgroupPlugin=autodetect
+
+ConstrainCores=yes
+ConstrainRAMSpace=yes
+ConstrainSwapSpace=no
+ConstrainDevices=yes
+
+AllowedRAMSpace=100
+AllowedSwapSpace=0
+MaxRAMPercent=100
+MaxSwapPercent=0
+
+MinRAMSpace=30
@@ -0,0 +1,4 @@
+# Managed by Ansible
+{% for node in slurm_nodes if node.managed_state | default('present') == 'present' and node.gres | default('') | length > 0 %}
+NodeName={{ node.name }} Name=gpu File={{ node.gres_file | default('/dev/nvidia0') }}
+{% endfor %}
@@ -0,0 +1,67 @@
+# Managed by Ansible
+
+ClusterName={{ slurm_cluster_name }}
+SlurmctldHost={{ slurm_control_machine }}({{ slurm_control_addr }})
+
+SlurmUser={{ slurm_user }}
+AuthType=auth/munge
+StateSaveLocation=/var/spool/slurmctld
+SlurmdSpoolDir=/var/spool/slurmd
+SwitchType=switch/none
+MpiDefault={{ slurm_default_mpi_type }}
+ProctrackType={{ slurm_proctrack_type }}
+ReturnToService={{ slurm_return_to_service }}
+{% if slurm_gres_types is defined and slurm_gres_types | length > 0 %}
+GresTypes={{ slurm_gres_types }}
+{% endif %}
+
+SlurmctldPidFile=/run/slurmctld.pid
+SlurmdPidFile=/run/slurmd.pid
+SlurmctldPort={{ slurmctld_port }}
+SlurmdPort={{ slurmd_port }}
+
+TaskPlugin={{ slurm_task_plugin }}
+SelectType={{ slurm_select_type }}
+SelectTypeParameters={{ slurm_select_type_parameters }}
+
+SchedulerType=sched/backfill
+# Priority / fairshare
+PriorityType={{ slurm_priority_type | default('priority/multifactor') }}
+PriorityDecayHalfLife={{ slurm_priority_decay_half_life | default('7-0') }}
+PriorityCalcPeriod={{ slurm_priority_calc_period | default(5) }}
+PriorityFavorSmall={{ slurm_priority_favor_small | default('NO') }}
+PriorityWeightAge={{ slurm_priority_weight_age | default(1000) }}
+PriorityWeightFairshare={{ slurm_priority_weight_fairshare | default(10000) }}
+PriorityWeightJobSize={{ slurm_priority_weight_job_size | default(1000) }}
+PriorityWeightPartition={{ slurm_priority_weight_partition | default(1000) }}
+PriorityWeightQOS={{ slurm_priority_weight_qos | default(10000) }}
+PriorityMaxAge={{ slurm_priority_max_age | default('1-0') }}
+
+SlurmctldTimeout=120
+SlurmdTimeout=300
+InactiveLimit=0
+KillWait=30
+Waittime=0
+
+AccountingStorageType={{ slurm_accounting_storage_type }}
+{% if slurm_accounting_storage_type == "accounting_storage/slurmdbd" %}
+AccountingStorageHost={{ slurm_accounting_storage_host }}
+AccountingStoragePort={{ slurm_accounting_storage_port }}
+AccountingStorageEnforce={{ slurm_accounting_storage_enforce | default('associations,limits,qos') }}
+AccountingStorageTRES={{ slurm_accounting_storage_tres | default('cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu') }}
+{% endif %}
+JobAcctGatherType={{ slurm_job_acct_gather_type | default('jobacct_gather/none') }}
+JobCompType={{ slurm_job_comp_type }}
+
+SlurmctldDebug=info
+SlurmdDebug=info
+SlurmctldLogFile=/var/log/slurm/slurmctld.log
+SlurmdLogFile=/var/log/slurm/slurmd.log
+
+{% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
+NodeName={{ node.name }} NodeAddr={{ node.addr }} CPUs={{ node.cpus }}{% if node.topology | default('') | length > 0 %} {{ node.topology }}{% endif %} RealMemory={{ node.real_memory }}{% if node.gres | default('') | length > 0 %} Gres={{ node.gres }}{% endif %}{% if node.features | default('') | length > 0 %} Feature={{ node.features }}{% endif %} State=UNKNOWN
+{% endfor %}
+
+{% for partition in slurm_partitions %}
+PartitionName={{ partition.name }} Nodes={{ partition.nodes }} Default={{ partition.default }} MaxTime={{ partition.max_time }} State={{ partition.state }}
+{% endfor %}
@@ -0,0 +1,38 @@
+# Managed by Ansible
+# Slurm database daemon configuration
+
+AuthType=auth/munge
+
+DbdHost={{ slurmdbd_host }}
+DbdPort={{ slurmdbd_port }}
+
+SlurmUser={{ slurm_user }}
+
+DebugLevel=info
+LogFile=/var/log/slurm/slurmdbd.log
+PidFile=/run/slurmdbd.pid
+
+CommitDelay={{ slurmdbd_commit_delay | default(1) }}
+
+StorageType={{ slurmdbd_storage_type }}
+StorageHost={{ slurmdbd_storage_host }}
+StoragePort={{ slurmdbd_storage_port }}
+StorageLoc={{ slurmdbd_storage_loc }}
+StorageUser={{ slurmdbd_storage_user }}
+StoragePass={{ slurmdbd_storage_pass }}
+
+# Retention / purge policy
+PurgeEventAfter={{ slurmdbd_purge_event_after | default('12months') }}
+PurgeJobAfter={{ slurmdbd_purge_job_after | default('12months') }}
+PurgeResvAfter={{ slurmdbd_purge_resv_after | default('12months') }}
+PurgeStepAfter={{ slurmdbd_purge_step_after | default('3months') }}
+PurgeSuspendAfter={{ slurmdbd_purge_suspend_after | default('3months') }}
+PurgeTXNAfter={{ slurmdbd_purge_txn_after | default('12months') }}
+PurgeUsageAfter={{ slurmdbd_purge_usage_after | default('24months') }}
+
+ArchiveEvents={{ slurmdbd_archive_events | default('no') }}
+ArchiveJobs={{ slurmdbd_archive_jobs | default('no') }}
+ArchiveSteps={{ slurmdbd_archive_steps | default('no') }}
+ArchiveSuspend={{ slurmdbd_archive_suspend | default('no') }}
+ArchiveTXN={{ slurmdbd_archive_txn | default('no') }}
+ArchiveUsage={{ slurmdbd_archive_usage | default('no') }}
				`@@ -0,0 +1 @@`
				`Generated backups and reports can be stored here locally. This directory is ignored by git.`