Document Slurm AI/HPC cluster project

Add Slurm AI/HPC cluster platform project
2026-06-04 19:54:43 +00:00 · 2026-06-04 19:42:45 +00:00
52 changed files with 4978 additions and 2 deletions
@@ -36,6 +36,7 @@
  - IBM AIX 7 role and playbook.
 - Shared sanitized Ansible inventory defaults for Linux and AIX examples.
 - Role-level task structure covering pre-checks, SSH, sudo, auditing, logging, services, filesystem controls, platform-specific settings, handlers, and post-check validation.
 - Slurm AI/HPC Cluster Automation Lab under `platform-projects`, covering Ansible-managed Slurm operations, GPU scheduling, cgroup enforcement, SlurmDBD accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
 ### Changed
@@ -42,6 +42,7 @@ It is a technical portfolio, not a production toolkit. The examples show how ope
 - [Known error matcher](./infra-run/scripts/python/known-error-matcher/) - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
 - [Python operational log analysis tools](./infra-run/scripts/python/) - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
 - [Ansible hardening examples](./infra-run/ansible/) - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
 - [Slurm AI/HPC cluster automation lab](./platform-projects/hpc-slurm-ai-cluster/) - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.
 ## Planned Areas
@@ -106,4 +107,5 @@ See [infra-run/TESTED.md](./infra-run/TESTED.md) and [infra-run/KNOWN_LIMITATION
 - Veritas VxVM/VCS operational awareness.
 - GPFS / IBM Spectrum Scale operational awareness.
 - Ansible role organization for selected hardening controls.
 - Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
 - Clear documentation of what was tested and what still needs a real system.
@@ -1,8 +1,14 @@
 # platform-projects
-This directory is reserved for larger infrastructure platform topics and future case studies. The current implemented project is [infra-run](../infra-run/).
+This directory contains larger infrastructure platform topics and case studies. Most subdirectories are planning areas unless their own README says otherwise.
-Current subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
+## Implemented platform projects
 - [hpc-slurm-ai-cluster](./hpc-slurm-ai-cluster/) - Slurm AI/HPC cluster automation covering Ansible-managed Slurm operations, GPU scheduling with GRES, cgroup enforcement, SlurmDBD accounting, QOS/fairshare/priority, node lifecycle operations, rolling upgrades, and health remediation.
 ## Planning areas
 These subdirectories are intentionally light and should be read as planning areas unless their own README says otherwise:
 - `monitoring-zabbix`
 - `elk-log-analysis`
@@ -0,0 +1,236 @@
 # Slurm AI/HPC Cluster Automation Lab
 ## Executive summary
 This project builds and operates a small production-like Slurm AI/HPC cluster in a sanitized lab. It uses Ansible to bootstrap hosts, manage Munge authentication, deploy Slurm controller and worker configuration, integrate a GPU node through GRES, enable cgroup enforcement, configure accounting, apply QOS/fairshare policy, and run operational validation jobs.
 The goal is not to present a certified production platform. The goal is to show practical Linux, HPC, and SRE-style operational work: controlled automation, repeatable workflows, explicit checks, recovery steps, and evidence that the cluster behaves as expected.
 ## What this project demonstrates
 - Slurm controller and worker node management.
 - Munge authentication across the cluster.
 - GPU node integration through Slurm GRES.
 - cgroup CPU, memory, and GPU device enforcement.
 - SlurmDBD with MariaDB-backed accounting.
 - `sacct`, `sreport`, and `sacctmgr` workflows.
 - QOS, fairshare, and multifactor priority configuration.
 - Node provisioning and decommissioning workflows.
 - Rolling OS upgrades with canary validation.
 - Health checks and auto-remediation.
 - Backup and restore-check workflow for the accounting database.
 - Operational validation jobs for CPU, GPU, cgroup, accounting, and reporting behavior.
 ## Architecture overview
 ```mermaid
 flowchart LR
    operator[Ansible control node]
    munge[Munge authentication]
    controller[Slurm controller<br/>slurmctld]
    db[MariaDB + SlurmDBD<br/>accounting]
    shared[Shared filesystem<br/>site dependency]
    cpu_part[CPU partition]
    gpu_part[GPU partition]
    cpu_nodes[CPU compute nodes<br/>slurmd]
    gpu_node[GPU node<br/>slurmd + GRES]
    jobs[User jobs<br/>sbatch / srun]
    operator -->|bootstrap and configure| controller
    operator -->|configure workers| cpu_nodes
    operator -->|configure GPU worker| gpu_node
    operator -->|deploy key and service| munge
    munge --> controller
    munge --> cpu_nodes
    munge --> gpu_node
    controller -->|accounting RPC| db
    jobs -->|submit to Slurm| controller
    controller -->|schedule CPU jobs| cpu_part
    controller -->|schedule GPU jobs| gpu_part
    cpu_part --> cpu_nodes
    gpu_part --> gpu_node
    cpu_nodes --- shared
    gpu_node --- shared
    controller --- shared
 ```
 The lab models a common Slurm pattern: an Ansible control node manages a Slurm controller, CPU workers, a GPU worker, Munge authentication, SlurmDBD accounting, and policy configuration. CPU and GPU jobs flow through Slurm partitions; GPU access is declared through GRES and constrained with cgroups.
 ## Repository layout
 ```text
 inventories/lab/          Sanitized lab inventory and group variables
 playbooks/bootstrap/      Initial SSH, sudo, operator user, and host setup
 playbooks/core/           Munge, Slurm config, and safe restart workflows
 playbooks/accounting/     SlurmDBD, MariaDB, backup, restore-check, and reporting validation
 playbooks/qos/            QOS, fairshare, and priority configuration
 playbooks/lifecycle/      Node provisioning, inspection, and decommissioning
 playbooks/upgrade/        Canary and rolling OS upgrade workflows
 playbooks/health/         Health checks, repair, and auto-remediation
 playbooks/tests/          CPU, GPU, cgroup, accounting, and reporting validation jobs
 playbooks/backup/         Slurm and Munge state backup helpers
 templates/                Slurm, cgroup, GRES, and SlurmDBD templates
 docs/                     Runbook, interview notes, and troubleshooting cases
 prompts/                  Documentation prompts used to expand this project
 ```
 ## Main operational workflows
 Run commands from `platform-projects/hpc-slurm-ai-cluster/`. Review inventory and variables before running any playbook.
 ### Bootstrap access
 ```bash
 ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
 ansible-playbook playbooks/bootstrap/slurm-hosts.yml
 ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
 ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
 ```
 ### Deploy Munge
 ```bash
 ansible-playbook playbooks/core/manage-munge.yml
 ```
 ### Deploy Slurm config
 ```bash
 ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
 ansible-playbook playbooks/core/manage-slurm-config.yml --diff
 ansible-playbook playbooks/core/restart-slurm-safe.yml
 ```
 ### Validate CPU jobs
 ```bash
 ansible-playbook playbooks/tests/validate-slurm-operator.yml
 ansible-playbook playbooks/tests/test-cpu-job.yml
 ```
 ### Validate GPU jobs
 ```bash
 ansible-playbook playbooks/tests/test-gpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
 ```
 ### Enable accounting
 ```bash
 ansible-playbook playbooks/accounting/setup-slurmdbd.yml
 ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
 ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
 ansible-playbook playbooks/tests/test-sreport-usage.yml
 ```
 ### Configure QOS and fairshare
 ```bash
 ansible-playbook playbooks/qos/configure-slurm-qos.yml
 ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
 ```
 ### Provision a node
 ```bash
 ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=<node>
 ansible-playbook playbooks/tests/test-specific-node.yml -e target_node=<node>
 ```
 ### Decommission a node
 ```bash
 ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml \
  -e target_node=<node> \
  -e "decom_reason=planned maintenance"
 ```
 ### Rolling OS upgrade
 ```bash
 ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=<node>
 ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml \
  -e canary_node=<node> \
  -e skip_canary=true
 ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
 ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
 ```
 ### Health check and auto-remediation
 ```bash
 ansible-playbook playbooks/health/check-slurm-health.yml
 ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
 ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=<node>
 ```
 ### Accounting backup and restore-check
 ```bash
 ansible-playbook playbooks/accounting/backup-slurmdbd.yml
 ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
 ```
 ## Operational maturity
 This is more than a toy lab because it includes operational controls around the cluster, not only a static `slurm.conf` example.
 - Ansible workflows are designed to be repeatable and readable.
 - Configuration deployment supports check and diff review before applying changes.
 - Validation jobs prove CPU scheduling, GPU scheduling, cgroup behavior, accounting, and reporting.
 - SlurmDBD and MariaDB accounting are configured with `sacct`, `sreport`, and `sacctmgr` validation.
 - QOS, fairshare, priority, and association workflows show resource governance.
 - Node lifecycle playbooks drain, decommission, reprovision, resume, and validate nodes.
 - Rolling upgrade playbooks include canary validation before broader worker upgrades.
 - Health and repair playbooks document remediation paths for common node states.
 - Backup and restore-check playbooks verify that accounting data can be dumped and imported into a test database.
 - Troubleshooting cases document real lab failure modes without exposing private infrastructure details.
 ## Tested capabilities
 - [x] CPU job scheduling.
 - [x] GPU job scheduling.
 - [x] GPU denial when no GRES is requested.
 - [x] CPU cgroup enforcement.
 - [x] SlurmDBD accounting setup.
 - [x] `sacct` job history visibility.
 - [x] `sreport` usage reporting.
 - [x] QOS creation and validation.
 - [x] Fairshare and priority visibility.
 - [x] Node decommission and reprovision workflow.
 - [x] Rolling upgrade canary workflow.
 - [x] Node health check and auto-remediation workflow.
 These checks represent sanitized lab validation, not a claim of production certification.
 ## Safety and sanitization
 This repository is prepared for public portfolio review. Inventory values are examples, and the sample `10.10.10.x` addresses are sanitized lab placeholders.
 Do not commit real inventories, internal hostnames, private IP plans, Munge keys, SSH private keys, database dumps, generated backup archives, or Ansible Vault files. Real credentials, including SlurmDBD database passwords, belong in Ansible Vault or another approved secret store.
 Generated backup artifacts are intentionally excluded from the repository. Treat backup paths and database names in playbooks as examples that must be reviewed before use in a real environment.
 ## Why this matters for AI/HPC infrastructure roles
 AI and HPC platforms depend on more than GPU hardware. They need Linux system ownership, scheduler operations, authentication, resource isolation, accounting, upgrade discipline, and a clear recovery path when nodes drift or fail.
 This project demonstrates practical understanding of:
 - Linux systems operations.
 - Slurm cluster operations.
 - GPU infrastructure and GRES scheduling.
 - Job scheduling and resource isolation.
 - Accounting, reporting, QOS, fairshare, and priority policy.
 - Automation and repeatability with Ansible.
 - Troubleshooting and operational ownership.
 ## Deeper docs
 - [Runbook](docs/runbook.md)
 - [Interview cheatsheet](docs/interview-cheatsheet.md)
 - [Troubleshooting cases](docs/troubleshooting-cases.md)
@@ -0,0 +1,14 @@
 [defaults]
 inventory = ./inventories/lab/inventory.yml
 host_key_checking = False
 retry_files_enabled = False
 stdout_callback = default
 result_format = yaml
 interpreter_python = auto_silent
 timeout = 30
 roles_path = ./roles
 collections_path = ./collections
 [ssh_connection]
 pipelining = True
 ssh_args = -o ControlMaster=auto -o ControlPersist=60s
@@ -0,0 +1 @@
 Generated backups and reports can be stored here locally. This directory is ignored by git.
@@ -0,0 +1,22 @@
 # Interview Cheatsheet: Slurm AI/HPC Lab
 ## One-minute summary
 I built an Ansible-managed Slurm AI/HPC lab with a controller, CPU compute nodes and a GPU node. The lab includes Munge authentication, cgroup-based CPU/GPU enforcement, GRES GPU scheduling, SlurmDBD accounting backed by MariaDB, QOS/fairshare/priority policies, rolling OS upgrades, node provisioning/decommissioning and health remediation workflows.
 ## Topics I can discuss
 - How Slurm schedules CPU and GPU workloads.
 - Difference between GRES scheduling and cgroup device enforcement.
 - Why Munge key consistency matters.
 - How `slurmdbd`, `sacct`, `sacctmgr` and `sreport` fit together.
 - How QOS, account associations, fairshare and multifactor priority work.
 - Operational workflows: drain, decommission, provision, rolling upgrade, canary test and auto-remediation.
 ## Real troubleshooting examples
 - `IDLE+NOT_RESPONDING` after node reprovisioning.
 - Accounting delay where `sacct` temporarily showed `PENDING` while job output existed.
 - Missing `gres/gpu` TRES before QOS GPU limits could be configured.
 - `sacctmgr` idempotency issues such as `Nothing new added`.
 - Slurm version differences around state transitions such as `RESUME`, `UNDRAIN` and `IDLE`.
@@ -0,0 +1,75 @@
 # Slurm AI/HPC Lab Runbook
 ## Standard deployment order
 ```bash
 ansible-playbook playbooks/bootstrap/bootstrap-ansible.yml --ask-pass --ask-become-pass
 ansible-playbook playbooks/bootstrap/slurm-hosts.yml
 ansible-playbook playbooks/bootstrap/slurmuser-ssh-mesh.yml
 ansible-playbook playbooks/bootstrap/slurmuser-sudoers-fix.yml
 ansible-playbook playbooks/core/manage-munge.yml
 ansible-playbook playbooks/core/manage-slurm-config.yml --check --diff
 ansible-playbook playbooks/core/manage-slurm-config.yml --diff
 ansible-playbook playbooks/core/restart-slurm-safe.yml
 ansible-playbook playbooks/tests/validate-slurm-operator.yml
 ansible-playbook playbooks/tests/test-cpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-job.yml
 ansible-playbook playbooks/tests/test-gpu-deny-without-gres.yml
 ansible-playbook playbooks/accounting/setup-slurmdbd.yml
 ansible-playbook playbooks/accounting/initialize-slurm-accounting.yml
 ansible-playbook playbooks/accounting/backup-slurmdbd.yml
 ansible-playbook playbooks/accounting/restore-check-slurmdbd.yml
 ansible-playbook playbooks/accounting/validate-slurm-accounting.yml
 ansible-playbook playbooks/qos/configure-slurm-qos.yml
 ansible-playbook playbooks/qos/validate-slurm-qos-priority.yml
 ansible-playbook playbooks/health/check-slurm-health.yml
 ```
 ## Node lifecycle
 Provision a node:
 ```bash
 ansible-playbook playbooks/lifecycle/provision-slurm-node.yml -e target_node=slurm-c02
 ```
 Decommission a node:
 ```bash
 ansible-playbook playbooks/lifecycle/decommission-slurm-node.yml -e target_node=slurm-c02 -e "decom_reason=planned maintenance"
 ```
 Repair a node:
 ```bash
 ansible-playbook playbooks/health/repair-slurm-node.yml -e target_node=slurm-c02
 ```
 Run health remediation for nodes that can be recovered by the automated workflow:
 ```bash
 ansible-playbook playbooks/health/auto-remediate-slurm-health.yml
 ```
 Back up Slurm and Munge state before planned lifecycle work:
 ```bash
 ansible-playbook playbooks/backup/backup-slurm-state.yml
 ansible-playbook playbooks/backup/fetch-slurm-backups.yml
 ```
 ## Rolling OS upgrade
 ```bash
 ansible-playbook playbooks/upgrade/canary-slurm-node-upgrade.yml -e canary_node=slurm-c02
 ansible-playbook playbooks/upgrade/rolling-upgrade-slurm-workers.yml -e canary_node=slurm-c02 -e skip_canary=true
 ansible-playbook playbooks/upgrade/upgrade-slurm-controller.yml
 ansible-playbook playbooks/upgrade/validate-after-os-upgrade.yml
 ```
 If `upgrade-slurm-controller.yml` is not present, create it from the documented controller upgrade workflow or keep controller upgrades manual.
@@ -0,0 +1,28 @@
 # Troubleshooting Cases
 ## `IDLE+NOT_RESPONDING` after node maintenance
 Symptoms: `sinfo` shows `idle*` or `scontrol show node` shows `IDLE+NOT_RESPONDING`.
 Actions:
 ```bash
 systemctl restart munge
 systemctl restart slurmd
 systemctl restart slurmctld
 scontrol update NodeName=<node> State=RESUME || true
 scontrol update NodeName=<node> State=UNDRAIN || true
 scontrol update NodeName=<node> State=IDLE || true
 ```
 ## Missing GPU TRES
 Symptoms: `sacctmgr` fails with `no TRES known by type gres/gpu`.
 Fix: add `AccountingStorageTRES=...,gres/gpu`, restart/reconfigure Slurm, run a GPU job and verify with `sacctmgr show tres`.
 ## SlurmDBD objects already exist
 Symptoms: `sacctmgr` returns `Nothing new added` or `Already existing`.
 Fix: make Ansible tasks idempotent: attempt the change, tolerate known existing-object messages, then normalize state with `modify`.
@@ -0,0 +1,128 @@
 ---
 # Example lab inventory variables. Replace addresses, users and node topology for your environment.
 slurm_cluster_name: labcluster
 slurm_control_machine: slurm-ctl01
 slurm_control_addr: 10.10.10.11
 slurm_config_dir: /etc/slurm
 slurm_user: slurm
 slurm_operator_user: slurmuser
 slurmctld_port: 6817
 slurmd_port: 6818
 slurm_job_comp_type: jobcomp/none
 slurm_select_type: select/cons_tres
 slurm_select_type_parameters: CR_Core_Memory
 slurm_return_to_service: 2
 slurm_default_mpi_type: none
 slurm_gres_types: gpu
 slurm_nodes:
  - name: slurm-c01
    managed_state: present
    addr: 10.10.10.12
    cpus: 2
    real_memory: 1800
    features: ""
    gres: ""
    topology: ""
  - name: slurm-c02
    managed_state: present
    addr: 10.10.10.13
    cpus: 2
    real_memory: 1800
    features: ""
    gres: ""
    topology: ""
  - name: gpu01
    managed_state: present
    addr: 10.10.10.14
    cpus: 12
    real_memory: 60000
    features: "gpu"
    gres: "gpu:1"
    gres_file: /dev/nvidia0
    topology: "Boards=1 SocketsPerBoard=1 CoresPerSocket=6 ThreadsPerCore=2"
 slurm_partitions:
  - name: debug
    managed_state: present
    nodes: "slurm-c[01-02]"
    default: "YES"
    max_time: "INFINITE"
    state: "UP"
  - name: gpu
    managed_state: present
    nodes: "gpu01"
    default: "NO"
    max_time: "INFINITE"
    state: "UP"
  - name: all
    managed_state: present
    nodes: "slurm-c[01-02],gpu01"
    default: "NO"
    max_time: "INFINITE"
    state: "UP"
 # Cgroup enforcement
 slurm_enable_cgroup: true
 slurm_task_plugin: task/cgroup,task/affinity
 slurm_proctrack_type: proctrack/cgroup
 slurm_job_acct_gather_type: jobacct_gather/cgroup
 # Slurm accounting / SlurmDBD
 slurm_accounting_storage_type: accounting_storage/slurmdbd
 slurm_accounting_storage_host: slurm-ctl01
 slurm_accounting_storage_port: 6819
 slurm_accounting_storage_enforce: associations,limits,qos
 slurm_accounting_storage_tres: cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu
 slurmdbd_host: slurm-ctl01
 slurmdbd_port: 6819
 slurmdbd_storage_type: accounting_storage/mysql
 slurmdbd_storage_host: localhost
 slurmdbd_storage_port: 3306
 slurmdbd_storage_loc: slurm_acct_db
 slurmdbd_storage_user: slurm
 # Use Ansible Vault in real environments. See inventories/lab/group_vars/vault.example.yml
 slurmdbd_storage_pass: "{{ vault_slurmdbd_storage_pass | default('CHANGE_ME_USE_ANSIBLE_VAULT') }}"
 slurm_account_name: lab
 slurm_account_description: "AI/HPC Slurm lab account"
 slurm_account_organization: "labcluster"
 # SlurmDBD purge / retention policy for lab
 slurmdbd_commit_delay: 1
 slurmdbd_purge_event_after: 12months
 slurmdbd_purge_job_after: 12months
 slurmdbd_purge_resv_after: 12months
 slurmdbd_purge_step_after: 3months
 slurmdbd_purge_suspend_after: 3months
 slurmdbd_purge_txn_after: 12months
 slurmdbd_purge_usage_after: 24months
 # Archive is disabled for the lab; backup playbooks handle database dumps.
 slurmdbd_archive_events: no
 slurmdbd_archive_jobs: no
 slurmdbd_archive_steps: no
 slurmdbd_archive_suspend: no
 slurmdbd_archive_txn: no
 slurmdbd_archive_usage: no
 # Slurm priority / fairshare
 slurm_priority_type: priority/multifactor
 slurm_priority_decay_half_life: 7-0
 slurm_priority_calc_period: 5
 slurm_priority_favor_small: "NO"
 slurm_priority_weight_age: 1000
 slurm_priority_weight_fairshare: 10000
 slurm_priority_weight_job_size: 1000
 slurm_priority_weight_partition: 1000
 slurm_priority_weight_qos: 10000
 slurm_priority_max_age: 1-0
@@ -0,0 +1,5 @@
 ---
 # Copy this file to vault.yml and encrypt it with ansible-vault.
 # ansible-vault encrypt inventories/lab/group_vars/vault.yml
 vault_slurmdbd_storage_pass: CHANGE_ME
@@ -0,0 +1,24 @@
 all:
  vars:
    ansible_ssh_common_args: '-o StrictHostKeyChecking=no'
  children:
    slurm_cluster:
      children:
        slurm_controller:
          hosts:
            slurm-ctl01:
              ansible_host: 10.10.10.11
              ansible_user: ansible
        slurm_compute:
          hosts:
            slurm-c01:
              ansible_host: 10.10.10.12
              ansible_user: ansible
            slurm-c02:
              ansible_host: 10.10.10.13
              ansible_user: ansible
        slurm_gpu:
          hosts:
            gpu01:
              ansible_host: 10.10.10.14
              ansible_user: ansible
@@ -0,0 +1,90 @@
 ---
 - name: Backup SlurmDBD MariaDB database
  hosts: slurm_controller
  become: true
  gather_facts: true
  vars:
    slurmdbd_backup_dir: /var/backups/slurmdbd
    local_fetch_dir: "{{ playbook_dir }}/../../artifacts/backups/slurmdbd"
  tasks:
    - name: Create remote backup directory
      ansible.builtin.file:
        path: "{{ slurmdbd_backup_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
    - name: Create local fetch directory on Ansible controller
      ansible.builtin.file:
        path: "{{ local_fetch_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
      delegate_to: localhost
      become: false
    - name: Validate MariaDB is running
      ansible.builtin.command:
        cmd: systemctl is-active mariadb
      changed_when: false
    - name: Validate SlurmDBD is running
      ansible.builtin.command:
        cmd: systemctl is-active slurmdbd
      changed_when: false
    - name: Validate Slurm accounting database exists
      ansible.builtin.shell: |
        set -euo pipefail
        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';" | grep -qx "{{ slurmdbd_storage_loc }}"
      args:
        executable: /bin/bash
      changed_when: false
    - name: Dump Slurm accounting database
      ansible.builtin.shell: |
        set -euo pipefail
        ts="$(date +%F-%H%M%S)"
        out="{{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-${ts}.sql.gz"
        mysqldump \
          --single-transaction \
          --routines \
          --events \
          --triggers \
          {{ slurmdbd_storage_loc }} | gzip -9 > "$out"
        chmod 0600 "$out"
        echo "$out"
      args:
        executable: /bin/bash
      register: db_dump
      changed_when: true
    - name: Validate backup file is non-empty
      ansible.builtin.stat:
        path: "{{ db_dump.stdout }}"
      register: backup_file
    - name: Fail if backup file is empty
      ansible.builtin.fail:
        msg: "Backup file is empty: {{ db_dump.stdout }}"
      when: backup_file.stat.size | int < 1024
    - name: Fetch DB backup to Ansible controller
      ansible.builtin.fetch:
        src: "{{ db_dump.stdout }}"
        dest: "{{ local_fetch_dir }}/"
        flat: true
    - name: Show DB backup result
      ansible.builtin.debug:
        msg:
          - "Remote backup: {{ db_dump.stdout }}"
          - "Backup size bytes: {{ backup_file.stat.size }}"
          - "Fetched to: {{ local_fetch_dir }}/"
@@ -0,0 +1,126 @@
 ---
 - name: Initialize Slurm accounting entities
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Wait for sacctmgr connectivity
      ansible.builtin.command:
        cmd: sacctmgr -n list cluster
      register: sacctmgr_cluster_list
      retries: 20
      delay: 2
      until: sacctmgr_cluster_list.rc == 0
      changed_when: false
    - name: Show current accounting state before changes
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: accounting_state_before
      changed_when: false
    - name: Print current accounting state before changes
      ansible.builtin.debug:
        var: accounting_state_before.stdout_lines
    - name: Ensure Slurm cluster exists in accounting DB
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list cluster format=Cluster | awk '{print $1}' | grep -qx "{{ slurm_cluster_name }}"; then
          echo "Cluster {{ slurm_cluster_name }} already exists"
        else
          sacctmgr -i add cluster {{ slurm_cluster_name }}
        fi
      args:
        executable: /bin/bash
      register: ensure_cluster
      changed_when: "'Adding Cluster' in ensure_cluster.stdout"
    - name: Ensure default lab account exists for cluster
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="" {found=1} END {exit !found}'; then
          echo "Account {{ slurm_account_name }} already associated with cluster {{ slurm_cluster_name }}"
        else
          sacctmgr -i add account {{ slurm_account_name }} \
            Cluster={{ slurm_cluster_name }} \
            Description="{{ slurm_account_description }}" \
            Organization="{{ slurm_account_organization }}"
        fi
      args:
        executable: /bin/bash
      register: ensure_account
      changed_when: "'Adding Account' in ensure_account.stdout"
    - name: Ensure slurmuser exists with lab account association
      ansible.builtin.shell: |
        set -euo pipefail
        if sacctmgr -n list assoc format=Cluster,Account,User | awk '$1=="{{ slurm_cluster_name }}" && $2=="{{ slurm_account_name }}" && $3=="slurmuser" {found=1} END {exit !found}'; then
          echo "User slurmuser already associated with account {{ slurm_account_name }} on cluster {{ slurm_cluster_name }}"
        else
          sacctmgr -i add user slurmuser \
            Cluster={{ slurm_cluster_name }} \
            Account={{ slurm_account_name }} \
            DefaultAccount={{ slurm_account_name }}
        fi
      args:
        executable: /bin/bash
      register: ensure_user_assoc
      changed_when: "'Adding User' in ensure_user_assoc.stdout"
    - name: Ensure slurmuser has default account set
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
      args:
        executable: /bin/bash
      register: set_default_account
      changed_when: "'Modified user' in (set_default_account.stdout + set_default_account.stderr)"
    - name: Show final accounting state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: accounting_state_after
      changed_when: false
    - name: Print final accounting state
      ansible.builtin.debug:
        var: accounting_state_after.stdout_lines
@@ -0,0 +1,98 @@
 ---
 - name: Restore-check latest SlurmDBD backup into test database
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    restore_check_db: "{{ slurmdbd_storage_loc }}_restorecheck"
    slurmdbd_backup_dir: /var/backups/slurmdbd
  tasks:
    - name: Validate MariaDB is running
      ansible.builtin.command:
        cmd: systemctl is-active mariadb
      changed_when: false
    - name: Find latest SlurmDBD backup
      ansible.builtin.shell: |
        set -euo pipefail
        ls -1t {{ slurmdbd_backup_dir }}/{{ slurmdbd_storage_loc }}-*.sql.gz | head -n 1
      args:
        executable: /bin/bash
      register: latest_backup
      changed_when: false
    - name: Validate latest backup exists
      ansible.builtin.stat:
        path: "{{ latest_backup.stdout }}"
      register: latest_backup_stat
    - name: Fail if latest backup is missing or empty
      ansible.builtin.fail:
        msg: "Latest SlurmDBD backup is missing or empty: {{ latest_backup.stdout }}"
      when:
        - not latest_backup_stat.stat.exists or latest_backup_stat.stat.size | int < 1024
    - name: Recreate restore-check database
      ansible.builtin.shell: |
        set -euo pipefail
        mysql <<SQL
        DROP DATABASE IF EXISTS {{ restore_check_db }};
        CREATE DATABASE {{ restore_check_db }};
        SQL
      args:
        executable: /bin/bash
      changed_when: true
    - name: Import backup into restore-check database
      ansible.builtin.shell: |
        set -euo pipefail
        zcat "{{ latest_backup.stdout }}" | mysql {{ restore_check_db }}
      args:
        executable: /bin/bash
      changed_when: true
    - name: Validate restored table count
      ansible.builtin.shell: |
        set -euo pipefail
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
      args:
        executable: /bin/bash
      register: restored_tables
      changed_when: false
      failed_when: restored_tables.stdout | int < 1
    - name: Validate restored row count sample
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### restored database"
        echo "{{ restore_check_db }}"
        echo
        echo "### table count"
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ restore_check_db }}';"
        echo
        echo "### largest tables"
        mysql -N -B -e "
          SELECT table_name, table_rows
          FROM information_schema.tables
          WHERE table_schema='{{ restore_check_db }}'
          ORDER BY table_rows DESC
          LIMIT 10;
        "
      args:
        executable: /bin/bash
      register: restore_check_summary
      changed_when: false
    - name: Show restore-check result
      ansible.builtin.debug:
        msg:
          - "Imported backup: {{ latest_backup.stdout }}"
          - "Restore-check DB: {{ restore_check_db }}"
          - "Restored tables: {{ restored_tables.stdout }}"
          - "Summary:"
          - "{{ restore_check_summary.stdout_lines }}"
@@ -0,0 +1,105 @@
 ---
 - name: Install and configure MariaDB for SlurmDBD
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Install MariaDB and SlurmDBD packages
      ansible.builtin.apt:
        name:
          - mariadb-server
          - mariadb-client
          - slurmdbd
          - slurm-wlm-mysql-plugin
        state: present
        update_cache: true
    - name: Ensure MariaDB is enabled and running
      ansible.builtin.systemd:
        name: mariadb
        enabled: true
        state: started
    - name: Ensure Slurm log directory exists
      ansible.builtin.file:
        path: /var/log/slurm
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Create Slurm accounting database and DB user
      ansible.builtin.shell: |
        set -euo pipefail
        mysql <<SQL
        CREATE DATABASE IF NOT EXISTS {{ slurmdbd_storage_loc }};
        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'localhost' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
        CREATE USER IF NOT EXISTS '{{ slurmdbd_storage_user }}'@'127.0.0.1' IDENTIFIED BY '{{ slurmdbd_storage_pass }}';
        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'localhost';
        GRANT ALL PRIVILEGES ON {{ slurmdbd_storage_loc }}.* TO '{{ slurmdbd_storage_user }}'@'127.0.0.1';
        FLUSH PRIVILEGES;
        SQL
      args:
        executable: /bin/bash
      changed_when: true
    - name: Ensure /etc/slurm exists
      ansible.builtin.file:
        path: /etc/slurm
        state: directory
        owner: root
        group: root
        mode: "0755"
    - name: Deploy slurmdbd.conf
      ansible.builtin.template:
        src: ../../templates/slurmdbd.conf.j2
        dest: /etc/slurm/slurmdbd.conf
        owner: slurm
        group: slurm
        mode: "0600"
      notify:
        - Restart slurmdbd
    - name: Ensure slurmdbd is enabled and running
      ansible.builtin.systemd:
        name: slurmdbd
        enabled: true
        state: started
    - name: Flush handlers before validation
      ansible.builtin.meta: flush_handlers
    - name: Validate slurmdbd service is active
      ansible.builtin.command:
        cmd: systemctl is-active slurmdbd
      register: slurmdbd_active
      retries: 10
      delay: 2
      until: slurmdbd_active.stdout == "active"
      changed_when: false
    - name: Validate slurmdbd is listening on port
      ansible.builtin.shell: |
        set -euo pipefail
        ss -lntp | grep ':{{ slurmdbd_port }} '
      args:
        executable: /bin/bash
      register: slurmdbd_port_check
      retries: 10
      delay: 2
      until: slurmdbd_port_check.rc == 0
      changed_when: false
    - name: Show slurmdbd service validation
      ansible.builtin.debug:
        msg:
          - "slurmdbd is active"
          - "{{ slurmdbd_port_check.stdout_lines }}"
  handlers:
    - name: Restart slurmdbd
      ansible.builtin.systemd:
        name: slurmdbd
        state: restarted
@@ -0,0 +1,178 @@
 ---
 - name: Validate Slurm accounting production-like setup
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate accounting services
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### services"
        systemctl is-active mariadb
        systemctl is-active slurmdbd
        systemctl is-active slurmctld
        echo
        echo "### slurmdbd listener"
        ss -lntp | grep ':6819 '
      args:
        executable: /bin/bash
      register: service_check
      changed_when: false
    - name: Validate Slurm accounting runtime config
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### accounting config"
        scontrol show config | grep -E "AccountingStorage|JobAcctGather|ClusterName"
        echo
        echo "### priority / select / cgroup config"
        scontrol show config | grep -E "SelectType|TaskPlugin|ProctrackType"
      args:
        executable: /bin/bash
      register: config_check
      changed_when: false
    - name: Validate sacctmgr entities
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### clusters"
        sacctmgr list cluster format=Cluster,ControlHost,ControlPort,RPC
        echo
        echo "### accounts"
        sacctmgr list account format=Account,Descr,Org
        echo
        echo "### users"
        sacctmgr list user format=User,DefaultAccount,Admin
        echo
        echo "### associations"
        sacctmgr list assoc format=Cluster,Account,User,Partition,Share,QOS,DefaultQOS
      args:
        executable: /bin/bash
      register: entity_check
      changed_when: false
    - name: Submit accounting validation job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=acct-prodlike-test
        #SBATCH --partition=debug
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/acct-prodlike-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/acct-prodlike-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: acct_job
      changed_when: true
    - name: Validate sacct can read recent jobs
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### recent jobs"
        sacct -S today --format=JobID,JobName,User,Account,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
      args:
        executable: /bin/bash
      register: sacct_recent
      changed_when: false
    - name: Validate sreport commands
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### cluster utilization"
        sreport cluster utilization start=today || true
        echo
        echo "### account utilization by user"
        sreport cluster AccountUtilizationByUser start=today || true
        echo
        echo "### user top"
        sreport user top start=today || true
      args:
        executable: /bin/bash
      register: sreport_check
      changed_when: false
    - name: Validate MariaDB table health summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### database exists"
        mysql -N -B -e "SHOW DATABASES LIKE '{{ slurmdbd_storage_loc }}';"
        echo
        echo "### table count"
        mysql -N -B -e "SELECT COUNT(*) FROM information_schema.tables WHERE table_schema='{{ slurmdbd_storage_loc }}';"
        echo
        echo "### largest tables"
        mysql -N -B -e "
          SELECT table_name, table_rows
          FROM information_schema.tables
          WHERE table_schema='{{ slurmdbd_storage_loc }}'
          ORDER BY table_rows DESC
          LIMIT 10;
        "
      args:
        executable: /bin/bash
      register: db_health
      changed_when: false
    - name: Print accounting validation
      ansible.builtin.debug:
        msg:
          - "### services"
          - "{{ service_check.stdout_lines }}"
          - "### runtime config"
          - "{{ config_check.stdout_lines }}"
          - "### accounting entities"
          - "{{ entity_check.stdout_lines }}"
          - "### accounting validation job"
          - "{{ acct_job.stdout_lines }}"
          - "### recent sacct data"
          - "{{ sacct_recent.stdout_lines }}"
          - "### sreport"
          - "{{ sreport_check.stdout_lines }}"
          - "### database health"
          - "{{ db_health.stdout_lines }}"
@@ -0,0 +1,83 @@
 ---
 - name: Backup Slurm and Munge state on all cluster nodes
  hosts: slurm_cluster
  become: true
  gather_facts: true
  vars:
    backup_base_dir: /var/backups/slurm
  tasks:
    - name: Create backup base directory
      ansible.builtin.file:
        path: "{{ backup_base_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0700"
    - name: Create timestamped backup directory
      ansible.builtin.shell: |
        set -euo pipefail
        ts="$(date +%F-%H%M%S)"
        dir="{{ backup_base_dir }}/$ts"
        mkdir -p "$dir"
        echo "$dir"
      args:
        executable: /bin/bash
      register: backup_dir_result
      changed_when: true
    - name: Store backup directory fact
      ansible.builtin.set_fact:
        node_backup_dir: "{{ backup_dir_result.stdout }}"
    - name: Backup Slurm and Munge config/state if present
      ansible.builtin.shell: |
        set -euo pipefail
        backup_dir="{{ node_backup_dir }}"
        for p in \
          /etc/slurm \
          /etc/slurm-llnl \
          /etc/munge \
          /var/spool/slurmctld \
          /var/spool/slurmd \
          /var/log/slurm \
          /var/log/slurm-llnl
        do
          if [ -e "$p" ]; then
            cp -a "$p" "$backup_dir/"
          fi
        done
        systemctl status munge --no-pager > "$backup_dir/systemctl-munge.txt" 2>&1 || true
        systemctl status slurmctld --no-pager > "$backup_dir/systemctl-slurmctld.txt" 2>&1 || true
        systemctl status slurmd --no-pager > "$backup_dir/systemctl-slurmd.txt" 2>&1 || true
        journalctl -u munge -n 200 --no-pager > "$backup_dir/journal-munge.txt" 2>&1 || true
        journalctl -u slurmctld -n 200 --no-pager > "$backup_dir/journal-slurmctld.txt" 2>&1 || true
        journalctl -u slurmd -n 200 --no-pager > "$backup_dir/journal-slurmd.txt" 2>&1 || true
        if command -v sinfo >/dev/null 2>&1; then
          sinfo > "$backup_dir/sinfo.txt" 2>&1 || true
        fi
        if command -v scontrol >/dev/null 2>&1; then
          scontrol show config > "$backup_dir/scontrol-show-config.txt" 2>&1 || true
          scontrol show nodes > "$backup_dir/scontrol-show-nodes.txt" 2>&1 || true
          scontrol show partitions > "$backup_dir/scontrol-show-partitions.txt" 2>&1 || true
        fi
        find "$backup_dir" -maxdepth 2 -type f -o -type d
      args:
        executable: /bin/bash
      register: backup_content
      changed_when: true
    - name: Show backup location on node
      ansible.builtin.debug:
        msg:
          - "Host: {{ inventory_hostname }}"
          - "Backup directory: {{ node_backup_dir }}"
@@ -0,0 +1,46 @@
 ---
 - name: Fetch latest Slurm backups from nodes to pvef
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    remote_backup_base: /var/backups/slurm
    local_backup_base: "{{ playbook_dir }}/../../artifacts/backups"
  tasks:
    - name: Find latest remote backup directory
      ansible.builtin.shell: |
        set -euo pipefail
        ls -1dt {{ remote_backup_base }}/* | head -n 1
      args:
        executable: /bin/bash
      register: latest_backup_dir
      changed_when: false
    - name: Create local backup directory on pvef
      ansible.builtin.file:
        path: "{{ local_backup_base }}/{{ inventory_hostname }}"
        state: directory
        mode: "0700"
      delegate_to: localhost
      become: false
    - name: Archive latest backup directory on remote node
      ansible.builtin.archive:
        path: "{{ latest_backup_dir.stdout }}"
        dest: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        format: gz
        force_archive: true
      changed_when: true
    - name: Fetch archive to pvef
      ansible.builtin.fetch:
        src: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        dest: "{{ local_backup_base }}/{{ inventory_hostname }}/"
        flat: true
    - name: Remove temporary remote archive
      ansible.builtin.file:
        path: "/tmp/{{ inventory_hostname }}-slurm-backup.tgz"
        state: absent
@@ -0,0 +1,58 @@
 ---
 - name: Bootstrap Ansible SSH access from pvef to Slurm nodes
  hosts: slurm_cluster
  gather_facts: false
  become: true
  vars:
    ansible_controller_pubkey: "{{ lookup('file', lookup('env', 'HOME') + '/.ssh/id_ed25519.pub') }}"
  pre_tasks:
    - name: Wait for SSH
      ansible.builtin.wait_for_connection:
        timeout: 30
    - name: Install Python if missing - Debian/Ubuntu
      ansible.builtin.raw: |
        test -e /usr/bin/python3 || (apt-get update && apt-get install -y python3)
      changed_when: false
  tasks:
    - name: Ensure sudo is installed
      ansible.builtin.apt:
        name:
          - sudo
          - openssh-server
        state: present
        update_cache: true
    - name: Ensure SSH server is enabled and running
      ansible.builtin.service:
        name: ssh
        state: started
        enabled: true
    - name: Ensure .ssh directory exists for login user
      ansible.builtin.file:
        path: "/home/{{ ansible_user }}/.ssh"
        state: directory
        owner: "{{ ansible_user }}"
        group: "{{ ansible_user }}"
        mode: "0700"
    - name: Add pvef root public key to login user's authorized_keys
      ansible.builtin.authorized_key:
        user: "{{ ansible_user }}"
        key: "{{ ansible_controller_pubkey }}"
        state: present
        manage_dir: true
    - name: Allow bootstrap login user passwordless sudo
      ansible.builtin.copy:
        dest: "/etc/sudoers.d/90-ansible-{{ ansible_user }}"
        owner: root
        group: root
        mode: "0440"
        content: |
          {{ ansible_user }} ALL=(ALL) NOPASSWD:ALL
        validate: "visudo -cf %s"
@@ -0,0 +1,16 @@
 ---
 - name: Configure /etc/hosts for Slurm cluster
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Add Slurm cluster hosts to /etc/hosts
      ansible.builtin.blockinfile:
        path: /etc/hosts
        marker: "# {mark} ANSIBLE MANAGED SLURM CLUSTER HOSTS"
        block: |
          {{ slurm_control_addr }} {{ slurm_control_machine }}
          {% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
          {{ node.addr }} {{ node.name }}
          {% endfor %}
@@ -0,0 +1,218 @@
 ---
 - name: Create slurmuser and generate SSH keys on every Slurm node
  hosts: slurm_cluster
  become: true
  gather_facts: true
  vars:
    slurm_operator_user: slurmuser
    slurm_operator_shell: /bin/bash
  tasks:
    - name: Ensure useful packages are installed
      ansible.builtin.apt:
        name:
          - sudo
          - openssh-client
          - openssh-server
          - acl
        state: present
        update_cache: true
    - name: Ensure slurmuser exists
      ansible.builtin.user:
        name: "{{ slurm_operator_user }}"
        shell: "{{ slurm_operator_shell }}"
        create_home: true
        state: present
    - name: Ensure .ssh directory exists for slurmuser
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh"
        state: directory
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0700"
    - name: Generate SSH key for slurmuser if missing
      ansible.builtin.openssh_keypair:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
        type: ed25519
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0600"
        comment: "{{ slurm_operator_user }}@{{ inventory_hostname }}"
        force: false
    - name: Read public key from each node
      ansible.builtin.slurp:
        src: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
      register: slurmuser_pubkey_raw
    - name: Store decoded public key as host fact
      ansible.builtin.set_fact:
        slurmuser_pubkey: "{{ slurmuser_pubkey_raw.content | b64decode | trim }}"
 - name: Exchange slurmuser SSH keys across all Slurm nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Install all slurmuser public keys into authorized_keys on every node
      ansible.builtin.authorized_key:
        user: "{{ slurm_operator_user }}"
        key: "{{ hostvars[item].slurmuser_pubkey }}"
        state: present
        manage_dir: true
      loop: "{{ groups['slurm_cluster'] }}"
    - name: Build SSH known_hosts entries for all cluster nodes
      ansible.builtin.shell: |
        set -e
        mkdir -p /home/{{ slurm_operator_user }}/.ssh
        touch /home/{{ slurm_operator_user }}/.ssh/known_hosts
        {% for host in groups['slurm_cluster'] %}
        ssh-keyscan -H {{ host }} {{ hostvars[host].ansible_host }} 2>/dev/null >> /home/{{ slurm_operator_user }}/.ssh/known_hosts || true
        {% endfor %}
        sort -u /home/{{ slurm_operator_user }}/.ssh/known_hosts -o /home/{{ slurm_operator_user }}/.ssh/known_hosts
        chown {{ slurm_operator_user }}:{{ slurm_operator_user }} /home/{{ slurm_operator_user }}/.ssh/known_hosts
        chmod 0644 /home/{{ slurm_operator_user }}/.ssh/known_hosts
      args:
        executable: /bin/bash
      changed_when: true
    - name: Ensure SSH permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh"
        state: directory
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0700"
    - name: Ensure private key permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519"
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0600"
    - name: Ensure public key permissions are correct
      ansible.builtin.file:
        path: "/home/{{ slurm_operator_user }}/.ssh/id_ed25519.pub"
        owner: "{{ slurm_operator_user }}"
        group: "{{ slurm_operator_user }}"
        mode: "0644"
 - name: Configure sudo permissions for slurmuser
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Configure sudoers for slurmuser on Slurm controller
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          # Operator access for Slurm controller node.
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
            /bin/systemctl status slurmctld, \
            /bin/systemctl restart slurmctld, \
            /bin/systemctl reload slurmctld, \
            /bin/systemctl stop slurmctld, \
            /bin/systemctl start slurmctld, \
            /bin/systemctl status slurmd, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl stop slurmd, \
            /bin/systemctl start slurmd, \
            /bin/journalctl -u slurmctld, \
            /bin/journalctl -u slurmd, \
            /usr/bin/scontrol, \
            /usr/bin/sinfo, \
            /usr/bin/squeue, \
            /usr/bin/scancel, \
            /usr/bin/sacct, \
            /usr/bin/sacctmgr, \
            /usr/bin/sbatch, \
            /usr/bin/srun, \
            /usr/bin/salloc
        validate: "visudo -cf %s"
      when: inventory_hostname in groups['slurm_controller']
    - name: Configure sudoers for slurmuser on Slurm compute and GPU nodes
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          # Operator access for Slurm worker/GPU nodes.
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: \
            /bin/systemctl status slurmd, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl stop slurmd, \
            /bin/systemctl start slurmd, \
            /bin/journalctl -u slurmd, \
            /usr/bin/scontrol, \
            /usr/bin/sinfo, \
            /usr/bin/squeue, \
            /usr/bin/scancel, \
            /usr/bin/sacct, \
            /usr/bin/sbatch, \
            /usr/bin/srun, \
            /usr/bin/salloc
        validate: "visudo -cf %s"
      when: inventory_hostname not in groups['slurm_controller']
 - name: Validate slurmuser SSH mesh and Slurm access
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Test local Slurm commands as slurmuser
      ansible.builtin.command: "sudo -iu {{ slurm_operator_user }} sinfo"
      register: sinfo_test
      changed_when: false
      failed_when: sinfo_test.rc != 0
    - name: Show sinfo result
      ansible.builtin.debug:
        var: sinfo_test.stdout_lines
    - name: Test SSH from each node to every other node as slurmuser
      ansible.builtin.shell: |
        set -e
        {% for host in groups['slurm_cluster'] %}
        ssh -o BatchMode=yes -o ConnectTimeout=5 {{ host }} 'hostname'
        {% endfor %}
      args:
        executable: /bin/bash
      become_user: "{{ slurm_operator_user }}"
      register: ssh_mesh_test
      changed_when: false
    - name: Show SSH mesh test result
      ansible.builtin.debug:
        var: ssh_mesh_test.stdout_lines
@@ -0,0 +1,112 @@
 ---
 - name: Fix sudo permissions for slurmuser Slurm operations
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Configure sudoers for slurmuser on controller
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-controller
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          Cmnd_Alias SLURM_SYSTEMCTL_CONTROLLER = \
            /bin/systemctl status slurmctld, \
            /bin/systemctl status slurmctld *, \
            /bin/systemctl restart slurmctld, \
            /bin/systemctl reload slurmctld, \
            /bin/systemctl start slurmctld, \
            /bin/systemctl stop slurmctld, \
            /bin/systemctl status slurmd, \
            /bin/systemctl status slurmd *, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl start slurmd, \
            /bin/systemctl stop slurmd, \
            /usr/bin/systemctl status slurmctld, \
            /usr/bin/systemctl status slurmctld *, \
            /usr/bin/systemctl restart slurmctld, \
            /usr/bin/systemctl reload slurmctld, \
            /usr/bin/systemctl start slurmctld, \
            /usr/bin/systemctl stop slurmctld, \
            /usr/bin/systemctl status slurmd, \
            /usr/bin/systemctl status slurmd *, \
            /usr/bin/systemctl restart slurmd, \
            /usr/bin/systemctl reload slurmd, \
            /usr/bin/systemctl start slurmd, \
            /usr/bin/systemctl stop slurmd
          Cmnd_Alias SLURM_JOURNAL_CONTROLLER = \
            /bin/journalctl -u slurmctld, \
            /bin/journalctl -u slurmctld *, \
            /bin/journalctl -u slurmd, \
            /bin/journalctl -u slurmd *, \
            /usr/bin/journalctl -u slurmctld, \
            /usr/bin/journalctl -u slurmctld *, \
            /usr/bin/journalctl -u slurmd, \
            /usr/bin/journalctl -u slurmd *
          Cmnd_Alias SLURM_COMMANDS = \
            /usr/bin/scontrol, /usr/bin/scontrol *, \
            /usr/bin/sinfo, /usr/bin/sinfo *, \
            /usr/bin/squeue, /usr/bin/squeue *, \
            /usr/bin/scancel, /usr/bin/scancel *, \
            /usr/bin/sacct, /usr/bin/sacct *, \
            /usr/bin/sacctmgr, /usr/bin/sacctmgr *, \
            /usr/bin/sbatch, /usr/bin/sbatch *, \
            /usr/bin/srun, /usr/bin/srun *, \
            /usr/bin/salloc, /usr/bin/salloc *
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_CONTROLLER, SLURM_JOURNAL_CONTROLLER, SLURM_COMMANDS
        validate: "visudo -cf %s"
      when: inventory_hostname in groups['slurm_controller']
    - name: Configure sudoers for slurmuser on compute and GPU nodes
      ansible.builtin.copy:
        dest: /etc/sudoers.d/91-slurmuser-slurm-compute
        owner: root
        group: root
        mode: "0440"
        content: |
          # Managed by Ansible
          Cmnd_Alias SLURM_SYSTEMCTL_COMPUTE = \
            /bin/systemctl status slurmd, \
            /bin/systemctl status slurmd *, \
            /bin/systemctl restart slurmd, \
            /bin/systemctl reload slurmd, \
            /bin/systemctl start slurmd, \
            /bin/systemctl stop slurmd, \
            /usr/bin/systemctl status slurmd, \
            /usr/bin/systemctl status slurmd *, \
            /usr/bin/systemctl restart slurmd, \
            /usr/bin/systemctl reload slurmd, \
            /usr/bin/systemctl start slurmd, \
            /usr/bin/systemctl stop slurmd
          Cmnd_Alias SLURM_JOURNAL_COMPUTE = \
            /bin/journalctl -u slurmd, \
            /bin/journalctl -u slurmd *, \
            /usr/bin/journalctl -u slurmd, \
            /usr/bin/journalctl -u slurmd *
          Cmnd_Alias SLURM_COMMANDS = \
            /usr/bin/scontrol, /usr/bin/scontrol *, \
            /usr/bin/sinfo, /usr/bin/sinfo *, \
            /usr/bin/squeue, /usr/bin/squeue *, \
            /usr/bin/scancel, /usr/bin/scancel *, \
            /usr/bin/sacct, /usr/bin/sacct *, \
            /usr/bin/sbatch, /usr/bin/sbatch *, \
            /usr/bin/srun, /usr/bin/srun *, \
            /usr/bin/salloc, /usr/bin/salloc *
          {{ slurm_operator_user }} ALL=(root) NOPASSWD: SLURM_SYSTEMCTL_COMPUTE, SLURM_JOURNAL_COMPUTE, SLURM_COMMANDS
        validate: "visudo -cf %s"
      when: inventory_hostname not in groups['slurm_controller']
@@ -0,0 +1,133 @@
 ---
 - name: Read Munge key from Slurm controller
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Check controller munge.key exists
      ansible.builtin.stat:
        path: /etc/munge/munge.key
      register: controller_munge_key
    - name: Fail if controller munge.key is missing
      ansible.builtin.fail:
        msg: "/etc/munge/munge.key is missing on controller. Do not continue."
      when: not controller_munge_key.stat.exists
    - name: Read controller munge.key
      ansible.builtin.slurp:
        src: /etc/munge/munge.key
      register: controller_munge_key_raw
    - name: Store controller Munge key as fact
      ansible.builtin.set_fact:
        cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
 - name: Deploy controller Munge key to all Slurm nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    controller_host: "{{ groups['slurm_controller'][0] }}"
  tasks:
    - name: Ensure munge package is installed
      ansible.builtin.apt:
        name:
          - munge
          - libmunge2
        state: present
        update_cache: true
    - name: Ensure munge group exists
      ansible.builtin.group:
        name: munge
        system: true
        state: present
    - name: Ensure munge user exists
      ansible.builtin.user:
        name: munge
        group: munge
        system: true
        shell: /usr/sbin/nologin
        home: /nonexistent
        create_home: false
        state: present
    - name: Ensure /etc/munge exists
      ansible.builtin.file:
        path: /etc/munge
        state: directory
        owner: munge
        group: munge
        mode: "0700"
    - name: Deploy shared munge.key from controller
      ansible.builtin.copy:
        dest: /etc/munge/munge.key
        content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
        owner: munge
        group: munge
        mode: "0400"
      notify:
        - Restart munge
    - name: Ensure /var/log/munge exists
      ansible.builtin.file:
        path: /var/log/munge
        state: directory
        owner: munge
        group: munge
        mode: "0755"
    - name: Ensure /var/lib/munge exists
      ansible.builtin.file:
        path: /var/lib/munge
        state: directory
        owner: munge
        group: munge
        mode: "0711"
    - name: Ensure /run/munge exists
      ansible.builtin.file:
        path: /run/munge
        state: directory
        owner: munge
        group: munge
        mode: "0755"
    - name: Ensure munge is enabled and running
      ansible.builtin.systemd:
        name: munge
        enabled: true
        state: started
  handlers:
    - name: Restart munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
 - name: Validate Munge locally on all nodes
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Test local munge encode/decode
      ansible.builtin.shell: |
        set -euo pipefail
        munge -n | unmunge
      args:
        executable: /bin/bash
      register: munge_local_test
      changed_when: false
    - name: Show local Munge validation
      ansible.builtin.debug:
        var: munge_local_test.stdout_lines
@@ -0,0 +1,132 @@
 ---
 - name: Prepare Slurm config directories and logs
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Ensure Slurm config directory exists
      ansible.builtin.file:
        path: "{{ slurm_config_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0755"
    - name: Ensure Slurm log directory exists
      ansible.builtin.file:
        path: /var/log/slurm
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Ensure slurmctld spool directory exists on controller
      ansible.builtin.file:
        path: /var/spool/slurmctld
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
      when: inventory_hostname in groups['slurm_controller']
    - name: Ensure slurmd spool directory exists on workers
      ansible.builtin.file:
        path: /var/spool/slurmd
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
 - name: Deploy Slurm config files
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Backup current slurm.conf before managed deployment
      ansible.builtin.copy:
        src: "{{ slurm_config_dir }}/slurm.conf"
        dest: "{{ slurm_config_dir }}/slurm.conf.pre-ansible-managed"
        remote_src: true
        owner: root
        group: root
        mode: "0644"
        force: false
    - name: Deploy managed slurm.conf
      ansible.builtin.template:
        src: ../../templates/slurm.conf.j2
        dest: "{{ slurm_config_dir }}/slurm.conf"
        owner: root
        group: root
        mode: "0644"
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
    - name: Deploy managed cgroup.conf
      ansible.builtin.template:
        src: ../../templates/cgroup.conf.j2
        dest: "{{ slurm_config_dir }}/cgroup.conf"
        owner: root
        group: root
        mode: "0644"
      when: slurm_enable_cgroup | default(false) | bool
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
    - name: Deploy managed gres.conf only on GPU nodes
      ansible.builtin.template:
        src: ../../templates/gres.conf.j2
        dest: "{{ slurm_config_dir }}/gres.conf"
        owner: root
        group: root
        mode: "0644"
      when: inventory_hostname in groups['slurm_gpu']
      notify:
        - Reconfigure slurmctld
        - Restart slurmd
  handlers:
    - name: Reconfigure slurmctld
      ansible.builtin.command:
        cmd: scontrol reconfigure
      when: inventory_hostname in groups['slurm_controller']
      changed_when: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
      when: inventory_hostname in groups['slurm_compute'] or inventory_hostname in groups['slurm_gpu']
 - name: Validate Slurm after config deployment
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Reconfigure controller
      ansible.builtin.command:
        cmd: scontrol reconfigure
      changed_when: true
    - name: Validate cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol ping
        sinfo
        scontrol show nodes
      args:
        executable: /bin/bash
      register: slurm_config_validation
      changed_when: false
    - name: Show validation output
      ansible.builtin.debug:
        var: slurm_config_validation.stdout_lines
@@ -0,0 +1,103 @@
 ---
 - name: Restart Slurm controller safely
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Restart munge on controller
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmctld on controller
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
        enabled: true
    - name: Wait for slurmctld to answer
      ansible.builtin.command:
        cmd: scontrol ping
      register: scontrol_ping
      retries: 15
      delay: 2
      until: scontrol_ping.rc == 0
      changed_when: false
    - name: Show controller ping
      ansible.builtin.debug:
        var: scontrol_ping.stdout_lines
 - name: Restart Slurm workers safely one by one
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: false
  serial: 1
  tasks:
    - name: Restart munge on worker
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd on worker
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Wait for slurmd to be active
      ansible.builtin.command:
        cmd: systemctl is-active slurmd
      register: slurmd_active
      retries: 15
      delay: 2
      until: slurmd_active.stdout == "active"
      changed_when: false
    - name: Wait until this node is visible in Slurm
      ansible.builtin.command:
        cmd: scontrol show node {{ inventory_hostname }}
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: node_visible
      retries: 15
      delay: 2
      until: node_visible.rc == 0
      changed_when: false
 - name: Validate Slurm after restart
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate Slurm cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### scontrol ping"
        scontrol ping
        echo
        echo "### sinfo"
        sinfo
        echo
        echo "### nodes"
        scontrol show nodes
        echo
        echo "### partitions"
        scontrol show partitions
      args:
        executable: /bin/bash
      register: slurm_validation
      changed_when: false
    - name: Show Slurm validation
      ansible.builtin.debug:
        var: slurm_validation.stdout_lines
@@ -0,0 +1,40 @@
 ---
 - name: Discover node resources for Slurm config
  hosts: slurm_cluster
  become: true
  gather_facts: true
  tasks:
    - name: Discover CPU and memory
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST={{ inventory_hostname }}"
        echo "CPUS=$(nproc)"
        echo "REAL_MEMORY_MB=$(awk '/MemTotal/ {print int($2/1024)}' /proc/meminfo)"
        echo "SOCKETS=$(lscpu | awk -F: '/Socket\\(s\\)/ {gsub(/ /,\"\",$2); print $2}')"
        echo "CORES_PER_SOCKET=$(lscpu | awk -F: '/Core\\(s\\) per socket/ {gsub(/ /,\"\",$2); print $2}')"
        echo "THREADS_PER_CORE=$(lscpu | awk -F: '/Thread\\(s\\) per core/ {gsub(/ /,\"\",$2); print $2}')"
      args:
        executable: /bin/bash
      register: cpu_mem
      changed_when: false
    - name: Discover NVIDIA GPU if present
      ansible.builtin.shell: |
        set -euo pipefail
        if command -v nvidia-smi >/dev/null 2>&1; then
          nvidia-smi --query-gpu=index,name,memory.total --format=csv,noheader
        else
          echo "NO_NVIDIA_SMI"
        fi
      args:
        executable: /bin/bash
      register: gpu_info
      changed_when: false
    - name: Show discovered resources
      ansible.builtin.debug:
        msg:
          - "{{ cpu_mem.stdout_lines }}"
          - "GPU:"
          - "{{ gpu_info.stdout_lines }}"
@@ -0,0 +1,89 @@
 ---
 - name: Inspect current Slurm and Munge state
  hosts: slurm_cluster
  become: true
  gather_facts: true
  tasks:
    - name: Basic host info
      ansible.builtin.shell: |
        set -e
        echo "HOST=$(hostname -f 2>/dev/null || hostname)"
        echo "SHORT_HOST=$(hostname -s)"
        echo "IP_ADDRESSES=$(hostname -I)"
        echo "OS=$(lsb_release -ds 2>/dev/null || cat /etc/os-release | grep PRETTY_NAME || true)"
        echo "KERNEL=$(uname -r)"
      args:
        executable: /bin/bash
      register: host_info
      changed_when: false
    - name: Slurm package info
      ansible.builtin.shell: |
        dpkg -l | grep -Ei 'slurm|munge' || true
      args:
        executable: /bin/bash
      register: package_info
      changed_when: false
    - name: Slurm config paths
      ansible.builtin.shell: |
        set -e
        for p in /etc/slurm /etc/slurm-llnl /etc/munge; do
          echo "### $p"
          if [ -e "$p" ]; then
            find "$p" -maxdepth 2 -type f -printf "%m %u %g %p\n" | sort
          else
            echo "MISSING"
          fi
        done
      args:
        executable: /bin/bash
      register: config_paths
      changed_when: false
    - name: Service state
      ansible.builtin.shell: |
        for s in munge slurmctld slurmd; do
          echo "### $s"
          systemctl is-enabled "$s" 2>/dev/null || true
          systemctl is-active "$s" 2>/dev/null || true
        done
      args:
        executable: /bin/bash
      register: service_state
      changed_when: false
    - name: Slurm commands
      ansible.builtin.shell: |
        echo "### which"
        command -v sinfo || true
        command -v scontrol || true
        command -v sbatch || true
        command -v srun || true
        command -v munge || true
        command -v unmunge || true
        echo "### sinfo"
        sinfo 2>&1 || true
        echo "### scontrol ping"
        scontrol ping 2>&1 || true
      args:
        executable: /bin/bash
      register: slurm_commands
      changed_when: false
    - name: Show inspection report
      ansible.builtin.debug:
        msg:
          - "===== {{ inventory_hostname }} :: host_info ====="
          - "{{ host_info.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: packages ====="
          - "{{ package_info.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: config_paths ====="
          - "{{ config_paths.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: services ====="
          - "{{ service_state.stdout_lines }}"
          - "===== {{ inventory_hostname }} :: slurm_commands ====="
          - "{{ slurm_commands.stdout_lines }}"
@@ -0,0 +1,216 @@
 ---
 - name: Detect problematic Slurm nodes
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Detect nodes needing remediation
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -h -o "%N %T" | awk '
          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
        ' | sort -u
      args:
        executable: /bin/bash
      register: bad_nodes_raw
      changed_when: false
    - name: Store bad node list
      ansible.builtin.set_fact:
        bad_nodes: "{{ bad_nodes_raw.stdout_lines }}"
    - name: Show detected problematic nodes
      ansible.builtin.debug:
        var: bad_nodes
 - name: Attempt auto-remediation on problematic nodes
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: false
  serial: 1
  vars:
    bad_nodes_from_controller: "{{ hostvars[groups['slurm_controller'][0]].bad_nodes | default([]) }}"
  tasks:
    - name: Skip healthy nodes
      ansible.builtin.meta: end_host
      when: inventory_hostname not in bad_nodes_from_controller
    - name: Restart Munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Validate local services after remediation attempt
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST=$(hostname)"
        echo
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller ping"
        scontrol ping
        echo
        echo "### slurmd listener"
        ss -lntp | grep ':6818 ' || true
        echo
        echo "### recent slurmd logs"
        journalctl -u slurmd -n 30 --no-pager || true
      args:
        executable: /bin/bash
      register: local_repair_check
      changed_when: false
    - name: Print local remediation result
      ansible.builtin.debug:
        var: local_repair_check.stdout_lines
 - name: Refresh controller and validate remediated nodes
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Restart slurmctld to refresh node states
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
    - name: Wait for controller
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping
      retries: 15
      delay: 2
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Clear maintenance state on previously bad nodes
      ansible.builtin.shell: |
        set -euo pipefail
        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
        if [ -z "$bad_nodes" ]; then
          echo "No bad nodes detected. Nothing to clear."
          sinfo -N
          exit 0
        fi
        for node in $bad_nodes; do
          echo "### clearing state on $node"
          scontrol update NodeName="$node" State=RESUME 2>/dev/null || true
          scontrol update NodeName="$node" State=UNDRAIN 2>/dev/null || true
          scontrol update NodeName="$node" State=IDLE 2>/dev/null || true
        done
        sleep 5
        sinfo -N
      args:
        executable: /bin/bash
      register: clear_result
      changed_when: true
    - name: Print clear-state result
      ansible.builtin.debug:
        var: clear_result.stdout_lines
    - name: Detect nodes still unhealthy after remediation
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -h -o "%N %T" | awk '
          tolower($2) ~ /down|drain|fail|unknown|not_responding|idle\*/ {print $1}
        ' | sort -u
      args:
        executable: /bin/bash
      register: still_bad_nodes_raw
      changed_when: false
    - name: Store still bad nodes
      ansible.builtin.set_fact:
        still_bad_nodes: "{{ still_bad_nodes_raw.stdout_lines }}"
    - name: Drain nodes that remain unhealthy
      ansible.builtin.shell: |
        set -euo pipefail
        unresolved_nodes="{{ still_bad_nodes | join(' ') }}"
        if [ -z "$unresolved_nodes" ]; then
          echo "No unresolved unhealthy nodes."
          sinfo -N
          exit 0
        fi
        for node in $unresolved_nodes; do
          echo "### draining unresolved node $node"
          scontrol update NodeName="$node" State=DRAIN Reason="auto-remediation failed"
        done
        sinfo -N
      args:
        executable: /bin/bash
      register: drain_unresolved
      changed_when: still_bad_nodes | length > 0
    - name: Show remediation summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### initial bad nodes"
        bad_nodes="{{ (bad_nodes | default([])) | join(' ') }}"
        if [ -z "$bad_nodes" ]; then
          echo "none"
        else
          printf '%s\n' $bad_nodes
        fi
        echo
        echo "### still bad nodes"
        still_bad_nodes="{{ (still_bad_nodes | default([])) | join(' ') }}"
        if [ -z "$still_bad_nodes" ]; then
          echo "none"
        else
          printf '%s\n' $still_bad_nodes
        fi
        echo
        echo "### final sinfo"
        sinfo -N
        echo
        echo "### queue"
        squeue
      args:
        executable: /bin/bash
      register: remediation_summary
      changed_when: false
    - name: Print remediation summary
      ansible.builtin.debug:
        var: remediation_summary.stdout_lines
@@ -0,0 +1,149 @@
 ---
 - name: Check Slurm controller health
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Check controller services and cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### controller services"
        systemctl is-active munge
        systemctl is-active slurmctld
        systemctl is-active slurmdbd || true
        systemctl is-active mariadb || true
        echo
        echo "### slurm ping"
        scontrol ping
        echo
        echo "### nodes"
        sinfo -N
        echo
        echo "### partitions"
        sinfo
        echo
        echo "### queue"
        squeue
        echo
        echo "### problematic nodes"
        sinfo -N -h -o "%N %T %E" | awk '$2 !~ /idle|alloc|mix/ {print}' || true
        echo
        echo "### accounting"
        sacctmgr -n list cluster || true
        echo
        echo "### recent failed jobs"
        sacct -S today --state=FAILED,CANCELLED,TIMEOUT,NODE_FAIL,OUT_OF_MEMORY \
          --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,NodeList | tail -30 || true
      args:
        executable: /bin/bash
      register: controller_health
      changed_when: false
    - name: Print controller health
      ansible.builtin.debug:
        var: controller_health.stdout_lines
 - name: Check Slurm worker health
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: true
  tasks:
    - name: Check worker services, config and connectivity
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST=$(hostname)"
        echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
        echo "KERNEL=$(uname -r)"
        echo "UPTIME=$(uptime -p)"
        echo
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge local test"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller connectivity"
        getent hosts slurm-ctl01 || true
        scontrol ping
        echo
        echo "### slurmd listener"
        ss -lntp | grep ':6818 ' || true
        echo
        echo "### config checksums"
        sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
        echo
        echo "### shared filesystem"
        test -d /shared
        touch /shared/.slurm-health-$(hostname)
        ls -l /shared/.slurm-health-$(hostname)
        rm -f /shared/.slurm-health-$(hostname)
        echo
        echo "### cgroup"
        mount | grep cgroup || true
        echo
        echo "### gpu check"
        if command -v nvidia-smi >/dev/null 2>&1; then
          nvidia-smi --query-gpu=index,name,driver_version,memory.total,temperature.gpu,utilization.gpu --format=csv,noheader || true
        else
          echo "NO_NVIDIA_SMI"
        fi
      args:
        executable: /bin/bash
      register: worker_health
      changed_when: false
    - name: Print worker health
      ansible.builtin.debug:
        var: worker_health.stdout_lines
 - name: Check Slurm-reported node state consistency
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Build Slurm node health summary
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### node summary"
        sinfo -N -o "%N %P %T %C %m %G %E"
        echo
        echo "### full problematic node details"
        for node in $(sinfo -N -h -o "%N %T" | awk '$2 ~ /down|drain|fail|unk|not_responding|idle\\*/ {print $1}' | sort -u); do
          echo
          echo "### $node"
          scontrol show node "$node"
        done
      args:
        executable: /bin/bash
      register: slurm_node_summary
      changed_when: false
    - name: Print Slurm node summary
      ansible.builtin.debug:
        var: slurm_node_summary.stdout_lines
@@ -0,0 +1,217 @@
 ---
 - name: Validate target node
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Require target_node
      ansible.builtin.fail:
        msg: "Use: ansible-playbook repair-slurm-node.yml -e target_node=<hostname>"
      when: target_node is not defined
    - name: Ensure target_node is in inventory
      ansible.builtin.fail:
        msg: "target_node={{ target_node }} is not in Ansible inventory"
      when: target_node not in groups['all']
 - name: Capture node state before repair
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Show target node state before repair
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### sinfo"
        sinfo -N -n {{ target_node }} || true
        echo
        echo "### scontrol"
        scontrol show node {{ target_node }} || true
        echo
        echo "### jobs"
        squeue -w {{ target_node }} || true
      args:
        executable: /bin/bash
      register: node_state_before
      changed_when: false
    - name: Print target node state before repair
      ansible.builtin.debug:
        var: node_state_before.stdout_lines
 - name: Repair local services on target node
  hosts: "{{ target_node }}"
  become: true
  gather_facts: false
  tasks:
    - name: Restart Munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
      when:
        - inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
    - name: Validate local repair
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller ping"
        scontrol ping
        echo
        echo "### slurmd listener"
        ss -lntp | grep ':6818 ' || true
        echo
        echo "### recent slurmd logs"
        journalctl -u slurmd -n 40 --no-pager || true
      args:
        executable: /bin/bash
      register: local_repair_state
      changed_when: false
    - name: Print local repair state
      ansible.builtin.debug:
        var: local_repair_state.stdout_lines
 - name: Clear Slurm maintenance/down state after repair
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Restart controller to refresh node state
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
    - name: Wait for controller
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping
      retries: 15
      delay: 2
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Clear target node state
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol update NodeName={{ target_node }} State=RESUME 2>/dev/null || true
        scontrol update NodeName={{ target_node }} State=UNDRAIN 2>/dev/null || true
        scontrol update NodeName={{ target_node }} State=IDLE 2>/dev/null || true
        sleep 5
        sinfo -N -n {{ target_node }}
        scontrol show node {{ target_node }}
      args:
        executable: /bin/bash
      register: clear_state
      changed_when: true
    - name: Wait until node is healthy
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ target_node }}
        scontrol show node {{ target_node }}
      args:
        executable: /bin/bash
      register: node_health_after
      retries: 30
      delay: 5
      until:
        - node_health_after.rc == 0
        - "'not_responding' not in node_health_after.stdout.lower()"
        - "'down' not in node_health_after.stdout.lower()"
        - "'drain' not in node_health_after.stdout.lower()"
        - "'idle*' not in node_health_after.stdout.lower()"
      changed_when: false
    - name: Print node state after repair
      ansible.builtin.debug:
        var: node_health_after.stdout_lines
 - name: Submit repair validation job
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit validation job to repaired node
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<SBATCH
        #!/bin/bash
        #SBATCH --job-name=repair-node-test
        #SBATCH --partition=all
        #SBATCH --nodelist={{ target_node }}
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --account=lab
        #SBATCH --qos=normal
        #SBATCH --output=/shared/repair-node-test-%j.out
        echo "HOST=\$(hostname)"
        echo "USER=\$(whoami)"
        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList
        echo "### output"
        cat "/shared/repair-node-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: repair_validation_job
      changed_when: true
    - name: Print repair validation job
      ansible.builtin.debug:
        var: repair_validation_job.stdout_lines
@@ -0,0 +1,126 @@
 ---
 - name: Validate target_node variable
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Require target_node
      ansible.builtin.fail:
        msg: "Use: ansible-playbook decommission-slurm-node.yml -e target_node=<hostname> [-e decom_reason='reason']"
      when: target_node is not defined
    - name: Ensure target_node is in inventory
      ansible.builtin.fail:
        msg: "target_node={{ target_node }} is not in Ansible inventory"
      when: target_node not in groups['all']
 - name: Drain target node and wait for jobs to leave
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    decom_reason_effective: "{{ decom_reason | default('decommission by Ansible') }}"
    decom_wait_retries_effective: "{{ decom_wait_retries | default(120) }}"
    decom_wait_delay_effective: "{{ decom_wait_delay | default(10) }}"
  tasks:
    - name: Show current target node state
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ target_node }} || true
        scontrol show node {{ target_node }} || true
      args:
        executable: /bin/bash
      register: node_state_before
      changed_when: false
    - name: Print current target node state
      ansible.builtin.debug:
        var: node_state_before.stdout_lines
    - name: Drain target node
      ansible.builtin.command:
        cmd: scontrol update NodeName={{ target_node }} State=DRAIN Reason="{{ decom_reason_effective }}"
      changed_when: true
    - name: Wait until no jobs are running on target node
      ansible.builtin.shell: |
        set -euo pipefail
        squeue -h -w {{ target_node }} || true
      args:
        executable: /bin/bash
      register: jobs_on_node
      retries: "{{ decom_wait_retries_effective | int }}"
      delay: "{{ decom_wait_delay_effective | int }}"
      until: jobs_on_node.stdout | trim == ""
      changed_when: false
    - name: Show drained node state
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ target_node }} || true
        scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
      args:
        executable: /bin/bash
      register: node_state_drained
      changed_when: false
    - name: Print drained node state
      ansible.builtin.debug:
        var: node_state_drained.stdout_lines
 - name: Stop Slurm worker service on target node
  hosts: "{{ target_node }}"
  become: true
  gather_facts: false
  tasks:
    - name: Stop slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: stopped
        enabled: false
      when:
        - inventory_hostname in groups.get('slurm_compute', []) or inventory_hostname in groups.get('slurm_gpu', [])
    - name: Show slurmd state
      ansible.builtin.shell: |
        systemctl is-enabled slurmd 2>/dev/null || true
        systemctl is-active slurmd 2>/dev/null || true
      args:
        executable: /bin/bash
      register: slurmd_state_after
      changed_when: false
    - name: Print slurmd state
      ansible.builtin.debug:
        var: slurmd_state_after.stdout_lines
 - name: Mark node down in Slurm controller
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Mark target node DOWN after service stop
      ansible.builtin.command:
        cmd: scontrol update NodeName={{ target_node }} State=DOWN Reason="decommissioned"
      changed_when: true
    - name: Show final node state
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ target_node }} || true
        scontrol show node {{ target_node }} | grep -E "NodeName=|State=|Reason=" || true
      args:
        executable: /bin/bash
      register: final_node_state
      changed_when: false
    - name: Print final node state
      ansible.builtin.debug:
        var: final_node_state.stdout_lines
@@ -0,0 +1,246 @@
 ---
 - name: Validate target_node variable
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Require target_node
      ansible.builtin.fail:
        msg: "Use: ansible-playbook provision-slurm-node.yml -e target_node=<hostname>"
      when: target_node is not defined
    - name: Ensure target_node is in inventory
      ansible.builtin.fail:
        msg: "target_node={{ target_node }} is not in Ansible inventory"
      when: target_node not in groups['all']
 - name: Prepare OS, packages and Slurm directories on target node
  hosts: "{{ target_node }}"
  become: true
  gather_facts: true
  tasks:
    - name: Ensure target is a Slurm worker or GPU node
      ansible.builtin.fail:
        msg: "{{ inventory_hostname }} must be in slurm_compute or slurm_gpu group"
      when:
        - inventory_hostname not in groups.get('slurm_compute', [])
        - inventory_hostname not in groups.get('slurm_gpu', [])
    - name: Install Slurm worker packages
      ansible.builtin.apt:
        name:
          - munge
          - libmunge2
          - slurm-client
          - slurmd
          - slurm-wlm-basic-plugins
          - slurm-wlm-plugins
          - slurm-wlm-mysql-plugin
        state: present
        update_cache: true
    - name: Ensure Slurm config directory exists
      ansible.builtin.file:
        path: "{{ slurm_config_dir }}"
        state: directory
        owner: root
        group: root
        mode: "0755"
    - name: Ensure Slurm log directory exists
      ansible.builtin.file:
        path: /var/log/slurm
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Ensure slurmd spool directory exists
      ansible.builtin.file:
        path: /var/spool/slurmd
        state: directory
        owner: slurm
        group: slurm
        mode: "0755"
    - name: Ensure munge dirs exist
      ansible.builtin.file:
        path: "{{ item.path }}"
        state: directory
        owner: munge
        group: munge
        mode: "{{ item.mode }}"
      loop:
        - { path: /etc/munge, mode: "0700" }
        - { path: /var/log/munge, mode: "0755" }
        - { path: /var/lib/munge, mode: "0711" }
        - { path: /run/munge, mode: "0755" }
 - name: Deploy Munge key from controller to target node
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Read controller munge.key
      ansible.builtin.slurp:
        src: /etc/munge/munge.key
      register: controller_munge_key_raw
    - name: Store controller Munge key as fact
      ansible.builtin.set_fact:
        cluster_munge_key_b64: "{{ controller_munge_key_raw.content }}"
 - name: Configure target node with Munge and Slurm files
  hosts: "{{ target_node }}"
  become: true
  gather_facts: false
  vars:
    controller_host: "{{ groups['slurm_controller'][0] }}"
  tasks:
    - name: Deploy shared munge.key
      ansible.builtin.copy:
        dest: /etc/munge/munge.key
        content: "{{ hostvars[controller_host].cluster_munge_key_b64 | b64decode }}"
        owner: munge
        group: munge
        mode: "0400"
      notify:
        - Restart munge
    - name: Deploy managed slurm.conf
      ansible.builtin.template:
        src: ../../templates/slurm.conf.j2
        dest: "{{ slurm_config_dir }}/slurm.conf"
        owner: root
        group: root
        mode: "0644"
      notify:
        - Restart slurmd
    - name: Deploy managed cgroup.conf
      ansible.builtin.template:
        src: ../../templates/cgroup.conf.j2
        dest: "{{ slurm_config_dir }}/cgroup.conf"
        owner: root
        group: root
        mode: "0644"
      when: slurm_enable_cgroup | default(false) | bool
      notify:
        - Restart slurmd
    - name: Deploy managed gres.conf on GPU nodes
      ansible.builtin.template:
        src: ../../templates/gres.conf.j2
        dest: "{{ slurm_config_dir }}/gres.conf"
        owner: root
        group: root
        mode: "0644"
      when: inventory_hostname in groups.get('slurm_gpu', [])
      notify:
        - Restart slurmd
    - name: Ensure munge is enabled and running
      ansible.builtin.systemd:
        name: munge
        enabled: true
        state: started
    - name: Ensure slurmd is enabled and running
      ansible.builtin.systemd:
        name: slurmd
        enabled: true
        state: started
  handlers:
    - name: Restart munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
 - name: Deploy updated Slurm config to whole cluster and reconfigure controller
  hosts: slurm_cluster
  become: true
  gather_facts: false
  tasks:
    - name: Deploy managed slurm.conf to all nodes
      ansible.builtin.template:
        src: ../../templates/slurm.conf.j2
        dest: "{{ slurm_config_dir }}/slurm.conf"
        owner: root
        group: root
        mode: "0644"
    - name: Deploy managed cgroup.conf to all nodes
      ansible.builtin.template:
        src: ../../templates/cgroup.conf.j2
        dest: "{{ slurm_config_dir }}/cgroup.conf"
        owner: root
        group: root
        mode: "0644"
      when: slurm_enable_cgroup | default(false) | bool
 - name: Reconfigure Slurm and validate target node
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Reconfigure Slurm controller
      ansible.builtin.command:
        cmd: scontrol reconfigure
      changed_when: true
    - name: Restart Slurm controller after node reprovision
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
    - name: Wait for Slurm controller after restart
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping_after_restart
      retries: 15
      delay: 2
      until: slurmctld_ping_after_restart.rc == 0
      changed_when: false
    - name: Resume target node in Slurm
      ansible.builtin.command:
        cmd: scontrol update NodeName={{ target_node }} State=RESUME
      changed_when: true
    - name: Wait until target node is visible and not down
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol show node {{ target_node }}
        sinfo -N -n {{ target_node }}
      args:
        executable: /bin/bash
      register: target_node_state
      retries: 20
      delay: 3
      until:
        - target_node_state.rc == 0
        - "'down' not in target_node_state.stdout.lower()"
        - "'not_responding' not in target_node_state.stdout.lower()"
        - "'idle*' not in target_node_state.stdout.lower()"
      changed_when: false
    - name: Show target node state
      ansible.builtin.debug:
        var: target_node_state.stdout_lines
@@ -0,0 +1,33 @@
 ---
 - name: Show Slurm node state
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Require target_node
      ansible.builtin.fail:
        msg: "Use: ansible-playbook show-slurm-node.yml -e target_node=<hostname>"
      when: target_node is not defined
    - name: Show node state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### sinfo"
        sinfo -N -n {{ target_node }} || true
        echo
        echo "### scontrol"
        scontrol show node {{ target_node }} || true
        echo
        echo "### jobs on node"
        squeue -w {{ target_node }} || true
      args:
        executable: /bin/bash
      register: node_lifecycle_state
      changed_when: false
    - name: Print node lifecycle state
      ansible.builtin.debug:
        var: node_lifecycle_state.stdout_lines
@@ -0,0 +1,169 @@
 ---
 - name: Configure Slurm QOS, limits and fairshare
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Ensure sacctmgr is avgpu01le
      ansible.builtin.command:
        cmd: sacctmgr -n list cluster
      changed_when: false
    - name: Validate accounting GPU TRES exists
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### configured AccountingStorageTRES"
        scontrol show config | grep -E "AccountingStorageTRES|AccountingStorageType|AccountingStorageEnforce"
        echo
        echo "### known TRES"
        sacctmgr show tres
        echo
        echo "### checking gres/gpu"
        sacctmgr -n show tres format=Type,Name | awk '$1=="gres" && $2=="gpu" {found=1} END {exit !found}'
      args:
        executable: /bin/bash
      register: gpu_tres_check
      changed_when: false
    - name: Ensure normal QOS exists
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i add qos normal Priority=100
      args:
        executable: /bin/bash
      register: add_qos_normal
      changed_when: "'Adding QOS' in (add_qos_normal.stdout + add_qos_normal.stderr)"
      failed_when: >
        add_qos_normal.rc != 0 and
        'Nothing new added' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
        'already exists' not in (add_qos_normal.stdout + add_qos_normal.stderr) and
        'Already existing' not in (add_qos_normal.stdout + add_qos_normal.stderr)
    - name: Ensure debug-short QOS exists
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i add qos debug-short Priority=500
      args:
        executable: /bin/bash
      register: add_qos_debug
      changed_when: "'Adding QOS' in (add_qos_debug.stdout + add_qos_debug.stderr)"
      failed_when: >
        add_qos_debug.rc != 0 and
        'Nothing new added' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
        'already exists' not in (add_qos_debug.stdout + add_qos_debug.stderr) and
        'Already existing' not in (add_qos_debug.stdout + add_qos_debug.stderr)
    - name: Ensure gpu-short QOS exists
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i add qos gpu-short Priority=1000
      args:
        executable: /bin/bash
      register: add_qos_gpu
      changed_when: "'Adding QOS' in (add_qos_gpu.stdout + add_qos_gpu.stderr)"
      failed_when: >
        add_qos_gpu.rc != 0 and
        'Nothing new added' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
        'already exists' not in (add_qos_gpu.stdout + add_qos_gpu.stderr) and
        'Already existing' not in (add_qos_gpu.stdout + add_qos_gpu.stderr)
    - name: Ensure maintenance QOS exists
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i add qos maintenance Priority=5000
      args:
        executable: /bin/bash
      register: add_qos_maintenance
      changed_when: "'Adding QOS' in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)"
      failed_when: >
        add_qos_maintenance.rc != 0 and
        'Nothing new added' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
        'already exists' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr) and
        'Already existing' not in (add_qos_maintenance.stdout + add_qos_maintenance.stderr)
    - name: Normalize normal QOS settings
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify qos normal set Priority=100
      args:
        executable: /bin/bash
      changed_when: true
    - name: Normalize debug-short QOS settings
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify qos debug-short set Priority=500 MaxWall=00:10:00 MaxTRESPU=cpu=2 MaxJobsPU=4
      args:
        executable: /bin/bash
      changed_when: true
    - name: Normalize gpu-short QOS settings
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify qos gpu-short set Priority=1000 MaxWall=01:00:00 MaxTRESPU=gres/gpu=1,cpu=12 MaxJobsPU=2
      args:
        executable: /bin/bash
      changed_when: true
    - name: Normalize maintenance QOS settings
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify qos maintenance set Priority=5000 MaxWall=02:00:00
      args:
        executable: /bin/bash
      changed_when: true
    - name: Assign QOS set to lab account
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify account {{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
      args:
        executable: /bin/bash
      changed_when: true
    - name: Assign default account to slurmuser
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify user where name=slurmuser set DefaultAccount={{ slurm_account_name }}
      args:
        executable: /bin/bash
      changed_when: true
    - name: Assign QOS set to slurmuser association
      ansible.builtin.shell: |
        set -euo pipefail
        sacctmgr -i modify user where name=slurmuser account={{ slurm_account_name }} set QOS=normal,debug-short,gpu-short,maintenance DefaultQOS=normal Fairshare=100
      args:
        executable: /bin/bash
      changed_when: true
    - name: Show configured QOS and associations
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### TRES"
        sacctmgr show tres
        echo
        echo "### QOS"
        sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%40,MaxJobsPU
        echo
        echo "### Associations"
        sacctmgr show assoc format=Cluster,Account,User,Share,QOS%60,DefaultQOS,Fairshare
        echo
        echo "### Fairshare"
        sshare -A {{ slurm_account_name }} || true
      args:
        executable: /bin/bash
      register: qos_state
      changed_when: false
    - name: Print QOS state
      ansible.builtin.debug:
        var: qos_state.stdout_lines
@@ -0,0 +1,235 @@
 ---
 - name: Validate Slurm QOS, fairshare and priority
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate priority runtime config
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### priority config"
        scontrol show config | grep -E "PriorityType|PriorityWeight|PriorityDecay|PriorityCalc|PriorityMaxAge|PriorityFavor"
        echo
        echo "### accounting enforcement"
        scontrol show config | grep -E "AccountingStorageType|AccountingStorageEnforce|AccountingStorageTRES"
        echo
        echo "### QOS"
        sacctmgr show qos format=Name%20,Priority,MaxWall,MaxTRESPU%50,MaxJobsPU
        echo
        echo "### associations"
        sacctmgr show assoc format=Cluster,Account,User,Share,QOS%80,DefaultQOS,Fairshare
        echo
        echo "### fairshare"
        sshare -A {{ slurm_account_name }} || true
      args:
        executable: /bin/bash
      register: priority_state
      changed_when: false
    - name: Submit debug-short QOS job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=qos-debug-test
        #SBATCH --partition=debug
        #SBATCH --qos=debug-short
        #SBATCH --account=lab
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/qos-debug-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "QOS=${SLURM_JOB_QOS:-}"
        echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/qos-debug-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: debug_qos_job
      changed_when: true
    - name: Submit gpu-short QOS job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=qos-gpu-test
        #SBATCH --partition=gpu
        #SBATCH --qos=gpu-short
        #SBATCH --account=lab
        #SBATCH --gres=gpu:1
        #SBATCH --cpus-per-task=2
        #SBATCH --mem=1G
        #SBATCH --time=00:03:00
        #SBATCH --output=/shared/qos-gpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "QOS=${SLURM_JOB_QOS:-}"
        echo "ACCOUNT=${SLURM_JOB_ACCOUNT:-}"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo
        nvidia-smi
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 120); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/qos-gpu-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: gpu_qos_job
      changed_when: true
    - name: Validate debug-short walltime limit behavior
      ansible.builtin.shell: |
        set -euo pipefail
        set +e
        output="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH' 2>&1
        #!/bin/bash
        #SBATCH --job-name=qos-limit-fail
        #SBATCH --partition=debug
        #SBATCH --qos=debug-short
        #SBATCH --account=lab
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:30:00
        #SBATCH --output=/shared/qos-limit-fail-%j.out
        sleep 10
        SBATCH
        )"
        rc=$?
        set -e
        echo "RC=$rc"
        echo "$output"
        if [ "$rc" -ne 0 ]; then
          echo "Limit rejection test passed at submit time"
          exit 0
        fi
        job_id="$output"
        echo "Submitted job despite expected limit check: $job_id"
        sleep 3
        echo "### squeue"
        squeue -j "$job_id" -o "%.18i %.9P %.20j %.8u %.2t %.10M %.6D %R" || true
        echo
        echo "### job detail"
        scontrol show job "$job_id" || true
        state="$(squeue -h -j "$job_id" -o "%T" || true)"
        reason="$(squeue -h -j "$job_id" -o "%R" || true)"
        echo "STATE=$state"
        echo "REASON=$reason"
        if echo "$state" | grep -qE "PENDING|CONFIGURING"; then
          if echo "$reason" | grep -qiE "qos|limit|time|max|assoc"; then
            echo "Limit enforcement test passed via pending reason"
            scancel "$job_id" || true
            exit 0
          fi
        fi
        echo "Job was accepted without an obvious QOS/limit pending reason"
        scancel "$job_id" || true
        exit 1
      args:
        executable: /bin/bash
      register: limit_rejection
      changed_when: false
    - name: Show priority and fairshare snapshot
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### queue"
        squeue || true
        echo
        echo "### sprio"
        sprio || true
        echo
        echo "### sshare"
        sshare -A {{ slurm_account_name }} || true
        echo
        echo "### recent sacct"
        sacct -S today --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -40
      args:
        executable: /bin/bash
      register: priority_snapshot
      changed_when: false
    - name: Print validation result
      ansible.builtin.debug:
        msg:
          - "### priority state"
          - "{{ priority_state.stdout_lines }}"
          - "### debug QOS job"
          - "{{ debug_qos_job.stdout_lines }}"
          - "### GPU QOS job"
          - "{{ gpu_qos_job.stdout_lines }}"
          - "### limit rejection"
          - "{{ limit_rejection.stdout_lines }}"
          - "### priority snapshot"
          - "{{ priority_snapshot.stdout_lines }}"
@@ -0,0 +1,59 @@
 ---
 - name: Test CPU cgroup enforcement on gpu01
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit cgroup CPU test to gpu01
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=cgroup-cpu-test
        #SBATCH --partition=all
        #SBATCH --nodelist=gpu01
        #SBATCH --cpus-per-task=2
        #SBATCH --mem=1G
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/cgroup-cpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo "MEM_ALLOWED=$(grep Mems_allowed_list /proc/self/status || true)"
        echo
        echo "### cgroup"
        cat /proc/self/cgroup
        echo
        echo "### mounted cgroups"
        mount | grep cgroup || true
        sleep 5
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 60); do
          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
            sudo -iu slurmuser squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### output"
        cat "/shared/cgroup-cpu-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: cgroup_cpu_result
      changed_when: true
    - name: Show cgroup CPU result
      ansible.builtin.debug:
        var: cgroup_cpu_result.stdout_lines
@@ -0,0 +1,60 @@
 ---
 - name: Submit CPU test job
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit test job to debug partition
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=cpu-test
        #SBATCH --partition=debug
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=512M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/cpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 60); do
          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
            sudo -iu slurmuser squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
        echo "### output"
        if [ -f "/shared/cpu-test-${job_id}.out" ]; then
          cat "/shared/cpu-test-${job_id}.out"
        else
          echo "Output file not found: /shared/cpu-test-${job_id}.out"
          find /shared -maxdepth 1 -name "cpu-test-*.out" -ls | tail -5 || true
          exit 1
        fi
      args:
        executable: /bin/bash
      register: cpu_job_result
      changed_when: true
    - name: Show CPU job result
      ansible.builtin.debug:
        var: cpu_job_result.stdout_lines
@@ -0,0 +1,58 @@
 ---
 - name: Test GPU access without GRES allocation
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit job to gpu01 without requesting GPU
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=gpu-deny-test
        #SBATCH --partition=all
        #SBATCH --nodelist=gpu01
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=1G
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/gpu-deny-test-%j.out
        echo "HOST=$(hostname)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo
        echo "### ls nvidia devices"
        ls -l /dev/nvidia* 2>&1 || true
        echo
        echo "### nvidia-smi without GRES"
        nvidia-smi 2>&1 || true
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 60); do
          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
            sudo -iu slurmuser squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### output"
        cat "/shared/gpu-deny-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: gpu_deny_result
      changed_when: true
    - name: Show GPU deny test result
      ansible.builtin.debug:
        var: gpu_deny_result.stdout_lines
@@ -0,0 +1,70 @@
 ---
 - name: Submit GPU test job
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit test job to gpu partition
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=gpu-test
        #SBATCH --partition=gpu
        #SBATCH --gres=gpu:1
        #SBATCH --cpus-per-task=2
        #SBATCH --mem=2G
        #SBATCH --time=00:03:00
        #SBATCH --output=/shared/gpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo
        echo "### nvidia-smi"
        nvidia-smi
        echo
        echo "### GPU process table"
        nvidia-smi pmon -c 1 || true
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if sudo -iu slurmuser squeue -h -j "$job_id" | grep -q .; then
            sudo -iu slurmuser squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sudo -iu slurmuser sacct -j "$job_id" --format=JobID,JobName,Partition,State,ExitCode 2>/dev/null || true
        echo "### output"
        if [ -f "/shared/gpu-test-${job_id}.out" ]; then
          cat "/shared/gpu-test-${job_id}.out"
        else
          echo "Output file not found: /shared/gpu-test-${job_id}.out"
          find /shared -maxdepth 1 -name "gpu-test-*.out" -ls | tail -5 || true
          exit 1
        fi
      args:
        executable: /bin/bash
      register: gpu_job_result
      changed_when: true
    - name: Show GPU job result
      ansible.builtin.debug:
        var: gpu_job_result.stdout_lines
@@ -0,0 +1,95 @@
 ---
 - name: Submit job to specific Slurm node
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Require target_node
      ansible.builtin.fail:
        msg: "Use: ansible-playbook test-specific-node.yml -e target_node=<hostname>"
      when: target_node is not defined
    - name: Submit test job to target node
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<SBATCH
        #!/bin/bash
        #SBATCH --job-name=node-test
        #SBATCH --partition=debug
        #SBATCH --nodelist={{ target_node }}
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --account=lab
        #SBATCH --qos=normal
        #SBATCH --output=/shared/node-test-%j.out
        echo "HOST=\$(hostname)"
        echo "USER=\$(whoami)"
        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        echo "### waiting for job to leave queue"
        for i in $(seq 1 120); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### waiting for output file"
        for i in $(seq 1 30); do
          if [ -s "/shared/node-test-${job_id}.out" ]; then
            break
          fi
          sleep 1
        done
        echo "### waiting for sacct final state"
        final_state=""
        for i in $(seq 1 30); do
          final_state="$(
            sacct -n -P -j "$job_id" --format=State 2>/dev/null \
              | head -n 1 \
              | cut -d'|' -f1 \
              | awk '{print $1}'
          )"
          if echo "$final_state" | grep -qE "COMPLETED|FAILED|CANCELLED|TIMEOUT|NODE_FAIL|OUT_OF_MEMORY"; then
            break
          fi
          sleep 1
        done
        echo "FINAL_STATE=${final_state:-UNKNOWN}"
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Account,QOS,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/node-test-${job_id}.out"
        if [ "${final_state:-UNKNOWN}" != "COMPLETED" ]; then
          echo "Job did not reach COMPLETED state according to sacct"
          exit 1
        fi
      args:
        executable: /bin/bash
      register: node_test
      changed_when: true
    - name: Show node test result
      ansible.builtin.debug:
        var: node_test.stdout_lines
@@ -0,0 +1,60 @@
 ---
 - name: Generate measurable Slurm usage for sreport
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit CPU usage job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=sreport-usage
        #SBATCH --partition=debug
        #SBATCH --cpus-per-task=2
        #SBATCH --mem=512M
        #SBATCH --time=00:03:00
        #SBATCH --output=/shared/sreport-usage-%j.out
        echo "HOST=$(hostname)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_CPUS_PER_TASK=${SLURM_CPUS_PER_TASK:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo "Burning CPU for 90 seconds"
        timeout 90 bash -c 'while true; do :; done' &
        timeout 90 bash -c 'while true; do :; done' &
        wait
        echo "Done"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 150); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 2
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/sreport-usage-${job_id}.out"
      args:
        executable: /bin/bash
      register: sreport_usage_job
      changed_when: true
    - name: Show usage job result
      ansible.builtin.debug:
        var: sreport_usage_job.stdout_lines
@@ -0,0 +1,140 @@
 ---
 - name: Validate Slurm operator user and SSH mesh
  hosts: slurm_cluster
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: "{{ slurm_operator_user | default('slurmuser') }}"
    slurm_hosts: "{{ groups['slurm_cluster'] }}"
  tasks:
    - name: Validate slurmuser exists
      ansible.builtin.command:
        cmd: id {{ slurm_operator_user }}
      changed_when: false
    - name: Validate sinfo as slurmuser
      ansible.builtin.command:
        cmd: sudo -iu {{ slurm_operator_user }} sinfo
      changed_when: false
    - name: Validate squeue as slurmuser
      ansible.builtin.command:
        cmd: sudo -iu {{ slurm_operator_user }} squeue
      changed_when: false
    - name: Validate SSH mesh as slurmuser
      ansible.builtin.shell: |
        set -euo pipefail
        for h in {{ slurm_hosts | join(' ') }}; do
          echo "=== $h ==="
          ssh -o BatchMode=yes -o ConnectTimeout=5 "$h" hostname
        done
      args:
        executable: /bin/bash
      become_user: "{{ slurm_operator_user }}"
      changed_when: false
 - name: Validate Slurm controller commands
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Validate slurmctld status through sudo
      ansible.builtin.command:
        cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmctld --no-pager
      changed_when: false
    - name: Validate controller Slurm commands
      ansible.builtin.shell: |
        set -euo pipefail
        sudo -iu {{ slurm_operator_user }} sinfo
        sudo -iu {{ slurm_operator_user }} squeue
        sudo -iu {{ slurm_operator_user }} scontrol show nodes
      args:
        executable: /bin/bash
      changed_when: false
 - name: Validate Slurm worker commands
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Validate slurmd status through sudo
      ansible.builtin.command:
        cmd: sudo -iu {{ slurm_operator_user }} sudo -n systemctl status slurmd --no-pager
      changed_when: false
    - name: Validate worker Slurm commands
      ansible.builtin.shell: |
        set -euo pipefail
        sudo -iu {{ slurm_operator_user }} sinfo
        sudo -iu {{ slurm_operator_user }} squeue
        sudo -iu {{ slurm_operator_user }} scontrol show nodes
      args:
        executable: /bin/bash
      changed_when: false
 - name: Validate basic job submission
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    slurm_operator_user: slurmuser
  tasks:
    - name: Submit simple Slurm test job as slurmuser
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu {{ slurm_operator_user }} sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=ansible-validate
        #SBATCH --partition=debug
        #SBATCH --time=00:01:00
        #SBATCH --output=/tmp/ansible-validate-%j.out
        hostname
        whoami
        date
        SBATCH
        )"
        echo "$job_id"
        for i in $(seq 1 20); do
          state="$(sudo -iu {{ slurm_operator_user }} squeue -h -j "$job_id" -o "%T" || true)"
          if [ -z "$state" ]; then
            break
          fi
          echo "job_state=$state"
          sleep 1
        done
        sudo -iu {{ slurm_operator_user }} sacct -j "$job_id" --format=JobID,JobName,State,ExitCode 2>/dev/null || true
        if ls /tmp/ansible-validate-"$job_id".out >/dev/null 2>&1; then
          cat /tmp/ansible-validate-"$job_id".out
        fi
      args:
        executable: /bin/bash
      register: slurm_job_test
      changed_when: true
    - name: Show basic job submission result
      ansible.builtin.debug:
        var: slurm_job_test.stdout_lines
@@ -0,0 +1,236 @@
 ---
 - name: Validate canary node variable
  hosts: localhost
  gather_facts: false
  vars:
    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
  tasks:
    - name: Ensure canary node is in inventory
      ansible.builtin.fail:
        msg: "canary_node={{ canary_node_effective }} is not in inventory"
      when: canary_node_effective not in groups['all']
    - name: Ensure canary node is not the controller
      ansible.builtin.fail:
        msg: "Do not use controller as canary for worker rolling upgrade"
      when: canary_node_effective in groups['slurm_controller']
 - name: Drain canary node
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
  tasks:
    - name: Show canary state before drain
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ canary_node_effective }} || true
        scontrol show node {{ canary_node_effective }} || true
        squeue -w {{ canary_node_effective }} || true
      args:
        executable: /bin/bash
      register: canary_before
      changed_when: false
    - name: Print canary state before drain
      ansible.builtin.debug:
        var: canary_before.stdout_lines
    - name: Drain canary node
      ansible.builtin.command:
        cmd: scontrol update NodeName={{ canary_node_effective }} State=DRAIN Reason="canary OS upgrade"
      changed_when: true
    - name: Wait until canary has no running jobs
      ansible.builtin.shell: |
        set -euo pipefail
        squeue -h -w {{ canary_node_effective }} || true
      args:
        executable: /bin/bash
      register: canary_jobs
      retries: 120
      delay: 10
      until: canary_jobs.stdout | trim == ""
      changed_when: false
 - name: Upgrade canary node OS packages
  hosts: "{{ canary_node | default('slurm-c02') }}"
  become: true
  gather_facts: true
  tasks:
    - name: Ensure apt cache is updated
      ansible.builtin.apt:
        update_cache: true
        cache_valid_time: 1800
    - name: Full upgrade packages
      ansible.builtin.apt:
        upgrade: full
        autoremove: true
        autoclean: true
      register: apt_upgrade_result
    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_required
    - name: Show upgrade summary
      ansible.builtin.debug:
        msg:
          - "Host: {{ inventory_hostname }}"
          - "Apt changed: {{ apt_upgrade_result.changed }}"
          - "Reboot required: {{ reboot_required.stat.exists }}"
    - name: Reboot canary if required
      ansible.builtin.reboot:
        msg: "Reboot after canary OS upgrade"
        reboot_timeout: 900
        connect_timeout: 20
        pre_reboot_delay: 5
        post_reboot_delay: 20
      when: reboot_required.stat.exists
    - name: Ensure munge is running
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Ensure slurmd is running
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Validate local services
      ansible.builtin.shell: |
        set -euo pipefail
        systemctl is-active munge
        systemctl is-active slurmd
        munge -n | unmunge >/dev/null
        scontrol ping
      args:
        executable: /bin/bash
      changed_when: false
 - name: Resume canary node and run canary job
  hosts: slurm_controller
  become: true
  gather_facts: false
  vars:
    canary_node_effective: "{{ canary_node | default('slurm-c02') }}"
  tasks:
    - name: Reconfigure controller
      ansible.builtin.command:
        cmd: scontrol reconfigure
      changed_when: true
    - name: Restart controller to refresh node state
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
    - name: Wait for controller
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping
      retries: 15
      delay: 2
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Clear canary node maintenance state
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol update NodeName={{ canary_node_effective }} State=RESUME 2>/dev/null || true
        scontrol update NodeName={{ canary_node_effective }} State=UNDRAIN 2>/dev/null || true
        scontrol update NodeName={{ canary_node_effective }} State=IDLE 2>/dev/null || true
        sleep 3
        sinfo -N -n {{ canary_node_effective }}
        scontrol show node {{ canary_node_effective }}
      args:
        executable: /bin/bash
      register: resume_canary
      changed_when: true
    - name: Wait until canary is IDLE and responding
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ canary_node_effective }}
        scontrol show node {{ canary_node_effective }}
      args:
        executable: /bin/bash
      register: canary_state
      retries: 30
      delay: 5
      until:
        - canary_state.rc == 0
        - "'not_responding' not in canary_state.stdout.lower()"
        - "'down' not in canary_state.stdout.lower()"
        - "'drain' not in canary_state.stdout.lower()"
        - "'idle*' not in canary_state.stdout.lower()"
      changed_when: false
    - name: Submit canary test job to upgraded node
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<SBATCH
        #!/bin/bash
        #SBATCH --job-name=canary-upgrade-test
        #SBATCH --partition=all
        #SBATCH --nodelist={{ canary_node_effective }}
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/canary-upgrade-test-%j.out
        echo "HOST=\$(hostname)"
        echo "USER=\$(whoami)"
        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
        echo "KERNEL=\$(uname -r)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/canary-upgrade-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: canary_job
      changed_when: true
    - name: Show canary test result
      ansible.builtin.debug:
        var: canary_job.stdout_lines
@@ -0,0 +1,197 @@
 ---
 - name: Rolling upgrade Slurm worker nodes
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: true
  serial: 1
  vars:
    skip_canary_node: "{{ canary_node | default('slurm-c02') }}"
    do_skip_canary: "{{ skip_canary | default(true) | bool }}"
  pre_tasks:
    - name: Skip canary node if requested
      ansible.builtin.meta: end_host
      when:
        - do_skip_canary
        - inventory_hostname == skip_canary_node
    - name: Drain node before OS upgrade
      ansible.builtin.command:
        cmd: scontrol update NodeName={{ inventory_hostname }} State=DRAIN Reason="rolling OS upgrade"
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      changed_when: true
    - name: Wait until no jobs are running on this node
      ansible.builtin.shell: |
        set -euo pipefail
        squeue -h -w {{ inventory_hostname }} || true
      args:
        executable: /bin/bash
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: jobs_on_node
      retries: 120
      delay: 10
      until: jobs_on_node.stdout | trim == ""
      changed_when: false
  tasks:
    - name: Update apt cache
      ansible.builtin.apt:
        update_cache: true
        cache_valid_time: 1800
    - name: Full upgrade packages
      ansible.builtin.apt:
        upgrade: full
        autoremove: true
        autoclean: true
      register: apt_upgrade_result
    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_required
    - name: Show upgrade status
      ansible.builtin.debug:
        msg:
          - "Node: {{ inventory_hostname }}"
          - "Apt changed: {{ apt_upgrade_result.changed }}"
          - "Reboot required: {{ reboot_required.stat.exists }}"
    - name: Reboot node if required
      ansible.builtin.reboot:
        msg: "Reboot after rolling OS upgrade"
        reboot_timeout: 900
        connect_timeout: 20
        pre_reboot_delay: 5
        post_reboot_delay: 20
      when: reboot_required.stat.exists
    - name: Restart munge
      ansible.builtin.systemd:
        name: munge
        state: restarted
        enabled: true
    - name: Restart slurmd
      ansible.builtin.systemd:
        name: slurmd
        state: restarted
        enabled: true
    - name: Validate local slurm services
      ansible.builtin.shell: |
        set -euo pipefail
        systemctl is-active munge
        systemctl is-active slurmd
        munge -n | unmunge >/dev/null
        scontrol ping
      args:
        executable: /bin/bash
      changed_when: false
  post_tasks:
    - name: Restart controller to refresh state after node upgrade
      ansible.builtin.systemd:
        name: slurmctld
        state: restarted
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      run_once: false
    - name: Wait for controller after restart
      ansible.builtin.command:
        cmd: scontrol ping
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: slurmctld_ping
      retries: 15
      delay: 2
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Clear upgraded node maintenance state
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol update NodeName={{ inventory_hostname }} State=RESUME 2>/dev/null || true
        scontrol update NodeName={{ inventory_hostname }} State=UNDRAIN 2>/dev/null || true
        scontrol update NodeName={{ inventory_hostname }} State=IDLE 2>/dev/null || true
        sleep 3
        sinfo -N -n {{ inventory_hostname }}
        scontrol show node {{ inventory_hostname }}
      args:
        executable: /bin/bash
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: resume_node
      changed_when: true
    - name: Wait until node is healthy
      ansible.builtin.shell: |
        set -euo pipefail
        sinfo -N -n {{ inventory_hostname }}
        scontrol show node {{ inventory_hostname }}
      args:
        executable: /bin/bash
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: upgraded_node_state
      retries: 30
      delay: 5
      until:
        - upgraded_node_state.rc == 0
        - "'not_responding' not in upgraded_node_state.stdout.lower()"
        - "'down' not in upgraded_node_state.stdout.lower()"
        - "'drain' not in upgraded_node_state.stdout.lower()"
        - "'idle*' not in upgraded_node_state.stdout.lower()"
      changed_when: false
    - name: Submit node-local post-upgrade test job
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<SBATCH
        #!/bin/bash
        #SBATCH --job-name=rolling-upgrade-test
        #SBATCH --partition=all
        #SBATCH --nodelist={{ inventory_hostname }}
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/rolling-upgrade-test-%j.out
        echo "HOST=\$(hostname)"
        echo "SLURM_JOB_ID=\$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=\$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=\$(grep Cpus_allowed_list /proc/self/status)"
        echo "KERNEL=\$(uname -r)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/rolling-upgrade-test-${job_id}.out"
      args:
        executable: /bin/bash
      delegate_to: "{{ groups['slurm_controller'][0] }}"
      register: node_test_job
      changed_when: true
    - name: Show node post-upgrade test result
      ansible.builtin.debug:
        var: node_test_job.stdout_lines
@@ -0,0 +1,94 @@
 ---
 - name: Upgrade Slurm controller OS safely
  hosts: slurm_controller
  become: true
  gather_facts: true
  tasks:
    - name: Show cluster state before controller upgrade
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol ping
        sinfo
        squeue
        systemctl is-active munge
        systemctl is-active slurmctld
        systemctl is-active slurmdbd || true
        systemctl is-active mariadb || true
      args:
        executable: /bin/bash
      register: before_state
      changed_when: false
    - name: Print cluster state before controller upgrade
      ansible.builtin.debug:
        var: before_state.stdout_lines
    - name: Update apt cache
      ansible.builtin.apt:
        update_cache: true
        cache_valid_time: 1800
    - name: Full upgrade controller packages
      ansible.builtin.apt:
        upgrade: full
        autoremove: true
        autoclean: true
      register: controller_upgrade
    - name: Check if reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: controller_reboot_required
    - name: Show controller upgrade status
      ansible.builtin.debug:
        msg:
          - "Apt changed: {{ controller_upgrade.changed }}"
          - "Reboot required: {{ controller_reboot_required.stat.exists }}"
    - name: Reboot controller if required
      ansible.builtin.reboot:
        msg: "Reboot after controller OS upgrade"
        reboot_timeout: 900
        connect_timeout: 20
        pre_reboot_delay: 5
        post_reboot_delay: 30
      when: controller_reboot_required.stat.exists
    - name: Restart controller services
      ansible.builtin.systemd:
        name: "{{ item }}"
        state: restarted
        enabled: true
      loop:
        - munge
        - mariadb
        - slurmdbd
        - slurmctld
    - name: Wait for slurmctld
      ansible.builtin.command:
        cmd: scontrol ping
      register: slurmctld_ping
      retries: 20
      delay: 3
      until: slurmctld_ping.rc == 0
      changed_when: false
    - name: Validate controller after upgrade
      ansible.builtin.shell: |
        set -euo pipefail
        scontrol ping
        sinfo
        squeue
        scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType"
        sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,NodeList | tail -20
      args:
        executable: /bin/bash
      register: controller_after
      changed_when: false
    - name: Print controller validation after upgrade
      ansible.builtin.debug:
        var: controller_after.stdout_lines
@@ -0,0 +1,207 @@
 ---
 - name: Validate cluster after OS rolling upgrade
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Validate Slurm controller and cluster state
      ansible.builtin.shell: |
        set -euo pipefail
        echo "### slurmctld ping"
        scontrol ping
        echo
        echo "### nodes"
        sinfo -N
        echo
        echo "### partitions"
        sinfo
        echo
        echo "### queue"
        squeue
        echo
        echo "### important config"
        scontrol show config | grep -E "AccountingStorage|JobAcctGather|TaskPlugin|ProctrackType|SelectType|ClusterName"
        echo
        echo "### accounting recent jobs"
        sacct -S today --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList | tail -30
      args:
        executable: /bin/bash
      register: cluster_state
      changed_when: false
    - name: Print cluster state
      ansible.builtin.debug:
        var: cluster_state.stdout_lines
 - name: Validate worker services after OS rolling upgrade
  hosts: slurm_compute:slurm_gpu
  become: true
  gather_facts: true
  tasks:
    - name: Validate local worker services and Slurm connectivity
      ansible.builtin.shell: |
        set -euo pipefail
        echo "HOST=$(hostname)"
        echo "FQDN=$(hostname -f 2>/dev/null || hostname)"
        echo "KERNEL=$(uname -r)"
        echo "UPTIME=$(uptime -p)"
        echo
        echo "### services"
        systemctl is-active munge
        systemctl is-active slurmd
        echo
        echo "### munge local test"
        munge -n | unmunge >/dev/null
        echo "munge OK"
        echo
        echo "### controller ping"
        scontrol ping
        echo
        echo "### local slurm.conf checksum"
        sha256sum /etc/slurm/slurm.conf /etc/slurm/cgroup.conf 2>/dev/null || true
        echo
        echo "### gpu check if present"
        if command -v nvidia-smi >/dev/null 2>&1; then
          nvidia-smi --query-gpu=index,name,driver_version,memory.total --format=csv,noheader || true
        else
          echo "NO_NVIDIA_SMI"
        fi
      args:
        executable: /bin/bash
      register: worker_state
      changed_when: false
    - name: Print worker state
      ansible.builtin.debug:
        var: worker_state.stdout_lines
 - name: Submit post-upgrade CPU validation job
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit CPU validation job to debug partition
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=os-upgrade-cpu-test
        #SBATCH --partition=debug
        #SBATCH --cpus-per-task=1
        #SBATCH --mem=256M
        #SBATCH --time=00:02:00
        #SBATCH --output=/shared/os-upgrade-cpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo "KERNEL=$(uname -r)"
        date
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 90); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/os-upgrade-cpu-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: cpu_validation_job
      changed_when: true
    - name: Print CPU validation job
      ansible.builtin.debug:
        var: cpu_validation_job.stdout_lines
 - name: Submit post-upgrade GPU validation job
  hosts: slurm_controller
  become: true
  gather_facts: false
  tasks:
    - name: Submit GPU validation job to gpu partition
      ansible.builtin.shell: |
        set -euo pipefail
        job_id="$(
          sudo -iu slurmuser sbatch --parsable <<'SBATCH'
        #!/bin/bash
        #SBATCH --job-name=os-upgrade-gpu-test
        #SBATCH --partition=gpu
        #SBATCH --gres=gpu:1
        #SBATCH --cpus-per-task=2
        #SBATCH --mem=1G
        #SBATCH --time=00:03:00
        #SBATCH --output=/shared/os-upgrade-gpu-test-%j.out
        echo "HOST=$(hostname)"
        echo "USER=$(whoami)"
        echo "SLURM_JOB_ID=$SLURM_JOB_ID"
        echo "SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
        echo "SLURM_JOB_GPUS=${SLURM_JOB_GPUS:-}"
        echo "CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-}"
        echo "CPUS_ALLOWED=$(grep Cpus_allowed_list /proc/self/status)"
        echo "KERNEL=$(uname -r)"
        echo
        nvidia-smi
        SBATCH
        )"
        echo "JOB_ID=$job_id"
        for i in $(seq 1 120); do
          if squeue -h -j "$job_id" | grep -q .; then
            squeue -j "$job_id"
            sleep 1
          else
            break
          fi
        done
        echo "### sacct"
        sacct -j "$job_id" --format=JobID,JobName,User,Partition,State,ExitCode,Elapsed,AllocCPUS,ReqMem,NodeList
        echo "### output"
        cat "/shared/os-upgrade-gpu-test-${job_id}.out"
      args:
        executable: /bin/bash
      register: gpu_validation_job
      changed_when: true
    - name: Print GPU validation job
      ansible.builtin.debug:
        var: gpu_validation_job.stdout_lines
@@ -0,0 +1,15 @@
 # Codex prompt: generate repository documentation
 You are working in an Ansible repository that automates a Slurm AI/HPC lab.
 Please review the repository and generate or improve documentation under `docs/` with the following goals:
 1. Explain the architecture and repository layout.
 2. Document the end-to-end deployment sequence.
 3. Document operational workflows: provisioning, decommissioning, rolling upgrades, health checks and auto-remediation.
 4. Document SlurmDBD accounting, QOS, fairshare and priority workflows.
 5. Add troubleshooting notes based on the playbooks and templates.
 6. Avoid exposing secrets, real IP addresses, real hostnames, SQL dumps, backup archives, private keys or vault content.
 7. Keep all text in English.
 Output should be practical, operator-focused and suitable for a public Git repository.
@@ -0,0 +1,16 @@
 # Managed by Ansible
 # Slurm cgroup configuration
 CgroupPlugin=autodetect
 ConstrainCores=yes
 ConstrainRAMSpace=yes
 ConstrainSwapSpace=no
 ConstrainDevices=yes
 AllowedRAMSpace=100
 AllowedSwapSpace=0
 MaxRAMPercent=100
 MaxSwapPercent=0
 MinRAMSpace=30
@@ -0,0 +1,4 @@
 # Managed by Ansible
 {% for node in slurm_nodes if node.managed_state | default('present') == 'present' and node.gres | default('') | length > 0 %}
 NodeName={{ node.name }} Name=gpu File={{ node.gres_file | default('/dev/nvidia0') }}
 {% endfor %}
@@ -0,0 +1,67 @@
 # Managed by Ansible
 ClusterName={{ slurm_cluster_name }}
 SlurmctldHost={{ slurm_control_machine }}({{ slurm_control_addr }})
 SlurmUser={{ slurm_user }}
 AuthType=auth/munge
 StateSaveLocation=/var/spool/slurmctld
 SlurmdSpoolDir=/var/spool/slurmd
 SwitchType=switch/none
 MpiDefault={{ slurm_default_mpi_type }}
 ProctrackType={{ slurm_proctrack_type }}
 ReturnToService={{ slurm_return_to_service }}
 {% if slurm_gres_types is defined and slurm_gres_types | length > 0 %}
 GresTypes={{ slurm_gres_types }}
 {% endif %}
 SlurmctldPidFile=/run/slurmctld.pid
 SlurmdPidFile=/run/slurmd.pid
 SlurmctldPort={{ slurmctld_port }}
 SlurmdPort={{ slurmd_port }}
 TaskPlugin={{ slurm_task_plugin }}
 SelectType={{ slurm_select_type }}
 SelectTypeParameters={{ slurm_select_type_parameters }}
 SchedulerType=sched/backfill
 # Priority / fairshare
 PriorityType={{ slurm_priority_type | default('priority/multifactor') }}
 PriorityDecayHalfLife={{ slurm_priority_decay_half_life | default('7-0') }}
 PriorityCalcPeriod={{ slurm_priority_calc_period | default(5) }}
 PriorityFavorSmall={{ slurm_priority_favor_small | default('NO') }}
 PriorityWeightAge={{ slurm_priority_weight_age | default(1000) }}
 PriorityWeightFairshare={{ slurm_priority_weight_fairshare | default(10000) }}
 PriorityWeightJobSize={{ slurm_priority_weight_job_size | default(1000) }}
 PriorityWeightPartition={{ slurm_priority_weight_partition | default(1000) }}
 PriorityWeightQOS={{ slurm_priority_weight_qos | default(10000) }}
 PriorityMaxAge={{ slurm_priority_max_age | default('1-0') }}
 SlurmctldTimeout=120
 SlurmdTimeout=300
 InactiveLimit=0
 KillWait=30
 Waittime=0
 AccountingStorageType={{ slurm_accounting_storage_type }}
 {% if slurm_accounting_storage_type == "accounting_storage/slurmdbd" %}
 AccountingStorageHost={{ slurm_accounting_storage_host }}
 AccountingStoragePort={{ slurm_accounting_storage_port }}
 AccountingStorageEnforce={{ slurm_accounting_storage_enforce | default('associations,limits,qos') }}
 AccountingStorageTRES={{ slurm_accounting_storage_tres | default('cpu,mem,energy,node,billing,fs/disk,pages,vmem,gres/gpu') }}
 {% endif %}
 JobAcctGatherType={{ slurm_job_acct_gather_type | default('jobacct_gather/none') }}
 JobCompType={{ slurm_job_comp_type }}
 SlurmctldDebug=info
 SlurmdDebug=info
 SlurmctldLogFile=/var/log/slurm/slurmctld.log
 SlurmdLogFile=/var/log/slurm/slurmd.log
 {% for node in slurm_nodes if node.managed_state | default('present') == 'present' %}
 NodeName={{ node.name }} NodeAddr={{ node.addr }} CPUs={{ node.cpus }}{% if node.topology | default('') | length > 0 %} {{ node.topology }}{% endif %} RealMemory={{ node.real_memory }}{% if node.gres | default('') | length > 0 %} Gres={{ node.gres }}{% endif %}{% if node.features | default('') | length > 0 %} Feature={{ node.features }}{% endif %} State=UNKNOWN
 {% endfor %}
 {% for partition in slurm_partitions %}
 PartitionName={{ partition.name }} Nodes={{ partition.nodes }} Default={{ partition.default }} MaxTime={{ partition.max_time }} State={{ partition.state }}
 {% endfor %}
@@ -0,0 +1,38 @@
 # Managed by Ansible
 # Slurm database daemon configuration
 AuthType=auth/munge
 DbdHost={{ slurmdbd_host }}
 DbdPort={{ slurmdbd_port }}
 SlurmUser={{ slurm_user }}
 DebugLevel=info
 LogFile=/var/log/slurm/slurmdbd.log
 PidFile=/run/slurmdbd.pid
 CommitDelay={{ slurmdbd_commit_delay | default(1) }}
 StorageType={{ slurmdbd_storage_type }}
 StorageHost={{ slurmdbd_storage_host }}
 StoragePort={{ slurmdbd_storage_port }}
 StorageLoc={{ slurmdbd_storage_loc }}
 StorageUser={{ slurmdbd_storage_user }}
 StoragePass={{ slurmdbd_storage_pass }}
 # Retention / purge policy
 PurgeEventAfter={{ slurmdbd_purge_event_after | default('12months') }}
 PurgeJobAfter={{ slurmdbd_purge_job_after | default('12months') }}
 PurgeResvAfter={{ slurmdbd_purge_resv_after | default('12months') }}
 PurgeStepAfter={{ slurmdbd_purge_step_after | default('3months') }}
 PurgeSuspendAfter={{ slurmdbd_purge_suspend_after | default('3months') }}
 PurgeTXNAfter={{ slurmdbd_purge_txn_after | default('12months') }}
 PurgeUsageAfter={{ slurmdbd_purge_usage_after | default('24months') }}
 ArchiveEvents={{ slurmdbd_archive_events | default('no') }}
 ArchiveJobs={{ slurmdbd_archive_jobs | default('no') }}
 ArchiveSteps={{ slurmdbd_archive_steps | default('no') }}
 ArchiveSuspend={{ slurmdbd_archive_suspend | default('no') }}
 ArchiveTXN={{ slurmdbd_archive_txn | default('no') }}
 ArchiveUsage={{ slurmdbd_archive_usage | default('no') }}
Author	SHA1	Message	Date
Mateusz Suski	83877fb598	Document Slurm AI/HPC cluster project lint / shell-yaml-ansible (push) Failing after 16s Details	2026-06-04 19:54:43 +00:00
Mateusz Suski	d300d490f5	Add Slurm AI/HPC cluster platform project lint / shell-yaml-ansible (push) Failing after 47s Details	2026-06-04 19:42:45 +00:00
		`@@ -0,0 +1 @@`
							`Generated backups and reports can be stored here locally. This directory is ignored by git.`