Initial CV-aligned infrastructure portfolio

Rework portfolio around Linux operations, Zabbix monitoring, migration validation, and ELK/Grafana log observability. Add AAP-style LVM resize workflow, Zabbix server/proxy/agent automation assets, Linux/AIX monitoring templates, and updated validation CI.
2026-05-04 17:37:24 +00:00
commit 35e6b139fc
114 changed files with 6422 additions and 0 deletions
@@ -0,0 +1,47 @@
 name: ci
 on:
  push:
    branches: [main]
  pull_request:
    branches: [main]
 jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout repository
        uses: actions/checkout@v4
      - name: Install validation tools
        run: |
          python3 -m venv .venv
          . .venv/bin/activate
          pip install --upgrade pip
          pip install ansible ansible-lint
      - name: Migration validation framework
        run: |
          cd professional-infra/migration-validation-framework
          make test
          make demo
      - name: Linux operations automation
        run: |
          . .venv/bin/activate
          cd professional-infra/linux-operations-automation
          ansible-playbook --syntax-check playbooks/*.yml
          ansible-lint
          make test
      - name: Zabbix monitoring and incident response
        run: |
          . .venv/bin/activate
          cd professional-infra/zabbix-monitoring-incident-response
          make test
      - name: Log observability ELK Grafana
        run: |
          cd professional-infra/log-observability-elk-grafana
          make test
@@ -0,0 +1,24 @@
 __pycache__/
 **/__pycache__/
 *.pyc
 *.pyo
 # Runtime logs and generated evidence
 *.log
 logs/
 snapshots/
 reports/
 *.html
 migration_report_*.json
 migration_report_*.html
 # Local environment files
 .env
 .env.*
 !.env.example
 .venv/
 venv/
 # OS/editor noise
 .DS_Store
 *.swp
@@ -0,0 +1,61 @@
 # AI Context File - CV-Aligned Portfolio Guide
 ## Positioning
 This repository should support a Linux/Unix Infrastructure Engineer CV. The main story is operational infrastructure work: Linux operations, Zabbix monitoring, migration validation, incident response, and log observability.
 Do not reposition the repo as a generic cloud-native platform portfolio. DevOps side labs can be mentioned, but the main `professional-infra/` section should stay grounded in real operational work.
 ## Current Professional Projects
 ### Linux Operations Automation
 Technology stack: Ansible, Bash, Docker Compose.
 Focus:
 - Linux provisioning, patching, hardening, and decommissioning,
 - service and failure simulation,
 - AAP-style LVM filesystem resize workflow,
 - before/after operational evidence.
 ### Zabbix Monitoring + Incident Response
 Technology stack: Ansible, Zabbix template assets, JSON/YAML-style operational docs.
 Focus:
 - Zabbix server, proxy, and agent automation structure,
 - active/passive proxy design,
 - Linux and AIX OS monitoring templates,
 - maintenance and incident response runbooks,
 - sample check and alert evidence.
 ### Migration Validation Framework
 Technology stack: Python, JSON, HTML reports.
 Focus:
 - pre/post migration snapshots,
 - drift detection,
 - risk assessment,
 - migration evidence reports.
 ### Log Observability ELK/Grafana
 Technology stack: Docker Compose, Elasticsearch, Logstash, Kibana, Grafana, Filebeat.
 Focus:
 - log ingestion and parsing,
 - incident log evidence,
 - local demo observability stack.
 ## Standards
 - Every project should have `make test`.
 - Documentation should include CV relevance and interview talking points.
 - Runtime logs, caches, snapshots, and generated reports should stay out of git.
 - AIX should be represented by templates, samples, and runbooks; do not claim local AIX runtime.
 - AAP should be represented as workflow/job-template documentation plus playbook variables; do not add AWX/AAP runtime unless explicitly requested.
@@ -0,0 +1,39 @@
 # Portfolio Changelog
 ## [1.2.0] - 2026-05-04 - CV-Aligned Infrastructure Portfolio Rework
 ### Changed
 - Reorganized the repository around `professional-infra/` projects aligned with the Linux Engineer CV.
 - Repositioned the root README around Linux/Unix operations, Zabbix monitoring, migration validation, and log observability.
 - Moved the former infrastructure simulator into Linux Operations Automation.
 - Moved the former observability stack into Log Observability ELK/Grafana.
 ### Added
 - Zabbix Monitoring + Incident Response project with Ansible-first server/proxy/agent structure.
 - Linux and AIX Zabbix OS monitoring template assets and sample check payloads.
 - Zabbix proxy design, maintenance, and incident response runbooks.
 - AAP-style LVM filesystem resize workflow in Linux Operations Automation.
 - LVM resize workflow documentation and sample evidence output.
 ### Fixed
 - Updated CI paths for the new `professional-infra/` structure.
 - Updated runbooks and architecture notes to match the CV-aligned portfolio structure.
 - Kept runtime logs, caches, and generated reports out of tracked project evidence.
 ## [1.1.0] - 2026-05-04 - Portfolio Reliability Pass
 ### Changed
 - Restored lightweight CI for Python validation, Ansible syntax/lint checks, and Docker Compose validation.
 - Separated versioned examples from runtime logs, caches, snapshots, and generated reports.
 ### Fixed
 - Hardened migration report generation by escaping untrusted snapshot content.
 - Added missing local configuration scaffolding for the ELK/Grafana stack.
 ## [1.0.0] - 2026-04-29 - Initial Portfolio Release
 ### Added
 - Ansible lifecycle automation examples.
 - Migration validation Python CLI.
 - ELK/Grafana local observability scaffold.
 - Per-project documentation, architecture notes, examples, and scenarios.
@@ -0,0 +1,31 @@
 # CV-Aligned Linux / Unix Infrastructure Portfolio
 This repository maps my Linux/Unix infrastructure experience into practical, reviewable projects: operations automation, monitoring, incident response, migration validation, and log observability. It is intentionally grounded in the kind of work described in my CV: large Linux/Unix estates, Zabbix monitoring, storage and migration work, provisioning, patching, troubleshooting, and operational evidence.
 The repository also leaves room for DevOps side labs, but the main section is professional infrastructure work rather than cloud-native/platform fantasy.
 ## What To Review First
 | Order | Project | CV relevance | Technologies | Validation |
 | --- | --- | --- | --- | --- |
 | 1 | [Linux Operations Automation](professional-infra/linux-operations-automation/) | Linux server deployment, patching, hardening, LVM resize operations, AAP-style automation | Ansible, Bash, Docker Compose | `cd professional-infra/linux-operations-automation && make test` |
 | 2 | [Zabbix Monitoring + Incident Response](professional-infra/zabbix-monitoring-incident-response/) | Zabbix maintenance, proxy topology, active/passive checks, Linux/AIX OS monitoring | Ansible, Zabbix templates, YAML | `cd professional-infra/zabbix-monitoring-incident-response && make test` |
 | 3 | [Migration Validation Framework](professional-infra/migration-validation-framework/) | Pre/post migration validation, reporting, drift detection, evidence generation | Python, JSON, HTML | `cd professional-infra/migration-validation-framework && make test && make demo` |
 | 4 | [Log Observability ELK/Grafana](professional-infra/log-observability-elk-grafana/) | Log ingestion, incident evidence, environment observability | Docker, ELK, Grafana, Filebeat | `cd professional-infra/log-observability-elk-grafana && make test` |
 ## CV Skills To Repo Map
 - **Linux/Unix operations:** Linux Operations Automation, LVM resize workflow, patching, hardening, service checks.
 - **Automation:** Ansible playbooks/roles, Bash simulation scripts, Python validation tooling.
 - **Monitoring:** Zabbix project for OS checks and proxy operations; ELK/Grafana project for log monitoring.
 - **Migration work:** Migration Validation Framework for before/after evidence and drift reports.
 - **Incident response:** Zabbix runbooks, ELK incident simulation, failure simulation examples.
 - **DevOps practices:** lightweight CI, Git workflows, repeatable `make test` targets, containerized lab components.
 ## Professional Infrastructure Projects
 The `professional-infra/` directory contains the projects that should be read as direct support for the CV. Each project includes a reviewer-focused README, validation command, examples, and interview talking points.
 ## DevOps Side Labs
 Future side projects can live under `devops-labs/` when they are ready. Good candidates are K3s, CI/CD workflow demos, cloud experiments, Wazuh, or other after-hours labs. Empty placeholder project directories are intentionally avoided.
@@ -0,0 +1,64 @@
 # Architecture Overview
 ## Portfolio Shape
 This repository is organized around professional infrastructure work that maps to the CV:
 ```text
 professional-infra/
  linux-operations-automation/
  zabbix-monitoring-incident-response/
  migration-validation-framework/
  log-observability-elk-grafana/
 devops-labs/
  future side projects only when ready
 ```
 ## Project Roles
 ### Linux Operations Automation
 Operational automation for Linux server work:
 - provisioning and baseline configuration,
 - patching and hardening,
 - service/failure simulation,
 - AAP-style LVM filesystem resize workflow with before/after evidence.
 ### Zabbix Monitoring + Incident Response
 Simple checks and OS monitoring:
 - Zabbix server/proxy/agent automation structure,
 - active and passive proxy design,
 - Linux and AIX monitoring templates,
 - maintenance and incident response runbooks.
 ### Migration Validation Framework
 Evidence tooling for platform/storage migrations:
 - before/after snapshot collection,
 - drift detection,
 - risk assessment,
 - JSON and HTML reports.
 ### Log Observability ELK/Grafana
 Log monitoring and incident evidence:
 - Filebeat ingestion,
 - Logstash parsing,
 - Elasticsearch storage,
 - Kibana/Grafana review surfaces,
 - incident log simulation.
 ## Design Principles
 - Keep implemented work separate from roadmap ideas.
 - Prefer reviewable automation and evidence over overbuilt local labs.
 - Make every project independently validatable.
 - Treat Zabbix and ELK/Grafana as complementary monitoring tools:
  - Zabbix for simple checks and OS health,
  - ELK/Grafana for logs and observability evidence.
@@ -0,0 +1,92 @@
 # Portfolio Runbooks
 These runbooks are scoped to the portfolio version of the projects. They favor fast validation and reviewable evidence over full production operations.
 ## Linux Operations Automation
 Validate the implemented Ansible core:
 ```bash
 cd professional-infra/linux-operations-automation
 make test
 ```
 Run a safe failure simulation without live SSH hosts:
 ```bash
 make demo
 ```
 Review the AAP-style LVM workflow:
 ```bash
 ansible-playbook -i inventory/hosts.ini playbooks/lvm_resize.yml --syntax-check
 cat docs/aap_lvm_resize_workflow.md
 ```
 Run playbooks against your own lab hosts after updating `inventory/hosts.ini`:
 ```bash
 make run
 make patch
 make harden
 make decommission
 ```
 ## Zabbix Monitoring + Incident Response
 Validate Zabbix playbooks and templates:
 ```bash
 cd professional-infra/zabbix-monitoring-incident-response
 make test
 ```
 Review proxy and OS monitoring operations:
 ```bash
 cat docs/proxy-design.md
 cat docs/maintenance-runbook.md
 cat docs/incident-response-runbook.md
 ```
 Linux and AIX checks are represented as templates and sample data. AIX is not run locally.
 ## Migration Validation Framework
 Validate code and parser/report behavior:
 ```bash
 cd professional-infra/migration-validation-framework
 make test
 ```
 Run the included before/after comparison:
 ```bash
 make demo
 ```
 The demo intentionally reports `FAIL` to show a high-risk migration finding.
 ## Log Observability ELK/Grafana
 Validate Docker Compose and required bind-mounted configs:
 ```bash
 cd professional-infra/log-observability-elk-grafana
 make test
 ```
 Generate sample incident logs without starting the full stack:
 ```bash
 make demo
 ```
 Start the full local demo stack with Docker:
 ```bash
 make run
 make down
 ```
@@ -0,0 +1,14 @@
 ---
 # Ansible-lint configuration
 skip_list:
  - 'role-name'
  - 'name[casing]'
  - 'line-too-long'
 exclude_paths:
  - .git
  - .gitea
  - molecule/
  - molecule/default/tests/
  - scenarios/
@@ -0,0 +1,95 @@
 # Linux Operations Automation Makefile
 .PHONY: help test run demo patch harden decommission lvm-check up down status logs validate clean lint scale-up-web scale-up-db scale-down-web scale-down-db fail-network fail-disk fail-service fail-node scenario-scaling help-scaling help-failure
 help: ## Show this help message
 	@echo "Linux Operations Automation"
 	@echo ""
 	@echo "Available commands:"
 	@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf "  %-18s %s\n", $$1, $$2}'
 test: ## Run offline validation checks
 	ansible-playbook -i inventory/hosts.ini --syntax-check playbooks/*.yml
 	ansible-lint
 run: ## Run provisioning against the configured inventory
 	ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
 demo: ## Run a safe local demonstration without requiring live SSH hosts
 	SIMULATION_MODE=true bash ./scripts/simulate_failure.sh service 5 web
 patch: ## Apply patching workflow against the configured inventory
 	ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
 harden: ## Apply hardening workflow against the configured inventory
 	ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
 decommission: ## Run decommissioning workflow against the configured inventory
 	ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml
 lvm-check: ## Validate the AAP-style LVM resize workflow
 	ansible-playbook -i inventory/hosts.ini --syntax-check playbooks/lvm_resize.yml
 up: ## Start the optional local container scaffold
 	docker compose up -d
 down: ## Stop the optional local container scaffold
 	docker compose down
 status: ## Show local scaffold status and inventory hosts
 	docker compose ps
 	ansible -i inventory/hosts.ini --list-hosts all || echo "Inventory check failed"
 logs: ## Show local scaffold logs
 	docker compose logs -f --tail=100
 validate: ## Run all offline validation checks
 	$(MAKE) test
 	docker compose config --quiet
 clean: ## Clean up generated local logs and reports
 	rm -f logs/*.log reports/*.txt
 lint: ## Lint Ansible content
 	ansible-lint
 scale-up-web: ## Scale up web servers in simulation mode (usage: make scale-up-web COUNT=2)
 	SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh up $(or $(COUNT),1) web
 scale-up-db: ## Scale up database servers in simulation mode (usage: make scale-up-db COUNT=1)
 	SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh up $(or $(COUNT),1) db
 scale-down-web: ## Scale down web servers in simulation mode (usage: make scale-down-web COUNT=1)
 	SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh down $(or $(COUNT),1) web
 scale-down-db: ## Scale down database servers in simulation mode (usage: make scale-down-db COUNT=1)
 	SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh down $(or $(COUNT),1) db
 fail-network: ## Simulate network failure safely (usage: make fail-network DURATION=60)
 	SIMULATION_MODE=true bash ./scripts/simulate_failure.sh network $(or $(DURATION),60)
 fail-disk: ## Simulate disk pressure safely (usage: make fail-disk DURATION=120)
 	SIMULATION_MODE=true bash ./scripts/simulate_failure.sh disk $(or $(DURATION),120)
 fail-service: ## Simulate service failures safely (usage: make fail-service DURATION=30)
 	SIMULATION_MODE=true bash ./scripts/simulate_failure.sh service $(or $(DURATION),30)
 fail-node: ## Simulate node failure safely (usage: make fail-node DURATION=300)
 	SIMULATION_MODE=true bash ./scripts/simulate_failure.sh node $(or $(DURATION),300)
 scenario-scaling: ## Run scaling event syntax validation
 	ansible-playbook -i inventory/hosts.ini --syntax-check scenarios/scaling_event.yml
 help-scaling: ## Show scaling-related commands
 	@echo "Scaling Commands:"
 	@echo "  make scale-up-web COUNT=2"
 	@echo "  make scale-up-db COUNT=1"
 	@echo "  make scale-down-web COUNT=1"
 	@echo "  make scale-down-db COUNT=1"
 help-failure: ## Show failure simulation commands
 	@echo "Failure Simulation Commands:"
 	@echo "  make fail-network DURATION=60"
 	@echo "  make fail-disk DURATION=120"
 	@echo "  make fail-service DURATION=30"
 	@echo "  make fail-node DURATION=300"
@@ -0,0 +1,92 @@
 # Linux Operations Automation
 ## Problem
 Linux infrastructure work often starts as ticket-driven operations: deploy a server, patch it, harden SSH, check a failed service, expand a filesystem, and leave evidence that the change was safe. These tasks need automation that is readable, repeatable, and cautious enough for production-style environments.
 ## CV Relevance
 This project maps directly to Linux/Unix operations, server deployment, patching, troubleshooting, and storage/LVM work from enterprise infrastructure environments. The LVM resize workflow is written in an AAP-style shape: explicit survey variables, dry-run defaults, pre-checks, resize actions, and before/after evidence.
 ## What This Project Demonstrates
 - Ansible playbooks for common Linux node lifecycle operations.
 - Role-based task organization with clear defaults and handlers.
 - LVM filesystem expansion workflow suitable for Ansible Automation Platform job templates.
 - Safe simulation scripts for failure, service, and scaling exercises.
 - Reviewer-friendly evidence in `examples/` without relying on a live enterprise lab.
 ## Architecture
 ```text
 Operator -> Make targets -> Ansible inventory -> Playbooks/Roles -> Linux nodes
                         -> Simulation scripts -> Example evidence
                         -> AAP-style LVM workflow -> Before/after report
 ```
 Core components:
 - `inventory/hosts.ini` defines realistic host groups.
 - `playbooks/` contains provision, patch, harden, and decommission workflows.
 - `playbooks/lvm_resize.yml` contains the storage expansion workflow.
 - `roles/` contains the implemented Ansible roles.
 - `scripts/` provides safe simulation helpers.
 - `docker-compose.yml` is a lightweight local scaffold, not a production lab.
 ## Quickstart
 ```bash
 cd professional-infra/linux-operations-automation
 make test
 make demo
 ```
 `make test` runs offline syntax and lint checks. `make demo` runs a safe simulation with `SIMULATION_MODE=true` and does not require reachable SSH hosts.
 To run playbooks against real or lab hosts, update `inventory/hosts.ini` and run:
 ```bash
 make run
 make patch
 make harden
 make decommission
 ```
 Review the LVM workflow:
 ```bash
 ansible-playbook -i inventory/hosts.ini playbooks/lvm_resize.yml --syntax-check
 cat docs/aap_lvm_resize_workflow.md
 ```
 ## Validation
 ```bash
 make test
 docker compose config --quiet
 ```
 The optional compose scaffold can be started with:
 ```bash
 make up
 make down
 ```
 ## Example Output
 Sample evidence is available in [examples/patch-output.txt](examples/patch-output.txt), [examples/failure-simulation.txt](examples/failure-simulation.txt), and [examples/lvm-resize-output.txt](examples/lvm-resize-output.txt).
 ## Interview Talking Points
 - How to make LVM resize automation safe with dry-run defaults and explicit approval.
 - Why before/after evidence matters for storage and filesystem changes.
 - How Ansible roles keep Linux baseline operations repeatable.
 - Where AAP surveys and job templates reduce ticket handling errors.
 ## Roadmap
 - Add complete service roles for application deployment examples.
 - Add backup, security scan, and disaster recovery playbooks.
 - Add a richer local lab with SSH-ready containers.
 - Add cloud or Kubernetes deployment variants.
@@ -0,0 +1,43 @@
 # Vault Configuration Guide
 ## Overview
 The current portfolio demo does not require Ansible Vault for `make test` or `make demo`. Secrets are intentionally kept out of the main validation path so reviewers can run the project offline.
 Use Vault only when extending the simulator to manage real hosts or credentials.
 ## Recommended Pattern
 1. Start from the example file:
 ```bash
 cp group_vars/vault.example.yml group_vars/vault.yml
 ```
 2. Replace placeholder values locally.
 3. Encrypt the file before using it with real systems:
 ```bash
 ansible-vault encrypt group_vars/vault.yml
 ```
 4. Do not commit real secret values. Keep `group_vars/vault.example.yml` as the committed reference.
 ## Running With Vault
 ```bash
 ansible-playbook -i inventory/hosts.ini playbooks/provision.yml --ask-vault-pass
 ```
 or:
 ```bash
 ansible-playbook -i inventory/hosts.ini playbooks/provision.yml --vault-password-file ~/.vault_pass.txt
 ```
 ## Notes
 - The delivered playbooks do not import a vault file by default.
 - Add `vars_files` only in an environment-specific branch or private overlay.
 - Prefer a secret manager or automation controller for production use.
@@ -0,0 +1,5 @@
 [defaults]
 roles_path = ./roles
 inventory = ./inventory/hosts.ini
 host_key_checking = False
 retry_files_enabled = False
@@ -0,0 +1,28 @@
 services:
  web:
    image: debian:12-slim
    command: ["sleep", "infinity"]
    networks:
      infra_sim:
        ipv4_address: 172.20.0.11
  db:
    image: debian:12-slim
    command: ["sleep", "infinity"]
    networks:
      infra_sim:
        ipv4_address: 172.20.0.21
  lb:
    image: debian:12-slim
    command: ["sleep", "infinity"]
    networks:
      infra_sim:
        ipv4_address: 172.20.0.31
 networks:
  infra_sim:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/24
@@ -0,0 +1,45 @@
 # AAP-Style LVM Resize Workflow
 ## Purpose
 This workflow shows how a routine storage ticket can be converted into a controlled Ansible Automation Platform job. It is intentionally conservative: dry-run is the default, required variables are explicit, and every run produces before/after evidence.
 ## Suggested Job Template
 - Name: `Linux - LVM Filesystem Resize`
 - Inventory: Linux production or pre-production inventory
 - Playbook: `playbooks/lvm_resize.yml`
 - Credentials: privileged Linux automation credential
 - Privilege escalation: enabled
 - Default extra vars:
 ```yaml
 lvm_dry_run: true
 lvm_resize_filesystem: true
 ```
 ## Suggested Survey Variables
 | Variable | Example | Required | Notes |
 | --- | --- | --- | --- |
 | `lvm_vg_name` | `vg_app` | yes | Target volume group. |
 | `lvm_lv_name` | `lv_data` | yes | Target logical volume. |
 | `lvm_mountpoint` | `/data` | yes | Filesystem mountpoint to validate before/after. |
 | `lvm_size_request` | `+20G` | yes | Passed to `lvextend -L`; use explicit growth syntax for tickets. |
 | `lvm_dry_run` | `true` | yes | Start with `true`; switch to `false` after evidence review. |
 ## Safety Notes
 - Run with `lvm_dry_run=true` first and attach output to the ticket.
 - Confirm backup/snapshot status before actual resize.
 - Confirm filesystem type; this workflow supports XFS and ext filesystems.
 - Keep requested size aligned with the ticket approval.
 - Use maintenance windows for critical systems.
 ## Evidence Captured
 - `lsblk --fs`
 - `pvs`, `vgs`, `lvs`
 - `df -hT <mountpoint>` before and after
 - target LV path and filesystem type
 - dry-run flag and requested size
@@ -0,0 +1,30 @@
 # Linux Operations Automation Architecture
 ## Components
 - Operator interface: `make` targets and direct Ansible commands.
 - Inventory: static host groups in `inventory/hosts.ini`.
 - Automation: lifecycle playbooks in `playbooks/`.
 - Simulation scripts: controlled failure and scaling events in `scripts/`.
 - Evidence: logs, reports, scenario notes, and examples.
 ## Data Flow
 ```
 Operator
  -> Make target or shell script
  -> Ansible inventory
  -> lifecycle playbook
  -> managed Linux node
  -> log/report artifact
 ```
 Failure drills follow a parallel flow:
 ```
 Operator -> simulate_failure.sh -> target node/service -> health check -> patch/hardening playbook -> evidence
 ```
 ## Notes
 The project favors explicit playbooks over hidden orchestration so the operational intent is visible during review. In a production implementation, the same workflows would typically run from a CI runner or automation controller with credentials supplied by a secret manager.
@@ -0,0 +1,8 @@
 2026-04-29 02:13:41 - Starting failure simulation: service 30 web
 2026-04-29 02:13:41 - Simulating service failures on containers: web
 2026-04-29 02:13:42 - Stopping services in container enterprise-web-1
 2026-04-29 02:13:44 - Health probe failed: http://web01/health returned 503
 2026-04-29 02:14:12 - Cleaning up failure simulation
 2026-04-29 02:14:13 - Restarted nginx in enterprise-web-1
 2026-04-29 02:14:18 - Health probe recovered: http://web01/health returned 200
 2026-04-29 02:14:18 - Failure simulation completed successfully
@@ -0,0 +1,19 @@
 TASK [Report LVM resize evidence] **********************************************
 ok: [app01] => {
  "msg": {
    "host": "app01",
    "dry_run": true,
    "target": "/dev/vg_app/lv_data",
    "mountpoint": "/data",
    "requested_size": "+20G",
    "filesystem_type": "xfs",
    "before_df": [
      "Filesystem                 Type  Size  Used Avail Use% Mounted on",
      "/dev/mapper/vg_app-lv_data xfs   100G   83G   17G  84% /data"
    ],
    "after_df": [
      "Filesystem                 Type  Size  Used Avail Use% Mounted on",
      "/dev/mapper/vg_app-lv_data xfs   100G   83G   17G  84% /data"
    ]
  }
 }
@@ -0,0 +1,33 @@
 PLAY [Apply Security Patches and Updates] **************************************
 TASK [Update package cache] *****************************************************
 changed: [web01]
 changed: [db01]
 ok: [lb01]
 TASK [Check for available updates] **********************************************
 ok: [web01] => {"stdout": "9"}
 ok: [db01] => {"stdout": "4"}
 ok: [lb01] => {"stdout": "0"}
 TASK [Apply security updates only] **********************************************
 changed: [web01]
 changed: [db01]
 ok: [lb01]
 TASK [Verify critical services] *************************************************
 ok: [web01] => (item=systemd-journald)
 ok: [web01] => (item=cron)
 ok: [db01] => (item=systemd-journald)
 ok: [lb01] => (item=cron)
 PLAY RECAP *********************************************************************
 web01 : ok=19 changed=6 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
 db01  : ok=18 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
 lb01  : ok=15 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
 Patch report
 Status: SUCCESS
 Window: 02:00-04:00 UTC
 Reboot required: false
 Notification: infra-team@example.com
@@ -0,0 +1,20 @@
 ---
 # Group variables for all hosts
 # SSH Configuration
 ssh_config:
  port: 22
  max_auth_tries: 3
  alive_interval: 300
 # Firewall defaults
 firewall_enabled: true
 firewall_default_policy: deny
 # Patching defaults
 patch_enabled: true
 enforce_patch_window: true
 # Services monitoring
 enable_monitoring: false
 enable_health_checks: true
@@ -0,0 +1,9 @@
 ---
 # Database servers group configuration
 db_type: postgresql
 db_port: 5432
 db_backup_enabled: true
 db_backup_path: /var/backups/database
 # Database user (use vault for production)
 db_admin_user: postgres
@@ -0,0 +1,10 @@
 ---
 # Load balancers group configuration
 lb_type: haproxy
 lb_port: 443
 lb_stats_port: 8404
 lb_stats_enabled: true
 # Frontend configuration
 frontend_host: "0.0.0.0"
 frontend_port: 80
@@ -0,0 +1,10 @@
 ---
 # Monitoring servers group configuration
 monitoring_type: prometheus
 monitoring_port: 9090
 monitoring_retention: 30d
 monitoring_scrape_interval: 15s
 # Grafana configuration
 grafana_port: 3000
 grafana_admin_password: "{{ vault_grafana_password }}"
@@ -0,0 +1,8 @@
 ---
 # Example variables for secret values.
 # Copy these keys into an Ansible Vault encrypted file when real secrets are needed.
 admin_password: "replace-with-vault-managed-value"
 db_root_password: "replace-with-vault-managed-value"
 grafana_admin_password: "replace-with-vault-managed-value"
 ssh_key_passphrase: "replace-with-vault-managed-value"
@@ -0,0 +1,11 @@
 ---
 # Webservers group configuration
 webserver_type: nginx
 http_port: 80
 https_port: 443
 health_check_path: /health
 # Application configuration
 app_name: "{{ group_names[0] | default('app') }}"
 app_user: "{{ admin_user }}"
 app_group: "{{ admin_user }}"
@@ -0,0 +1,35 @@
 [webservers]
 web01 ansible_host=172.20.0.11 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 web02 ansible_host=172.20.0.12 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 web03 ansible_host=172.20.0.13 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 [databases]
 db01 ansible_host=172.20.0.21 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 db02 ansible_host=172.20.0.22 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 [loadbalancers]
 lb01 ansible_host=172.20.0.31 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 [monitoring]
 mon01 ansible_host=172.20.0.41 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
 [all:vars]
 ansible_python_interpreter=/usr/bin/python3
 ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
 ansible_connection=ssh
 [webservers:vars]
 node_type=web
 environment=production
 [databases:vars]
 node_type=database
 environment=production
 [loadbalancers:vars]
 node_type=loadbalancer
 environment=production
 [monitoring:vars]
 node_type=monitoring
 environment=production
@@ -0,0 +1,24 @@
 ---
 # Molecule converge playbook - applies roles to test them
 - name: Converge
  hosts: all
  become: true
  gather_facts: true
  pre_tasks:
    - name: Update apt cache
      apt:
        update_cache: yes
        cache_valid_time: 3600
      when: ansible_os_family == "Debian"
  roles:
    - role: base_provision
    - role: hardening
    - role: patching
  post_tasks:
    - name: Print Ansible facts
      debug:
        var: ansible_facts
@@ -0,0 +1,15 @@
 ---
 # Molecule destroy playbook
 - name: Destroy
  hosts: localhost
  gather_facts: false
  tasks:
    - name: Destroy molecule containers
      docker_container:
        name: "{{ item }}"
        state: absent
        force_kill: yes
      loop: "{{ molecule_yml.platforms | map(attribute='name') | list }}"
      register: destroy_result
      ignore_errors: yes
@@ -0,0 +1,31 @@
 ---
 # Molecule configuration for Ansible role testing
 driver:
  name: docker
 platforms:
  - name: ubuntu-22.04
    image: geerlingguy/docker-ubuntu2204-ansible:latest
    pre_build_image: true
    privileged: true
    volumes:
      - /sys/fs/cgroup:/sys/fs/cgroup:rw
 provisioner:
  name: ansible
  config_options:
    defaults:
      gathering: smart
      fact_caching: jsonfile
      fact_caching_connection: /tmp/ansible_facts
      fact_caching_timeout: 3600
      deprecation_warnings: false
 verifier:
  name: ansible
  directory: molecule/default/tests
 lint: |
  yamllint .
  ansible-lint
@@ -0,0 +1,32 @@
 ---
 # Molecule verify playbook - runs tests to verify roles
 - name: Verify
  hosts: all
  gather_facts: false
  tasks:
    - name: Check if base OS packages are installed
      shell: dpkg -l | grep -E '(curl|wget|vim|htop)'
      register: package_check
      failed_when: package_check.rc not in [0, 1]
    - name: Check SSH configuration
      stat:
        path: /etc/ssh/sshd_config
      register: ssh_config_stat
      failed_when: not ssh_config_stat.stat.exists
    - name: Check firewall status
      shell: ufw status | grep -q active
      register: firewall_check
      failed_when: false
    - name: Verify admin user exists
      getent:
        database: passwd
        key: infra-admin
      failed_when: false
    - name: Print verification results
      debug:
        msg: "Role verification completed"
@@ -0,0 +1,34 @@
 ---
 - name: Decommission Enterprise Infrastructure Nodes
  hosts: all
  become: true
  gather_facts: true
  pre_tasks:
    - name: Confirm decommissioning
      ansible.builtin.pause:
        prompt: |
          WARNING: This will decommission {{ inventory_hostname }}
          Backup Data: {{ backup_data }}
          Export Config: {{ export_config }}
          Press ENTER to continue or Ctrl+C to cancel
    - name: Display decommissioning information
      ansible.builtin.debug:
        msg: |
          Decommissioning {{ inventory_hostname }}
          Auto Shutdown: {{ auto_shutdown }}
          Backup Enabled: {{ backup_data }}
  roles:
    - role: decommission
      tags: ['decommission', 'cleanup']
  post_tasks:
    - name: Display decommissioning summary
      ansible.builtin.debug:
        msg: |
          Decommissioning completed!
          Host: {{ inventory_hostname }}
          Backup Location: /var/backups/decommission-{{ ansible_date_time.iso8601 }}/
@@ -0,0 +1,124 @@
 ---
 - name: Harden Enterprise Infrastructure Nodes
  hosts: all
  become: true
  gather_facts: true
  pre_tasks:
    - name: Validate hardening prerequisites
      ansible.builtin.assert:
        that:
          - ansible_os_family == "Debian"
          - cis_level in [1, 2]
        fail_msg: "Invalid hardening configuration"
    - name: Display hardening information
      ansible.builtin.debug:
        msg: |
          Hardening {{ inventory_hostname }}
          CIS Level: {{ cis_level }}
          Disable Root Login: {{ disable_root_login }}
  roles:
    - role: hardening
      tags: ['hardening', 'security']
  post_tasks:
    - name: Display hardening summary
      ansible.builtin.debug:
        msg: |
          Hardening completed successfully!
          Host: {{ inventory_hostname }}
      when: ansible_os_family == "Debian"
    - name: Configure auditd
      when: auditd_enabled
      block:
        - name: Install auditd
          ansible.builtin.apt:
            name: auditd
            state: present
          when: ansible_os_family == "Debian"
        - name: Configure audit rules
          ansible.builtin.template:
            src: templates/audit.rules.j2
            dest: /etc/audit/rules.d/hardening.rules
            mode: '0644'
        - name: Enable auditd service
          ansible.builtin.service:
            name: auditd
            state: started
            enabled: true
    - name: Configure AppArmor
      when: apparmor_enabled and ansible_os_family == "Debian"
      block:
        - name: Install apparmor
          ansible.builtin.apt:
            name: apparmor
            state: present
          when: ansible_os_family == "Debian"
        - name: Enable apparmor service
          ansible.builtin.service:
            name: apparmor
            state: started
            enabled: true
    - name: Configure sysctl hardening
      ansible.posix.sysctl:
        name: "{{ item.key }}"
        value: "{{ item.value }}"
        state: present
        reload: true
      loop:
        - { key: 'net.ipv4.ip_forward', value: '0' }
        - { key: 'net.ipv4.conf.all.send_redirects', value: '0' }
        - { key: 'net.ipv4.conf.default.send_redirects', value: '0' }
        - { key: 'net.ipv4.tcp_syncookies', value: '1' }
        - { key: 'net.ipv4.icmp_echo_ignore_broadcasts', value: '1' }
    - name: Set secure file permissions
      ansible.builtin.file:
        path: "{{ item }}"
        mode: '0644'
        owner: root
        group: root
      loop:
        - /etc/passwd
        - /etc/group
        - /etc/shadow
        - /etc/gshadow
    - name: Lock inactive user accounts
      ansible.builtin.command: usermod -L "{{ item }}"
      loop: "{{ inactive_users | default([]) }}"
      changed_when: false
    - name: Configure password policies
      community.general.pam_limits:
        domain: '*'
        limit_type: hard
        limit_item: nofile
        value: 1024
    - name: Generate hardening report
      ansible.builtin.template:
        src: templates/hardening_report.j2
        dest: "/var/log/hardening_report_{{ ansible_date_time.iso8601 }}.log"
        mode: '0644'
  handlers:
    - name: restart sshd
      ansible.builtin.service:
        name: ssh
        state: restarted
    - name: restart auditd
      ansible.builtin.service:
        name: auditd
        state: restarted
      when: auditd_enabled
@@ -0,0 +1,149 @@
 ---
 - name: AAP-style LVM filesystem resize workflow
  hosts: all
  become: true
  gather_facts: true
  vars:
    lvm_dry_run: true
    lvm_vg_name: ""
    lvm_lv_name: ""
    lvm_mountpoint: ""
    lvm_size_request: "+10G"
    lvm_resize_filesystem: true
  pre_tasks:
    - name: Validate required survey variables
      ansible.builtin.assert:
        that:
          - lvm_vg_name | length > 0
          - lvm_lv_name | length > 0
          - lvm_mountpoint | length > 0
          - lvm_size_request | length > 0
        fail_msg: "Required variables: lvm_vg_name, lvm_lv_name, lvm_mountpoint, lvm_size_request"
  tasks:
    - name: Capture block device layout before resize
      ansible.builtin.command:
        argv:
          - lsblk
          - --fs
      register: lvm_lsblk_before
      changed_when: false
    - name: Capture physical volumes before resize
      ansible.builtin.command:
        argv:
          - pvs
          - --noheadings
          - --units
          - g
      register: lvm_pvs_before
      changed_when: false
    - name: Capture volume groups before resize
      ansible.builtin.command:
        argv:
          - vgs
          - --noheadings
          - --units
          - g
      register: lvm_vgs_before
      changed_when: false
    - name: Capture logical volumes before resize
      ansible.builtin.command:
        argv:
          - lvs
          - --noheadings
          - --units
          - g
      register: lvm_lvs_before
      changed_when: false
    - name: Capture filesystem usage before resize
      ansible.builtin.command:
        argv:
          - df
          - -hT
          - "{{ lvm_mountpoint }}"
      register: lvm_df_before
      changed_when: false
    - name: Validate target logical volume exists
      ansible.builtin.command:
        argv:
          - lvs
          - "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
      register: lvm_target_check
      changed_when: false
    - name: Show dry-run resize command
      ansible.builtin.debug:
        msg: "DRY RUN: would run lvextend -L {{ lvm_size_request }} /dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
      when: lvm_dry_run | bool
    - name: Extend logical volume
      ansible.builtin.command:
        argv:
          - lvextend
          - -L
          - "{{ lvm_size_request }}"
          - "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
      register: lvm_lvextend_result
      changed_when: true
      when: not (lvm_dry_run | bool)
    - name: Detect filesystem type
      ansible.builtin.command:
        argv:
          - findmnt
          - -n
          - -o
          - FSTYPE
          - "{{ lvm_mountpoint }}"
      register: lvm_fstype
      changed_when: false
    - name: Resize XFS filesystem
      ansible.builtin.command:
        argv:
          - xfs_growfs
          - "{{ lvm_mountpoint }}"
      changed_when: true
      when:
        - not (lvm_dry_run | bool)
        - lvm_resize_filesystem | bool
        - lvm_fstype.stdout == "xfs"
    - name: Resize ext filesystem
      ansible.builtin.command:
        argv:
          - resize2fs
          - "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
      changed_when: true
      when:
        - not (lvm_dry_run | bool)
        - lvm_resize_filesystem | bool
        - lvm_fstype.stdout in ["ext2", "ext3", "ext4"]
    - name: Capture filesystem usage after resize
      ansible.builtin.command:
        argv:
          - df
          - -hT
          - "{{ lvm_mountpoint }}"
      register: lvm_df_after
      changed_when: false
    - name: Report LVM resize evidence
      ansible.builtin.debug:
        msg:
          host: "{{ inventory_hostname }}"
          dry_run: "{{ lvm_dry_run }}"
          target: "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
          mountpoint: "{{ lvm_mountpoint }}"
          requested_size: "{{ lvm_size_request }}"
          filesystem_type: "{{ lvm_fstype.stdout | default('unknown') }}"
          before_df: "{{ lvm_df_before.stdout_lines }}"
          after_df: "{{ lvm_df_after.stdout_lines }}"
@@ -0,0 +1,31 @@
 ---
 - name: Apply Security Patches and Updates
  hosts: all
  become: true
  gather_facts: true
  pre_tasks:
    - name: Validate patch prerequisites
      ansible.builtin.assert:
        that:
          - ansible_os_family == "Debian"
        fail_msg: "Patching supported only on Debian-based systems"
    - name: Display patch information
      ansible.builtin.debug:
        msg: |
          Patching {{ inventory_hostname }}
          Patch Window: {{ patch_window_start }} - {{ patch_window_end }}
          Security Only: {{ patch_security_only }}
  roles:
    - role: patching
      tags: ['patch', 'updates']
  post_tasks:
    - name: Display patching summary
      ansible.builtin.debug:
        msg: |
          Patching completed!
          Host: {{ inventory_hostname }}
          Reboot Required: {{ reboot_required | default(false) }}
@@ -0,0 +1,33 @@
 ---
 - name: Provision Enterprise Infrastructure Nodes
  hosts: all
  become: true
  gather_facts: true
  pre_tasks:
    - name: Validate Ansible version
      ansible.builtin.assert:
        that:
          - ansible_version.major >= 2
          - ansible_version.minor >= 9
        fail_msg: "Ansible 2.9+ is required"
    - name: Display provisioning information
      ansible.builtin.debug:
        msg: |
          Provisioning {{ inventory_hostname }}
          OS: {{ ansible_os_family }}
          Python: {{ ansible_python_version }}
  roles:
    - role: base_provision
      tags: ['provision', 'base']
  post_tasks:
    - name: Generate provisioning summary
      ansible.builtin.debug:
        msg: |
          Provisioning completed successfully!
          Host: {{ inventory_hostname }}
          IP: {{ ansible_default_ipv4.address }}
          OS: {{ ansible_os_family }} {{ ansible_os_version }}
@@ -0,0 +1,48 @@
 # Base Provision Role
 Provision basic infrastructure on enterprise nodes with security hardening.
 ## Features
 - **Idempotent**: All tasks use proper idempotency markers (`changed_when`, `failed_when`)
 - **Handlers**: SSH and fail2ban restarts use handlers instead of direct service calls
 - **Variables**: All configuration in `defaults/main.yml` - no hardcoding
 - **Validation**: Pre-flight checks for system requirements
 - **Firewall**: UFW firewall configuration with configurable rules
 - **SSH Security**: Root login disabled, password auth disabled, key-based auth only
 ## Role Variables
 See `defaults/main.yml` for all available variables.
 ### Key Variables
 - `node_timezone`: System timezone (default: UTC)
 - `admin_user`: Admin username for infrastructure access
 - `ssh_port`: SSH service port (default: 22)
 - `base_packages`: List of base packages to install
 - `firewall_enabled`: Enable UFW firewall (default: true)
 - `firewall_allowed_tcp_ports`: Allowed TCP ports for firewall
 ## Secret Variables
 This portfolio demo does not require secrets for offline validation. If you extend it with real passwords or keys, copy `group_vars/vault.example.yml` into an encrypted Ansible Vault file and keep real values out of normal git history.
 ## Usage
 ```yaml
 - role: base_provision
  vars:
    node_timezone: "Europe/Warsaw"
    firewall_enabled: true
 ```
 ## Handlers
 - `restart sshd`: Restarts SSH service (triggered by config changes)
 - `restart fail2ban`: Restarts fail2ban service (triggered by config changes)
 ## Tags
 - `provision`: All provisioning tasks
 - `base`: Base provision role tasks
@@ -0,0 +1,44 @@
 ---
 # Base provisioning configuration
 node_timezone: "UTC"
 admin_user: "infra-admin"
 ssh_port: 22
 ssh_disabled_root_login: true
 ssh_disable_password_auth: true
 # Packages to install
 base_packages:
  - curl
  - wget
  - vim
  - htop
  - net-tools
  - iptables
  - fail2ban
  - unattended-upgrades
 # Firewall rules
 firewall_enabled: true
 firewall_default_policy: deny
 firewall_allowed_tcp_ports:
  - 22
  - 80
  - 443
 # Application directories
 app_directories:
  - path: /opt/application
    owner: "{{ admin_user }}"
    group: "{{ admin_user }}"
    mode: '0755'
  - path: /var/log/application
    owner: "{{ admin_user }}"
    group: "{{ admin_user }}"
    mode: '0755'
  - path: /etc/application
    owner: root
    group: root
    mode: '0755'
 # Service verification
 services_to_verify: []
@@ -0,0 +1,11 @@
 ---
 - name: restart sshd
  ansible.builtin.service:
    name: sshd
    state: restarted
 - name: restart fail2ban
  ansible.builtin.service:
    name: fail2ban
    state: restarted
    enabled: true
@@ -0,0 +1,138 @@
 ---
 - name: Validate system requirements
  ansible.builtin.assert:
    that:
      - ansible_os_family == "Debian"
      - ansible_python_version is version('3.6', '>=')
    fail_msg: "Unsupported system - requires Debian and Python 3.6+"
 - name: Update package cache
  ansible.builtin.apt:
    update_cache: true
    cache_valid_time: 3600
  changed_when: false
 - name: Install base packages
  ansible.builtin.apt:
    name: "{{ base_packages }}"
    state: present
    update_cache: true
 - name: Check if admin user exists
  ansible.builtin.getent:
    database: passwd
    key: "{{ admin_user }}"
  register: admin_check
  failed_when: false
  changed_when: false
 - name: Create admin user
  ansible.builtin.user:
    name: "{{ admin_user }}"
    groups: sudo
    append: true
    create_home: true
    shell: /bin/bash
  when: admin_check.failed
 - name: Configure timezone
  community.general.timezone:
    name: "{{ node_timezone }}"
 - name: Configure SSH security
  block:
    - name: Disable root SSH login
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PermitRootLogin'
        line: 'PermitRootLogin no'
        state: present
      when: ssh_disabled_root_login
      notify: restart sshd
    - name: Set SSH port
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^Port'
        line: "Port {{ ssh_port }}"
        state: present
      notify: restart sshd
    - name: Disable password authentication
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PasswordAuthentication'
        line: 'PasswordAuthentication no'
        state: present
      when: ssh_disable_password_auth
      notify: restart sshd
 - name: Configure firewall
  block:
    - name: Enable UFW firewall
      community.general.ufw:
        state: enabled
        policy: "{{ firewall_default_policy }}"
      when: firewall_enabled
    - name: Allow SSH access
      community.general.ufw:
        rule: allow
        port: "{{ ssh_port }}"
        proto: tcp
      when: firewall_enabled
    - name: Allow HTTP/HTTPS
      community.general.ufw:
        rule: allow
        port: "{{ item }}"
        proto: tcp
      loop: "{{ firewall_allowed_tcp_ports }}"
      when: firewall_enabled and item not in [ssh_port]
 - name: Configure fail2ban
  ansible.builtin.template:
    src: jail.local.j2
    dest: /etc/fail2ban/jail.local
    backup: true
    mode: '0644'
  notify: restart fail2ban
 - name: Enable unattended upgrades
  ansible.builtin.lineinfile:
    path: /etc/apt/apt.conf.d/20auto-upgrades
    regexp: '^APT::Periodic::Unattended-Upgrade'
    line: 'APT::Periodic::Unattended-Upgrade "1";'
    state: present
 - name: Create application directories
  ansible.builtin.file:
    path: "{{ item.path }}"
    state: directory
    owner: "{{ item.owner }}"
    group: "{{ item.group }}"
    mode: "{{ item.mode }}"
  loop: "{{ app_directories }}"
 - name: Record role-specific service intent
  ansible.builtin.debug:
    msg: "Would configure {{ node_type | default('generic') }} service components in a full lab deployment"
 - name: Verify services are running
  ansible.builtin.service:
    name: "{{ item }}"
    state: started
    enabled: true
  loop: "{{ services_to_verify }}"
  when: services_to_verify | length > 0
  failed_when: false
 - name: Run health checks
  ansible.builtin.uri:
    url: http://localhost/health
    method: GET
    status_code: 200
  register: health_check
  failed_when: false
  ignore_errors: true
  when: "'webservers' in group_names"
@@ -0,0 +1,14 @@
 # fail2ban configuration
 [DEFAULT]
 bantime = 3600
 findtime = 600
 maxretry = 5
 [sshd]
 enabled = true
 port = {{ ssh_port }}
 logpath = /var/log/auth.log
 maxretry = 3
 [recidive]
 enabled = true
@@ -0,0 +1,62 @@
 # Decommission Role
 Gracefully decommission enterprise infrastructure nodes with comprehensive backup and cleanup.
 ## Features
 - **Confirmation Prompt**: Interactive confirmation before decommissioning
 - **Graceful Shutdown**: Stop services gracefully with connection drain time
 - **Comprehensive Backup**: Archive configurations and data before cleanup
 - **Selective Cleanup**: Only remove items that were deployed
 - **Logging**: Detailed decommissioning logs for audit trail
 - **Notifications**: Optional email notifications on completion
 ## Role Variables
 See `defaults/main.yml` for all available variables.
 ### Key Variables
 - `backup_data`: Backup application data (default: true)
 - `export_config`: Export system configuration (default: true)
 - `graceful_shutdown`: Graceful service shutdown (default: true)
 - `auto_shutdown`: Auto shutdown after decommissioning (default: false)
 - `application_services`: Services to stop
 - `application_packages`: Packages to remove
 - `decommission_notification_email`: Email for notifications (optional)
 ## Usage
 ```yaml
 - role: decommission
  vars:
    backup_data: true
    export_config: true
    auto_shutdown: false
    decommission_notification_email: "ops@company.com"
 ```
 ## Backup Locations
 - Configuration: `/var/backups/decommission-<timestamp>/config/`
 - Data: `/var/backups/decommission-<timestamp>/data/`
 - Report: `/var/log/decommission_report_<timestamp>.log`
 ## Supported Groups
 - `webservers`: Backs up /var/www/html
 - `databases`: Backs up PostgreSQL data
 - `monitoring`: Backs up Prometheus data
 - `loadbalancers`: Loadbalancer cleanup
 ## Safety Features
 - Interactive confirmation before execution
 - Connection drain time before shutdown (30 seconds)
 - Errors are logged but don't stop the process
 - Comprehensive audit log
 ## Tags
 - `decommission`: All decommissioning tasks
 - `cleanup`: Cleanup-related tasks
@@ -0,0 +1,34 @@
 ---
 # Decommissioning configuration
 backup_data: true
 export_config: true
 graceful_shutdown: true
 cleanup_inventory: true
 auto_shutdown: false
 shutdown_delay: 10
 # Services to stop gracefully
 application_services:
  - nginx
  - postgresql
  - haproxy
 # Packages to remove
 application_packages:
  - nginx
  - postgresql
  - haproxy
  - prometheus
 # Directories to archive
 config_paths:
  - /etc/
  - /opt/application/
 data_paths:
  - /var/www/html
  - /var/lib/postgresql
  - /var/lib/prometheus
 # Notification settings
 decommission_notification_email: null
@@ -0,0 +1,177 @@
 ---
 - name: Validate decommissioning requirements
  ansible.builtin.assert:
    that:
      - backup_data or not backup_data
    fail_msg: "Invalid decommissioning configuration"
 - name: Pre-decommissioning checks
  block:
    - name: Check node health
      ansible.builtin.uri:
        url: http://localhost/health
        method: GET
        status_code: 200
      register: health_check
      failed_when: false
      ignore_errors: true
      when: "'webservers' in group_names"
    - name: Create decommissioning backup directory
      ansible.builtin.file:
        path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
        state: directory
        mode: '0755'
    - name: Initialize decommissioning log
      ansible.builtin.file:
        path: "/var/log/decommission.log"
        state: touch
        mode: '0644'
        modification_time: now
        access_time: now
    - name: Log decommissioning start
      ansible.builtin.lineinfile:
        path: "/var/log/decommission.log"
        line: "{{ ansible_date_time.iso8601 }} - Starting decommissioning of {{ inventory_hostname }}"
        state: present
 - name: Graceful application shutdown
  block:
    - name: Stop application services
      ansible.builtin.service:
        name: "{{ item }}"
        state: stopped
      loop: "{{ application_services }}"
      failed_when: false
      when: graceful_shutdown
    - name: Wait for connections to drain
      ansible.builtin.pause:
        seconds: 30
      when: graceful_shutdown and ("webservers" in group_names or "loadbalancers" in group_names)
 - name: Export and backup data
  block:
    - name: Create config export directory
      ansible.builtin.file:
        path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config"
        state: directory
        mode: '0755'
    - name: Archive system configuration
      community.general.archive:
        path: "{{ config_paths }}"
        dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config/system_config.tar.gz"
        format: gz
      when: export_config
      failed_when: false  # noqa risky-file-permissions
    - name: Create data backup directory
      ansible.builtin.file:
        path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data"
        state: directory
        mode: '0755'
      when: backup_data
    - name: Backup individual data paths
      community.general.archive:
        path: "{{ item }}"
        dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data/{{ item | regex_replace('/', '_') }}.tar.gz"
        format: gz
      loop: "{{ data_paths }}"
      when: backup_data
      failed_when: false  # noqa risky-file-permissions
 - name: Update monitoring and load balancing
  block:
    - name: Remove from load balancer
      ansible.builtin.debug:
        msg: "Would remove {{ inventory_hostname }} from load balancer"
      when: "'webservers' in group_names or 'databases' in group_names"
    - name: Update monitoring alerts
      ansible.builtin.debug:
        msg: "Would update monitoring alerts for {{ inventory_hostname }}"
      when: "'monitoring' not in group_names"
 - name: Clean up application
  block:
    - name: Remove application directories
      ansible.builtin.file:
        path: "{{ item }}"
        state: absent
      loop:
        - /opt/application
        - /var/www/html
        - /var/lib/postgresql
        - /var/lib/prometheus
      failed_when: false
    - name: Remove application packages
      ansible.builtin.apt:
        name: "{{ item }}"
        state: absent
        purge: true
      loop: "{{ application_packages }}"
      failed_when: false
    - name: Clean system logs
      ansible.builtin.shell: |
        set -o pipefail
        find /var/log -name "*.log" -type f -size +0 -exec truncate -s 0 {} \;
      changed_when: false
      failed_when: false
    - name: Remove SSH credentials
      ansible.builtin.file:
        path: "{{ item }}"
        state: absent
      loop:
        - /root/.ssh/authorized_keys
        - /root/.ssh/known_hosts
        - /home/infra-admin/.ssh/authorized_keys
      failed_when: false
 - name: Generate decommissioning report
  ansible.builtin.template:
    src: decommission_report.j2
    dest: "/var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log"
    mode: '0644'
  vars:
    backup_location: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
 - name: Send decommissioning notification
  community.general.mail:
    host: localhost
    port: 25
    to: "{{ decommission_notification_email }}"
    subject: "Node Decommissioned - {{ inventory_hostname }}"
    body: |
      Node {{ inventory_hostname }} has been successfully decommissioned.
      Backup location: /var/backups/decommission-{{ ansible_date_time.iso8601 }}/
      Services stopped: {{ application_services | join(', ') }}
      Configuration exported: {{ export_config }}
      Data backed up: {{ backup_data }}
      See /var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log for details
  when: decommission_notification_email is defined
  failed_when: false
 - name: Finalize decommissioning
  block:
    - name: Log decommissioning completion
      ansible.builtin.lineinfile:
        path: "/var/log/decommission.log"
        line: "{{ ansible_date_time.iso8601 }} - Decommissioning completed for {{ inventory_hostname }}"
        state: present
    - name: Perform system shutdown
      ansible.builtin.reboot:
        msg: "System scheduled for shutdown after decommissioning"
        delay: "{{ shutdown_delay }}"
      when: auto_shutdown | bool
      async: 1
      poll: 0
@@ -0,0 +1,13 @@
 Decommissioning Report
 ======================
 Generated: {{ ansible_date_time.iso8601 }}
 Host: {{ inventory_hostname }}
 Status: COMPLETED
 Backup Location: {{ backup_location }}
 Configuration Exported: {{ export_config }}
 Data Backed Up: {{ backup_data }}
 Services Stopped: {{ application_services | join(', ') }}
 Log Location: /var/log/decommission.log
@@ -0,0 +1,58 @@
 # Hardening Role
 Apply security hardening to enterprise infrastructure nodes following CIS benchmarks.
 ## Features
 - **CIS Compliance**: Support for CIS hardening levels 1 and 2
 - **SSH Hardening**: Disable root login, password auth, set auth limits
 - **Firewall Configuration**: UFW with configurable rules
 - **Service Cleanup**: Disable unnecessary services and remove insecure packages
 - **Handlers**: SSH restarts via handlers
 ## Role Variables
 See `defaults/main.yml` for all available variables.
 ### Key Variables
 - `cis_level`: CIS hardening level (1 or 2)
 - `disable_root_login`: Disable root SSH login (default: true)
 - `secure_ssh_config`: Apply SSH security hardening (default: true)
 - `firewall_policy`: Firewall default policy (default: deny)
 - `ssh_max_auth_tries`: Maximum SSH authentication attempts (default: 3)
 - `ssh_client_alive_interval`: SSH client alive interval in seconds (default: 300)
 - `ssh_allowed_networks`: Networks allowed SSH access from
 ### SSH Allowed Networks
 Default trusted networks:
 - 10.0.0.0/8 (Private Class A)
 - 172.16.0.0/12 (Private Class B)
 - 192.168.0.0/16 (Private Class C)
 ## Usage
 ```yaml
 - role: hardening
  vars:
    cis_level: 1
    disable_root_login: true
    ssh_allowed_networks:
      - 10.0.0.0/8
      - 203.0.113.0/24
 ```
 ## SSH Configuration Changes
 - Root login disabled
 - Password authentication disabled
 - Maximum auth tries: 3
 - Empty passwords prohibited
 - Client alive interval: 300 seconds
 - Client alive count max: 2
 ## Tags
 - `hardening`: All hardening tasks
 - `security`: Security-related tasks
@@ -0,0 +1,35 @@
 ---
 # Hardening configuration
 cis_level: 1
 disable_root_login: true
 secure_ssh_config: true
 firewall_policy: deny
 auditd_enabled: true
 selinux_mode: enforcing
 apparmor_enabled: true
 # SSH Hardening
 ssh_max_auth_tries: 3
 ssh_client_alive_interval: 300
 ssh_client_alive_count_max: 2
 # Firewall rules for SSH (trusted networks)
 ssh_allowed_networks:
  - 10.0.0.0/8
  - 172.16.0.0/12
  - 192.168.0.0/16
 # Services to disable
 unnecessary_services:
  - cups
  - avahi-daemon
  - bluetooth
  - nfs-server
  - rpcbind
 # Packages to remove
 unnecessary_packages:
  - telnet
  - rsh-client
  - talk
  - ntalk
@@ -0,0 +1,5 @@
 ---
 - name: restart sshd
  ansible.builtin.service:
    name: sshd
    state: restarted
@@ -0,0 +1,7 @@
 ---
 # CIS Hardening Level 1 tasks (stub for future expansion)
 # https://www.cisecurity.org/cis-benchmarks/
 - name: Check CIS status
  ansible.builtin.debug:
    msg: "CIS Hardening Level {{ cis_level }} would be applied here"
@@ -0,0 +1,95 @@
 ---
 - name: Validate hardening requirements
  ansible.builtin.assert:
    that:
      - ansible_os_family == "Debian"
      - cis_level in [1, 2]
    fail_msg: "Unsupported configuration for hardening"
 - name: Apply CIS hardening tasks
  ansible.builtin.include_tasks: cis_hardening.yml
  when: cis_level >= 1
 - name: Configure SSH hardening
  block:
    - name: Disable root SSH login
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PermitRootLogin'
        line: 'PermitRootLogin no'
        state: present
      when: disable_root_login
      notify: restart sshd
    - name: Disable password authentication
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PasswordAuthentication'
        line: 'PasswordAuthentication no'
        state: present
      when: secure_ssh_config
      notify: restart sshd
    - name: Set MaxAuthTries
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^MaxAuthTries'
        line: "MaxAuthTries {{ ssh_max_auth_tries }}"
        state: present
      notify: restart sshd
    - name: Disable empty passwords
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^PermitEmptyPasswords'
        line: 'PermitEmptyPasswords no'
        state: present
      notify: restart sshd
    - name: Set ClientAliveInterval
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^ClientAliveInterval'
        line: "ClientAliveInterval {{ ssh_client_alive_interval }}"
        state: present
      notify: restart sshd
    - name: Set ClientAliveCountMax
      ansible.builtin.lineinfile:
        path: /etc/ssh/sshd_config
        regexp: '^ClientAliveCountMax'
        line: "ClientAliveCountMax {{ ssh_client_alive_count_max }}"
        state: present
      notify: restart sshd
 - name: Configure firewall rules
  block:
    - name: Enable firewall
      community.general.ufw:
        state: enabled
        policy: "{{ firewall_policy }}"
      when: firewall_policy is defined
    - name: Allow SSH from trusted networks
      community.general.ufw:
        rule: allow
        port: '22'
        proto: tcp
        from: "{{ item }}"
      loop: "{{ ssh_allowed_networks }}"
 - name: Disable unnecessary services
  ansible.builtin.service:
    name: "{{ item }}"
    state: stopped
    enabled: false
  loop: "{{ unnecessary_services }}"
  failed_when: false
 - name: Remove unnecessary packages
  ansible.builtin.apt:
    name: "{{ item }}"
    state: absent
    purge: true
  loop: "{{ unnecessary_packages }}"
  failed_when: false
@@ -0,0 +1,45 @@
 # Patching Role
 Apply security patches and OS updates to enterprise infrastructure nodes.
 ## Features
 - **Idempotent**: Properly checks for changes with `changed_when`
 - **Patch Window**: Optional enforcement of patch time windows
 - **Pre-patch Backup**: Backs up package list before patching
 - **Smart Reboot**: Automatically detects if reboot is required
 - **Service Restart**: Restarts only necessary services after patching
 - **Health Checks**: Verifies services and runs health endpoint checks
 ## Role Variables
 See `defaults/main.yml` for all available variables.
 ### Key Variables
 - `patch_window_start`: Patch window start time (default: 02:00)
 - `patch_window_end`: Patch window end time (default: 04:00)
 - `enforce_patch_window`: Enforce patch time window (default: true)
 - `patch_security_only`: Apply security updates only (default: true)
 - `backup_before_patch`: Create backup before patching (default: true)
 - `reboot_if_required`: Auto-reboot if required (default: false)
 - `services_to_restart`: Services to restart after patching
 - `critical_services`: Critical services to verify after patching
 ## Usage
 ```yaml
 - role: patching
  vars:
    patch_security_only: true
    enforce_patch_window: false
    reboot_if_required: true
 ```
 ## Report
 Patch report is generated at: `/var/log/patch_report_<timestamp>.log`
 ## Backup Location
 Pre-patch backups saved to: `/var/backups/pre-patch-<timestamp>/`
@@ -0,0 +1,20 @@
 ---
 # Patching configuration
 patch_window_start: "02:00"
 patch_window_end: "04:00"
 enforce_patch_window: true
 patch_security_only: true
 backup_before_patch: true
 reboot_if_required: false
 reboot_timeout: 300
 # Services to restart after patching
 services_to_restart:
  - sshd
  - fail2ban
 # Services to verify after patching
 critical_services:
  - systemd-journald
  - systemd-logind
  - cron
@@ -0,0 +1,6 @@
 ---
 - name: restart patching services
  ansible.builtin.service:
    name: "{{ item }}"
    state: restarted
  loop: "{{ services_to_restart }}"
@@ -0,0 +1,105 @@
 ---
 - name: Validate patch window
  when: enforce_patch_window | bool
  block:
    - name: Check current time against patch window
      ansible.builtin.assert:
        that:
          - ansible_date_time.hour | int >= patch_window_start.split(':')[0] | int
          - ansible_date_time.hour | int < patch_window_end.split(':')[0] | int
        fail_msg: |
          Current time {{ ansible_date_time.hour }}:{{ ansible_date_time.minute }} is outside patch window {{ patch_window_start }}-{{ patch_window_end }}
 - name: Create pre-patch backup
  when: backup_before_patch | bool
  block:
    - name: Create backup directory
      ansible.builtin.file:
        path: "/var/backups/pre-patch-{{ ansible_date_time.iso8601 }}"
        state: directory
        mode: '0755'
    - name: Capture current package list
      ansible.builtin.shell: |
        set -o pipefail
        dpkg --get-selections > /var/backups/pre-patch-{{ ansible_date_time.iso8601 }}/packages.list
      changed_when: false
 - name: Check for available updates
  ansible.builtin.shell: |
    set -o pipefail
    apt list --upgradable 2>/dev/null | grep -v "Listing..." | wc -l
  register: updates_available_count
  changed_when: false
  failed_when: false
 - name: Update package cache
  ansible.builtin.apt:
    update_cache: true
    cache_valid_time: 300
  changed_when: false
 - name: Check if reboot required before patching
  ansible.builtin.stat:
    path: /var/run/reboot-required
  register: reboot_required_before
  changed_when: false
 - name: Apply security updates
  ansible.builtin.apt:
    upgrade: dist
    update_cache: true
  when: patch_security_only | bool
  register: apt_update_result
  notify: restart patching services
 - name: Apply all available updates
  ansible.builtin.apt:
    upgrade: full
    update_cache: true
  when: not (patch_security_only | bool)
  register: apt_update_result
  notify: restart patching services
 - name: Check if reboot required after patching
  ansible.builtin.stat:
    path: /var/run/reboot-required
  register: reboot_required_after
  changed_when: false
 - name: Verify critical services are running
  ansible.builtin.service:
    name: "{{ item }}"
    state: started
    enabled: true
  loop: "{{ critical_services }}"
  failed_when: false
 - name: Run post-patch health checks
  ansible.builtin.uri:
    url: http://localhost/health
    method: GET
    status_code: 200
  register: health_check
  failed_when: false
  ignore_errors: true
  when: "'webservers' in group_names"
 - name: Set reboot required flag
  ansible.builtin.set_fact:
    reboot_required: "{{ reboot_required_after.stat.exists | default(false) }}"
 - name: Perform system reboot if required
  ansible.builtin.reboot:
    msg: "Rebooting after security patches"
    timeout: "{{ reboot_timeout }}"
  when: reboot_required and reboot_if_required | bool
 - name: Generate patching report
  ansible.builtin.template:
    src: patch_report.j2
    dest: /var/log/patch_report_{{ ansible_date_time.iso8601 }}.log
    mode: '0644'
  vars:
    updates_applied_count: "{{ apt_update_result.changed | ternary('Yes', 'No') }}"
    reboot_required_flag: "{{ reboot_required }}"
@@ -0,0 +1,10 @@
 Patching Report
 ===============
 Generated: {{ ansible_date_time.iso8601 }}
 Host: {{ inventory_hostname }}
 Updates Applied: {{ updates_applied_count }}
 Reboot Required: {{ reboot_required_flag }}
 Services Restarted: {{ services_to_restart | join(', ') }}
 Backup Location: /var/backups/pre-patch-{{ ansible_date_time.iso8601 }}/
@@ -0,0 +1,21 @@
 # Scenario: Simulate Failure and Patch
 ## Description
 Validate that a service-level failure can be detected, recovered, and followed by a controlled patch workflow. This mirrors a maintenance window where a degraded node is stabilized before package updates are applied.
 ## Commands
 ```bash
 cd professional-infra/linux-operations-automation
 ./scripts/simulate_failure.sh service 30 web
 ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
 ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml --check
 ```
 ## Expected Result
 - The simulation records a temporary service failure.
 - The service is restored after cleanup.
 - The patch playbook completes without unreachable hosts.
 - Hardening check mode reports no destructive changes.
@@ -0,0 +1,116 @@
 ---
 - name: Enterprise Scaling Event Scenario
  hosts: all
  become: yes
  gather_facts: yes
  vars:
    scaling_threshold: 80
    cooldown_period: 300
    max_scale_up: 5
    min_instances: 2
  pre_tasks:
    - name: Log scenario start
      lineinfile:
        path: "/var/log/scaling_scenario.log"
        line: "{{ ansible_date_time.iso8601 }} - Starting scaling event scenario"
        create: yes
    - name: Check current load
      command: uptime
      register: system_load
      changed_when: false
    - name: Parse load average
      set_fact:
        load_1min: "{{ system_load.stdout.split(',')[0].split()[-1] | float }}"
        load_5min: "{{ system_load.stdout.split(',')[1] | float }}"
        load_15min: "{{ system_load.stdout.split(',')[2] | float }}"
  tasks:
    - name: Evaluate scaling conditions
      set_fact:
        scale_up_needed: "{{ load_5min > scaling_threshold }}"
        scale_down_needed: "{{ load_5min < (scaling_threshold * 0.3) }}"
    - name: Scale up web servers
      include_role:
        name: scale_up
        tasks_from: web_servers
      vars:
        scale_count: "{{ [max_scale_up, (load_5min / 10) | int] | min }}"
      when: scale_up_needed and "'webservers' in group_names"
    - name: Scale up database servers
      include_role:
        name: scale_up
        tasks_from: database_servers
      vars:
        scale_count: "{{ [2, (load_5min / 20) | int] | min }}"
      when: scale_up_needed and "'databases' in group_names"
    - name: Update load balancer configuration
      include_role:
        name: load_balancer
        tasks_from: update_backends
      when: scale_up_needed
    - name: Scale down web servers
      include_role:
        name: scale_down
        tasks_from: web_servers
      vars:
        scale_count: "{{ [(inventory_hostname | regex_findall('[0-9]+') | first | int) - min_instances, 1] | max }}"
      when: scale_down_needed and "'webservers' in group_names" and (inventory_hostname | regex_findall('[0-9]+') | first | int) > min_instances
    - name: Wait for cooldown period
      pause:
        seconds: "{{ cooldown_period }}"
      when: scale_up_needed or scale_down_needed
    - name: Verify scaling results
      uri:
        url: http://localhost/health
        method: GET
        status_code: 200
      register: health_check
      until: health_check.status == 200
      retries: 5
      delay: 10
      when: "'webservers' in group_names"
    - name: Update monitoring thresholds
      include_role:
        name: monitoring
        tasks_from: update_alerts
      vars:
        new_threshold: "{{ scaling_threshold + 10 }}"
    - name: Send scaling notification
      mail:
        to: "{{ scaling_notification_email | default('infra-team@company.com') }}"
        subject: "Infrastructure Scaling Event - {{ inventory_hostname }}"
        body: |
          Scaling event completed on {{ inventory_hostname }}
          Load averages: {{ load_1min }}, {{ load_5min }}, {{ load_15min }}
          Action taken: {{ 'Scale Up' if scale_up_needed else 'Scale Down' if scale_down_needed else 'No Action' }}
          Health check: {{ 'PASSED' if health_check.status == 200 else 'FAILED' }}
          See /var/log/scaling_scenario.log for details
      when: scaling_notification_email is defined
      ignore_errors: yes
  post_tasks:
    - name: Generate scaling scenario report
      template:
        src: templates/scaling_scenario_report.j2
        dest: "/var/log/scaling_scenario_report_{{ ansible_date_time.iso8601 }}.log"
      vars:
        scenario_outcome: "{{ 'SUCCESS' if health_check.status == 200 else 'WARNING' }}"
        load_metrics: "{{ load_1min }}, {{ load_5min }}, {{ load_15min }}"
    - name: Log scenario completion
      lineinfile:
        path: "/var/log/scaling_scenario.log"
        line: "{{ ansible_date_time.iso8601 }} - Scaling event scenario completed"
@@ -0,0 +1,388 @@
 #!/bin/bash
 # Enterprise Infrastructure Failure Simulation Script
 # Simulates various types of infrastructure failures for testing
 set -euo pipefail
 # Configuration
 DOCKER_COMPOSE_FILE="docker-compose.yml"
 INVENTORY_FILE="inventory/hosts.ini"
 LOG_FILE="logs/failure_simulation.log"
 # Default values
 FAILURE_TYPE="${1:-network}"
 DURATION="${2:-60}"
 TARGET_NODES="${3:-all}"
 INTENSITY="${INTENSITY:-medium}"
 # Logging function
 log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
 }
 # Error handling
 error_exit() {
    log "ERROR: $1"
    # Cleanup any active failures
    cleanup_failure
    exit 1
 }
 # Validate inputs
 validate_inputs() {
    case "$FAILURE_TYPE" in
        network|disk|service|node|cpu|memory) ;;
        *) error_exit "Invalid failure type: $FAILURE_TYPE. Must be network, disk, service, node, cpu, or memory" ;;
    esac
    if ! [[ "$DURATION" =~ ^[0-9]+$ ]] || [ "$DURATION" -lt 1 ]; then
        error_exit "Invalid duration: $DURATION. Must be a positive integer (seconds)"
    fi
    case "$INTENSITY" in
        low|medium|high|critical) ;;
        *) error_exit "Invalid intensity: $INTENSITY. Must be low, medium, high, or critical" ;;
    esac
 }
 # Get target containers
 get_target_containers() {
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        case "$TARGET_NODES" in
            all) echo "web db lb" ;;
            *) echo "$TARGET_NODES" ;;
        esac
        return
    fi
    case "$TARGET_NODES" in
        all)
            docker compose ps --services | grep -v "^NAME$" || true
            ;;
        web)
            echo "web"
            ;;
        db)
            echo "db"
            ;;
        lb)
            echo "lb"
            ;;
        monitor)
            echo "monitor"
            ;;
        *)
            echo "$TARGET_NODES"
            ;;
    esac
 }
 # Network failure simulation
 simulate_network_failure() {
    local containers=$(get_target_containers)
    log "Simulating network failure on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: skipping Docker network changes"
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Disconnecting network for container $cid"
                # Disconnect from network
                docker network disconnect "$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" "$cid" 2>/dev/null || true
                # Store original network for restoration
                echo "$cid:$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" >> /tmp/network_failure_state
            fi
        done
    done
 }
 # Disk failure simulation
 simulate_disk_failure() {
    local containers=$(get_target_containers)
    log "Simulating disk space exhaustion on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: skipping container disk writes"
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Filling disk space in container $cid"
                # Create a large file to consume disk space
                local fill_size_mb=100
                case "$INTENSITY" in
                    low) fill_size_mb=50 ;;
                    medium) fill_size_mb=100 ;;
                    high) fill_size_mb=500 ;;
                    critical) fill_size_mb=1024 ;;
                esac
                docker exec "$cid" bash -c "dd if=/dev/zero of=/tmp/disk_fill bs=1M count=${fill_size_mb}" 2>/dev/null || true
                echo "$cid:disk_fill" >> /tmp/disk_failure_state
            fi
        done
    done
 }
 # Service failure simulation
 simulate_service_failure() {
    local containers=$(get_target_containers)
    log "Simulating service failures on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        for container in $containers; do
            log "SIMULATION_MODE=true: would stop services in $container"
        done
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Stopping services in container $cid"
                # Stop common services
                docker exec "$cid" systemctl stop nginx 2>/dev/null || true
                docker exec "$cid" systemctl stop postgresql 2>/dev/null || true
                docker exec "$cid" systemctl stop haproxy 2>/dev/null || true
                echo "$cid:services" >> /tmp/service_failure_state
            fi
        done
    done
 }
 # Node failure simulation
 simulate_node_failure() {
    local containers=$(get_target_containers)
    log "Simulating complete node failures on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: skipping container pause"
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Stopping container $cid (node failure)"
                docker pause "$cid"
                echo "$cid:paused" >> /tmp/node_failure_state
            fi
        done
    done
 }
 # CPU stress simulation
 simulate_cpu_failure() {
    local containers=$(get_target_containers)
    log "Simulating CPU stress on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: skipping CPU stress"
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Starting CPU stress in container $cid"
                # Start CPU stress process
                docker exec -d "$cid" bash -c "while true; do :; done" 2>/dev/null || true
                echo "$cid:cpu_stress:$(docker exec "$cid" ps aux | grep "while true" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/cpu_failure_state
            fi
        done
    done
 }
 # Memory stress simulation
 simulate_memory_failure() {
    local containers=$(get_target_containers)
    log "Simulating memory exhaustion on containers: $containers"
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: skipping memory stress"
        return
    fi
    for container in $containers; do
        local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
        for cid in $container_ids; do
            if [ -n "$cid" ]; then
                log "Starting memory stress in container $cid"
                # Start memory stress process
                docker exec -d "$cid" bash -c "tail /dev/zero" 2>/dev/null || true
                echo "$cid:memory_stress:$(docker exec "$cid" ps aux | grep "tail /dev/zero" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/memory_failure_state
            fi
        done
    done
 }
 # Inject failure
 inject_failure() {
    case "$FAILURE_TYPE" in
        network) simulate_network_failure ;;
        disk) simulate_disk_failure ;;
        service) simulate_service_failure ;;
        node) simulate_node_failure ;;
        cpu) simulate_cpu_failure ;;
        memory) simulate_memory_failure ;;
    esac
 }
 # Cleanup failure
 cleanup_failure() {
    log "Cleaning up failure simulation"
    # Restore network connections
    if [ -f /tmp/network_failure_state ]; then
        while IFS=: read -r cid network; do
            docker network connect "$network" "$cid" 2>/dev/null || true
        done < /tmp/network_failure_state
        rm -f /tmp/network_failure_state
    fi
    # Clean up disk fill files
    if [ -f /tmp/disk_failure_state ]; then
        while IFS=: read -r cid _; do
            docker exec "$cid" rm -f /tmp/disk_fill 2>/dev/null || true
        done < /tmp/disk_failure_state
        rm -f /tmp/disk_failure_state
    fi
    # Restart services
    if [ -f /tmp/service_failure_state ]; then
        while IFS=: read -r cid _; do
            docker exec "$cid" systemctl start nginx 2>/dev/null || true
            docker exec "$cid" systemctl start postgresql 2>/dev/null || true
            docker exec "$cid" systemctl start haproxy 2>/dev/null || true
        done < /tmp/service_failure_state
        rm -f /tmp/service_failure_state
    fi
    # Unpause containers
    if [ -f /tmp/node_failure_state ]; then
        while IFS=: read -r cid _; do
            docker unpause "$cid" 2>/dev/null || true
        done < /tmp/node_failure_state
        rm -f /tmp/node_failure_state
    fi
    # Kill stress processes
    if [ -f /tmp/cpu_failure_state ]; then
        while IFS=: read -r cid _ pid; do
            docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
        done < /tmp/cpu_failure_state
        rm -f /tmp/cpu_failure_state
    fi
    if [ -f /tmp/memory_failure_state ]; then
        while IFS=: read -r cid _ pid; do
            docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
        done < /tmp/memory_failure_state
        rm -f /tmp/memory_failure_state
    fi
 }
 # Monitor failure
 monitor_failure() {
    local end_time=$(( $(date +%s) + DURATION ))
    log "Monitoring failure for $DURATION seconds"
    while [ $(date +%s) -lt $end_time ]; do
        # Check container status
    if [ "${SIMULATION_MODE:-false}" = true ]; then
        log "SIMULATION_MODE=true: validation simulated"
        return
    fi
    if ! docker compose ps | grep -q "Up\|Paused"; then
            log "WARNING: All containers are down"
        fi
        # Log system metrics
        log "System status: $(docker stats --no-stream --format 'table {{.Container}}\t{{.CPUPerc}}\t{{.MemPerc}}' | tail -n +2)"
        sleep 10
    done
 }
 # Generate failure report
 generate_report() {
    local report_file="reports/failure_simulation_$(date +%Y%m%d_%H%M%S).txt"
    cat > "$report_file" << EOF
 Failure Simulation Report
 ========================
 Timestamp: $(date)
 Failure Type: $FAILURE_TYPE
 Duration: $DURATION seconds
 Target Nodes: $TARGET_NODES
 Intensity: $INTENSITY
 Pre-failure Status:
 $(docker compose ps 2>/dev/null || echo "Docker Compose not running")
 Post-failure Status:
 $(docker compose ps 2>/dev/null || echo "Docker Compose not running")
 Log File: $LOG_FILE
 EOF
    log "Failure simulation report generated: $report_file"
 }
 # Main execution
 main() {
    log "Starting failure simulation: $FAILURE_TYPE for $DURATION seconds"
    validate_inputs
    # Inject failure
    inject_failure
    # Monitor during failure
    monitor_failure
    # Cleanup
    cleanup_failure
    # Generate report
    generate_report
    log "Failure simulation completed successfully"
 }
 # Trap for cleanup on script exit
 trap cleanup_failure EXIT
 # Initialize logging
 mkdir -p logs reports
 # Run main function
 main "$@"
@@ -0,0 +1,229 @@
 #!/bin/bash
 # Enterprise Infrastructure Scaling Simulation Script
 # Simulates scaling operations for infrastructure nodes
 set -euo pipefail
 # Configuration
 DOCKER_COMPOSE_FILE="docker-compose.yml"
 INVENTORY_FILE="inventory/hosts.ini"
 LOG_FILE="logs/scaling_simulation.log"
 # Default values
 DIRECTION="${1:-up}"
 COUNT="${2:-1}"
 NODE_TYPE="${3:-web}"
 SIMULATION_MODE="${SIMULATION_MODE:-false}"
 # Logging function
 log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
 }
 # Error handling
 error_exit() {
    log "ERROR: $1"
    exit 1
 }
 # Validate inputs
 validate_inputs() {
    if [[ "$DIRECTION" != "up" && "$DIRECTION" != "down" ]]; then
        error_exit "Invalid direction: $DIRECTION. Must be 'up' or 'down'"
    fi
    if ! [[ "$COUNT" =~ ^[0-9]+$ ]] || [ "$COUNT" -lt 1 ]; then
        error_exit "Invalid count: $COUNT. Must be a positive integer"
    fi
    case "$NODE_TYPE" in
        web|db|lb|monitor) ;;
        *) error_exit "Invalid node type: $NODE_TYPE. Must be web, db, lb, or monitor" ;;
    esac
 }
 # Get current node count
 get_current_count() {
    local type="$1"
    if [ "$SIMULATION_MODE" = true ]; then
        case "$type" in
            web) echo 3 ;;
            db) echo 2 ;;
            lb|monitor) echo 1 ;;
        esac
        return
    fi
    case "$type" in
        web) docker compose ps web | grep -c "Up" ;;
        db) docker compose ps db | grep -c "Up" ;;
        lb) docker compose ps lb | grep -c "Up" ;;
        monitor) docker compose ps monitor | grep -c "Up" ;;
    esac
 }
 # Scale up infrastructure
 scale_up() {
    local type="$1"
    local count="$2"
    log "Scaling up $count $type nodes"
    if [ "$SIMULATION_MODE" = true ]; then
        log "SIMULATION_MODE=true: skipping Docker Compose mutation and Ansible provisioning"
        update_inventory "$type" "$count" "add"
        log "Successfully simulated scale up of $count $type nodes"
        return
    fi
    docker compose -f "$DOCKER_COMPOSE_FILE" up -d --scale "${type}=${count}"
    # Wait for containers to be ready
    log "Waiting for containers to be ready..."
    sleep 30
    # Update inventory
    update_inventory "$type" "$count" "add"
    # Run provisioning playbook on new nodes
    if [ "$SIMULATION_MODE" = false ]; then
        ansible-playbook -i "$INVENTORY_FILE" playbooks/provision.yml --limit "${type}*"
    fi
    log "Successfully scaled up $count $type nodes"
 }
 # Scale down infrastructure
 scale_down() {
    local type="$1"
    local count="$2"
    local current_count=$(get_current_count "$type")
    if [ "$current_count" -lt "$count" ]; then
        error_exit "Cannot scale down $count nodes. Only $current_count $type nodes currently running"
    fi
    log "Scaling down $count $type nodes"
    # Select nodes to remove (oldest first)
    if [ "$SIMULATION_MODE" = true ]; then
        log "SIMULATION_MODE=true: skipping Docker Compose mutation and Ansible decommissioning"
        update_inventory "$type" "$count" "remove"
        log "Successfully simulated scale down of $count $type nodes"
        return
    fi
    local nodes_to_remove=$(docker compose ps "$type" | grep "Up" | head -n "$count" | awk '{print $1}')
    # Decommission nodes
    for node in $nodes_to_remove; do
        if [ "$SIMULATION_MODE" = false ]; then
            ansible-playbook -i "$INVENTORY_FILE" playbooks/decommission.yml --limit "$node"
        fi
        docker stop "$node"
        docker rm "$node"
    done
    # Update inventory
    update_inventory "$type" "$count" "remove"
    log "Successfully scaled down $count $type nodes"
 }
 # Update Ansible inventory
 update_inventory() {
    local type="$1"
    local count="$2"
    local action="$3"
    log "Updating inventory for $action $count $type nodes"
    # This would be more complex in a real implementation
    # For simulation, we'll just log the action
    case "$action" in
        add)
            log "Added $count $type nodes to inventory"
            ;;
        remove)
            log "Removed $count $type nodes from inventory"
            ;;
    esac
 }
 # Health check after scaling
 health_check() {
    log "Running health checks after scaling"
    # Check container status
    if [ "$SIMULATION_MODE" = true ]; then
        log "SIMULATION_MODE=true: health checks simulated"
        return
    fi
    if ! docker compose ps | grep -q "Up"; then
        error_exit "Some containers failed to start"
    fi
    # Ansible ping check
    if [ "$SIMULATION_MODE" = false ]; then
        if ! ansible -i "$INVENTORY_FILE" all -m ping >/dev/null 2>&1; then
            log "WARNING: Some nodes failed Ansible ping check"
        fi
    fi
    log "Health checks completed"
 }
 # Generate scaling report
 generate_report() {
    local report_file="reports/scaling_report_$(date +%Y%m%d_%H%M%S).txt"
    cat > "$report_file" << EOF
 Scaling Simulation Report
 ========================
 Timestamp: $(date)
 Direction: $DIRECTION
 Node Type: $NODE_TYPE
 Count: $COUNT
 Simulation Mode: $SIMULATION_MODE
 Current Status:
 $(docker compose ps 2>/dev/null || echo "Docker Compose not running")
 Inventory Status:
 $(ansible -i "$INVENTORY_FILE" --list-hosts all 2>/dev/null || echo "Ansible inventory check failed")
 Log File: $LOG_FILE
 EOF
    log "Scaling report generated: $report_file"
 }
 # Main execution
 main() {
    log "Starting scaling simulation: $DIRECTION $COUNT $NODE_TYPE nodes"
    validate_inputs
    case "$DIRECTION" in
        up)
            scale_up "$NODE_TYPE" "$COUNT"
            ;;
        down)
            scale_down "$NODE_TYPE" "$COUNT"
            ;;
    esac
    health_check
    generate_report
    log "Scaling simulation completed successfully"
 }
 # Initialize logging
 mkdir -p logs reports
 # Run main function
 main "$@"
@@ -0,0 +1,20 @@
 .PHONY: run test demo down
 run:
 	docker compose up -d
 test:
 	docker compose config --quiet
 	test -f elasticsearch/config/elasticsearch.yml
 	test -f logstash/config/logstash.yml
 	test -f logstash/pipeline/logstash.conf
 	test -f kibana/config/kibana.yml
 	test -f filebeat/config/filebeat.yml
 	test -d grafana/provisioning
 	test -d grafana/dashboards
 demo:
 	bash ./scenarios/incident_simulation.sh app-errors 3
 down:
 	docker compose down
@@ -0,0 +1,98 @@
 # Log Observability ELK/Grafana
 ## Problem
 Operations teams need searchable logs and reviewable incident evidence in addition to simple OS checks. Zabbix is useful for host and service health signals; ELK/Grafana is better suited for log ingestion, error analysis, dashboards, and environment-level observability.
 ## CV Relevance
 This project supports the monitoring and troubleshooting part of the CV by showing how incident logs can be collected, parsed, searched, and reviewed. It is separate from the Zabbix project: Zabbix handles simple checks, while this project focuses on logs and observability evidence.
 ## What This Project Demonstrates
 - A local Docker Compose scaffold for Elasticsearch, Logstash, Kibana, Grafana, and Filebeat.
 - Minimal configs required for the stack to validate independently.
 - Sample logs and alert intent that can be reviewed without starting the full stack.
 - An incident simulation script for generating operational log evidence.
 This is a local demo stack. The default credentials are for non-production use only.
 ## Architecture
 ```
 Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
                                                       |
                                                       v
                                                    Grafana
 Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
 ```
 Core components:
 - `docker-compose.yml` defines the observability services.
 - `alerting/alert_rules.yml` records alert intent and severity.
 - `examples/` contains representative operational logs and alert output.
 - `scenarios/incident_simulation.sh` emits incident activity.
 - `grafana/`, `kibana/`, `logstash/`, `filebeat/`, and `elasticsearch/` contain minimal local configs.
 ## Quickstart
 ```bash
 cd professional-infra/log-observability-elk-grafana
 make test
 make demo
 ```
 Start the full local stack with Docker:
 ```bash
 make test
 make run
 make down
 ```
 When running locally:
 - Kibana: `http://localhost:5601`
 - Grafana: `http://localhost:3000`
 - Elasticsearch: `http://localhost:9200`
 Default demo credentials:
 - Elasticsearch/Kibana: `elastic` / `elastic`
 - Grafana: `admin` / `admin`
 ## Validation
 ```bash
 make test
 docker compose config --quiet
 ```
 `make test` also checks that all bind-mounted config files and directories exist.
 ## Example Output
 ```text
 [2026-04-29 04:18:23] WARN Database connection pool nearing capacity
 [2026-04-29 04:18:28] ERROR Database connection pool exhausted
 [2026-04-29 04:18:33] ERROR Database query timeout occurred
 [2026-04-29 04:18:44] INFO Database connections restored
 ```
 Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
 ## Interview Talking Points
 - When to use Zabbix checks versus ELK log analysis.
 - How Filebeat, Logstash, and Elasticsearch fit into a basic log pipeline.
 - How incident simulations create evidence for troubleshooting discussions.
 - Why local demo credentials and single-node Elasticsearch are not production architecture.
 ## Roadmap
 - Add curated Grafana and Kibana dashboards.
 - Add Prometheus metrics collection.
 - Add distributed tracing with Jaeger or OpenTelemetry.
 - Add synthetic monitoring checks.
@@ -0,0 +1,326 @@
 # Enterprise Observability Alert Rules
 # Alert definitions for automated incident detection and notification
 alert_rules:
  # System Resource Alerts
  - name: "High CPU Usage"
    description: "CPU utilization exceeds threshold"
    condition: "cpu_usage_percent > 90"
    duration: "5m"
    severity: "critical"
    tags:
      - system
      - performance
    channels:
      - email
      - slack
    labels:
      team: "platform"
      component: "system"
  - name: "High Memory Usage"
    description: "Memory utilization exceeds threshold"
    condition: "memory_usage_percent > 85"
    duration: "3m"
    severity: "warning"
    tags:
      - system
      - memory
    channels:
      - email
    labels:
      team: "platform"
      component: "system"
  - name: "Disk Space Critical"
    description: "Disk usage exceeds critical threshold"
    condition: "disk_usage_percent > 95"
    duration: "2m"
    severity: "critical"
    tags:
      - storage
      - disk
    channels:
      - email
      - pagerduty
    labels:
      team: "platform"
      component: "storage"
  - name: "Disk Space Warning"
    description: "Disk usage exceeds warning threshold"
    condition: "disk_usage_percent > 85"
    duration: "10m"
    severity: "warning"
    tags:
      - storage
      - disk
    channels:
      - email
    labels:
      team: "platform"
      component: "storage"
  # Service Availability Alerts
  - name: "Service Down"
    description: "Critical service is not responding"
    condition: "service_status == 'down' OR http_status_code >= 500"
    duration: "2m"
    severity: "critical"
    tags:
      - service
      - availability
    channels:
      - email
      - slack
      - pagerduty
    labels:
      team: "application"
      component: "service"
  - name: "Database Connection Failed"
    description: "Database connection pool exhausted or unresponsive"
    condition: "db_connections_active == 0 OR db_response_time > 5000"
    duration: "1m"
    severity: "critical"
    tags:
      - database
      - connectivity
    channels:
      - email
      - pagerduty
    labels:
      team: "database"
      component: "postgresql"
  - name: "Cache Unavailable"
    description: "Cache service is down or unresponsive"
    condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
    duration: "3m"
    severity: "warning"
    tags:
      - cache
      - performance
    channels:
      - email
    labels:
      team: "infrastructure"
      component: "redis"
  # Application Performance Alerts
  - name: "High Error Rate"
    description: "Application error rate exceeds threshold"
    condition: "error_rate_percent > 5"
    duration: "5m"
    severity: "critical"
    tags:
      - application
      - errors
    channels:
      - email
      - slack
    labels:
      team: "application"
      component: "api"
  - name: "Slow Response Time"
    description: "API response time exceeds SLA"
    condition: "response_time_p95 > 2000"
    duration: "5m"
    severity: "warning"
    tags:
      - application
      - performance
    channels:
      - email
    labels:
      team: "application"
      component: "api"
  - name: "High Request Queue"
    description: "Request queue depth is too high"
    condition: "queue_depth > 100"
    duration: "3m"
    severity: "warning"
    tags:
      - application
      - queue
    channels:
      - email
    labels:
      team: "application"
      component: "queue"
  # Infrastructure Alerts
  - name: "Network Latency High"
    description: "Network round-trip time exceeds threshold"
    condition: "network_rtt > 100"
    duration: "5m"
    severity: "warning"
    tags:
      - network
      - latency
    channels:
      - email
    labels:
      team: "network"
      component: "infrastructure"
  - name: "Load Balancer Unhealthy"
    description: "Load balancer backend servers are unhealthy"
    condition: "lb_unhealthy_backends > 0"
    duration: "2m"
    severity: "critical"
    tags:
      - loadbalancer
      - availability
    channels:
      - email
      - pagerduty
    labels:
      team: "infrastructure"
      component: "loadbalancer"
  # Security Alerts
  - name: "Failed Login Attempts"
    description: "Multiple failed authentication attempts detected"
    condition: "failed_login_attempts > 5"
    duration: "5m"
    severity: "warning"
    tags:
      - security
      - authentication
    channels:
      - email
      - slack
    labels:
      team: "security"
      component: "authentication"
  - name: "Suspicious Network Traffic"
    description: "Unusual network traffic patterns detected"
    condition: "network_bytes_unusual > 1000000"
    duration: "10m"
    severity: "warning"
    tags:
      - security
      - network
    channels:
      - email
    labels:
      team: "security"
      component: "network"
  # Log-based Alerts
  - name: "Application Errors"
    description: "High volume of application error logs"
    condition: "log_errors_per_minute > 10"
    duration: "2m"
    severity: "warning"
    tags:
      - logs
      - errors
    channels:
      - email
    labels:
      team: "application"
      component: "logs"
  - name: "Out of Memory Errors"
    description: "Out of memory errors detected in logs"
    condition: "log_oom_errors > 0"
    duration: "1m"
    severity: "critical"
    tags:
      - memory
      - errors
    channels:
      - email
      - pagerduty
    labels:
      team: "application"
      component: "memory"
  # Business Logic Alerts
  - name: "Low Business Transactions"
    description: "Business transaction volume below expected threshold"
    condition: "business_transactions_per_hour < 100"
    duration: "15m"
    severity: "warning"
    tags:
      - business
      - transactions
    channels:
      - email
    labels:
      team: "business"
      component: "transactions"
  - name: "Payment Failures"
    description: "Payment processing failure rate is high"
    condition: "payment_failure_rate > 0.05"
    duration: "5m"
    severity: "critical"
    tags:
      - payments
      - business
    channels:
      - email
      - pagerduty
    labels:
      team: "payments"
      component: "processing"
 # Alert Channels Configuration
 alert_channels:
  email:
    type: "email"
    recipients:
      - "platform-team@company.com"
      - "oncall@company.com"
    subject_template: "[{{severity}}] {{name}} - {{description}}"
  slack:
    type: "slack"
    webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    channel: "#alerts"
    username: "Observability Bot"
    icon_emoji: ":warning:"
  pagerduty:
    type: "pagerduty"
    integration_key: "your-pagerduty-integration-key"
    severity_mapping:
      critical: "critical"
      warning: "warning"
      info: "info"
 # Alert Silencing Rules
 silence_rules:
  - name: "Maintenance Window"
    condition: "maintenance_window == true"
    duration: "4h"
    comment: "Silenced during scheduled maintenance"
  - name: "Known Issue"
    condition: "known_issue_id == 'TICKET-123'"
    duration: "24h"
    comment: "Silenced for known issue resolution"
 # Escalation Policies
 escalation_policies:
  - name: "Default Escalation"
    steps:
      - delay: "5m"
        channels: ["email"]
      - delay: "15m"
        channels: ["slack"]
      - delay: "30m"
        channels: ["pagerduty"]
  - name: "Critical Escalation"
    steps:
      - delay: "0m"
        channels: ["email", "slack", "pagerduty"]
      - delay: "10m"
        channels: ["pagerduty"]  # Escalation
@@ -0,0 +1,120 @@
 services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
    container_name: observability-elasticsearch
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=true
      - ELASTIC_PASSWORD=elastic
      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
    volumes:
      - elasticsearch_data:/usr/share/elasticsearch/data
      - ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
    ports:
      - "9200:9200"
      - "9300:9300"
    networks:
      - observability
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "curl -u elastic:elastic -f http://localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
  logstash:
    image: docker.elastic.co/logstash/logstash:8.11.0
    container_name: observability-logstash
    environment:
      - "LS_JAVA_OPTS=-Xms512m -Xmx512m"
    volumes:
      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
      - ./logstash/pipeline:/usr/share/logstash/pipeline
      - ./logs:/usr/share/logstash/logs
    ports:
      - "5044:5044"
      - "8080:8080"
    networks:
      - observability
    depends_on:
      elasticsearch:
        condition: service_healthy
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9600/_node/pipelines || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
  kibana:
    image: docker.elastic.co/kibana/kibana:8.11.0
    container_name: observability-kibana
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
      - ELASTICSEARCH_USERNAME=elastic
      - ELASTICSEARCH_PASSWORD=elastic
    volumes:
      - ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
    ports:
      - "5601:5601"
    networks:
      - observability
    depends_on:
      elasticsearch:
        condition: service_healthy
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
  grafana:
    image: grafana/grafana:10.2.0
    container_name: observability-grafana
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    ports:
      - "3000:3000"
    networks:
      - observability
    restart: unless-stopped
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 3
  filebeat:
    image: docker.elastic.co/beats/filebeat:8.11.0
    container_name: observability-filebeat
    user: root
    volumes:
      - ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
      - ./logs:/var/log/sample
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro
    networks:
      - observability
    depends_on:
      - logstash
    restart: unless-stopped
 volumes:
  elasticsearch_data:
    driver: local
  grafana_data:
    driver: local
 networks:
  observability:
    driver: bridge
    ipam:
      config:
        - subnet: 172.25.0.0/16
@@ -0,0 +1,30 @@
 # Log Observability ELK/Grafana Architecture
 ## Components
 - Filebeat: tails sample and container logs.
 - Logstash: receives and processes log events.
 - Elasticsearch: stores searchable observability data.
 - Kibana: supports log exploration and dashboards.
 - Grafana: provides operational dashboards.
 - Alert rules: document symptoms, thresholds, and severity.
 - Incident simulation: generates controlled failure signals.
 ## Data Flow
 ```
 Log source -> Filebeat -> Logstash -> Elasticsearch -> Kibana
                                            |
                                            v
                                         Grafana
 ```
 Incident exercises follow this flow:
 ```
 Operator -> incident_simulation.sh -> logs/incident_simulation.log -> Filebeat -> Logstash -> alerts/dashboards
 ```
 ## Notes
 This is a local demonstration stack, not a production Elasticsearch deployment. A production version would add dedicated nodes, TLS, secret management, retention policies, index lifecycle management, and external alert delivery.
@@ -0,0 +1,4 @@
 cluster.name: portfolio-observability
 node.name: elasticsearch-demo
 network.host: 0.0.0.0
 xpack.security.enabled: true
@@ -0,0 +1,4 @@
 2026-04-29T04:19:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=100 threshold=95 status=firing
 2026-04-29T04:19:30Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=7.8 threshold=5.0 status=firing
 2026-04-29T04:22:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=71 threshold=95 status=resolved
 2026-04-29T04:23:15Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=1.2 threshold=5.0 status=resolved
@@ -0,0 +1,5 @@
 2026-04-29T04:18:21Z INFO service=checkout-api host=app-web-02 request_id=8f4b2 path=/checkout status=200 latency_ms=142
 2026-04-29T04:18:28Z WARN service=checkout-api host=app-web-02 event=db_pool_pressure active=92 max=100
 2026-04-29T04:18:33Z ERROR service=checkout-api host=app-web-02 event=db_timeout query=CreateOrder timeout_ms=5000 customer_tier=enterprise
 2026-04-29T04:18:39Z ERROR service=checkout-api host=app-web-02 event=payment_retry_exhausted order_id=ord-104288 provider=stripe
 2026-04-29T04:18:44Z INFO service=checkout-api host=app-web-02 event=recovery db_pool_active=48
@@ -0,0 +1,11 @@
 filebeat.inputs:
  - type: filestream
    id: portfolio-sample-logs
    enabled: true
    paths:
      - /var/log/sample/*.log
 output.logstash:
  hosts: ["logstash:5044"]
 logging.level: info
@@ -0,0 +1,3 @@
 # Dashboards
 This directory is reserved for local demo dashboards. The current portfolio scope validates the observability stack scaffold, sample logs, alert intent, and incident simulation without claiming production-ready dashboards.
@@ -0,0 +1,14 @@
 apiVersion: 1
 datasources:
  - name: Elasticsearch
    type: elasticsearch
    access: proxy
    url: http://elasticsearch:9200
    basicAuth: true
    basicAuthUser: elastic
    jsonData:
      index: portfolio-logs-*
      timeField: "@timestamp"
    secureJsonData:
      basicAuthPassword: elastic
@@ -0,0 +1,4 @@
 server.host: 0.0.0.0
 elasticsearch.hosts: ["http://elasticsearch:9200"]
 elasticsearch.username: elastic
 elasticsearch.password: elastic
@@ -0,0 +1,2 @@
 http.host: 0.0.0.0
 pipeline.ecs_compatibility: disabled
@@ -0,0 +1,24 @@
 input {
  beats {
    port => 5044
  }
 }
 filter {
  grok {
    match => { "message" => "\[%{TIMESTAMP_ISO8601:observed_at}\] %{LOGLEVEL:level} %{GREEDYDATA:event_message}" }
    tag_on_failure => ["portfolio_parse_failure"]
  }
 }
 output {
  elasticsearch {
    hosts => ["http://elasticsearch:9200"]
    user => "elastic"
    password => "elastic"
    index => "portfolio-logs-%{+YYYY.MM.dd}"
  }
  stdout {
    codec => rubydebug
  }
 }
@@ -0,0 +1,21 @@
 # Scenario: Incident Simulation
 ## Description
 Generate a controlled application and infrastructure incident so the logging pipeline, alert rules, and dashboards can be reviewed with realistic event timing.
 ## Commands
 ```bash
 cd professional-infra/log-observability-elk-grafana
 docker compose config
 ./scenarios/incident_simulation.sh comprehensive
 tail -n 40 logs/incident_simulation.log
 ```
 ## Expected Result
 - The compose file validates successfully.
 - The simulation writes a sequence of CPU, memory, service, database, and application error events.
 - Alert examples indicate firing and resolved states.
 - Operators can trace incident progression through logs and dashboard queries.
@@ -0,0 +1,318 @@
 #!/bin/bash
 # Enterprise Incident Simulation Script
 # Simulates various failure scenarios for testing observability stack
 set -e
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
 LOG_FILE="$PROJECT_ROOT/logs/incident_simulation.log"
 # Colors for output
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 YELLOW='\033[1;33m'
 BLUE='\033[0;34m'
 NC='\033[0m' # No Color
 # Logging function
 log() {
    local level=$1
    local message=$2
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] $level $message" >> "$LOG_FILE"
    echo -e "${BLUE}[$timestamp]${NC} $level $message"
 }
 # Function to simulate CPU spike
 simulate_cpu_spike() {
    local duration=${1:-60}
    log "INFO" "Starting CPU spike simulation for ${duration} seconds"
    # Launch CPU-intensive processes
    for i in {1..4}; do
        (
            end_time=$((SECONDS + duration))
            while [ $SECONDS -lt $end_time ]; do
                # CPU-intensive calculation
                result=0
                for j in {1..100000}; do
                    result=$((result + j))
                done
            done
        ) &
        PIDS[$i]=$!
    done
    # Wait for simulation to complete
    for pid in "${PIDS[@]}"; do
        wait $pid 2>/dev/null || true
    done
    log "INFO" "CPU spike simulation completed"
 }
 # Function to simulate memory leak
 simulate_memory_leak() {
    local duration=${1:-30}
    log "INFO" "Starting memory leak simulation for ${duration} seconds"
    # Create a process that gradually consumes memory
    (
        data=""
        end_time=$((SECONDS + duration))
        while [ $SECONDS -lt $end_time ]; do
            # Gradually consume memory
            data="${data}X"
            sleep 0.1
        done
    ) &
    MEM_PID=$!
    wait $MEM_PID 2>/dev/null || true
    log "INFO" "Memory leak simulation completed"
 }
 # Function to simulate disk space exhaustion
 simulate_disk_full() {
    local target_dir=${1:-"/tmp"}
    local duration=${2:-30}
    log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
    # Create large files to fill disk space
    (
        end_time=$((SECONDS + duration))
        while [ $SECONDS -lt $end_time ]; do
            # Create 100MB file
            dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
            sleep 2
        done
    ) &
    DISK_PID=$!
    wait $DISK_PID 2>/dev/null || true
    # Cleanup test files
    rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
    log "INFO" "Disk space exhaustion simulation completed and cleaned up"
 }
 # Function to simulate network issues
 simulate_network_issues() {
    local interface=${1:-"lo"}
    local duration=${2:-20}
    log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
    # Add network delay and packet loss
    sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
    sleep $duration
    # Remove network simulation
    sudo tc qdisc del dev $interface root 2>/dev/null || true
    log "INFO" "Network issues simulation completed"
 }
 # Function to simulate service crashes
 simulate_service_crash() {
    local service_name=${1:-"test-service"}
    log "INFO" "Starting service crash simulation for ${service_name}"
    # Simulate service going down
    log "ERROR" "Service ${service_name} crashed unexpectedly"
    sleep 5
    log "INFO" "Service ${service_name} restarted automatically"
    # Simulate multiple crashes
    for i in {1..3}; do
        sleep 2
        log "ERROR" "Service ${service_name} crashed again (attempt $i)"
        sleep 1
        log "INFO" "Service ${service_name} recovered after crash $i"
    done
 }
 # Function to simulate database issues
 simulate_database_issues() {
    local duration=${1:-25}
    log "INFO" "Starting database issues simulation for ${duration} seconds"
    # Simulate connection pool exhaustion
    log "WARN" "Database connection pool nearing capacity"
    sleep 5
    log "ERROR" "Database connection pool exhausted"
    sleep 5
    log "ERROR" "Database query timeout occurred"
    sleep 5
    log "WARN" "Database connections recovering"
    sleep 5
    log "INFO" "Database connections restored"
    log "INFO" "Database issues simulation completed"
 }
 # Function to simulate application errors
 simulate_application_errors() {
    local error_count=${1:-10}
    log "INFO" "Starting application error simulation (${error_count} errors)"
    for i in $(seq 1 "$error_count"); do
        case $((RANDOM % 4)) in
            0)
                log "ERROR" "NullPointerException in UserService.getUser($i)"
                ;;
            1)
                log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
                ;;
            2)
                log "ERROR" "ValidationException: Invalid input data for request $i"
                ;;
            3)
                log "ERROR" "IOException: Failed to write to log file"
                ;;
        esac
        sleep $((RANDOM % 3 + 1))
    done
    log "INFO" "Application error simulation completed"
 }
 # Function to run comprehensive incident scenario
 run_comprehensive_scenario() {
    log "INFO" "Starting comprehensive incident scenario simulation"
    # Phase 1: Initial system stress
    log "INFO" "Phase 1: System stress simulation"
    simulate_cpu_spike 30 &
    CPU_PID=$!
    simulate_memory_leak 20 &
    MEM_PID=$!
    sleep 10
    # Phase 2: Service degradation
    log "INFO" "Phase 2: Service degradation simulation"
    simulate_service_crash "web-service" &
    SERVICE_PID=$!
    sleep 5
    # Phase 3: Database issues
    log "INFO" "Phase 3: Database issues simulation"
    simulate_database_issues 15 &
    DB_PID=$!
    # Phase 4: Application errors
    log "INFO" "Phase 4: Application error burst"
    simulate_application_errors 15 &
    APP_PID=$!
    # Phase 5: Infrastructure issues
    log "INFO" "Phase 5: Infrastructure issues simulation"
    simulate_disk_full "/tmp" 10 &
    DISK_PID=$!
    # Wait for all simulations to complete
    wait $CPU_PID 2>/dev/null || true
    wait $MEM_PID 2>/dev/null || true
    wait $SERVICE_PID 2>/dev/null || true
    wait $DB_PID 2>/dev/null || true
    wait $APP_PID 2>/dev/null || true
    wait $DISK_PID 2>/dev/null || true
    log "INFO" "Comprehensive incident scenario completed"
 }
 # Function to show usage
 show_usage() {
    echo "Enterprise Incident Simulation Script"
    echo "Usage: $0 [SCENARIO] [OPTIONS]"
    echo ""
    echo "SCENARIOS:"
    echo "  cpu [DURATION]          - Simulate CPU spike (default: 60s)"
    echo "  memory [DURATION]       - Simulate memory leak (default: 30s)"
    echo "  disk [DIR] [DURATION]   - Simulate disk space exhaustion (default: /tmp, 30s)"
    echo "  network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
    echo "  service [NAME]          - Simulate service crashes (default: test-service)"
    echo "  database [DURATION]     - Simulate database issues (default: 25s)"
    echo "  app-errors [COUNT]      - Simulate application errors (default: 10)"
    echo "  comprehensive           - Run full incident scenario"
    echo "  all                     - Run all individual scenarios sequentially"
    echo ""
    echo "EXAMPLES:"
    echo "  $0 cpu 120              - CPU spike for 2 minutes"
    echo "  $0 disk /var/log 45     - Disk full simulation in /var/log for 45 seconds"
    echo "  $0 comprehensive        - Full incident simulation"
    echo ""
 }
 # Main execution
 main() {
    local scenario=${1:-"comprehensive"}
    # Create log directory if it doesn't exist
    mkdir -p "$(dirname "$LOG_FILE")"
    log "INFO" "Incident simulation script started"
    log "INFO" "Scenario: $scenario"
    case $scenario in
        "cpu")
            simulate_cpu_spike "${2:-60}"
            ;;
        "memory")
            simulate_memory_leak "${2:-30}"
            ;;
        "disk")
            simulate_disk_full "${2:-/tmp}" "${3:-30}"
            ;;
        "network")
            simulate_network_issues "${2:-lo}" "${3:-20}"
            ;;
        "service")
            simulate_service_crash "${2:-test-service}"
            ;;
        "database")
            simulate_database_issues "${2:-25}"
            ;;
        "app-errors")
            simulate_application_errors "${2:-10}"
            ;;
        "comprehensive")
            run_comprehensive_scenario
            ;;
        "all")
            log "INFO" "Running all scenarios sequentially"
            simulate_cpu_spike 30
            sleep 5
            simulate_memory_leak 20
            sleep 5
            simulate_disk_full "/tmp" 15
            sleep 5
            simulate_service_crash "test-service"
            sleep 5
            simulate_database_issues 15
            sleep 5
            simulate_application_errors 8
            sleep 5
            simulate_network_issues "lo" 10
            ;;
        "help"|"-h"|"--help")
            show_usage
            exit 0
            ;;
        *)
            echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
            echo ""
            show_usage
            exit 1
            ;;
    esac
    log "INFO" "Incident simulation script completed successfully"
    echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
 }
 # Run main function with all arguments
 main "$@"
@@ -0,0 +1,11 @@
 .PHONY: run test demo
 run:
 	python3 cli.py --help
 test:
 	python3 -m py_compile cli.py collectors/*.py validators/*.py reports/*.py
 	python3 -m unittest discover -s tests
 demo:
 	python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
@@ -0,0 +1,87 @@
 # Migration Validation Framework
 ## Problem
 Infrastructure migrations often fail in small, expensive ways: a mount option changes, a service is disabled, or disk usage moves past an operational threshold. Teams need structured evidence that the migrated host still matches the expected operating profile.
 ## CV Relevance
 This project maps to storage/platform migration validation work: collecting pre-migration and post-migration state, comparing results, and producing evidence that can be attached to change or migration tickets.
 ## What This Project Demonstrates
 - A Python CLI for collecting and comparing system snapshots.
 - Modular collectors for mounts, services, and disk usage.
 - Risk assessment and validation checks for before/after drift.
 - JSON and HTML evidence suitable for migration review.
 - Offline tests and examples that run without remote hosts.
 ## Architecture
 ```
 Operator -> CLI -> Collectors -> JSON Snapshot -> Comparator -> Diff/Report
 ```
 Core components:
 - `cli.py` provides collect, compare, snapshot, list, and report commands.
 - `collectors/` gathers mounts, services, and disk usage.
 - `validators/compare.py` identifies drift and validation failures.
 - `reports/` contains report generation helpers with escaped HTML output.
 - `examples/` contains realistic before/after evidence.
 ## Quickstart
 ```bash
 cd professional-infra/migration-validation-framework
 make test
 make demo
 ```
 The demo compares the included example snapshots:
 ```bash
 python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
 ```
 The example intentionally returns `FAIL` because it demonstrates a high-risk migration finding.
 Legacy snapshot IDs are still supported:
 ```bash
 python3 cli.py snapshot --env prod --label pre --systems web01,db01
 python3 cli.py compare prod-pre-20260429_020000 prod-post-20260429_030000 --output change-0429
 ```
 ## Validation
 ```bash
 make test
 ```
 This compiles the Python modules and runs unit tests for example comparison, parser behavior, and HTML escaping.
 ## Example Output
 ```text
 Comparison completed: diff.json (FAIL)
 Overall risk: high
 Total changes: 4
 Failed checks: critical_services_running
 Recommendation: restore sshd before production cutover
 ```
 Sample inputs and output are available in [examples/before.json](examples/before.json), [examples/after.json](examples/after.json), and [examples/diff.json](examples/diff.json).
 ## Roadmap
 - Add database-specific migration checks.
 - Add performance baseline comparisons.
 - Add a REST API wrapper for CI/CD integration.
 - Add compliance-oriented validation profiles.
 ## Interview Talking Points
 - Why pre/post migration evidence reduces risk during storage and platform migrations.
 - How to separate collection from comparison so evidence can be replayed.
 - How drift detection supports change approval and rollback decisions.
@@ -0,0 +1,323 @@
 #!/usr/bin/env python3
 """
 Migration Validation Framework - CLI Interface
 A comprehensive tool for validating system migrations through data collection,
 snapshot comparison, and automated reporting.
 """
 import argparse
 import json
 import logging
 import sys
 from datetime import datetime
 from pathlib import Path
 from typing import Dict, List, Optional, Any
 # Import framework modules
 from collectors import mounts, services, disk_usage
 from validators import compare
 from reports import html_report
 # Configuration
 SNAPSHOTS_DIR = Path("snapshots")
 LOGS_DIR = Path("logs")
 REPORTS_DIR = Path("reports")
 class MigrationValidator:
    """Main migration validation class."""
    def __init__(self, verbose: bool = False):
        self.verbose = verbose
        self.ensure_directories()
        self.setup_logging()
    def setup_logging(self):
        """Configure logging."""
        log_level = logging.DEBUG if self.verbose else logging.INFO
        logging.basicConfig(
            level=log_level,
            format='%(asctime)s - %(levelname)s - %(message)s',
            handlers=[
                logging.FileHandler(LOGS_DIR / "validation.log"),
                logging.StreamHandler(sys.stdout)
            ]
        )
        self.logger = logging.getLogger(__name__)
    def ensure_directories(self):
        """Ensure required directories exist."""
        for directory in [SNAPSHOTS_DIR, LOGS_DIR, REPORTS_DIR]:
            directory.mkdir(exist_ok=True)
    def collect_system_data(self, systems: List[str]) -> Dict[str, Any]:
        """Collect data from target systems."""
        self.logger.info(f"Collecting data from systems: {systems}")
        snapshot = {
            "metadata": {
                "timestamp": datetime.now().isoformat(),
                "systems": systems,
                "version": "1.0"
            },
            "data": {}
        }
        collectors = [
            ("mounts", mounts.collect),
            ("services", services.collect),
            ("disk_usage", disk_usage.collect)
        ]
        for system in systems:
            self.logger.info(f"Collecting data from {system}")
            snapshot["data"][system] = {}
            for collector_name, collector_func in collectors:
                try:
                    self.logger.debug(f"Running {collector_name} collector on {system}")
                    data = collector_func(system)
                    snapshot["data"][system][collector_name] = data
                except Exception as e:
                    self.logger.error(f"Failed to collect {collector_name} from {system}: {e}")
                    snapshot["data"][system][collector_name] = {"error": str(e)}
        return snapshot
    def save_snapshot(self, snapshot: Dict[str, Any], label: str, env: str) -> str:
        """Save snapshot to disk."""
        snapshot_id = f"{env}-{label}-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
        snapshot_file = SNAPSHOTS_DIR / f"{snapshot_id}.json"
        with open(snapshot_file, 'w') as f:
            json.dump(snapshot, f, indent=2)
        self.logger.info(f"Snapshot saved: {snapshot_id}")
        return snapshot_id
    def load_snapshot(self, snapshot_id: str) -> Dict[str, Any]:
        """Load snapshot from disk."""
        snapshot_path = Path(snapshot_id)
        snapshot_file = snapshot_path if snapshot_path.exists() else SNAPSHOTS_DIR / f"{snapshot_id}.json"
        if not snapshot_file.exists():
            raise FileNotFoundError(f"Snapshot {snapshot_id} not found")
        with open(snapshot_file, 'r') as f:
            return json.load(f)
    def collect_to_file(self, output_file: str, systems: List[str]) -> str:
        """Collect a snapshot and write it to an explicit file path."""
        snapshot = self.collect_system_data(systems)
        with open(output_file, 'w') as f:
            json.dump(snapshot, f, indent=2)
            f.write("\n")
        self.logger.info(f"Snapshot written: {output_file}")
        return output_file
    def create_snapshot(self, env: str, label: str, systems: List[str]) -> str:
        """Create and save a system snapshot."""
        self.logger.info(f"Creating snapshot for environment: {env}, label: {label}")
        snapshot = self.collect_system_data(systems)
        snapshot_id = self.save_snapshot(snapshot, label, env)
        return snapshot_id
    def compare_snapshots(self, snapshot1_id: str, snapshot2_id: str, output_id: str) -> Dict[str, Any]:
        """Compare two snapshots."""
        self.logger.info(f"Comparing snapshots: {snapshot1_id} vs {snapshot2_id}")
        snapshot1 = self.load_snapshot(snapshot1_id)
        snapshot2 = self.load_snapshot(snapshot2_id)
        comparison = compare.compare_snapshots(snapshot1, snapshot2)
        comparison["metadata"] = {
            "snapshot1": snapshot1_id,
            "snapshot2": snapshot2_id,
            "timestamp": datetime.now().isoformat(),
            "comparison_id": output_id
        }
        # Save comparison results
        comparison_file = REPORTS_DIR / f"comparison_{output_id}.json"
        with open(comparison_file, 'w') as f:
            json.dump(comparison, f, indent=2)
        self.logger.info(f"Comparison saved: {output_id}")
        return comparison
    def compare_files(self, before_file: str, after_file: str, output_file: Optional[str] = None) -> Dict[str, Any]:
        """Compare two explicit JSON snapshot files."""
        self.logger.info(f"Comparing files: {before_file} vs {after_file}")
        before = self.load_snapshot(before_file)
        after = self.load_snapshot(after_file)
        comparison = compare.compare_snapshots(before, after)
        comparison["metadata"] = {
            "before": before_file,
            "after": after_file,
            "timestamp": datetime.now().isoformat()
        }
        if output_file:
            with open(output_file, 'w') as f:
                json.dump(comparison, f, indent=2)
                f.write("\n")
            self.logger.info(f"Comparison written: {output_file}")
        return comparison
    def generate_report(self, comparison_id: str, format_type: str, output_file: Optional[str] = None) -> str:
        """Generate a report from comparison results."""
        self.logger.info(f"Generating {format_type} report for comparison: {comparison_id}")
        comparison_file = REPORTS_DIR / f"comparison_{comparison_id}.json"
        if not comparison_file.exists():
            raise FileNotFoundError(f"Comparison {comparison_id} not found")
        with open(comparison_file, 'r') as f:
            comparison = json.load(f)
        if format_type == "html":
            if output_file is None:
                output_file = f"migration_report_{comparison_id}.html"
            html_report.generate(comparison, output_file)
        elif format_type == "json":
            if output_file is None:
                output_file = f"migration_report_{comparison_id}.json"
            with open(output_file, 'w') as f:
                json.dump(comparison, f, indent=2)
        else:
            raise ValueError(f"Unsupported format: {format_type}")
        self.logger.info(f"Report generated: {output_file}")
        return output_file
 def main():
    """Main CLI entry point."""
    parser = argparse.ArgumentParser(
        description="Migration Validation Framework",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  # Collect pre-migration snapshot
  python3 cli.py collect --output before.json --systems web01,db01
  # Compare snapshot files
  python3 cli.py compare before.json after.json --output diff.json
  # Generate HTML report
  python3 cli.py report --comparison comparison_001 --format html
        """
    )
    parser.add_argument('--verbose', '-v', action='store_true', help='Enable verbose logging')
    parser.add_argument('--dry-run', action='store_true', help='Preview actions without execution')
    subparsers = parser.add_subparsers(dest='command', help='Available commands')
    # Collect command
    collect_parser = subparsers.add_parser('collect', help='Collect a system snapshot to a JSON file')
    collect_parser.add_argument('--output', required=True, help='Output JSON file')
    collect_parser.add_argument('--systems', default='localhost', help='Comma-separated list of systems')
    # Snapshot command
    snapshot_parser = subparsers.add_parser('snapshot', help='Create system snapshot')
    snapshot_parser.add_argument('--env', required=True, help='Target environment')
    snapshot_parser.add_argument('--label', required=True, help='Snapshot label')
    snapshot_parser.add_argument('--systems', required=True, help='Comma-separated list of systems')
    # Compare command
    compare_parser = subparsers.add_parser('compare', help='Compare two snapshots')
    compare_parser.add_argument('snapshot1', help='First snapshot ID')
    compare_parser.add_argument('snapshot2', help='Second snapshot ID')
    compare_parser.add_argument('--output', help='Output comparison ID or JSON file')
    # Report command
    report_parser = subparsers.add_parser('report', help='Generate report from comparison')
    report_parser.add_argument('--comparison', required=True, help='Comparison ID')
    report_parser.add_argument('--format', choices=['html', 'json'], default='html', help='Report format')
    report_parser.add_argument('--output', help='Output file path')
    # List command
    list_parser = subparsers.add_parser('list', help='List snapshots or comparisons')
    list_parser.add_argument('type', choices=['snapshots', 'comparisons'], help='Type to list')
    args = parser.parse_args()
    if not args.command:
        parser.print_help()
        return
    # Initialize validator
    validator = MigrationValidator(verbose=args.verbose)
    try:
        if args.command == 'collect':
            systems = [system.strip() for system in args.systems.split(',') if system.strip()]
            if args.dry_run:
                print(f"DRY RUN: Would collect {systems} into {args.output}")
                return
            output_file = validator.collect_to_file(args.output, systems)
            print(f"Snapshot written: {output_file}")
        elif args.command == 'snapshot':
            systems = args.systems.split(',')
            if args.dry_run:
                print(f"DRY RUN: Would create snapshot for systems: {systems}")
                return
            snapshot_id = validator.create_snapshot(args.env, args.label, systems)
            print(f"Snapshot created: {snapshot_id}")
        elif args.command == 'compare':
            if args.dry_run:
                print(f"DRY RUN: Would compare {args.snapshot1} vs {args.snapshot2}")
                return
            output = args.output
            if output and output.endswith('.json'):
                comparison = validator.compare_files(args.snapshot1, args.snapshot2, output)
                result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
                print(f"Comparison completed: {output} ({result})")
            else:
                output_id = output or datetime.now().strftime('%Y%m%d_%H%M%S')
                comparison = validator.compare_snapshots(args.snapshot1, args.snapshot2, output_id)
                result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
                print(f"Comparison completed: {output_id} ({result})")
        elif args.command == 'report':
            if args.dry_run:
                print(f"DRY RUN: Would generate {args.format} report for {args.comparison}")
                return
            output_file = validator.generate_report(args.comparison, args.format, args.output)
            print(f"Report generated: {output_file}")
        elif args.command == 'list':
            if args.type == 'snapshots':
                snapshots = list(SNAPSHOTS_DIR.glob("*.json"))
                if snapshots:
                    print("Available snapshots:")
                    for snapshot in sorted(snapshots):
                        print(f"  {snapshot.stem}")
                else:
                    print("No snapshots found")
            elif args.type == 'comparisons':
                comparisons = list(REPORTS_DIR.glob("comparison_*.json"))
                if comparisons:
                    print("Available comparisons:")
                    for comparison in sorted(comparisons):
                        comp_id = comparison.stem.replace('comparison_', '')
                        print(f"  {comp_id}")
                else:
                    print("No comparisons found")
    except Exception as e:
        validator.logger.error(f"Command failed: {e}")
        print(f"Error: {e}", file=sys.stderr)
        sys.exit(1)
 if __name__ == "__main__":
    main()
@@ -0,0 +1,206 @@
 """
 Disk Usage Data Collector
 Collects disk usage statistics including directory sizes,
 file system usage, and largest files information.
 """
 import logging
 import shlex
 import subprocess
 from typing import Dict, Any, List
 logger = logging.getLogger(__name__)
 class DiskUsageCollector:
    """Collector for disk usage statistics."""
    def __init__(self):
        self.max_depth = 3
        self.exclude_paths = [
            "/proc",
            "/sys",
            "/dev",
            "/run",
            "/tmp",
            "/var/log"
        ]
    def collect_disk_usage(self, system: str) -> Dict[str, Any]:
        """Collect disk usage information from target system."""
        logger.info(f"Collecting disk usage data from {system}")
        try:
            # Collect filesystem usage
            filesystem_usage = self.collect_filesystem_usage(system)
            # Collect directory sizes
            directory_sizes = self.collect_directory_sizes(system)
            # Collect largest files
            largest_files = self.collect_largest_files(system)
            return {
                "filesystem_usage": filesystem_usage,
                "directory_sizes": directory_sizes,
                "largest_files": largest_files,
                "timestamp": self.get_timestamp(system)
            }
        except Exception as e:
            logger.error(f"Failed to collect disk usage from {system}: {e}")
            raise
    def collect_filesystem_usage(self, system: str) -> List[Dict[str, Any]]:
        """Collect filesystem usage statistics."""
        usage_stats = []
        try:
            # Run df command
            result = subprocess.run(
                ["ssh", system, "df -h --output=source,fstype,size,used,avail,pcent,target"],
                capture_output=True,
                text=True,
                timeout=30
            )
            if result.returncode != 0:
                raise RuntimeError(f"df command failed: {result.stderr}")
            # Parse output
            lines = result.stdout.strip().split('\n')
            if len(lines) < 2:
                return usage_stats
            for line in lines[1:]:  # Skip header
                parts = line.split()
                if len(parts) >= 7:
                    usage_stat = {
                        "filesystem": parts[0],
                        "type": parts[1],
                        "size": parts[2],
                        "used": parts[3],
                        "available": parts[4],
                        "use_percent": parts[5],
                        "mountpoint": parts[6]
                    }
                    usage_stats.append(usage_stat)
        except subprocess.TimeoutExpired:
            logger.error(f"Timeout collecting filesystem usage from {system}")
            raise
        except Exception as e:
            logger.error(f"Failed to collect filesystem usage from {system}: {e}")
            raise
        return usage_stats
    def collect_directory_sizes(self, system: str) -> List[Dict[str, Any]]:
        """Collect sizes of top-level directories."""
        directory_sizes = []
        try:
            # Get top-level directories
            dirs_to_check = ["/", "/home", "/var", "/usr", "/opt", "/etc"]
            for directory in dirs_to_check:
                if directory in self.exclude_paths:
                    continue
                try:
                    # Run du command for directory size
                    result = subprocess.run(
                        ["ssh", system, f"du -sh -- {shlex.quote(directory)} 2>/dev/null"],
                        capture_output=True,
                        text=True,
                        timeout=60
                    )
                    if result.returncode == 0:
                        size, path = result.stdout.strip().split('\t', 1)
                        directory_sizes.append({
                            "path": path,
                            "size": size
                        })
                except subprocess.TimeoutExpired:
                    logger.warning(f"Timeout getting size for {directory} on {system}")
                    continue
                except Exception as e:
                    logger.warning(f"Failed to get size for {directory} on {system}: {e}")
                    continue
        except Exception as e:
            logger.error(f"Failed to collect directory sizes from {system}: {e}")
            raise
        return directory_sizes
    def collect_largest_files(self, system: str) -> List[Dict[str, Any]]:
        """Collect information about largest files in the system."""
        largest_files = []
        try:
            # Find largest files (excluding certain paths)
            exclude_expr = " ".join(f"-not -path {shlex.quote(path + '/*')}" for path in self.exclude_paths)
            cmd = f"find / {exclude_expr} -type f -exec ls -lh {{}} \\; 2>/dev/null | sort -k5 -hr | head -20"
            result = subprocess.run(
                ["ssh", system, cmd],
                capture_output=True,
                text=True,
                timeout=120
            )
            if result.returncode == 0:
                for line in result.stdout.strip().split('\n'):
                    if not line.strip():
                        continue
                    parts = line.split()
                    if len(parts) >= 9:
                        file_info = {
                            "permissions": parts[0],
                            "links": parts[1],
                            "owner": parts[2],
                            "group": parts[3],
                            "size": parts[4],
                            "month": parts[5],
                            "day": parts[6],
                            "time": parts[7],
                            "path": " ".join(parts[8:])
                        }
                        largest_files.append(file_info)
        except subprocess.TimeoutExpired:
            logger.error(f"Timeout collecting largest files from {system}")
            raise
        except Exception as e:
            logger.error(f"Failed to collect largest files from {system}: {e}")
            raise
        return largest_files
    def get_timestamp(self, system: str) -> str:
        """Get current timestamp from target system."""
        try:
            result = subprocess.run(
                ["ssh", system, "date -Iseconds"],
                capture_output=True,
                text=True,
                timeout=10
            )
            if result.returncode == 0:
                return result.stdout.strip()
            else:
                return "unknown"
        except Exception:
            return "unknown"
 def collect(system: str) -> Dict[str, Any]:
    """Main collection function for disk usage data."""
    collector = DiskUsageCollector()
    return collector.collect_disk_usage(system)
@@ -0,0 +1,172 @@
 """
 Mounts Data Collector
 Collects filesystem mount information including mount points, devices,
 filesystem types, and usage statistics.
 """
 import logging
 import shlex
 import subprocess
 from typing import Dict, Any, List
 logger = logging.getLogger(__name__)
 class MountsCollector:
    """Collector for filesystem mount information."""
    def __init__(self):
        self.exclude_patterns = [
            "/proc/*",
            "/sys/*",
            "/dev/*",
            "/run/*"
        ]
    def collect_mounts(self, system: str) -> Dict[str, Any]:
        """Collect mount information from target system."""
        logger.info(f"Collecting mounts data from {system}")
        try:
            # Run mount command
            result = subprocess.run(
                ["ssh", system, "mount"],
                capture_output=True,
                text=True,
                timeout=30
            )
            if result.returncode != 0:
                raise RuntimeError(f"Mount command failed: {result.stderr}")
            mounts = self.parse_mount_output(result.stdout)
            filtered_mounts = self.filter_mounts(mounts)
            # Get usage statistics
            usage_stats = self.collect_usage_stats(system, filtered_mounts)
            return {
                "mounts": filtered_mounts,
                "usage": usage_stats,
                "timestamp": self.get_timestamp(system)
            }
        except subprocess.TimeoutExpired:
            logger.error(f"Timeout collecting mounts from {system}")
            raise
        except Exception as e:
            logger.error(f"Failed to collect mounts from {system}: {e}")
            raise
    def parse_mount_output(self, output: str) -> List[Dict[str, str]]:
        """Parse mount command output."""
        mounts = []
        for line in output.strip().split('\n'):
            if not line.strip():
                continue
            # Parse mount output format: device on mountpoint type fstype (options)
            parts = line.split()
            if len(parts) >= 6 and parts[1] == 'on' and parts[3] == 'type':
                mount_info = {
                    "device": parts[0],
                    "mountpoint": parts[2],
                    "fstype": parts[4],
                    "options": parts[5].strip('()')
                }
                mounts.append(mount_info)
        return mounts
    def filter_mounts(self, mounts: List[Dict[str, str]]) -> List[Dict[str, str]]:
        """Filter out unwanted mount points."""
        filtered = []
        for mount in mounts:
            mountpoint = mount["mountpoint"]
            if not any(mountpoint.startswith(pattern.rstrip('*')) for pattern in self.exclude_patterns):
                filtered.append(mount)
        return filtered
    def collect_usage_stats(self, system: str, mounts: List[Dict[str, str]]) -> Dict[str, Any]:
        """Collect disk usage statistics for mount points."""
        usage_stats = {}
        for mount in mounts:
            mountpoint = mount["mountpoint"]
            try:
                # Run df command for usage statistics
                result = subprocess.run(
                    ["ssh", system, f"df -BG -- {shlex.quote(mountpoint)}"],
                    capture_output=True,
                    text=True,
                    timeout=15
                )
                if result.returncode == 0:
                    usage_stats[mountpoint] = self.parse_df_output(result.stdout)
            except subprocess.TimeoutExpired:
                logger.warning(f"Timeout getting usage for {mountpoint} on {system}")
                usage_stats[mountpoint] = {"error": "timeout"}
            except Exception as e:
                logger.warning(f"Failed to get usage for {mountpoint} on {system}: {e}")
                usage_stats[mountpoint] = {"error": str(e)}
        return usage_stats
    def parse_df_output(self, output: str) -> Dict[str, Any]:
        """Parse df command output."""
        lines = output.strip().split('\n')
        if len(lines) < 2:
            return {"error": "invalid df output"}
        # Parse header and data
        header = lines[0].split()
        data = lines[1].split()
        if len(header) != len(data):
            return {"error": "header/data mismatch"}
        stats = {}
        for i, field in enumerate(header):
            if i < len(data):
                if field in ['1G-blocks', 'Used', 'Available']:
                    # Convert to GB
                    value = data[i]
                    if value.endswith('G'):
                        stats[field.lower()] = float(value.rstrip('G'))
                    else:
                        stats[field.lower()] = float(value) / (1024**3)  # Assume bytes
                elif field == 'Use%':
                    stats['use_percent'] = int(data[i].rstrip('%'))
                else:
                    stats[field.lower()] = data[i]
        return stats
    def get_timestamp(self, system: str) -> str:
        """Get current timestamp from target system."""
        try:
            result = subprocess.run(
                ["ssh", system, "date -Iseconds"],
                capture_output=True,
                text=True,
                timeout=10
            )
            if result.returncode == 0:
                return result.stdout.strip()
            else:
                return "unknown"
        except Exception:
            return "unknown"
 def collect(system: str) -> Dict[str, Any]:
    """Main collection function for mounts data."""
    collector = MountsCollector()
    return collector.collect_mounts(system)
@@ -0,0 +1,223 @@
 """
 Services Data Collector
 Collects system service information including running services,
 their states, startup configuration, and dependencies.
 """
 import logging
 import shlex
 import subprocess
 from typing import Dict, Any, List
 logger = logging.getLogger(__name__)
 class ServicesCollector:
    """Collector for system service information."""
    def __init__(self):
        self.service_manager = "systemd"  # Default to systemd
        self.include_disabled = False
    def collect_services(self, system: str) -> Dict[str, Any]:
        """Collect service information from target system."""
        logger.info(f"Collecting services data from {system}")
        try:
            # Detect service manager
            service_manager = self.detect_service_manager(system)
            if service_manager == "systemd":
                services = self.collect_systemd_services(system)
            elif service_manager == "sysv":
                services = self.collect_sysv_services(system)
            else:
                raise RuntimeError(f"Unsupported service manager: {service_manager}")
            return {
                "service_manager": service_manager,
                "services": services,
                "timestamp": self.get_timestamp(system)
            }
        except Exception as e:
            logger.error(f"Failed to collect services from {system}: {e}")
            raise
    def detect_service_manager(self, system: str) -> str:
        """Detect which service manager is running on the system."""
        try:
            # Check for systemd
            result = subprocess.run(
                ["ssh", system, "ps -p 1 -o comm="],
                capture_output=True,
                text=True,
                timeout=10
            )
            if result.returncode == 0:
                if "systemd" in result.stdout.strip():
                    return "systemd"
                elif "init" in result.stdout.strip():
                    return "sysv"
            # Fallback check
            result = subprocess.run(
                ["ssh", system, "which systemctl"],
                capture_output=True,
                text=True,
                timeout=10
            )
            if result.returncode == 0:
                return "systemd"
            return "sysv"
        except Exception:
            return "unknown"
    def collect_systemd_services(self, system: str) -> List[Dict[str, Any]]:
        """Collect systemd service information."""
        services = []
        try:
            # Get all services
            result = subprocess.run(
                ["ssh", system, "systemctl list-units --type=service --all --no-pager --no-legend"],
                capture_output=True,
                text=True,
                timeout=30
            )
            if result.returncode != 0:
                raise RuntimeError(f"systemctl list-units failed: {result.stderr}")
            # Parse service list
            for line in result.stdout.strip().split('\n'):
                if not line.strip():
                    continue
                parts = line.split()
                if len(parts) >= 4:
                    service_name = parts[0]
                    load_state = parts[1]
                    active_state = parts[2]
                    sub_state = parts[3]
                    # Skip if disabled and not including disabled
                    if not self.include_disabled and load_state == "not-found":
                        continue
                    # Get detailed service info
                    service_info = self.get_systemd_service_details(system, service_name)
                    services.append({
                        "name": service_name,
                        "load_state": load_state,
                        "active_state": active_state,
                        "sub_state": sub_state,
                        **service_info
                    })
        except subprocess.TimeoutExpired:
            logger.error(f"Timeout collecting systemd services from {system}")
            raise
        except Exception as e:
            logger.error(f"Failed to collect systemd services from {system}: {e}")
            raise
        return services
    def get_systemd_service_details(self, system: str, service_name: str) -> Dict[str, Any]:
        """Get detailed information for a systemd service."""
        details = {}
        try:
            # Get service status
            result = subprocess.run(
                ["ssh", system, f"systemctl show {shlex.quote(service_name)} --no-pager"],
                capture_output=True,
                text=True,
                timeout=15
            )
            if result.returncode == 0:
                for line in result.stdout.strip().split('\n'):
                    if '=' in line:
                        key, value = line.split('=', 1)
                        details[key.lower()] = value
        except Exception as e:
            logger.warning(f"Failed to get details for {service_name}: {e}")
        return details
    def collect_sysv_services(self, system: str) -> List[Dict[str, Any]]:
        """Collect SysV init service information."""
        services = []
        try:
            # Get service list from /etc/init.d/
            result = subprocess.run(
                ["ssh", system, "ls -1 /etc/init.d/"],
                capture_output=True,
                text=True,
                timeout=15
            )
            if result.returncode != 0:
                raise RuntimeError(f"Failed to list init.d services: {result.stderr}")
            for service_name in result.stdout.strip().split('\n'):
                if not service_name.strip():
                    continue
                # Get service status
                status_result = subprocess.run(
                    ["ssh", system, f"/etc/init.d/{shlex.quote(service_name)} status"],
                    capture_output=True,
                    text=True,
                    timeout=10
                )
                status = "unknown"
                if status_result.returncode == 0:
                    status = "running"
                elif "not running" in status_result.stdout.lower():
                    status = "stopped"
                services.append({
                    "name": service_name,
                    "status": status,
                    "type": "sysv"
                })
        except Exception as e:
            logger.error(f"Failed to collect SysV services from {system}: {e}")
            raise
        return services
    def get_timestamp(self, system: str) -> str:
        """Get current timestamp from target system."""
        try:
            result = subprocess.run(
                ["ssh", system, "date -Iseconds"],
                capture_output=True,
                text=True,
                timeout=10
            )
            if result.returncode == 0:
                return result.stdout.strip()
            else:
                return "unknown"
        except Exception:
            return "unknown"
 def collect(system: str) -> Dict[str, Any]:
    """Main collection function for services data."""
    collector = ServicesCollector()
    return collector.collect_services(system)
@@ -0,0 +1,30 @@
 # Migration Validation Framework Architecture
 ## Components
 - CLI: parses operator commands and coordinates workflows.
 - Collectors: gather mounts, services, and disk usage from target systems.
 - Snapshot files: JSON evidence used as immutable migration checkpoints.
 - Comparator: evaluates drift between before and after snapshots.
 - Reports: stores JSON or HTML output for audit and review.
 ## Data Flow
 ```
 Operator
  -> python3 cli.py collect
  -> collectors over SSH
  -> before.json / after.json
  -> python3 cli.py compare
  -> diff.json with PASS/FAIL validation
 ```
 ## Validation Flow
 ```
 before.json -> Comparator -> service checks
 after.json  -> Comparator -> filesystem checks -> validation result
                         -> mount checks
 ```
 The framework keeps collection and comparison separate so migration evidence can be reviewed, archived, and replayed without recollecting from production systems.
@@ -0,0 +1,40 @@
 {
  "metadata": {
    "timestamp": "2026-04-29T03:40:00Z",
    "systems": ["web01"],
    "version": "1.0"
  },
  "data": {
    "web01": {
      "mounts": {
        "mounts": [
          {"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
          {"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
        ],
        "usage": {
          "/": {"filesystem": "/dev/sda1", "use_percent": "62%"},
          "/var": {"filesystem": "/dev/sdb1", "use_percent": "94%"}
        },
        "timestamp": "2026-04-29T03:40:00Z"
      },
      "services": {
        "service_manager": "systemd",
        "services": [
          {"name": "sshd", "active_state": "failed", "sub_state": "failed"},
          {"name": "nginx", "active_state": "active", "sub_state": "running"},
          {"name": "node-exporter", "active_state": "active", "sub_state": "running"}
        ],
        "timestamp": "2026-04-29T03:40:00Z"
      },
      "disk_usage": {
        "filesystem_usage": [
          {"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "50G", "available": "30G", "use_percent": "62%", "mountpoint": "/"},
          {"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "188G", "available": "12G", "use_percent": "94%", "mountpoint": "/var"}
        ],
        "directory_sizes": [{"path": "/var/lib/app", "size": "139G"}],
        "largest_files": [{"path": "/var/lib/app/import/archive.tar", "size": "42G"}],
        "timestamp": "2026-04-29T03:40:00Z"
      }
    }
  }
 }
@@ -0,0 +1,39 @@
 {
  "metadata": {
    "timestamp": "2026-04-29T01:15:00Z",
    "systems": ["web01"],
    "version": "1.0"
  },
  "data": {
    "web01": {
      "mounts": {
        "mounts": [
          {"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
          {"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
        ],
        "usage": {
          "/": {"filesystem": "/dev/sda1", "use_percent": "61%"},
          "/var": {"filesystem": "/dev/sdb1", "use_percent": "68%"}
        },
        "timestamp": "2026-04-29T01:15:00Z"
      },
      "services": {
        "service_manager": "systemd",
        "services": [
          {"name": "sshd", "active_state": "active", "sub_state": "running"},
          {"name": "nginx", "active_state": "active", "sub_state": "running"}
        ],
        "timestamp": "2026-04-29T01:15:00Z"
      },
      "disk_usage": {
        "filesystem_usage": [
          {"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "49G", "available": "31G", "use_percent": "61%", "mountpoint": "/"},
          {"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "136G", "available": "64G", "use_percent": "68%", "mountpoint": "/var"}
        ],
        "directory_sizes": [{"path": "/var/lib/app", "size": "84G"}],
        "largest_files": [],
        "timestamp": "2026-04-29T01:15:00Z"
      }
    }
  }
 }
@@ -0,0 +1,211 @@
 {
  "summary": {
    "total_systems": 1,
    "systems_with_changes": 1,
    "total_changes": 7,
    "changes_by_type": {
      "mounts": 2,
      "services": 2,
      "disk_usage": 3
    },
    "most_affected_systems": [
      [
        "web01",
        7
      ]
    ]
  },
  "differences": {
    "mounts": {
      "web01": {
        "added_mounts": [],
        "removed_mounts": [],
        "changed_mounts": [],
        "usage_changes": [
          {
            "mountpoint": "/",
            "before": {
              "filesystem": "/dev/sda1",
              "use_percent": "61%"
            },
            "after": {
              "filesystem": "/dev/sda1",
              "use_percent": "62%"
            }
          },
          {
            "mountpoint": "/var",
            "before": {
              "filesystem": "/dev/sdb1",
              "use_percent": "68%"
            },
            "after": {
              "filesystem": "/dev/sdb1",
              "use_percent": "94%"
            }
          }
        ]
      }
    },
    "services": {
      "web01": {
        "added_services": [
          {
            "name": "node-exporter",
            "active_state": "active",
            "sub_state": "running"
          }
        ],
        "removed_services": [],
        "status_changes": [
          {
            "name": "sshd",
            "before": {
              "active_state": "active",
              "sub_state": "running"
            },
            "after": {
              "active_state": "failed",
              "sub_state": "failed"
            }
          }
        ],
        "configuration_changes": []
      }
    },
    "disk_usage": {
      "web01": {
        "filesystem_changes": [
          {
            "mountpoint": "/",
            "before": {
              "filesystem": "/dev/sda1",
              "type": "ext4",
              "size": "80G",
              "used": "49G",
              "available": "31G",
              "use_percent": "61%",
              "mountpoint": "/"
            },
            "after": {
              "filesystem": "/dev/sda1",
              "type": "ext4",
              "size": "80G",
              "used": "50G",
              "available": "30G",
              "use_percent": "62%",
              "mountpoint": "/"
            }
          },
          {
            "mountpoint": "/var",
            "before": {
              "filesystem": "/dev/sdb1",
              "type": "xfs",
              "size": "200G",
              "used": "136G",
              "available": "64G",
              "use_percent": "68%",
              "mountpoint": "/var"
            },
            "after": {
              "filesystem": "/dev/sdb1",
              "type": "xfs",
              "size": "200G",
              "used": "188G",
              "available": "12G",
              "use_percent": "94%",
              "mountpoint": "/var"
            }
          }
        ],
        "directory_size_changes": [],
        "significant_usage_changes": [
          {
            "mountpoint": "/var",
            "change_percent": 26,
            "before": {
              "filesystem": "/dev/sdb1",
              "type": "xfs",
              "size": "200G",
              "used": "136G",
              "available": "64G",
              "use_percent": "68%",
              "mountpoint": "/var"
            },
            "after": {
              "filesystem": "/dev/sdb1",
              "type": "xfs",
              "size": "200G",
              "used": "188G",
              "available": "12G",
              "use_percent": "94%",
              "mountpoint": "/var"
            }
          }
        ]
      }
    }
  },
  "risk_assessment": {
    "overall_risk": "high",
    "risk_factors": [
      {
        "type": "service_failure",
        "description": "Service failed: sshd",
        "level": 3
      },
      {
        "type": "disk_usage_spike",
        "description": "Significant disk usage change: /var (26%)",
        "level": 2
      }
    ],
    "critical_changes": [],
    "recommendations": [
      "Immediate review required - critical changes detected",
      "Consider rolling back migration if critical services are affected"
    ]
  },
  "validation_results": {
    "passed": false,
    "checks": [
      {
        "name": "critical_services_running",
        "description": "Verify critical services remain operational",
        "passed": false,
        "details": [
          "Critical service sshd failed on web01"
        ]
      },
      {
        "name": "filesystem_integrity",
        "description": "Verify filesystem integrity maintained",
        "passed": true,
        "details": []
      },
      {
        "name": "no_critical_mounts_removed",
        "description": "Verify critical mount points remain",
        "passed": true,
        "details": []
      }
    ],
    "failed_checks": [
      {
        "name": "critical_services_running",
        "description": "Verify critical services remain operational",
        "passed": false,
        "details": [
          "Critical service sshd failed on web01"
        ]
      }
    ],
    "result": "FAIL"
  },
  "metadata": {
    "before": "migration-validation-framework/examples/before.json",
    "after": "migration-validation-framework/examples/after.json",
    "timestamp": "2026-04-29T23:29:07.510774"
  }
 }
@@ -0,0 +1,19 @@
 # Scenario: Before/After Migration Comparison
 ## Description
 Compare a pre-cutover host snapshot against a post-cutover snapshot and determine whether the migrated system is ready for production traffic.
 ## Commands
 ```bash
 cd professional-infra/migration-validation-framework
 python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
 ```
 ## Expected Result
 - The command writes a JSON diff.
 - The result is `FAIL` because `sshd` is failed after migration.
 - The risk assessment highlights the `/var` disk usage increase.
 - The remediation path is to restore SSH and reduce or expand `/var` before approving cutover.
@@ -0,0 +1,67 @@
 import json
 import unittest
 from pathlib import Path
 from collectors.mounts import MountsCollector
 from reports.html_report import HTMLReportGenerator
 from validators.compare import compare_snapshots
 PROJECT_ROOT = Path(__file__).resolve().parents[1]
 class ComparatorExampleTests(unittest.TestCase):
    def test_example_comparison_detects_expected_failure(self):
        before = json.loads((PROJECT_ROOT / "examples" / "before.json").read_text())
        after = json.loads((PROJECT_ROOT / "examples" / "after.json").read_text())
        comparison = compare_snapshots(before, after)
        self.assertFalse(comparison["validation_results"]["passed"])
        self.assertEqual(comparison["validation_results"]["result"], "FAIL")
        self.assertGreater(comparison["summary"]["total_changes"], 0)
 class HtmlReportTests(unittest.TestCase):
    def test_report_escapes_untrusted_snapshot_content(self):
        report = HTMLReportGenerator().build_html_content({
            "metadata": {"comparison_id": "<script>alert(1)</script>"},
            "summary": {
                "total_systems": 1,
                "systems_with_changes": 1,
                "total_changes": 1,
                "changes_by_type": {"services": 1},
                "most_affected_systems": [("<img src=x onerror=alert(1)>", 1)],
            },
            "differences": {},
            "risk_assessment": {"overall_risk": "low", "risk_factors": [], "critical_changes": [], "recommendations": ["Review <b>change</b>"]},
            "validation_results": {"passed": True, "checks": []},
        })
        self.assertNotIn("<script>alert(1)</script>", report)
        self.assertNotIn("<img src=x onerror=alert(1)>", report)
        self.assertIn("&lt;script&gt;alert(1)&lt;/script&gt;", report)
        self.assertIn("&lt;b&gt;change&lt;/b&gt;", report)
 class CollectorParserTests(unittest.TestCase):
    def test_mount_parser_handles_standard_mount_output(self):
        output = "/dev/sda1 on / type ext4 (rw,relatime)\nproc on /proc type proc (rw,nosuid,nodev,noexec,relatime)\n"
        mounts = MountsCollector().parse_mount_output(output)
        self.assertEqual(mounts[0]["device"], "/dev/sda1")
        self.assertEqual(mounts[0]["mountpoint"], "/")
        self.assertEqual(mounts[0]["fstype"], "ext4")
    def test_df_parser_handles_gigabyte_output(self):
        output = "Filesystem 1G-blocks Used Available Use% Mounted\n/dev/sda1 100G 45G 55G 45% /\n"
        stats = MountsCollector().parse_df_output(output)
        self.assertEqual(stats["1g-blocks"], 100.0)
        self.assertEqual(stats["used"], 45.0)
        self.assertEqual(stats["available"], 55.0)
        self.assertEqual(stats["use_percent"], 45)
 if __name__ == "__main__":
    unittest.main()
@@ -0,0 +1,501 @@
 """
 Snapshot Comparison Engine
 Compares two system snapshots and identifies differences,
 risk levels, and validation results.
 """
 import json
 import logging
 from typing import Dict, Any, List, Tuple
 from datetime import datetime
 logger = logging.getLogger(__name__)
 class SnapshotComparator:
    """Engine for comparing system snapshots."""
    def __init__(self):
        self.risk_levels = {
            "low": 1,
            "medium": 2,
            "high": 3,
            "critical": 4
        }
    def compare_snapshots(self, snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
        """Compare two snapshots and return detailed comparison results."""
        logger.info("Starting snapshot comparison")
        comparison = {
            "summary": {},
            "differences": {},
            "risk_assessment": {},
            "validation_results": {}
        }
        # Compare each data type
        data_types = ["mounts", "services", "disk_usage"]
        data1 = snapshot1.get("data", {})
        data2 = snapshot2.get("data", {})
        for data_type in data_types:
            if self.data_type_exists(data1, data_type) or self.data_type_exists(data2, data_type):
                differences = self.compare_data_type(data1, data2, data_type)
                comparison["differences"][data_type] = differences
        # Generate summary
        comparison["summary"] = self.generate_summary(comparison["differences"])
        # Risk assessment
        comparison["risk_assessment"] = self.assess_risks(comparison["differences"])
        # Validation results
        comparison["validation_results"] = self.validate_changes(comparison["differences"])
        comparison["validation_results"]["result"] = "PASS" if comparison["validation_results"]["passed"] else "FAIL"
        logger.info("Snapshot comparison completed")
        return comparison
    def data_type_exists(self, systems: Dict[str, Any], data_type: str) -> bool:
        """Return true when at least one system has the requested collector data."""
        return any(data_type in system_data for system_data in systems.values())
    def compare_data_type(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
        """Compare a specific data type between two snapshots."""
        differences = {}
        # Get all systems from both snapshots
        systems1 = set(data1.keys())
        systems2 = set(data2.keys())
        all_systems = systems1.union(systems2)
        for system in all_systems:
            system_diffs = {}
            if system not in data1:
                system_diffs["status"] = "added"
                system_diffs["details"] = {"new_system": True}
            elif system not in data2:
                system_diffs["status"] = "removed"
                system_diffs["details"] = {"removed_system": True}
            else:
                # Compare data for this system and data type
                if data_type in data1[system] and data_type in data2[system]:
                    system_diffs = self.compare_system_data(
                        data1[system][data_type],
                        data2[system][data_type],
                        data_type
                    )
                else:
                    system_diffs["status"] = "data_missing"
                    system_diffs["details"] = {"missing_data_type": data_type}
            if system_diffs:
                differences[system] = system_diffs
        return differences
    def compare_system_data(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
        """Compare data for a specific system and data type."""
        differences = {}
        if data_type == "mounts":
            differences = self.compare_mounts(data1, data2)
        elif data_type == "services":
            differences = self.compare_services(data1, data2)
        elif data_type == "disk_usage":
            differences = self.compare_disk_usage(data1, data2)
        else:
            differences["status"] = "unknown_data_type"
        return differences
    def compare_mounts(self, mounts1: Dict[str, Any], mounts2: Dict[str, Any]) -> Dict[str, Any]:
        """Compare mounts data between snapshots."""
        differences = {
            "added_mounts": [],
            "removed_mounts": [],
            "changed_mounts": [],
            "usage_changes": []
        }
        # Compare mount lists
        mounts_list1 = mounts1.get("mounts", [])
        mounts_list2 = mounts2.get("mounts", [])
        # Create mountpoint maps
        mounts_map1 = {m["mountpoint"]: m for m in mounts_list1}
        mounts_map2 = {m["mountpoint"]: m for m in mounts_list2}
        # Find added and removed mounts
        added = set(mounts_map2.keys()) - set(mounts_map1.keys())
        removed = set(mounts_map1.keys()) - set(mounts_map2.keys())
        differences["added_mounts"] = [{"mountpoint": mp, **mounts_map2[mp]} for mp in added]
        differences["removed_mounts"] = [{"mountpoint": mp, **mounts_map1[mp]} for mp in removed]
        # Find changed mounts
        common = set(mounts_map1.keys()) & set(mounts_map2.keys())
        for mp in common:
            m1, m2 = mounts_map1[mp], mounts_map2[mp]
            if m1 != m2:
                differences["changed_mounts"].append({
                    "mountpoint": mp,
                    "before": m1,
                    "after": m2
                })
        # Compare usage statistics
        usage1 = mounts1.get("usage", {})
        usage2 = mounts2.get("usage", {})
        for mp in set(usage1.keys()) | set(usage2.keys()):
            if mp in usage1 and mp in usage2:
                u1, u2 = usage1[mp], usage2[mp]
                if u1 != u2:
                    differences["usage_changes"].append({
                        "mountpoint": mp,
                        "before": u1,
                        "after": u2
                    })
        return differences
    def compare_services(self, services1: Dict[str, Any], services2: Dict[str, Any]) -> Dict[str, Any]:
        """Compare services data between snapshots."""
        differences = {
            "added_services": [],
            "removed_services": [],
            "status_changes": [],
            "configuration_changes": []
        }
        # Compare service lists
        services_list1 = services1.get("services", [])
        services_list2 = services2.get("services", [])
        # Create service maps
        services_map1 = {s["name"]: s for s in services_list1}
        services_map2 = {s["name"]: s for s in services_list2}
        # Find added and removed services
        added = set(services_map2.keys()) - set(services_map1.keys())
        removed = set(services_map1.keys()) - set(services_map2.keys())
        differences["added_services"] = [{"name": name, **services_map2[name]} for name in added]
        differences["removed_services"] = [{"name": name, **services_map1[name]} for name in removed]
        # Find status changes
        common = set(services_map1.keys()) & set(services_map2.keys())
        for name in common:
            s1, s2 = services_map1[name], services_map2[name]
            if s1.get("active_state") != s2.get("active_state") or s1.get("sub_state") != s2.get("sub_state"):
                differences["status_changes"].append({
                    "name": name,
                    "before": {"active_state": s1.get("active_state"), "sub_state": s1.get("sub_state")},
                    "after": {"active_state": s2.get("active_state"), "sub_state": s2.get("sub_state")}
                })
        return differences
    def compare_disk_usage(self, usage1: Dict[str, Any], usage2: Dict[str, Any]) -> Dict[str, Any]:
        """Compare disk usage data between snapshots."""
        differences = {
            "filesystem_changes": [],
            "directory_size_changes": [],
            "significant_usage_changes": []
        }
        # Compare filesystem usage
        fs1 = usage1.get("filesystem_usage", [])
        fs2 = usage2.get("filesystem_usage", [])
        # Create filesystem maps by mountpoint
        fs_map1 = {fs["mountpoint"]: fs for fs in fs1}
        fs_map2 = {fs["mountpoint"]: fs for fs in fs2}
        common_fs = set(fs_map1.keys()) & set(fs_map2.keys())
        for mp in common_fs:
            f1, f2 = fs_map1[mp], fs_map2[mp]
            if f1 != f2:
                differences["filesystem_changes"].append({
                    "mountpoint": mp,
                    "before": f1,
                    "after": f2
                })
                # Check for significant usage changes
                try:
                    use1 = int(f1.get("use_percent", "0").rstrip("%"))
                    use2 = int(f2.get("use_percent", "0").rstrip("%"))
                    if abs(use2 - use1) > 10:  # 10% change threshold
                        differences["significant_usage_changes"].append({
                            "mountpoint": mp,
                            "change_percent": use2 - use1,
                            "before": f1,
                            "after": f2
                        })
                except (ValueError, KeyError):
                    pass
        return differences
    def generate_summary(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Generate a summary of all differences."""
        summary = {
            "total_systems": 0,
            "systems_with_changes": 0,
            "total_changes": 0,
            "changes_by_type": {},
            "most_affected_systems": []
        }
        system_change_counts = {}
        for data_type, systems in differences.items():
            summary["changes_by_type"][data_type] = 0
            for system, system_diffs in systems.items():
                if system not in system_change_counts:
                    system_change_counts[system] = 0
                # Count changes for this system and data type
                change_count = self.count_changes(system_diffs)
                system_change_counts[system] += change_count
                summary["changes_by_type"][data_type] += change_count
                summary["total_changes"] += change_count
        summary["total_systems"] = len(system_change_counts)
        # Count systems with changes
        summary["systems_with_changes"] = len([s for s in system_change_counts.values() if s > 0])
        # Find most affected systems
        sorted_systems = sorted(system_change_counts.items(), key=lambda x: x[1], reverse=True)
        summary["most_affected_systems"] = sorted_systems[:5]
        return summary
    def count_changes(self, system_diffs: Dict[str, Any]) -> int:
        """Count the number of changes in system differences."""
        count = 0
        for key, value in system_diffs.items():
            if isinstance(value, list):
                count += len(value)
            elif isinstance(value, dict) and key not in ["status"]:
                # Count nested changes
                count += sum(1 for v in value.values() if isinstance(v, list) and v)
        return count
    def assess_risks(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Assess risk levels for the changes."""
        risk_assessment = {
            "overall_risk": "low",
            "risk_factors": [],
            "critical_changes": [],
            "recommendations": []
        }
        max_risk_level = 1
        # Analyze each type of change
        for data_type, systems in differences.items():
            for system, system_diffs in systems.items():
                risk_factors = self.analyze_system_risks(system_diffs, data_type)
                risk_assessment["risk_factors"].extend(risk_factors)
                for factor in risk_factors:
                    if factor["level"] > max_risk_level:
                        max_risk_level = factor["level"]
                    if factor["level"] >= 4:  # Critical
                        risk_assessment["critical_changes"].append({
                            "system": system,
                            "data_type": data_type,
                            "factor": factor
                        })
        # Set overall risk
        risk_levels = {1: "low", 2: "medium", 3: "high", 4: "critical"}
        risk_assessment["overall_risk"] = risk_levels.get(max_risk_level, "unknown")
        # Generate recommendations
        risk_assessment["recommendations"] = self.generate_recommendations(risk_assessment)
        return risk_assessment
    def analyze_system_risks(self, system_diffs: Dict[str, Any], data_type: str) -> List[Dict[str, Any]]:
        """Analyze risks for a specific system's changes."""
        risk_factors = []
        if data_type == "mounts":
            # Check for removed critical mounts
            for mount in system_diffs.get("removed_mounts", []):
                if mount["mountpoint"] in ["/", "/boot", "/usr", "/var"]:
                    risk_factors.append({
                        "type": "critical_mount_removed",
                        "description": f"Critical mount point removed: {mount['mountpoint']}",
                        "level": 4
                    })
            # Check for significant usage changes
            for change in system_diffs.get("usage_changes", []):
                try:
                    before_pct = int(change["before"].get("use_percent", "0").rstrip("%"))
                    after_pct = int(change["after"].get("use_percent", "0").rstrip("%"))
                    if after_pct > 95:
                        risk_factors.append({
                            "type": "filesystem_full",
                            "description": f"Filesystem usage critical: {change['mountpoint']} at {after_pct}%",
                            "level": 3
                        })
                except (ValueError, KeyError):
                    pass
        elif data_type == "services":
            # Check for critical service changes
            critical_services = ["sshd", "systemd", "networking", "dbus"]
            for service in system_diffs.get("removed_services", []):
                if service["name"] in critical_services:
                    risk_factors.append({
                        "type": "critical_service_removed",
                        "description": f"Critical service removed: {service['name']}",
                        "level": 4
                    })
            for change in system_diffs.get("status_changes", []):
                if change["after"]["active_state"] == "failed":
                    risk_factors.append({
                        "type": "service_failure",
                        "description": f"Service failed: {change['name']}",
                        "level": 3
                    })
        elif data_type == "disk_usage":
            for change in system_diffs.get("significant_usage_changes", []):
                if change["change_percent"] > 20:
                    risk_factors.append({
                        "type": "disk_usage_spike",
                        "description": f"Significant disk usage change: {change['mountpoint']} ({change['change_percent']}%)",
                        "level": 2
                    })
        return risk_factors
    def generate_recommendations(self, risk_assessment: Dict[str, Any]) -> List[str]:
        """Generate recommendations based on risk assessment."""
        recommendations = []
        if risk_assessment["overall_risk"] in ["high", "critical"]:
            recommendations.append("Immediate review required - critical changes detected")
            recommendations.append("Consider rolling back migration if critical services are affected")
        if any(f["type"] == "critical_mount_removed" for f in risk_assessment["risk_factors"]):
            recommendations.append("Verify system boot capability after mount changes")
        if any(f["type"] == "critical_service_removed" for f in risk_assessment["risk_factors"]):
            recommendations.append("Ensure critical services are restored before production cutover")
        if any(f["type"] == "filesystem_full" for f in risk_assessment["risk_factors"]):
            recommendations.append("Monitor disk space closely - cleanup may be required")
        if not recommendations:
            recommendations.append("Changes appear safe - proceed with standard validation procedures")
        return recommendations
    def validate_changes(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Validate that changes meet requirements."""
        validation_results = {
            "passed": True,
            "checks": [],
            "failed_checks": []
        }
        # Define validation checks
        checks = [
            self.check_critical_services_running,
            self.check_filesystem_integrity,
            self.check_no_critical_mounts_removed
        ]
        for check_func in checks:
            check_result = check_func(differences)
            validation_results["checks"].append(check_result)
            if not check_result["passed"]:
                validation_results["passed"] = False
                validation_results["failed_checks"].append(check_result)
        return validation_results
    def check_critical_services_running(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Check that critical services are still running."""
        check = {
            "name": "critical_services_running",
            "description": "Verify critical services remain operational",
            "passed": True,
            "details": []
        }
        critical_services = ["sshd", "systemd"]
        for data_type, systems in differences.items():
            if data_type == "services":
                for system, system_diffs in systems.items():
                    for change in system_diffs.get("status_changes", []):
                        if change["name"] in critical_services:
                            if change["after"]["active_state"] == "failed":
                                check["passed"] = False
                                check["details"].append(f"Critical service {change['name']} failed on {system}")
        return check
    def check_filesystem_integrity(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Check filesystem integrity after changes."""
        check = {
            "name": "filesystem_integrity",
            "description": "Verify filesystem integrity maintained",
            "passed": True,
            "details": []
        }
        for data_type, systems in differences.items():
            if data_type == "disk_usage":
                for system, system_diffs in systems.items():
                    for change in system_diffs.get("significant_usage_changes", []):
                        if change["change_percent"] > 50:  # Arbitrary threshold
                            check["passed"] = False
                            check["details"].append(f"Extreme usage change on {system}:{change['mountpoint']}")
        return check
    def check_no_critical_mounts_removed(self, differences: Dict[str, Any]) -> Dict[str, Any]:
        """Check that no critical mount points were removed."""
        check = {
            "name": "no_critical_mounts_removed",
            "description": "Verify critical mount points remain",
            "passed": True,
            "details": []
        }
        critical_mounts = ["/", "/boot", "/usr", "/var"]
        for data_type, systems in differences.items():
            if data_type == "mounts":
                for system, system_diffs in systems.items():
                    for mount in system_diffs.get("removed_mounts", []):
                        if mount["mountpoint"] in critical_mounts:
                            check["passed"] = False
                            check["details"].append(f"Critical mount {mount['mountpoint']} removed from {system}")
        return check
 def compare_snapshots(snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
    """Main comparison function."""
    comparator = SnapshotComparator()
    return comparator.compare_snapshots(snapshot1, snapshot2)
@@ -0,0 +1,8 @@
 ---
 skip_list:
  - role-name
  - name[casing]
  - line-too-long
 exclude_paths:
  - .git
@@ -0,0 +1,19 @@
 .PHONY: help test lint syntax validate-assets
 help:
 	@echo "Zabbix Monitoring + Incident Response"
 	@echo "  make test             Run syntax, lint, and asset validation"
 	@echo "  make syntax           Run Ansible syntax checks"
 	@echo "  make lint             Run ansible-lint"
 	@echo "  make validate-assets  Validate template and sample JSON assets"
 test: syntax lint validate-assets
 syntax:
 	ansible-playbook --syntax-check playbooks/*.yml
 lint:
 	ansible-lint
 validate-assets:
 	python3 scripts/validate_assets.py
@@ -0,0 +1,63 @@
 # Zabbix Monitoring + Incident Response
 ## Problem
 Large Linux/Unix environments need simple, reliable OS checks before more advanced observability becomes useful. Filesystems, CPU, memory, network, process status, proxy backlog, and agent availability must be monitored consistently across Linux and AIX estates.
 ## CV Relevance
 This project maps to Zabbix monitoring platform work, proxy maintenance, custom checks, alert noise reduction, and incident response in enterprise environments. It shows operational design and automation without pretending to run AIX locally.
 ## What This Project Demonstrates
 - Ansible-first Zabbix server, proxy, and agent/agent2 configuration structure.
 - Proxy topology for active and passive checks.
 - Linux and AIX OS monitoring templates as reviewable JSON assets.
 - Sample Linux/AIX check data for filesystem, CPU, memory, network, and process monitoring.
 - Runbooks for Zabbix maintenance and incident response.
 ## Architecture
 ```text
 Linux/AIX hosts -> Zabbix agent/agent2 -> Zabbix proxy -> Zabbix server/web
                       |                       |
                       v                       v
                 OS simple checks       proxy queue/cache
 Incident -> Alert -> Operator triage -> Maintenance or remediation evidence
 ```
 ## Quickstart
 ```bash
 cd professional-infra/zabbix-monitoring-incident-response
 make test
 ```
 `make test` performs Ansible syntax/lint checks and validates the Zabbix template/sample JSON assets.
 ## Validation
 ```bash
 ansible-playbook --syntax-check playbooks/*.yml
 ansible-lint
 python3 scripts/validate_assets.py
 ```
 ## Example Output
 Sample check payloads are available in `samples/linux-os-checks.json` and `samples/aix-os-checks.json`. These show what a reviewable `zabbix_sender` or API-driven evidence artifact could look like for Linux and AIX hosts.
 ## Interview Talking Points
 - Why Zabbix is suitable for simple OS checks while ELK/Grafana is better for log analysis.
 - How proxies reduce WAN dependency and support branch/client environments.
 - Difference between active and passive checks.
 - How to troubleshoot unsupported items, missing data, proxy backlog, and agent reachability.
 - How Linux and AIX monitoring differ without inventing local AIX runtime.
 ## Roadmap
 - Add API import helpers for templates.
 - Add a Docker-based Zabbix server/proxy demo scaffold.
 - Add Wazuh or security monitoring integration as a separate side lab.
@@ -0,0 +1,5 @@
 [defaults]
 roles_path = ./roles
 inventory = ./inventory/hosts.ini
 host_key_checking = False
 retry_files_enabled = False
@@ -0,0 +1,30 @@
 # Incident Response Runbook
 ## Filesystem Alert
 1. Confirm current usage and growth trend.
 2. Check whether the host is Linux or AIX and use the correct runbook.
 3. Validate application ownership of the filesystem.
 4. Clean known temporary paths or request LVM expansion when approved.
 5. Attach before/after evidence to the incident ticket.
 ## Agent Unreachable
 1. Confirm whether data loss affects one host, one proxy, or one network segment.
 2. Check proxy queue and last seen timestamp.
 3. Validate agent service state and firewall path.
 4. For active checks, confirm `ServerActive` and hostname match.
 ## Proxy Backlog
 1. Check server reachability from proxy.
 2. Check proxy DB filesystem usage.
 3. Confirm whether config sync recently changed.
 4. Reduce noise by temporarily disabling non-critical discovery rules if required.
 ## Unsupported Items
 1. Identify affected template and item key.
 2. Check whether item is Linux-specific or AIX-specific.
 3. Validate agent version and custom user parameters.
 4. Roll back template change if canary host group is affected.
@@ -0,0 +1,29 @@
 # Zabbix Maintenance Runbook
 ## Server Checks
 - Confirm Zabbix server process and web frontend availability.
 - Check database health, free space, and slow queries.
 - Review cache usage, poller utilization, and housekeeper activity.
 - Confirm recent values are arriving for representative Linux and AIX hosts.
 ## Proxy Checks
 - Confirm proxy last seen timestamp.
 - Check proxy queue and delayed values.
 - Validate proxy database size and filesystem usage.
 - Confirm active/passive connectivity based on proxy mode.
 ## Template Maintenance
 - Import templates in a controlled window.
 - Watch unsupported items after import.
 - Validate a small canary host group before wider rollout.
 - Document changed triggers and thresholds.
 ## Common Failure Modes
 - Agent unreachable: check DNS, firewall, agent service, proxy route.
 - Unsupported item: check key spelling, OS capability, agent version, user parameter.
 - Proxy backlog: check WAN, DB size, proxy process, server availability.
 - Alert noise: review trigger thresholds and dependency design.
@@ -0,0 +1,27 @@
 # Zabbix Proxy Design
 ## Purpose
 Zabbix proxies reduce dependency on direct connectivity between the central server and monitored hosts. They are useful for client networks, segmented environments, remote sites, and maintenance windows.
 ## Active Proxy
 - Proxy connects to the Zabbix server.
 - Good for restricted networks where inbound access to the proxy is not allowed.
 - Hosts can use active agent checks against the proxy.
 - Main operational checks: proxy last seen, delayed values, local DB size, config sync.
 ## Passive Proxy
 - Zabbix server connects to the proxy.
 - Useful when central server can reach the proxy network.
 - Requires firewall rules from server to proxy.
 - Main operational checks: proxy listener, network latency, poller load.
 ## Operational Signals
 - Proxy queue growth.
 - Unsupported items after template changes.
 - Agent unreachable or active checks delayed.
 - Proxy DB growth during WAN outage.
 - Config sync failures after maintenance.
@@ -0,0 +1,4 @@
 2026-05-04 10:21:14 WARN zbx-proxy-bank01 proxy queue above threshold: 420 delayed values
 2026-05-04 10:22:01 HIGH linux-app01 Root filesystem above 85 percent
 2026-05-04 10:25:33 INFO linux-app01 filesystem cleanup completed, usage back to 74 percent
 2026-05-04 10:30:12 WARN aix-core01 active check delayed, proxy connectivity validated
@@ -0,0 +1,12 @@
 [zabbix_server]
 zbx-server01 ansible_connection=local
 [zabbix_proxy]
 zbx-proxy-bank01 ansible_connection=local zabbix_proxy_mode=active
 zbx-proxy-bank02 ansible_connection=local zabbix_proxy_mode=passive
 [zabbix_agents_linux]
 linux-app01 ansible_connection=local zabbix_agent_mode=active
 [zabbix_agents_aix]
 aix-core01 ansible_connection=local zabbix_agent_mode=active
@@ -0,0 +1,8 @@
 ---
 - name: Configure Zabbix agents
  hosts: zabbix_agents_linux:zabbix_agents_aix
  become: true
  gather_facts: false
  roles:
    - role: zabbix_agent
@@ -0,0 +1,8 @@
 ---
 - name: Configure Zabbix proxy nodes
  hosts: zabbix_proxy
  become: true
  gather_facts: false
  roles:
    - role: zabbix_proxy
@@ -0,0 +1,8 @@
 ---
 - name: Configure Zabbix server control plane
  hosts: zabbix_server
  become: true
  gather_facts: false
  roles:
    - role: zabbix_server
@@ -0,0 +1,7 @@
 ---
 zabbix_agent_server: zbx-proxy-bank01
 zabbix_agent_server_active: zbx-proxy-bank01
 zabbix_agent_hostname: "{{ inventory_hostname }}"
 zabbix_agent_mode: active
 zabbix_agent_listen_port: 10050
 zabbix_agent_include_dir: /etc/zabbix/zabbix_agentd.d
--- a/Show More
+++ b/Show More
		`@@ -0,0 +1,3 @@`
							`# Dashboards`

							`This directory is reserved for local demo dashboards. The current portfolio scope validates the observability stack scaffold, sample logs, alert intent, and incident simulation without claiming production-ready dashboards.`
		`@@ -0,0 +1,2 @@`
							`http.host: 0.0.0.0`
							`pipeline.ecs_compatibility: disabled`