Rework portfolio around Linux operations, Zabbix monitoring, migration validation, and ELK/Grafana log observability. Add AAP-style LVM resize workflow, Zabbix server/proxy/agent automation assets, Linux/AIX monitoring templates, and updated validation CI.
This commit is contained in:
@@ -0,0 +1,47 @@
|
|||||||
|
name: ci
|
||||||
|
|
||||||
|
on:
|
||||||
|
push:
|
||||||
|
branches: [main]
|
||||||
|
pull_request:
|
||||||
|
branches: [main]
|
||||||
|
|
||||||
|
jobs:
|
||||||
|
validate:
|
||||||
|
runs-on: ubuntu-latest
|
||||||
|
|
||||||
|
steps:
|
||||||
|
- name: Checkout repository
|
||||||
|
uses: actions/checkout@v4
|
||||||
|
|
||||||
|
- name: Install validation tools
|
||||||
|
run: |
|
||||||
|
python3 -m venv .venv
|
||||||
|
. .venv/bin/activate
|
||||||
|
pip install --upgrade pip
|
||||||
|
pip install ansible ansible-lint
|
||||||
|
|
||||||
|
- name: Migration validation framework
|
||||||
|
run: |
|
||||||
|
cd professional-infra/migration-validation-framework
|
||||||
|
make test
|
||||||
|
make demo
|
||||||
|
|
||||||
|
- name: Linux operations automation
|
||||||
|
run: |
|
||||||
|
. .venv/bin/activate
|
||||||
|
cd professional-infra/linux-operations-automation
|
||||||
|
ansible-playbook --syntax-check playbooks/*.yml
|
||||||
|
ansible-lint
|
||||||
|
make test
|
||||||
|
|
||||||
|
- name: Zabbix monitoring and incident response
|
||||||
|
run: |
|
||||||
|
. .venv/bin/activate
|
||||||
|
cd professional-infra/zabbix-monitoring-incident-response
|
||||||
|
make test
|
||||||
|
|
||||||
|
- name: Log observability ELK Grafana
|
||||||
|
run: |
|
||||||
|
cd professional-infra/log-observability-elk-grafana
|
||||||
|
make test
|
||||||
+24
@@ -0,0 +1,24 @@
|
|||||||
|
__pycache__/
|
||||||
|
**/__pycache__/
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
|
||||||
|
# Runtime logs and generated evidence
|
||||||
|
*.log
|
||||||
|
logs/
|
||||||
|
snapshots/
|
||||||
|
reports/
|
||||||
|
*.html
|
||||||
|
migration_report_*.json
|
||||||
|
migration_report_*.html
|
||||||
|
|
||||||
|
# Local environment files
|
||||||
|
.env
|
||||||
|
.env.*
|
||||||
|
!.env.example
|
||||||
|
.venv/
|
||||||
|
venv/
|
||||||
|
|
||||||
|
# OS/editor noise
|
||||||
|
.DS_Store
|
||||||
|
*.swp
|
||||||
@@ -0,0 +1,61 @@
|
|||||||
|
# AI Context File - CV-Aligned Portfolio Guide
|
||||||
|
|
||||||
|
## Positioning
|
||||||
|
|
||||||
|
This repository should support a Linux/Unix Infrastructure Engineer CV. The main story is operational infrastructure work: Linux operations, Zabbix monitoring, migration validation, incident response, and log observability.
|
||||||
|
|
||||||
|
Do not reposition the repo as a generic cloud-native platform portfolio. DevOps side labs can be mentioned, but the main `professional-infra/` section should stay grounded in real operational work.
|
||||||
|
|
||||||
|
## Current Professional Projects
|
||||||
|
|
||||||
|
### Linux Operations Automation
|
||||||
|
|
||||||
|
Technology stack: Ansible, Bash, Docker Compose.
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- Linux provisioning, patching, hardening, and decommissioning,
|
||||||
|
- service and failure simulation,
|
||||||
|
- AAP-style LVM filesystem resize workflow,
|
||||||
|
- before/after operational evidence.
|
||||||
|
|
||||||
|
### Zabbix Monitoring + Incident Response
|
||||||
|
|
||||||
|
Technology stack: Ansible, Zabbix template assets, JSON/YAML-style operational docs.
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- Zabbix server, proxy, and agent automation structure,
|
||||||
|
- active/passive proxy design,
|
||||||
|
- Linux and AIX OS monitoring templates,
|
||||||
|
- maintenance and incident response runbooks,
|
||||||
|
- sample check and alert evidence.
|
||||||
|
|
||||||
|
### Migration Validation Framework
|
||||||
|
|
||||||
|
Technology stack: Python, JSON, HTML reports.
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- pre/post migration snapshots,
|
||||||
|
- drift detection,
|
||||||
|
- risk assessment,
|
||||||
|
- migration evidence reports.
|
||||||
|
|
||||||
|
### Log Observability ELK/Grafana
|
||||||
|
|
||||||
|
Technology stack: Docker Compose, Elasticsearch, Logstash, Kibana, Grafana, Filebeat.
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- log ingestion and parsing,
|
||||||
|
- incident log evidence,
|
||||||
|
- local demo observability stack.
|
||||||
|
|
||||||
|
## Standards
|
||||||
|
|
||||||
|
- Every project should have `make test`.
|
||||||
|
- Documentation should include CV relevance and interview talking points.
|
||||||
|
- Runtime logs, caches, snapshots, and generated reports should stay out of git.
|
||||||
|
- AIX should be represented by templates, samples, and runbooks; do not claim local AIX runtime.
|
||||||
|
- AAP should be represented as workflow/job-template documentation plus playbook variables; do not add AWX/AAP runtime unless explicitly requested.
|
||||||
@@ -0,0 +1,39 @@
|
|||||||
|
# Portfolio Changelog
|
||||||
|
|
||||||
|
## [1.2.0] - 2026-05-04 - CV-Aligned Infrastructure Portfolio Rework
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Reorganized the repository around `professional-infra/` projects aligned with the Linux Engineer CV.
|
||||||
|
- Repositioned the root README around Linux/Unix operations, Zabbix monitoring, migration validation, and log observability.
|
||||||
|
- Moved the former infrastructure simulator into Linux Operations Automation.
|
||||||
|
- Moved the former observability stack into Log Observability ELK/Grafana.
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Zabbix Monitoring + Incident Response project with Ansible-first server/proxy/agent structure.
|
||||||
|
- Linux and AIX Zabbix OS monitoring template assets and sample check payloads.
|
||||||
|
- Zabbix proxy design, maintenance, and incident response runbooks.
|
||||||
|
- AAP-style LVM filesystem resize workflow in Linux Operations Automation.
|
||||||
|
- LVM resize workflow documentation and sample evidence output.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Updated CI paths for the new `professional-infra/` structure.
|
||||||
|
- Updated runbooks and architecture notes to match the CV-aligned portfolio structure.
|
||||||
|
- Kept runtime logs, caches, and generated reports out of tracked project evidence.
|
||||||
|
|
||||||
|
## [1.1.0] - 2026-05-04 - Portfolio Reliability Pass
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
- Restored lightweight CI for Python validation, Ansible syntax/lint checks, and Docker Compose validation.
|
||||||
|
- Separated versioned examples from runtime logs, caches, snapshots, and generated reports.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
- Hardened migration report generation by escaping untrusted snapshot content.
|
||||||
|
- Added missing local configuration scaffolding for the ELK/Grafana stack.
|
||||||
|
|
||||||
|
## [1.0.0] - 2026-04-29 - Initial Portfolio Release
|
||||||
|
|
||||||
|
### Added
|
||||||
|
- Ansible lifecycle automation examples.
|
||||||
|
- Migration validation Python CLI.
|
||||||
|
- ELK/Grafana local observability scaffold.
|
||||||
|
- Per-project documentation, architecture notes, examples, and scenarios.
|
||||||
Binary file not shown.
@@ -0,0 +1,31 @@
|
|||||||
|
# CV-Aligned Linux / Unix Infrastructure Portfolio
|
||||||
|
|
||||||
|
This repository maps my Linux/Unix infrastructure experience into practical, reviewable projects: operations automation, monitoring, incident response, migration validation, and log observability. It is intentionally grounded in the kind of work described in my CV: large Linux/Unix estates, Zabbix monitoring, storage and migration work, provisioning, patching, troubleshooting, and operational evidence.
|
||||||
|
|
||||||
|
The repository also leaves room for DevOps side labs, but the main section is professional infrastructure work rather than cloud-native/platform fantasy.
|
||||||
|
|
||||||
|
## What To Review First
|
||||||
|
|
||||||
|
| Order | Project | CV relevance | Technologies | Validation |
|
||||||
|
| --- | --- | --- | --- | --- |
|
||||||
|
| 1 | [Linux Operations Automation](professional-infra/linux-operations-automation/) | Linux server deployment, patching, hardening, LVM resize operations, AAP-style automation | Ansible, Bash, Docker Compose | `cd professional-infra/linux-operations-automation && make test` |
|
||||||
|
| 2 | [Zabbix Monitoring + Incident Response](professional-infra/zabbix-monitoring-incident-response/) | Zabbix maintenance, proxy topology, active/passive checks, Linux/AIX OS monitoring | Ansible, Zabbix templates, YAML | `cd professional-infra/zabbix-monitoring-incident-response && make test` |
|
||||||
|
| 3 | [Migration Validation Framework](professional-infra/migration-validation-framework/) | Pre/post migration validation, reporting, drift detection, evidence generation | Python, JSON, HTML | `cd professional-infra/migration-validation-framework && make test && make demo` |
|
||||||
|
| 4 | [Log Observability ELK/Grafana](professional-infra/log-observability-elk-grafana/) | Log ingestion, incident evidence, environment observability | Docker, ELK, Grafana, Filebeat | `cd professional-infra/log-observability-elk-grafana && make test` |
|
||||||
|
|
||||||
|
## CV Skills To Repo Map
|
||||||
|
|
||||||
|
- **Linux/Unix operations:** Linux Operations Automation, LVM resize workflow, patching, hardening, service checks.
|
||||||
|
- **Automation:** Ansible playbooks/roles, Bash simulation scripts, Python validation tooling.
|
||||||
|
- **Monitoring:** Zabbix project for OS checks and proxy operations; ELK/Grafana project for log monitoring.
|
||||||
|
- **Migration work:** Migration Validation Framework for before/after evidence and drift reports.
|
||||||
|
- **Incident response:** Zabbix runbooks, ELK incident simulation, failure simulation examples.
|
||||||
|
- **DevOps practices:** lightweight CI, Git workflows, repeatable `make test` targets, containerized lab components.
|
||||||
|
|
||||||
|
## Professional Infrastructure Projects
|
||||||
|
|
||||||
|
The `professional-infra/` directory contains the projects that should be read as direct support for the CV. Each project includes a reviewer-focused README, validation command, examples, and interview talking points.
|
||||||
|
|
||||||
|
## DevOps Side Labs
|
||||||
|
|
||||||
|
Future side projects can live under `devops-labs/` when they are ready. Good candidates are K3s, CI/CD workflow demos, cloud experiments, Wazuh, or other after-hours labs. Empty placeholder project directories are intentionally avoided.
|
||||||
@@ -0,0 +1,64 @@
|
|||||||
|
# Architecture Overview
|
||||||
|
|
||||||
|
## Portfolio Shape
|
||||||
|
|
||||||
|
This repository is organized around professional infrastructure work that maps to the CV:
|
||||||
|
|
||||||
|
```text
|
||||||
|
professional-infra/
|
||||||
|
linux-operations-automation/
|
||||||
|
zabbix-monitoring-incident-response/
|
||||||
|
migration-validation-framework/
|
||||||
|
log-observability-elk-grafana/
|
||||||
|
|
||||||
|
devops-labs/
|
||||||
|
future side projects only when ready
|
||||||
|
```
|
||||||
|
|
||||||
|
## Project Roles
|
||||||
|
|
||||||
|
### Linux Operations Automation
|
||||||
|
|
||||||
|
Operational automation for Linux server work:
|
||||||
|
|
||||||
|
- provisioning and baseline configuration,
|
||||||
|
- patching and hardening,
|
||||||
|
- service/failure simulation,
|
||||||
|
- AAP-style LVM filesystem resize workflow with before/after evidence.
|
||||||
|
|
||||||
|
### Zabbix Monitoring + Incident Response
|
||||||
|
|
||||||
|
Simple checks and OS monitoring:
|
||||||
|
|
||||||
|
- Zabbix server/proxy/agent automation structure,
|
||||||
|
- active and passive proxy design,
|
||||||
|
- Linux and AIX monitoring templates,
|
||||||
|
- maintenance and incident response runbooks.
|
||||||
|
|
||||||
|
### Migration Validation Framework
|
||||||
|
|
||||||
|
Evidence tooling for platform/storage migrations:
|
||||||
|
|
||||||
|
- before/after snapshot collection,
|
||||||
|
- drift detection,
|
||||||
|
- risk assessment,
|
||||||
|
- JSON and HTML reports.
|
||||||
|
|
||||||
|
### Log Observability ELK/Grafana
|
||||||
|
|
||||||
|
Log monitoring and incident evidence:
|
||||||
|
|
||||||
|
- Filebeat ingestion,
|
||||||
|
- Logstash parsing,
|
||||||
|
- Elasticsearch storage,
|
||||||
|
- Kibana/Grafana review surfaces,
|
||||||
|
- incident log simulation.
|
||||||
|
|
||||||
|
## Design Principles
|
||||||
|
|
||||||
|
- Keep implemented work separate from roadmap ideas.
|
||||||
|
- Prefer reviewable automation and evidence over overbuilt local labs.
|
||||||
|
- Make every project independently validatable.
|
||||||
|
- Treat Zabbix and ELK/Grafana as complementary monitoring tools:
|
||||||
|
- Zabbix for simple checks and OS health,
|
||||||
|
- ELK/Grafana for logs and observability evidence.
|
||||||
@@ -0,0 +1,92 @@
|
|||||||
|
# Portfolio Runbooks
|
||||||
|
|
||||||
|
These runbooks are scoped to the portfolio version of the projects. They favor fast validation and reviewable evidence over full production operations.
|
||||||
|
|
||||||
|
## Linux Operations Automation
|
||||||
|
|
||||||
|
Validate the implemented Ansible core:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/linux-operations-automation
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
Run a safe failure simulation without live SSH hosts:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
Review the AAP-style LVM workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/lvm_resize.yml --syntax-check
|
||||||
|
cat docs/aap_lvm_resize_workflow.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Run playbooks against your own lab hosts after updating `inventory/hosts.ini`:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make run
|
||||||
|
make patch
|
||||||
|
make harden
|
||||||
|
make decommission
|
||||||
|
```
|
||||||
|
|
||||||
|
## Zabbix Monitoring + Incident Response
|
||||||
|
|
||||||
|
Validate Zabbix playbooks and templates:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/zabbix-monitoring-incident-response
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
Review proxy and OS monitoring operations:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cat docs/proxy-design.md
|
||||||
|
cat docs/maintenance-runbook.md
|
||||||
|
cat docs/incident-response-runbook.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Linux and AIX checks are represented as templates and sample data. AIX is not run locally.
|
||||||
|
|
||||||
|
## Migration Validation Framework
|
||||||
|
|
||||||
|
Validate code and parser/report behavior:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/migration-validation-framework
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
Run the included before/after comparison:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
The demo intentionally reports `FAIL` to show a high-risk migration finding.
|
||||||
|
|
||||||
|
## Log Observability ELK/Grafana
|
||||||
|
|
||||||
|
Validate Docker Compose and required bind-mounted configs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/log-observability-elk-grafana
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
Generate sample incident logs without starting the full stack:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
Start the full local demo stack with Docker:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make run
|
||||||
|
make down
|
||||||
|
```
|
||||||
@@ -0,0 +1,14 @@
|
|||||||
|
---
|
||||||
|
# Ansible-lint configuration
|
||||||
|
|
||||||
|
skip_list:
|
||||||
|
- 'role-name'
|
||||||
|
- 'name[casing]'
|
||||||
|
- 'line-too-long'
|
||||||
|
|
||||||
|
exclude_paths:
|
||||||
|
- .git
|
||||||
|
- .gitea
|
||||||
|
- molecule/
|
||||||
|
- molecule/default/tests/
|
||||||
|
- scenarios/
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
# Linux Operations Automation Makefile
|
||||||
|
|
||||||
|
.PHONY: help test run demo patch harden decommission lvm-check up down status logs validate clean lint scale-up-web scale-up-db scale-down-web scale-down-db fail-network fail-disk fail-service fail-node scenario-scaling help-scaling help-failure
|
||||||
|
|
||||||
|
help: ## Show this help message
|
||||||
|
@echo "Linux Operations Automation"
|
||||||
|
@echo ""
|
||||||
|
@echo "Available commands:"
|
||||||
|
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf " %-18s %s\n", $$1, $$2}'
|
||||||
|
|
||||||
|
test: ## Run offline validation checks
|
||||||
|
ansible-playbook -i inventory/hosts.ini --syntax-check playbooks/*.yml
|
||||||
|
ansible-lint
|
||||||
|
|
||||||
|
run: ## Run provisioning against the configured inventory
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
|
||||||
|
|
||||||
|
demo: ## Run a safe local demonstration without requiring live SSH hosts
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_failure.sh service 5 web
|
||||||
|
|
||||||
|
patch: ## Apply patching workflow against the configured inventory
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
|
||||||
|
|
||||||
|
harden: ## Apply hardening workflow against the configured inventory
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
|
||||||
|
|
||||||
|
decommission: ## Run decommissioning workflow against the configured inventory
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml
|
||||||
|
|
||||||
|
lvm-check: ## Validate the AAP-style LVM resize workflow
|
||||||
|
ansible-playbook -i inventory/hosts.ini --syntax-check playbooks/lvm_resize.yml
|
||||||
|
|
||||||
|
up: ## Start the optional local container scaffold
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
down: ## Stop the optional local container scaffold
|
||||||
|
docker compose down
|
||||||
|
|
||||||
|
status: ## Show local scaffold status and inventory hosts
|
||||||
|
docker compose ps
|
||||||
|
ansible -i inventory/hosts.ini --list-hosts all || echo "Inventory check failed"
|
||||||
|
|
||||||
|
logs: ## Show local scaffold logs
|
||||||
|
docker compose logs -f --tail=100
|
||||||
|
|
||||||
|
validate: ## Run all offline validation checks
|
||||||
|
$(MAKE) test
|
||||||
|
docker compose config --quiet
|
||||||
|
|
||||||
|
clean: ## Clean up generated local logs and reports
|
||||||
|
rm -f logs/*.log reports/*.txt
|
||||||
|
|
||||||
|
lint: ## Lint Ansible content
|
||||||
|
ansible-lint
|
||||||
|
|
||||||
|
scale-up-web: ## Scale up web servers in simulation mode (usage: make scale-up-web COUNT=2)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh up $(or $(COUNT),1) web
|
||||||
|
|
||||||
|
scale-up-db: ## Scale up database servers in simulation mode (usage: make scale-up-db COUNT=1)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh up $(or $(COUNT),1) db
|
||||||
|
|
||||||
|
scale-down-web: ## Scale down web servers in simulation mode (usage: make scale-down-web COUNT=1)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh down $(or $(COUNT),1) web
|
||||||
|
|
||||||
|
scale-down-db: ## Scale down database servers in simulation mode (usage: make scale-down-db COUNT=1)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_scaling.sh down $(or $(COUNT),1) db
|
||||||
|
|
||||||
|
fail-network: ## Simulate network failure safely (usage: make fail-network DURATION=60)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_failure.sh network $(or $(DURATION),60)
|
||||||
|
|
||||||
|
fail-disk: ## Simulate disk pressure safely (usage: make fail-disk DURATION=120)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_failure.sh disk $(or $(DURATION),120)
|
||||||
|
|
||||||
|
fail-service: ## Simulate service failures safely (usage: make fail-service DURATION=30)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_failure.sh service $(or $(DURATION),30)
|
||||||
|
|
||||||
|
fail-node: ## Simulate node failure safely (usage: make fail-node DURATION=300)
|
||||||
|
SIMULATION_MODE=true bash ./scripts/simulate_failure.sh node $(or $(DURATION),300)
|
||||||
|
|
||||||
|
scenario-scaling: ## Run scaling event syntax validation
|
||||||
|
ansible-playbook -i inventory/hosts.ini --syntax-check scenarios/scaling_event.yml
|
||||||
|
|
||||||
|
help-scaling: ## Show scaling-related commands
|
||||||
|
@echo "Scaling Commands:"
|
||||||
|
@echo " make scale-up-web COUNT=2"
|
||||||
|
@echo " make scale-up-db COUNT=1"
|
||||||
|
@echo " make scale-down-web COUNT=1"
|
||||||
|
@echo " make scale-down-db COUNT=1"
|
||||||
|
|
||||||
|
help-failure: ## Show failure simulation commands
|
||||||
|
@echo "Failure Simulation Commands:"
|
||||||
|
@echo " make fail-network DURATION=60"
|
||||||
|
@echo " make fail-disk DURATION=120"
|
||||||
|
@echo " make fail-service DURATION=30"
|
||||||
|
@echo " make fail-node DURATION=300"
|
||||||
@@ -0,0 +1,92 @@
|
|||||||
|
# Linux Operations Automation
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Linux infrastructure work often starts as ticket-driven operations: deploy a server, patch it, harden SSH, check a failed service, expand a filesystem, and leave evidence that the change was safe. These tasks need automation that is readable, repeatable, and cautious enough for production-style environments.
|
||||||
|
|
||||||
|
## CV Relevance
|
||||||
|
|
||||||
|
This project maps directly to Linux/Unix operations, server deployment, patching, troubleshooting, and storage/LVM work from enterprise infrastructure environments. The LVM resize workflow is written in an AAP-style shape: explicit survey variables, dry-run defaults, pre-checks, resize actions, and before/after evidence.
|
||||||
|
|
||||||
|
## What This Project Demonstrates
|
||||||
|
|
||||||
|
- Ansible playbooks for common Linux node lifecycle operations.
|
||||||
|
- Role-based task organization with clear defaults and handlers.
|
||||||
|
- LVM filesystem expansion workflow suitable for Ansible Automation Platform job templates.
|
||||||
|
- Safe simulation scripts for failure, service, and scaling exercises.
|
||||||
|
- Reviewer-friendly evidence in `examples/` without relying on a live enterprise lab.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```text
|
||||||
|
Operator -> Make targets -> Ansible inventory -> Playbooks/Roles -> Linux nodes
|
||||||
|
-> Simulation scripts -> Example evidence
|
||||||
|
-> AAP-style LVM workflow -> Before/after report
|
||||||
|
```
|
||||||
|
|
||||||
|
Core components:
|
||||||
|
|
||||||
|
- `inventory/hosts.ini` defines realistic host groups.
|
||||||
|
- `playbooks/` contains provision, patch, harden, and decommission workflows.
|
||||||
|
- `playbooks/lvm_resize.yml` contains the storage expansion workflow.
|
||||||
|
- `roles/` contains the implemented Ansible roles.
|
||||||
|
- `scripts/` provides safe simulation helpers.
|
||||||
|
- `docker-compose.yml` is a lightweight local scaffold, not a production lab.
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/linux-operations-automation
|
||||||
|
make test
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
`make test` runs offline syntax and lint checks. `make demo` runs a safe simulation with `SIMULATION_MODE=true` and does not require reachable SSH hosts.
|
||||||
|
|
||||||
|
To run playbooks against real or lab hosts, update `inventory/hosts.ini` and run:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make run
|
||||||
|
make patch
|
||||||
|
make harden
|
||||||
|
make decommission
|
||||||
|
```
|
||||||
|
|
||||||
|
Review the LVM workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/lvm_resize.yml --syntax-check
|
||||||
|
cat docs/aap_lvm_resize_workflow.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test
|
||||||
|
docker compose config --quiet
|
||||||
|
```
|
||||||
|
|
||||||
|
The optional compose scaffold can be started with:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make up
|
||||||
|
make down
|
||||||
|
```
|
||||||
|
|
||||||
|
## Example Output
|
||||||
|
|
||||||
|
Sample evidence is available in [examples/patch-output.txt](examples/patch-output.txt), [examples/failure-simulation.txt](examples/failure-simulation.txt), and [examples/lvm-resize-output.txt](examples/lvm-resize-output.txt).
|
||||||
|
|
||||||
|
## Interview Talking Points
|
||||||
|
|
||||||
|
- How to make LVM resize automation safe with dry-run defaults and explicit approval.
|
||||||
|
- Why before/after evidence matters for storage and filesystem changes.
|
||||||
|
- How Ansible roles keep Linux baseline operations repeatable.
|
||||||
|
- Where AAP surveys and job templates reduce ticket handling errors.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
- Add complete service roles for application deployment examples.
|
||||||
|
- Add backup, security scan, and disaster recovery playbooks.
|
||||||
|
- Add a richer local lab with SSH-ready containers.
|
||||||
|
- Add cloud or Kubernetes deployment variants.
|
||||||
@@ -0,0 +1,43 @@
|
|||||||
|
# Vault Configuration Guide
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
The current portfolio demo does not require Ansible Vault for `make test` or `make demo`. Secrets are intentionally kept out of the main validation path so reviewers can run the project offline.
|
||||||
|
|
||||||
|
Use Vault only when extending the simulator to manage real hosts or credentials.
|
||||||
|
|
||||||
|
## Recommended Pattern
|
||||||
|
|
||||||
|
1. Start from the example file:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp group_vars/vault.example.yml group_vars/vault.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Replace placeholder values locally.
|
||||||
|
|
||||||
|
3. Encrypt the file before using it with real systems:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-vault encrypt group_vars/vault.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
4. Do not commit real secret values. Keep `group_vars/vault.example.yml` as the committed reference.
|
||||||
|
|
||||||
|
## Running With Vault
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml --ask-vault-pass
|
||||||
|
```
|
||||||
|
|
||||||
|
or:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml --vault-password-file ~/.vault_pass.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
- The delivered playbooks do not import a vault file by default.
|
||||||
|
- Add `vars_files` only in an environment-specific branch or private overlay.
|
||||||
|
- Prefer a secret manager or automation controller for production use.
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
[defaults]
|
||||||
|
roles_path = ./roles
|
||||||
|
inventory = ./inventory/hosts.ini
|
||||||
|
host_key_checking = False
|
||||||
|
retry_files_enabled = False
|
||||||
@@ -0,0 +1,28 @@
|
|||||||
|
services:
|
||||||
|
web:
|
||||||
|
image: debian:12-slim
|
||||||
|
command: ["sleep", "infinity"]
|
||||||
|
networks:
|
||||||
|
infra_sim:
|
||||||
|
ipv4_address: 172.20.0.11
|
||||||
|
|
||||||
|
db:
|
||||||
|
image: debian:12-slim
|
||||||
|
command: ["sleep", "infinity"]
|
||||||
|
networks:
|
||||||
|
infra_sim:
|
||||||
|
ipv4_address: 172.20.0.21
|
||||||
|
|
||||||
|
lb:
|
||||||
|
image: debian:12-slim
|
||||||
|
command: ["sleep", "infinity"]
|
||||||
|
networks:
|
||||||
|
infra_sim:
|
||||||
|
ipv4_address: 172.20.0.31
|
||||||
|
|
||||||
|
networks:
|
||||||
|
infra_sim:
|
||||||
|
driver: bridge
|
||||||
|
ipam:
|
||||||
|
config:
|
||||||
|
- subnet: 172.20.0.0/24
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
# AAP-Style LVM Resize Workflow
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
This workflow shows how a routine storage ticket can be converted into a controlled Ansible Automation Platform job. It is intentionally conservative: dry-run is the default, required variables are explicit, and every run produces before/after evidence.
|
||||||
|
|
||||||
|
## Suggested Job Template
|
||||||
|
|
||||||
|
- Name: `Linux - LVM Filesystem Resize`
|
||||||
|
- Inventory: Linux production or pre-production inventory
|
||||||
|
- Playbook: `playbooks/lvm_resize.yml`
|
||||||
|
- Credentials: privileged Linux automation credential
|
||||||
|
- Privilege escalation: enabled
|
||||||
|
- Default extra vars:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
lvm_dry_run: true
|
||||||
|
lvm_resize_filesystem: true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Suggested Survey Variables
|
||||||
|
|
||||||
|
| Variable | Example | Required | Notes |
|
||||||
|
| --- | --- | --- | --- |
|
||||||
|
| `lvm_vg_name` | `vg_app` | yes | Target volume group. |
|
||||||
|
| `lvm_lv_name` | `lv_data` | yes | Target logical volume. |
|
||||||
|
| `lvm_mountpoint` | `/data` | yes | Filesystem mountpoint to validate before/after. |
|
||||||
|
| `lvm_size_request` | `+20G` | yes | Passed to `lvextend -L`; use explicit growth syntax for tickets. |
|
||||||
|
| `lvm_dry_run` | `true` | yes | Start with `true`; switch to `false` after evidence review. |
|
||||||
|
|
||||||
|
## Safety Notes
|
||||||
|
|
||||||
|
- Run with `lvm_dry_run=true` first and attach output to the ticket.
|
||||||
|
- Confirm backup/snapshot status before actual resize.
|
||||||
|
- Confirm filesystem type; this workflow supports XFS and ext filesystems.
|
||||||
|
- Keep requested size aligned with the ticket approval.
|
||||||
|
- Use maintenance windows for critical systems.
|
||||||
|
|
||||||
|
## Evidence Captured
|
||||||
|
|
||||||
|
- `lsblk --fs`
|
||||||
|
- `pvs`, `vgs`, `lvs`
|
||||||
|
- `df -hT <mountpoint>` before and after
|
||||||
|
- target LV path and filesystem type
|
||||||
|
- dry-run flag and requested size
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
# Linux Operations Automation Architecture
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
- Operator interface: `make` targets and direct Ansible commands.
|
||||||
|
- Inventory: static host groups in `inventory/hosts.ini`.
|
||||||
|
- Automation: lifecycle playbooks in `playbooks/`.
|
||||||
|
- Simulation scripts: controlled failure and scaling events in `scripts/`.
|
||||||
|
- Evidence: logs, reports, scenario notes, and examples.
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Operator
|
||||||
|
-> Make target or shell script
|
||||||
|
-> Ansible inventory
|
||||||
|
-> lifecycle playbook
|
||||||
|
-> managed Linux node
|
||||||
|
-> log/report artifact
|
||||||
|
```
|
||||||
|
|
||||||
|
Failure drills follow a parallel flow:
|
||||||
|
|
||||||
|
```
|
||||||
|
Operator -> simulate_failure.sh -> target node/service -> health check -> patch/hardening playbook -> evidence
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
The project favors explicit playbooks over hidden orchestration so the operational intent is visible during review. In a production implementation, the same workflows would typically run from a CI runner or automation controller with credentials supplied by a secret manager.
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
2026-04-29 02:13:41 - Starting failure simulation: service 30 web
|
||||||
|
2026-04-29 02:13:41 - Simulating service failures on containers: web
|
||||||
|
2026-04-29 02:13:42 - Stopping services in container enterprise-web-1
|
||||||
|
2026-04-29 02:13:44 - Health probe failed: http://web01/health returned 503
|
||||||
|
2026-04-29 02:14:12 - Cleaning up failure simulation
|
||||||
|
2026-04-29 02:14:13 - Restarted nginx in enterprise-web-1
|
||||||
|
2026-04-29 02:14:18 - Health probe recovered: http://web01/health returned 200
|
||||||
|
2026-04-29 02:14:18 - Failure simulation completed successfully
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
TASK [Report LVM resize evidence] **********************************************
|
||||||
|
ok: [app01] => {
|
||||||
|
"msg": {
|
||||||
|
"host": "app01",
|
||||||
|
"dry_run": true,
|
||||||
|
"target": "/dev/vg_app/lv_data",
|
||||||
|
"mountpoint": "/data",
|
||||||
|
"requested_size": "+20G",
|
||||||
|
"filesystem_type": "xfs",
|
||||||
|
"before_df": [
|
||||||
|
"Filesystem Type Size Used Avail Use% Mounted on",
|
||||||
|
"/dev/mapper/vg_app-lv_data xfs 100G 83G 17G 84% /data"
|
||||||
|
],
|
||||||
|
"after_df": [
|
||||||
|
"Filesystem Type Size Used Avail Use% Mounted on",
|
||||||
|
"/dev/mapper/vg_app-lv_data xfs 100G 83G 17G 84% /data"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
PLAY [Apply Security Patches and Updates] **************************************
|
||||||
|
|
||||||
|
TASK [Update package cache] *****************************************************
|
||||||
|
changed: [web01]
|
||||||
|
changed: [db01]
|
||||||
|
ok: [lb01]
|
||||||
|
|
||||||
|
TASK [Check for available updates] **********************************************
|
||||||
|
ok: [web01] => {"stdout": "9"}
|
||||||
|
ok: [db01] => {"stdout": "4"}
|
||||||
|
ok: [lb01] => {"stdout": "0"}
|
||||||
|
|
||||||
|
TASK [Apply security updates only] **********************************************
|
||||||
|
changed: [web01]
|
||||||
|
changed: [db01]
|
||||||
|
ok: [lb01]
|
||||||
|
|
||||||
|
TASK [Verify critical services] *************************************************
|
||||||
|
ok: [web01] => (item=systemd-journald)
|
||||||
|
ok: [web01] => (item=cron)
|
||||||
|
ok: [db01] => (item=systemd-journald)
|
||||||
|
ok: [lb01] => (item=cron)
|
||||||
|
|
||||||
|
PLAY RECAP *********************************************************************
|
||||||
|
web01 : ok=19 changed=6 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
|
||||||
|
db01 : ok=18 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
|
||||||
|
lb01 : ok=15 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
|
||||||
|
|
||||||
|
Patch report
|
||||||
|
Status: SUCCESS
|
||||||
|
Window: 02:00-04:00 UTC
|
||||||
|
Reboot required: false
|
||||||
|
Notification: infra-team@example.com
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
---
|
||||||
|
# Group variables for all hosts
|
||||||
|
|
||||||
|
# SSH Configuration
|
||||||
|
ssh_config:
|
||||||
|
port: 22
|
||||||
|
max_auth_tries: 3
|
||||||
|
alive_interval: 300
|
||||||
|
|
||||||
|
# Firewall defaults
|
||||||
|
firewall_enabled: true
|
||||||
|
firewall_default_policy: deny
|
||||||
|
|
||||||
|
# Patching defaults
|
||||||
|
patch_enabled: true
|
||||||
|
enforce_patch_window: true
|
||||||
|
|
||||||
|
# Services monitoring
|
||||||
|
enable_monitoring: false
|
||||||
|
enable_health_checks: true
|
||||||
@@ -0,0 +1,9 @@
|
|||||||
|
---
|
||||||
|
# Database servers group configuration
|
||||||
|
db_type: postgresql
|
||||||
|
db_port: 5432
|
||||||
|
db_backup_enabled: true
|
||||||
|
db_backup_path: /var/backups/database
|
||||||
|
|
||||||
|
# Database user (use vault for production)
|
||||||
|
db_admin_user: postgres
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
---
|
||||||
|
# Load balancers group configuration
|
||||||
|
lb_type: haproxy
|
||||||
|
lb_port: 443
|
||||||
|
lb_stats_port: 8404
|
||||||
|
lb_stats_enabled: true
|
||||||
|
|
||||||
|
# Frontend configuration
|
||||||
|
frontend_host: "0.0.0.0"
|
||||||
|
frontend_port: 80
|
||||||
@@ -0,0 +1,10 @@
|
|||||||
|
---
|
||||||
|
# Monitoring servers group configuration
|
||||||
|
monitoring_type: prometheus
|
||||||
|
monitoring_port: 9090
|
||||||
|
monitoring_retention: 30d
|
||||||
|
monitoring_scrape_interval: 15s
|
||||||
|
|
||||||
|
# Grafana configuration
|
||||||
|
grafana_port: 3000
|
||||||
|
grafana_admin_password: "{{ vault_grafana_password }}"
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
# Example variables for secret values.
|
||||||
|
# Copy these keys into an Ansible Vault encrypted file when real secrets are needed.
|
||||||
|
|
||||||
|
admin_password: "replace-with-vault-managed-value"
|
||||||
|
db_root_password: "replace-with-vault-managed-value"
|
||||||
|
grafana_admin_password: "replace-with-vault-managed-value"
|
||||||
|
ssh_key_passphrase: "replace-with-vault-managed-value"
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
---
|
||||||
|
# Webservers group configuration
|
||||||
|
webserver_type: nginx
|
||||||
|
http_port: 80
|
||||||
|
https_port: 443
|
||||||
|
health_check_path: /health
|
||||||
|
|
||||||
|
# Application configuration
|
||||||
|
app_name: "{{ group_names[0] | default('app') }}"
|
||||||
|
app_user: "{{ admin_user }}"
|
||||||
|
app_group: "{{ admin_user }}"
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
[webservers]
|
||||||
|
web01 ansible_host=172.20.0.11 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
web02 ansible_host=172.20.0.12 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
web03 ansible_host=172.20.0.13 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
|
||||||
|
[databases]
|
||||||
|
db01 ansible_host=172.20.0.21 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
db02 ansible_host=172.20.0.22 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
|
||||||
|
[loadbalancers]
|
||||||
|
lb01 ansible_host=172.20.0.31 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
|
||||||
|
[monitoring]
|
||||||
|
mon01 ansible_host=172.20.0.41 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
|
||||||
|
|
||||||
|
[all:vars]
|
||||||
|
ansible_python_interpreter=/usr/bin/python3
|
||||||
|
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
|
||||||
|
ansible_connection=ssh
|
||||||
|
|
||||||
|
[webservers:vars]
|
||||||
|
node_type=web
|
||||||
|
environment=production
|
||||||
|
|
||||||
|
[databases:vars]
|
||||||
|
node_type=database
|
||||||
|
environment=production
|
||||||
|
|
||||||
|
[loadbalancers:vars]
|
||||||
|
node_type=loadbalancer
|
||||||
|
environment=production
|
||||||
|
|
||||||
|
[monitoring:vars]
|
||||||
|
node_type=monitoring
|
||||||
|
environment=production
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
---
|
||||||
|
# Molecule converge playbook - applies roles to test them
|
||||||
|
|
||||||
|
- name: Converge
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Update apt cache
|
||||||
|
apt:
|
||||||
|
update_cache: yes
|
||||||
|
cache_valid_time: 3600
|
||||||
|
when: ansible_os_family == "Debian"
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: base_provision
|
||||||
|
- role: hardening
|
||||||
|
- role: patching
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Print Ansible facts
|
||||||
|
debug:
|
||||||
|
var: ansible_facts
|
||||||
@@ -0,0 +1,15 @@
|
|||||||
|
---
|
||||||
|
# Molecule destroy playbook
|
||||||
|
|
||||||
|
- name: Destroy
|
||||||
|
hosts: localhost
|
||||||
|
gather_facts: false
|
||||||
|
tasks:
|
||||||
|
- name: Destroy molecule containers
|
||||||
|
docker_container:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: absent
|
||||||
|
force_kill: yes
|
||||||
|
loop: "{{ molecule_yml.platforms | map(attribute='name') | list }}"
|
||||||
|
register: destroy_result
|
||||||
|
ignore_errors: yes
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
---
|
||||||
|
# Molecule configuration for Ansible role testing
|
||||||
|
|
||||||
|
driver:
|
||||||
|
name: docker
|
||||||
|
|
||||||
|
platforms:
|
||||||
|
- name: ubuntu-22.04
|
||||||
|
image: geerlingguy/docker-ubuntu2204-ansible:latest
|
||||||
|
pre_build_image: true
|
||||||
|
privileged: true
|
||||||
|
volumes:
|
||||||
|
- /sys/fs/cgroup:/sys/fs/cgroup:rw
|
||||||
|
|
||||||
|
provisioner:
|
||||||
|
name: ansible
|
||||||
|
config_options:
|
||||||
|
defaults:
|
||||||
|
gathering: smart
|
||||||
|
fact_caching: jsonfile
|
||||||
|
fact_caching_connection: /tmp/ansible_facts
|
||||||
|
fact_caching_timeout: 3600
|
||||||
|
deprecation_warnings: false
|
||||||
|
|
||||||
|
verifier:
|
||||||
|
name: ansible
|
||||||
|
directory: molecule/default/tests
|
||||||
|
|
||||||
|
lint: |
|
||||||
|
yamllint .
|
||||||
|
ansible-lint
|
||||||
@@ -0,0 +1,32 @@
|
|||||||
|
---
|
||||||
|
# Molecule verify playbook - runs tests to verify roles
|
||||||
|
|
||||||
|
- name: Verify
|
||||||
|
hosts: all
|
||||||
|
gather_facts: false
|
||||||
|
tasks:
|
||||||
|
- name: Check if base OS packages are installed
|
||||||
|
shell: dpkg -l | grep -E '(curl|wget|vim|htop)'
|
||||||
|
register: package_check
|
||||||
|
failed_when: package_check.rc not in [0, 1]
|
||||||
|
|
||||||
|
- name: Check SSH configuration
|
||||||
|
stat:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
register: ssh_config_stat
|
||||||
|
failed_when: not ssh_config_stat.stat.exists
|
||||||
|
|
||||||
|
- name: Check firewall status
|
||||||
|
shell: ufw status | grep -q active
|
||||||
|
register: firewall_check
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Verify admin user exists
|
||||||
|
getent:
|
||||||
|
database: passwd
|
||||||
|
key: infra-admin
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Print verification results
|
||||||
|
debug:
|
||||||
|
msg: "Role verification completed"
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
---
|
||||||
|
- name: Decommission Enterprise Infrastructure Nodes
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Confirm decommissioning
|
||||||
|
ansible.builtin.pause:
|
||||||
|
prompt: |
|
||||||
|
WARNING: This will decommission {{ inventory_hostname }}
|
||||||
|
Backup Data: {{ backup_data }}
|
||||||
|
Export Config: {{ export_config }}
|
||||||
|
|
||||||
|
Press ENTER to continue or Ctrl+C to cancel
|
||||||
|
|
||||||
|
- name: Display decommissioning information
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Decommissioning {{ inventory_hostname }}
|
||||||
|
Auto Shutdown: {{ auto_shutdown }}
|
||||||
|
Backup Enabled: {{ backup_data }}
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: decommission
|
||||||
|
tags: ['decommission', 'cleanup']
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Display decommissioning summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Decommissioning completed!
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
Backup Location: /var/backups/decommission-{{ ansible_date_time.iso8601 }}/
|
||||||
@@ -0,0 +1,124 @@
|
|||||||
|
---
|
||||||
|
- name: Harden Enterprise Infrastructure Nodes
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Validate hardening prerequisites
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_os_family == "Debian"
|
||||||
|
- cis_level in [1, 2]
|
||||||
|
fail_msg: "Invalid hardening configuration"
|
||||||
|
|
||||||
|
- name: Display hardening information
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Hardening {{ inventory_hostname }}
|
||||||
|
CIS Level: {{ cis_level }}
|
||||||
|
Disable Root Login: {{ disable_root_login }}
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: hardening
|
||||||
|
tags: ['hardening', 'security']
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Display hardening summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Hardening completed successfully!
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
|
||||||
|
when: ansible_os_family == "Debian"
|
||||||
|
|
||||||
|
- name: Configure auditd
|
||||||
|
when: auditd_enabled
|
||||||
|
block:
|
||||||
|
- name: Install auditd
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: auditd
|
||||||
|
state: present
|
||||||
|
when: ansible_os_family == "Debian"
|
||||||
|
|
||||||
|
- name: Configure audit rules
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: templates/audit.rules.j2
|
||||||
|
dest: /etc/audit/rules.d/hardening.rules
|
||||||
|
mode: '0644'
|
||||||
|
|
||||||
|
- name: Enable auditd service
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: auditd
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Configure AppArmor
|
||||||
|
when: apparmor_enabled and ansible_os_family == "Debian"
|
||||||
|
block:
|
||||||
|
- name: Install apparmor
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: apparmor
|
||||||
|
state: present
|
||||||
|
when: ansible_os_family == "Debian"
|
||||||
|
|
||||||
|
- name: Enable apparmor service
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: apparmor
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
|
||||||
|
- name: Configure sysctl hardening
|
||||||
|
ansible.posix.sysctl:
|
||||||
|
name: "{{ item.key }}"
|
||||||
|
value: "{{ item.value }}"
|
||||||
|
state: present
|
||||||
|
reload: true
|
||||||
|
loop:
|
||||||
|
- { key: 'net.ipv4.ip_forward', value: '0' }
|
||||||
|
- { key: 'net.ipv4.conf.all.send_redirects', value: '0' }
|
||||||
|
- { key: 'net.ipv4.conf.default.send_redirects', value: '0' }
|
||||||
|
- { key: 'net.ipv4.tcp_syncookies', value: '1' }
|
||||||
|
- { key: 'net.ipv4.icmp_echo_ignore_broadcasts', value: '1' }
|
||||||
|
|
||||||
|
- name: Set secure file permissions
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item }}"
|
||||||
|
mode: '0644'
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
loop:
|
||||||
|
- /etc/passwd
|
||||||
|
- /etc/group
|
||||||
|
- /etc/shadow
|
||||||
|
- /etc/gshadow
|
||||||
|
|
||||||
|
- name: Lock inactive user accounts
|
||||||
|
ansible.builtin.command: usermod -L "{{ item }}"
|
||||||
|
loop: "{{ inactive_users | default([]) }}"
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Configure password policies
|
||||||
|
community.general.pam_limits:
|
||||||
|
domain: '*'
|
||||||
|
limit_type: hard
|
||||||
|
limit_item: nofile
|
||||||
|
value: 1024
|
||||||
|
|
||||||
|
- name: Generate hardening report
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: templates/hardening_report.j2
|
||||||
|
dest: "/var/log/hardening_report_{{ ansible_date_time.iso8601 }}.log"
|
||||||
|
mode: '0644'
|
||||||
|
|
||||||
|
handlers:
|
||||||
|
- name: restart sshd
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: ssh
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: restart auditd
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: auditd
|
||||||
|
state: restarted
|
||||||
|
when: auditd_enabled
|
||||||
@@ -0,0 +1,149 @@
|
|||||||
|
---
|
||||||
|
- name: AAP-style LVM filesystem resize workflow
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
vars:
|
||||||
|
lvm_dry_run: true
|
||||||
|
lvm_vg_name: ""
|
||||||
|
lvm_lv_name: ""
|
||||||
|
lvm_mountpoint: ""
|
||||||
|
lvm_size_request: "+10G"
|
||||||
|
lvm_resize_filesystem: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Validate required survey variables
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- lvm_vg_name | length > 0
|
||||||
|
- lvm_lv_name | length > 0
|
||||||
|
- lvm_mountpoint | length > 0
|
||||||
|
- lvm_size_request | length > 0
|
||||||
|
fail_msg: "Required variables: lvm_vg_name, lvm_lv_name, lvm_mountpoint, lvm_size_request"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Capture block device layout before resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- lsblk
|
||||||
|
- --fs
|
||||||
|
register: lvm_lsblk_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Capture physical volumes before resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- pvs
|
||||||
|
- --noheadings
|
||||||
|
- --units
|
||||||
|
- g
|
||||||
|
register: lvm_pvs_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Capture volume groups before resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- vgs
|
||||||
|
- --noheadings
|
||||||
|
- --units
|
||||||
|
- g
|
||||||
|
register: lvm_vgs_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Capture logical volumes before resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- lvs
|
||||||
|
- --noheadings
|
||||||
|
- --units
|
||||||
|
- g
|
||||||
|
register: lvm_lvs_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Capture filesystem usage before resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- df
|
||||||
|
- -hT
|
||||||
|
- "{{ lvm_mountpoint }}"
|
||||||
|
register: lvm_df_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Validate target logical volume exists
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- lvs
|
||||||
|
- "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
|
||||||
|
register: lvm_target_check
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Show dry-run resize command
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "DRY RUN: would run lvextend -L {{ lvm_size_request }} /dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
|
||||||
|
when: lvm_dry_run | bool
|
||||||
|
|
||||||
|
- name: Extend logical volume
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- lvextend
|
||||||
|
- -L
|
||||||
|
- "{{ lvm_size_request }}"
|
||||||
|
- "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
|
||||||
|
register: lvm_lvextend_result
|
||||||
|
changed_when: true
|
||||||
|
when: not (lvm_dry_run | bool)
|
||||||
|
|
||||||
|
- name: Detect filesystem type
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- findmnt
|
||||||
|
- -n
|
||||||
|
- -o
|
||||||
|
- FSTYPE
|
||||||
|
- "{{ lvm_mountpoint }}"
|
||||||
|
register: lvm_fstype
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Resize XFS filesystem
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- xfs_growfs
|
||||||
|
- "{{ lvm_mountpoint }}"
|
||||||
|
changed_when: true
|
||||||
|
when:
|
||||||
|
- not (lvm_dry_run | bool)
|
||||||
|
- lvm_resize_filesystem | bool
|
||||||
|
- lvm_fstype.stdout == "xfs"
|
||||||
|
|
||||||
|
- name: Resize ext filesystem
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- resize2fs
|
||||||
|
- "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
|
||||||
|
changed_when: true
|
||||||
|
when:
|
||||||
|
- not (lvm_dry_run | bool)
|
||||||
|
- lvm_resize_filesystem | bool
|
||||||
|
- lvm_fstype.stdout in ["ext2", "ext3", "ext4"]
|
||||||
|
|
||||||
|
- name: Capture filesystem usage after resize
|
||||||
|
ansible.builtin.command:
|
||||||
|
argv:
|
||||||
|
- df
|
||||||
|
- -hT
|
||||||
|
- "{{ lvm_mountpoint }}"
|
||||||
|
register: lvm_df_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Report LVM resize evidence
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg:
|
||||||
|
host: "{{ inventory_hostname }}"
|
||||||
|
dry_run: "{{ lvm_dry_run }}"
|
||||||
|
target: "/dev/{{ lvm_vg_name }}/{{ lvm_lv_name }}"
|
||||||
|
mountpoint: "{{ lvm_mountpoint }}"
|
||||||
|
requested_size: "{{ lvm_size_request }}"
|
||||||
|
filesystem_type: "{{ lvm_fstype.stdout | default('unknown') }}"
|
||||||
|
before_df: "{{ lvm_df_before.stdout_lines }}"
|
||||||
|
after_df: "{{ lvm_df_after.stdout_lines }}"
|
||||||
@@ -0,0 +1,31 @@
|
|||||||
|
---
|
||||||
|
- name: Apply Security Patches and Updates
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Validate patch prerequisites
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_os_family == "Debian"
|
||||||
|
fail_msg: "Patching supported only on Debian-based systems"
|
||||||
|
|
||||||
|
- name: Display patch information
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Patching {{ inventory_hostname }}
|
||||||
|
Patch Window: {{ patch_window_start }} - {{ patch_window_end }}
|
||||||
|
Security Only: {{ patch_security_only }}
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: patching
|
||||||
|
tags: ['patch', 'updates']
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Display patching summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Patching completed!
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
Reboot Required: {{ reboot_required | default(false) }}
|
||||||
@@ -0,0 +1,33 @@
|
|||||||
|
---
|
||||||
|
- name: Provision Enterprise Infrastructure Nodes
|
||||||
|
hosts: all
|
||||||
|
become: true
|
||||||
|
gather_facts: true
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Validate Ansible version
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_version.major >= 2
|
||||||
|
- ansible_version.minor >= 9
|
||||||
|
fail_msg: "Ansible 2.9+ is required"
|
||||||
|
|
||||||
|
- name: Display provisioning information
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Provisioning {{ inventory_hostname }}
|
||||||
|
OS: {{ ansible_os_family }}
|
||||||
|
Python: {{ ansible_python_version }}
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: base_provision
|
||||||
|
tags: ['provision', 'base']
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Generate provisioning summary
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: |
|
||||||
|
Provisioning completed successfully!
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
IP: {{ ansible_default_ipv4.address }}
|
||||||
|
OS: {{ ansible_os_family }} {{ ansible_os_version }}
|
||||||
@@ -0,0 +1,48 @@
|
|||||||
|
# Base Provision Role
|
||||||
|
|
||||||
|
Provision basic infrastructure on enterprise nodes with security hardening.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Idempotent**: All tasks use proper idempotency markers (`changed_when`, `failed_when`)
|
||||||
|
- **Handlers**: SSH and fail2ban restarts use handlers instead of direct service calls
|
||||||
|
- **Variables**: All configuration in `defaults/main.yml` - no hardcoding
|
||||||
|
- **Validation**: Pre-flight checks for system requirements
|
||||||
|
- **Firewall**: UFW firewall configuration with configurable rules
|
||||||
|
- **SSH Security**: Root login disabled, password auth disabled, key-based auth only
|
||||||
|
|
||||||
|
## Role Variables
|
||||||
|
|
||||||
|
See `defaults/main.yml` for all available variables.
|
||||||
|
|
||||||
|
### Key Variables
|
||||||
|
|
||||||
|
- `node_timezone`: System timezone (default: UTC)
|
||||||
|
- `admin_user`: Admin username for infrastructure access
|
||||||
|
- `ssh_port`: SSH service port (default: 22)
|
||||||
|
- `base_packages`: List of base packages to install
|
||||||
|
- `firewall_enabled`: Enable UFW firewall (default: true)
|
||||||
|
- `firewall_allowed_tcp_ports`: Allowed TCP ports for firewall
|
||||||
|
|
||||||
|
## Secret Variables
|
||||||
|
|
||||||
|
This portfolio demo does not require secrets for offline validation. If you extend it with real passwords or keys, copy `group_vars/vault.example.yml` into an encrypted Ansible Vault file and keep real values out of normal git history.
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: base_provision
|
||||||
|
vars:
|
||||||
|
node_timezone: "Europe/Warsaw"
|
||||||
|
firewall_enabled: true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Handlers
|
||||||
|
|
||||||
|
- `restart sshd`: Restarts SSH service (triggered by config changes)
|
||||||
|
- `restart fail2ban`: Restarts fail2ban service (triggered by config changes)
|
||||||
|
|
||||||
|
## Tags
|
||||||
|
|
||||||
|
- `provision`: All provisioning tasks
|
||||||
|
- `base`: Base provision role tasks
|
||||||
@@ -0,0 +1,44 @@
|
|||||||
|
---
|
||||||
|
# Base provisioning configuration
|
||||||
|
node_timezone: "UTC"
|
||||||
|
admin_user: "infra-admin"
|
||||||
|
ssh_port: 22
|
||||||
|
ssh_disabled_root_login: true
|
||||||
|
ssh_disable_password_auth: true
|
||||||
|
|
||||||
|
# Packages to install
|
||||||
|
base_packages:
|
||||||
|
- curl
|
||||||
|
- wget
|
||||||
|
- vim
|
||||||
|
- htop
|
||||||
|
- net-tools
|
||||||
|
- iptables
|
||||||
|
- fail2ban
|
||||||
|
- unattended-upgrades
|
||||||
|
|
||||||
|
# Firewall rules
|
||||||
|
firewall_enabled: true
|
||||||
|
firewall_default_policy: deny
|
||||||
|
firewall_allowed_tcp_ports:
|
||||||
|
- 22
|
||||||
|
- 80
|
||||||
|
- 443
|
||||||
|
|
||||||
|
# Application directories
|
||||||
|
app_directories:
|
||||||
|
- path: /opt/application
|
||||||
|
owner: "{{ admin_user }}"
|
||||||
|
group: "{{ admin_user }}"
|
||||||
|
mode: '0755'
|
||||||
|
- path: /var/log/application
|
||||||
|
owner: "{{ admin_user }}"
|
||||||
|
group: "{{ admin_user }}"
|
||||||
|
mode: '0755'
|
||||||
|
- path: /etc/application
|
||||||
|
owner: root
|
||||||
|
group: root
|
||||||
|
mode: '0755'
|
||||||
|
|
||||||
|
# Service verification
|
||||||
|
services_to_verify: []
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
---
|
||||||
|
- name: restart sshd
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: sshd
|
||||||
|
state: restarted
|
||||||
|
|
||||||
|
- name: restart fail2ban
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: fail2ban
|
||||||
|
state: restarted
|
||||||
|
enabled: true
|
||||||
@@ -0,0 +1,138 @@
|
|||||||
|
---
|
||||||
|
- name: Validate system requirements
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_os_family == "Debian"
|
||||||
|
- ansible_python_version is version('3.6', '>=')
|
||||||
|
fail_msg: "Unsupported system - requires Debian and Python 3.6+"
|
||||||
|
|
||||||
|
- name: Update package cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 3600
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Install base packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: "{{ base_packages }}"
|
||||||
|
state: present
|
||||||
|
update_cache: true
|
||||||
|
|
||||||
|
- name: Check if admin user exists
|
||||||
|
ansible.builtin.getent:
|
||||||
|
database: passwd
|
||||||
|
key: "{{ admin_user }}"
|
||||||
|
register: admin_check
|
||||||
|
failed_when: false
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Create admin user
|
||||||
|
ansible.builtin.user:
|
||||||
|
name: "{{ admin_user }}"
|
||||||
|
groups: sudo
|
||||||
|
append: true
|
||||||
|
create_home: true
|
||||||
|
shell: /bin/bash
|
||||||
|
when: admin_check.failed
|
||||||
|
|
||||||
|
- name: Configure timezone
|
||||||
|
community.general.timezone:
|
||||||
|
name: "{{ node_timezone }}"
|
||||||
|
|
||||||
|
- name: Configure SSH security
|
||||||
|
block:
|
||||||
|
- name: Disable root SSH login
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^PermitRootLogin'
|
||||||
|
line: 'PermitRootLogin no'
|
||||||
|
state: present
|
||||||
|
when: ssh_disabled_root_login
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Set SSH port
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^Port'
|
||||||
|
line: "Port {{ ssh_port }}"
|
||||||
|
state: present
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Disable password authentication
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^PasswordAuthentication'
|
||||||
|
line: 'PasswordAuthentication no'
|
||||||
|
state: present
|
||||||
|
when: ssh_disable_password_auth
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Configure firewall
|
||||||
|
block:
|
||||||
|
- name: Enable UFW firewall
|
||||||
|
community.general.ufw:
|
||||||
|
state: enabled
|
||||||
|
policy: "{{ firewall_default_policy }}"
|
||||||
|
when: firewall_enabled
|
||||||
|
|
||||||
|
- name: Allow SSH access
|
||||||
|
community.general.ufw:
|
||||||
|
rule: allow
|
||||||
|
port: "{{ ssh_port }}"
|
||||||
|
proto: tcp
|
||||||
|
when: firewall_enabled
|
||||||
|
|
||||||
|
- name: Allow HTTP/HTTPS
|
||||||
|
community.general.ufw:
|
||||||
|
rule: allow
|
||||||
|
port: "{{ item }}"
|
||||||
|
proto: tcp
|
||||||
|
loop: "{{ firewall_allowed_tcp_ports }}"
|
||||||
|
when: firewall_enabled and item not in [ssh_port]
|
||||||
|
|
||||||
|
- name: Configure fail2ban
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: jail.local.j2
|
||||||
|
dest: /etc/fail2ban/jail.local
|
||||||
|
backup: true
|
||||||
|
mode: '0644'
|
||||||
|
notify: restart fail2ban
|
||||||
|
|
||||||
|
- name: Enable unattended upgrades
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/apt/apt.conf.d/20auto-upgrades
|
||||||
|
regexp: '^APT::Periodic::Unattended-Upgrade'
|
||||||
|
line: 'APT::Periodic::Unattended-Upgrade "1";'
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Create application directories
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item.path }}"
|
||||||
|
state: directory
|
||||||
|
owner: "{{ item.owner }}"
|
||||||
|
group: "{{ item.group }}"
|
||||||
|
mode: "{{ item.mode }}"
|
||||||
|
loop: "{{ app_directories }}"
|
||||||
|
|
||||||
|
- name: Record role-specific service intent
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Would configure {{ node_type | default('generic') }} service components in a full lab deployment"
|
||||||
|
|
||||||
|
- name: Verify services are running
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
loop: "{{ services_to_verify }}"
|
||||||
|
when: services_to_verify | length > 0
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Run health checks
|
||||||
|
ansible.builtin.uri:
|
||||||
|
url: http://localhost/health
|
||||||
|
method: GET
|
||||||
|
status_code: 200
|
||||||
|
register: health_check
|
||||||
|
failed_when: false
|
||||||
|
ignore_errors: true
|
||||||
|
when: "'webservers' in group_names"
|
||||||
+14
@@ -0,0 +1,14 @@
|
|||||||
|
# fail2ban configuration
|
||||||
|
[DEFAULT]
|
||||||
|
bantime = 3600
|
||||||
|
findtime = 600
|
||||||
|
maxretry = 5
|
||||||
|
|
||||||
|
[sshd]
|
||||||
|
enabled = true
|
||||||
|
port = {{ ssh_port }}
|
||||||
|
logpath = /var/log/auth.log
|
||||||
|
maxretry = 3
|
||||||
|
|
||||||
|
[recidive]
|
||||||
|
enabled = true
|
||||||
@@ -0,0 +1,62 @@
|
|||||||
|
# Decommission Role
|
||||||
|
|
||||||
|
Gracefully decommission enterprise infrastructure nodes with comprehensive backup and cleanup.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Confirmation Prompt**: Interactive confirmation before decommissioning
|
||||||
|
- **Graceful Shutdown**: Stop services gracefully with connection drain time
|
||||||
|
- **Comprehensive Backup**: Archive configurations and data before cleanup
|
||||||
|
- **Selective Cleanup**: Only remove items that were deployed
|
||||||
|
- **Logging**: Detailed decommissioning logs for audit trail
|
||||||
|
- **Notifications**: Optional email notifications on completion
|
||||||
|
|
||||||
|
## Role Variables
|
||||||
|
|
||||||
|
See `defaults/main.yml` for all available variables.
|
||||||
|
|
||||||
|
### Key Variables
|
||||||
|
|
||||||
|
- `backup_data`: Backup application data (default: true)
|
||||||
|
- `export_config`: Export system configuration (default: true)
|
||||||
|
- `graceful_shutdown`: Graceful service shutdown (default: true)
|
||||||
|
- `auto_shutdown`: Auto shutdown after decommissioning (default: false)
|
||||||
|
- `application_services`: Services to stop
|
||||||
|
- `application_packages`: Packages to remove
|
||||||
|
- `decommission_notification_email`: Email for notifications (optional)
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: decommission
|
||||||
|
vars:
|
||||||
|
backup_data: true
|
||||||
|
export_config: true
|
||||||
|
auto_shutdown: false
|
||||||
|
decommission_notification_email: "ops@company.com"
|
||||||
|
```
|
||||||
|
|
||||||
|
## Backup Locations
|
||||||
|
|
||||||
|
- Configuration: `/var/backups/decommission-<timestamp>/config/`
|
||||||
|
- Data: `/var/backups/decommission-<timestamp>/data/`
|
||||||
|
- Report: `/var/log/decommission_report_<timestamp>.log`
|
||||||
|
|
||||||
|
## Supported Groups
|
||||||
|
|
||||||
|
- `webservers`: Backs up /var/www/html
|
||||||
|
- `databases`: Backs up PostgreSQL data
|
||||||
|
- `monitoring`: Backs up Prometheus data
|
||||||
|
- `loadbalancers`: Loadbalancer cleanup
|
||||||
|
|
||||||
|
## Safety Features
|
||||||
|
|
||||||
|
- Interactive confirmation before execution
|
||||||
|
- Connection drain time before shutdown (30 seconds)
|
||||||
|
- Errors are logged but don't stop the process
|
||||||
|
- Comprehensive audit log
|
||||||
|
|
||||||
|
## Tags
|
||||||
|
|
||||||
|
- `decommission`: All decommissioning tasks
|
||||||
|
- `cleanup`: Cleanup-related tasks
|
||||||
@@ -0,0 +1,34 @@
|
|||||||
|
---
|
||||||
|
# Decommissioning configuration
|
||||||
|
backup_data: true
|
||||||
|
export_config: true
|
||||||
|
graceful_shutdown: true
|
||||||
|
cleanup_inventory: true
|
||||||
|
auto_shutdown: false
|
||||||
|
shutdown_delay: 10
|
||||||
|
|
||||||
|
# Services to stop gracefully
|
||||||
|
application_services:
|
||||||
|
- nginx
|
||||||
|
- postgresql
|
||||||
|
- haproxy
|
||||||
|
|
||||||
|
# Packages to remove
|
||||||
|
application_packages:
|
||||||
|
- nginx
|
||||||
|
- postgresql
|
||||||
|
- haproxy
|
||||||
|
- prometheus
|
||||||
|
|
||||||
|
# Directories to archive
|
||||||
|
config_paths:
|
||||||
|
- /etc/
|
||||||
|
- /opt/application/
|
||||||
|
|
||||||
|
data_paths:
|
||||||
|
- /var/www/html
|
||||||
|
- /var/lib/postgresql
|
||||||
|
- /var/lib/prometheus
|
||||||
|
|
||||||
|
# Notification settings
|
||||||
|
decommission_notification_email: null
|
||||||
@@ -0,0 +1,177 @@
|
|||||||
|
---
|
||||||
|
- name: Validate decommissioning requirements
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- backup_data or not backup_data
|
||||||
|
fail_msg: "Invalid decommissioning configuration"
|
||||||
|
|
||||||
|
- name: Pre-decommissioning checks
|
||||||
|
block:
|
||||||
|
- name: Check node health
|
||||||
|
ansible.builtin.uri:
|
||||||
|
url: http://localhost/health
|
||||||
|
method: GET
|
||||||
|
status_code: 200
|
||||||
|
register: health_check
|
||||||
|
failed_when: false
|
||||||
|
ignore_errors: true
|
||||||
|
when: "'webservers' in group_names"
|
||||||
|
|
||||||
|
- name: Create decommissioning backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
|
||||||
|
state: directory
|
||||||
|
mode: '0755'
|
||||||
|
|
||||||
|
- name: Initialize decommissioning log
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/var/log/decommission.log"
|
||||||
|
state: touch
|
||||||
|
mode: '0644'
|
||||||
|
modification_time: now
|
||||||
|
access_time: now
|
||||||
|
|
||||||
|
- name: Log decommissioning start
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: "/var/log/decommission.log"
|
||||||
|
line: "{{ ansible_date_time.iso8601 }} - Starting decommissioning of {{ inventory_hostname }}"
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Graceful application shutdown
|
||||||
|
block:
|
||||||
|
- name: Stop application services
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: stopped
|
||||||
|
loop: "{{ application_services }}"
|
||||||
|
failed_when: false
|
||||||
|
when: graceful_shutdown
|
||||||
|
|
||||||
|
- name: Wait for connections to drain
|
||||||
|
ansible.builtin.pause:
|
||||||
|
seconds: 30
|
||||||
|
when: graceful_shutdown and ("webservers" in group_names or "loadbalancers" in group_names)
|
||||||
|
|
||||||
|
- name: Export and backup data
|
||||||
|
block:
|
||||||
|
- name: Create config export directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config"
|
||||||
|
state: directory
|
||||||
|
mode: '0755'
|
||||||
|
|
||||||
|
- name: Archive system configuration
|
||||||
|
community.general.archive:
|
||||||
|
path: "{{ config_paths }}"
|
||||||
|
dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config/system_config.tar.gz"
|
||||||
|
format: gz
|
||||||
|
when: export_config
|
||||||
|
failed_when: false # noqa risky-file-permissions
|
||||||
|
|
||||||
|
- name: Create data backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data"
|
||||||
|
state: directory
|
||||||
|
mode: '0755'
|
||||||
|
when: backup_data
|
||||||
|
|
||||||
|
- name: Backup individual data paths
|
||||||
|
community.general.archive:
|
||||||
|
path: "{{ item }}"
|
||||||
|
dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data/{{ item | regex_replace('/', '_') }}.tar.gz"
|
||||||
|
format: gz
|
||||||
|
loop: "{{ data_paths }}"
|
||||||
|
when: backup_data
|
||||||
|
failed_when: false # noqa risky-file-permissions
|
||||||
|
|
||||||
|
- name: Update monitoring and load balancing
|
||||||
|
block:
|
||||||
|
- name: Remove from load balancer
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Would remove {{ inventory_hostname }} from load balancer"
|
||||||
|
when: "'webservers' in group_names or 'databases' in group_names"
|
||||||
|
|
||||||
|
- name: Update monitoring alerts
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "Would update monitoring alerts for {{ inventory_hostname }}"
|
||||||
|
when: "'monitoring' not in group_names"
|
||||||
|
|
||||||
|
- name: Clean up application
|
||||||
|
block:
|
||||||
|
- name: Remove application directories
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item }}"
|
||||||
|
state: absent
|
||||||
|
loop:
|
||||||
|
- /opt/application
|
||||||
|
- /var/www/html
|
||||||
|
- /var/lib/postgresql
|
||||||
|
- /var/lib/prometheus
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Remove application packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: absent
|
||||||
|
purge: true
|
||||||
|
loop: "{{ application_packages }}"
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Clean system logs
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -o pipefail
|
||||||
|
find /var/log -name "*.log" -type f -size +0 -exec truncate -s 0 {} \;
|
||||||
|
changed_when: false
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Remove SSH credentials
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "{{ item }}"
|
||||||
|
state: absent
|
||||||
|
loop:
|
||||||
|
- /root/.ssh/authorized_keys
|
||||||
|
- /root/.ssh/known_hosts
|
||||||
|
- /home/infra-admin/.ssh/authorized_keys
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Generate decommissioning report
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: decommission_report.j2
|
||||||
|
dest: "/var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log"
|
||||||
|
mode: '0644'
|
||||||
|
vars:
|
||||||
|
backup_location: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
|
||||||
|
|
||||||
|
- name: Send decommissioning notification
|
||||||
|
community.general.mail:
|
||||||
|
host: localhost
|
||||||
|
port: 25
|
||||||
|
to: "{{ decommission_notification_email }}"
|
||||||
|
subject: "Node Decommissioned - {{ inventory_hostname }}"
|
||||||
|
body: |
|
||||||
|
Node {{ inventory_hostname }} has been successfully decommissioned.
|
||||||
|
|
||||||
|
Backup location: /var/backups/decommission-{{ ansible_date_time.iso8601 }}/
|
||||||
|
Services stopped: {{ application_services | join(', ') }}
|
||||||
|
Configuration exported: {{ export_config }}
|
||||||
|
Data backed up: {{ backup_data }}
|
||||||
|
|
||||||
|
See /var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log for details
|
||||||
|
when: decommission_notification_email is defined
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Finalize decommissioning
|
||||||
|
block:
|
||||||
|
- name: Log decommissioning completion
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: "/var/log/decommission.log"
|
||||||
|
line: "{{ ansible_date_time.iso8601 }} - Decommissioning completed for {{ inventory_hostname }}"
|
||||||
|
state: present
|
||||||
|
|
||||||
|
- name: Perform system shutdown
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "System scheduled for shutdown after decommissioning"
|
||||||
|
delay: "{{ shutdown_delay }}"
|
||||||
|
when: auto_shutdown | bool
|
||||||
|
async: 1
|
||||||
|
poll: 0
|
||||||
+13
@@ -0,0 +1,13 @@
|
|||||||
|
Decommissioning Report
|
||||||
|
======================
|
||||||
|
Generated: {{ ansible_date_time.iso8601 }}
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
|
||||||
|
Status: COMPLETED
|
||||||
|
Backup Location: {{ backup_location }}
|
||||||
|
|
||||||
|
Configuration Exported: {{ export_config }}
|
||||||
|
Data Backed Up: {{ backup_data }}
|
||||||
|
Services Stopped: {{ application_services | join(', ') }}
|
||||||
|
|
||||||
|
Log Location: /var/log/decommission.log
|
||||||
@@ -0,0 +1,58 @@
|
|||||||
|
# Hardening Role
|
||||||
|
|
||||||
|
Apply security hardening to enterprise infrastructure nodes following CIS benchmarks.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **CIS Compliance**: Support for CIS hardening levels 1 and 2
|
||||||
|
- **SSH Hardening**: Disable root login, password auth, set auth limits
|
||||||
|
- **Firewall Configuration**: UFW with configurable rules
|
||||||
|
- **Service Cleanup**: Disable unnecessary services and remove insecure packages
|
||||||
|
- **Handlers**: SSH restarts via handlers
|
||||||
|
|
||||||
|
## Role Variables
|
||||||
|
|
||||||
|
See `defaults/main.yml` for all available variables.
|
||||||
|
|
||||||
|
### Key Variables
|
||||||
|
|
||||||
|
- `cis_level`: CIS hardening level (1 or 2)
|
||||||
|
- `disable_root_login`: Disable root SSH login (default: true)
|
||||||
|
- `secure_ssh_config`: Apply SSH security hardening (default: true)
|
||||||
|
- `firewall_policy`: Firewall default policy (default: deny)
|
||||||
|
- `ssh_max_auth_tries`: Maximum SSH authentication attempts (default: 3)
|
||||||
|
- `ssh_client_alive_interval`: SSH client alive interval in seconds (default: 300)
|
||||||
|
- `ssh_allowed_networks`: Networks allowed SSH access from
|
||||||
|
|
||||||
|
### SSH Allowed Networks
|
||||||
|
|
||||||
|
Default trusted networks:
|
||||||
|
- 10.0.0.0/8 (Private Class A)
|
||||||
|
- 172.16.0.0/12 (Private Class B)
|
||||||
|
- 192.168.0.0/16 (Private Class C)
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: hardening
|
||||||
|
vars:
|
||||||
|
cis_level: 1
|
||||||
|
disable_root_login: true
|
||||||
|
ssh_allowed_networks:
|
||||||
|
- 10.0.0.0/8
|
||||||
|
- 203.0.113.0/24
|
||||||
|
```
|
||||||
|
|
||||||
|
## SSH Configuration Changes
|
||||||
|
|
||||||
|
- Root login disabled
|
||||||
|
- Password authentication disabled
|
||||||
|
- Maximum auth tries: 3
|
||||||
|
- Empty passwords prohibited
|
||||||
|
- Client alive interval: 300 seconds
|
||||||
|
- Client alive count max: 2
|
||||||
|
|
||||||
|
## Tags
|
||||||
|
|
||||||
|
- `hardening`: All hardening tasks
|
||||||
|
- `security`: Security-related tasks
|
||||||
@@ -0,0 +1,35 @@
|
|||||||
|
---
|
||||||
|
# Hardening configuration
|
||||||
|
cis_level: 1
|
||||||
|
disable_root_login: true
|
||||||
|
secure_ssh_config: true
|
||||||
|
firewall_policy: deny
|
||||||
|
auditd_enabled: true
|
||||||
|
selinux_mode: enforcing
|
||||||
|
apparmor_enabled: true
|
||||||
|
|
||||||
|
# SSH Hardening
|
||||||
|
ssh_max_auth_tries: 3
|
||||||
|
ssh_client_alive_interval: 300
|
||||||
|
ssh_client_alive_count_max: 2
|
||||||
|
|
||||||
|
# Firewall rules for SSH (trusted networks)
|
||||||
|
ssh_allowed_networks:
|
||||||
|
- 10.0.0.0/8
|
||||||
|
- 172.16.0.0/12
|
||||||
|
- 192.168.0.0/16
|
||||||
|
|
||||||
|
# Services to disable
|
||||||
|
unnecessary_services:
|
||||||
|
- cups
|
||||||
|
- avahi-daemon
|
||||||
|
- bluetooth
|
||||||
|
- nfs-server
|
||||||
|
- rpcbind
|
||||||
|
|
||||||
|
# Packages to remove
|
||||||
|
unnecessary_packages:
|
||||||
|
- telnet
|
||||||
|
- rsh-client
|
||||||
|
- talk
|
||||||
|
- ntalk
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
---
|
||||||
|
- name: restart sshd
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: sshd
|
||||||
|
state: restarted
|
||||||
@@ -0,0 +1,7 @@
|
|||||||
|
---
|
||||||
|
# CIS Hardening Level 1 tasks (stub for future expansion)
|
||||||
|
# https://www.cisecurity.org/cis-benchmarks/
|
||||||
|
|
||||||
|
- name: Check CIS status
|
||||||
|
ansible.builtin.debug:
|
||||||
|
msg: "CIS Hardening Level {{ cis_level }} would be applied here"
|
||||||
@@ -0,0 +1,95 @@
|
|||||||
|
---
|
||||||
|
- name: Validate hardening requirements
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_os_family == "Debian"
|
||||||
|
- cis_level in [1, 2]
|
||||||
|
fail_msg: "Unsupported configuration for hardening"
|
||||||
|
|
||||||
|
- name: Apply CIS hardening tasks
|
||||||
|
ansible.builtin.include_tasks: cis_hardening.yml
|
||||||
|
when: cis_level >= 1
|
||||||
|
|
||||||
|
- name: Configure SSH hardening
|
||||||
|
block:
|
||||||
|
- name: Disable root SSH login
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^PermitRootLogin'
|
||||||
|
line: 'PermitRootLogin no'
|
||||||
|
state: present
|
||||||
|
when: disable_root_login
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Disable password authentication
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^PasswordAuthentication'
|
||||||
|
line: 'PasswordAuthentication no'
|
||||||
|
state: present
|
||||||
|
when: secure_ssh_config
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Set MaxAuthTries
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^MaxAuthTries'
|
||||||
|
line: "MaxAuthTries {{ ssh_max_auth_tries }}"
|
||||||
|
state: present
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Disable empty passwords
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^PermitEmptyPasswords'
|
||||||
|
line: 'PermitEmptyPasswords no'
|
||||||
|
state: present
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Set ClientAliveInterval
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^ClientAliveInterval'
|
||||||
|
line: "ClientAliveInterval {{ ssh_client_alive_interval }}"
|
||||||
|
state: present
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Set ClientAliveCountMax
|
||||||
|
ansible.builtin.lineinfile:
|
||||||
|
path: /etc/ssh/sshd_config
|
||||||
|
regexp: '^ClientAliveCountMax'
|
||||||
|
line: "ClientAliveCountMax {{ ssh_client_alive_count_max }}"
|
||||||
|
state: present
|
||||||
|
notify: restart sshd
|
||||||
|
|
||||||
|
- name: Configure firewall rules
|
||||||
|
block:
|
||||||
|
- name: Enable firewall
|
||||||
|
community.general.ufw:
|
||||||
|
state: enabled
|
||||||
|
policy: "{{ firewall_policy }}"
|
||||||
|
when: firewall_policy is defined
|
||||||
|
|
||||||
|
- name: Allow SSH from trusted networks
|
||||||
|
community.general.ufw:
|
||||||
|
rule: allow
|
||||||
|
port: '22'
|
||||||
|
proto: tcp
|
||||||
|
from: "{{ item }}"
|
||||||
|
loop: "{{ ssh_allowed_networks }}"
|
||||||
|
|
||||||
|
- name: Disable unnecessary services
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: stopped
|
||||||
|
enabled: false
|
||||||
|
loop: "{{ unnecessary_services }}"
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Remove unnecessary packages
|
||||||
|
ansible.builtin.apt:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: absent
|
||||||
|
purge: true
|
||||||
|
loop: "{{ unnecessary_packages }}"
|
||||||
|
failed_when: false
|
||||||
@@ -0,0 +1,45 @@
|
|||||||
|
# Patching Role
|
||||||
|
|
||||||
|
Apply security patches and OS updates to enterprise infrastructure nodes.
|
||||||
|
|
||||||
|
## Features
|
||||||
|
|
||||||
|
- **Idempotent**: Properly checks for changes with `changed_when`
|
||||||
|
- **Patch Window**: Optional enforcement of patch time windows
|
||||||
|
- **Pre-patch Backup**: Backs up package list before patching
|
||||||
|
- **Smart Reboot**: Automatically detects if reboot is required
|
||||||
|
- **Service Restart**: Restarts only necessary services after patching
|
||||||
|
- **Health Checks**: Verifies services and runs health endpoint checks
|
||||||
|
|
||||||
|
## Role Variables
|
||||||
|
|
||||||
|
See `defaults/main.yml` for all available variables.
|
||||||
|
|
||||||
|
### Key Variables
|
||||||
|
|
||||||
|
- `patch_window_start`: Patch window start time (default: 02:00)
|
||||||
|
- `patch_window_end`: Patch window end time (default: 04:00)
|
||||||
|
- `enforce_patch_window`: Enforce patch time window (default: true)
|
||||||
|
- `patch_security_only`: Apply security updates only (default: true)
|
||||||
|
- `backup_before_patch`: Create backup before patching (default: true)
|
||||||
|
- `reboot_if_required`: Auto-reboot if required (default: false)
|
||||||
|
- `services_to_restart`: Services to restart after patching
|
||||||
|
- `critical_services`: Critical services to verify after patching
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
- role: patching
|
||||||
|
vars:
|
||||||
|
patch_security_only: true
|
||||||
|
enforce_patch_window: false
|
||||||
|
reboot_if_required: true
|
||||||
|
```
|
||||||
|
|
||||||
|
## Report
|
||||||
|
|
||||||
|
Patch report is generated at: `/var/log/patch_report_<timestamp>.log`
|
||||||
|
|
||||||
|
## Backup Location
|
||||||
|
|
||||||
|
Pre-patch backups saved to: `/var/backups/pre-patch-<timestamp>/`
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
---
|
||||||
|
# Patching configuration
|
||||||
|
patch_window_start: "02:00"
|
||||||
|
patch_window_end: "04:00"
|
||||||
|
enforce_patch_window: true
|
||||||
|
patch_security_only: true
|
||||||
|
backup_before_patch: true
|
||||||
|
reboot_if_required: false
|
||||||
|
reboot_timeout: 300
|
||||||
|
|
||||||
|
# Services to restart after patching
|
||||||
|
services_to_restart:
|
||||||
|
- sshd
|
||||||
|
- fail2ban
|
||||||
|
|
||||||
|
# Services to verify after patching
|
||||||
|
critical_services:
|
||||||
|
- systemd-journald
|
||||||
|
- systemd-logind
|
||||||
|
- cron
|
||||||
@@ -0,0 +1,6 @@
|
|||||||
|
---
|
||||||
|
- name: restart patching services
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: restarted
|
||||||
|
loop: "{{ services_to_restart }}"
|
||||||
@@ -0,0 +1,105 @@
|
|||||||
|
---
|
||||||
|
- name: Validate patch window
|
||||||
|
when: enforce_patch_window | bool
|
||||||
|
block:
|
||||||
|
- name: Check current time against patch window
|
||||||
|
ansible.builtin.assert:
|
||||||
|
that:
|
||||||
|
- ansible_date_time.hour | int >= patch_window_start.split(':')[0] | int
|
||||||
|
- ansible_date_time.hour | int < patch_window_end.split(':')[0] | int
|
||||||
|
fail_msg: |
|
||||||
|
Current time {{ ansible_date_time.hour }}:{{ ansible_date_time.minute }} is outside patch window {{ patch_window_start }}-{{ patch_window_end }}
|
||||||
|
|
||||||
|
- name: Create pre-patch backup
|
||||||
|
when: backup_before_patch | bool
|
||||||
|
block:
|
||||||
|
- name: Create backup directory
|
||||||
|
ansible.builtin.file:
|
||||||
|
path: "/var/backups/pre-patch-{{ ansible_date_time.iso8601 }}"
|
||||||
|
state: directory
|
||||||
|
mode: '0755'
|
||||||
|
|
||||||
|
- name: Capture current package list
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -o pipefail
|
||||||
|
dpkg --get-selections > /var/backups/pre-patch-{{ ansible_date_time.iso8601 }}/packages.list
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Check for available updates
|
||||||
|
ansible.builtin.shell: |
|
||||||
|
set -o pipefail
|
||||||
|
apt list --upgradable 2>/dev/null | grep -v "Listing..." | wc -l
|
||||||
|
register: updates_available_count
|
||||||
|
changed_when: false
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Update package cache
|
||||||
|
ansible.builtin.apt:
|
||||||
|
update_cache: true
|
||||||
|
cache_valid_time: 300
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Check if reboot required before patching
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required_before
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Apply security updates
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: dist
|
||||||
|
update_cache: true
|
||||||
|
when: patch_security_only | bool
|
||||||
|
register: apt_update_result
|
||||||
|
notify: restart patching services
|
||||||
|
|
||||||
|
- name: Apply all available updates
|
||||||
|
ansible.builtin.apt:
|
||||||
|
upgrade: full
|
||||||
|
update_cache: true
|
||||||
|
when: not (patch_security_only | bool)
|
||||||
|
register: apt_update_result
|
||||||
|
notify: restart patching services
|
||||||
|
|
||||||
|
- name: Check if reboot required after patching
|
||||||
|
ansible.builtin.stat:
|
||||||
|
path: /var/run/reboot-required
|
||||||
|
register: reboot_required_after
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Verify critical services are running
|
||||||
|
ansible.builtin.service:
|
||||||
|
name: "{{ item }}"
|
||||||
|
state: started
|
||||||
|
enabled: true
|
||||||
|
loop: "{{ critical_services }}"
|
||||||
|
failed_when: false
|
||||||
|
|
||||||
|
- name: Run post-patch health checks
|
||||||
|
ansible.builtin.uri:
|
||||||
|
url: http://localhost/health
|
||||||
|
method: GET
|
||||||
|
status_code: 200
|
||||||
|
register: health_check
|
||||||
|
failed_when: false
|
||||||
|
ignore_errors: true
|
||||||
|
when: "'webservers' in group_names"
|
||||||
|
|
||||||
|
- name: Set reboot required flag
|
||||||
|
ansible.builtin.set_fact:
|
||||||
|
reboot_required: "{{ reboot_required_after.stat.exists | default(false) }}"
|
||||||
|
|
||||||
|
- name: Perform system reboot if required
|
||||||
|
ansible.builtin.reboot:
|
||||||
|
msg: "Rebooting after security patches"
|
||||||
|
timeout: "{{ reboot_timeout }}"
|
||||||
|
when: reboot_required and reboot_if_required | bool
|
||||||
|
|
||||||
|
- name: Generate patching report
|
||||||
|
ansible.builtin.template:
|
||||||
|
src: patch_report.j2
|
||||||
|
dest: /var/log/patch_report_{{ ansible_date_time.iso8601 }}.log
|
||||||
|
mode: '0644'
|
||||||
|
vars:
|
||||||
|
updates_applied_count: "{{ apt_update_result.changed | ternary('Yes', 'No') }}"
|
||||||
|
reboot_required_flag: "{{ reboot_required }}"
|
||||||
+10
@@ -0,0 +1,10 @@
|
|||||||
|
Patching Report
|
||||||
|
===============
|
||||||
|
Generated: {{ ansible_date_time.iso8601 }}
|
||||||
|
Host: {{ inventory_hostname }}
|
||||||
|
|
||||||
|
Updates Applied: {{ updates_applied_count }}
|
||||||
|
Reboot Required: {{ reboot_required_flag }}
|
||||||
|
Services Restarted: {{ services_to_restart | join(', ') }}
|
||||||
|
|
||||||
|
Backup Location: /var/backups/pre-patch-{{ ansible_date_time.iso8601 }}/
|
||||||
@@ -0,0 +1,21 @@
|
|||||||
|
# Scenario: Simulate Failure and Patch
|
||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
Validate that a service-level failure can be detected, recovered, and followed by a controlled patch workflow. This mirrors a maintenance window where a degraded node is stabilized before package updates are applied.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/linux-operations-automation
|
||||||
|
./scripts/simulate_failure.sh service 30 web
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
|
||||||
|
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml --check
|
||||||
|
```
|
||||||
|
|
||||||
|
## Expected Result
|
||||||
|
|
||||||
|
- The simulation records a temporary service failure.
|
||||||
|
- The service is restored after cleanup.
|
||||||
|
- The patch playbook completes without unreachable hosts.
|
||||||
|
- Hardening check mode reports no destructive changes.
|
||||||
@@ -0,0 +1,116 @@
|
|||||||
|
---
|
||||||
|
- name: Enterprise Scaling Event Scenario
|
||||||
|
hosts: all
|
||||||
|
become: yes
|
||||||
|
gather_facts: yes
|
||||||
|
vars:
|
||||||
|
scaling_threshold: 80
|
||||||
|
cooldown_period: 300
|
||||||
|
max_scale_up: 5
|
||||||
|
min_instances: 2
|
||||||
|
|
||||||
|
pre_tasks:
|
||||||
|
- name: Log scenario start
|
||||||
|
lineinfile:
|
||||||
|
path: "/var/log/scaling_scenario.log"
|
||||||
|
line: "{{ ansible_date_time.iso8601 }} - Starting scaling event scenario"
|
||||||
|
create: yes
|
||||||
|
|
||||||
|
- name: Check current load
|
||||||
|
command: uptime
|
||||||
|
register: system_load
|
||||||
|
changed_when: false
|
||||||
|
|
||||||
|
- name: Parse load average
|
||||||
|
set_fact:
|
||||||
|
load_1min: "{{ system_load.stdout.split(',')[0].split()[-1] | float }}"
|
||||||
|
load_5min: "{{ system_load.stdout.split(',')[1] | float }}"
|
||||||
|
load_15min: "{{ system_load.stdout.split(',')[2] | float }}"
|
||||||
|
|
||||||
|
tasks:
|
||||||
|
- name: Evaluate scaling conditions
|
||||||
|
set_fact:
|
||||||
|
scale_up_needed: "{{ load_5min > scaling_threshold }}"
|
||||||
|
scale_down_needed: "{{ load_5min < (scaling_threshold * 0.3) }}"
|
||||||
|
|
||||||
|
- name: Scale up web servers
|
||||||
|
include_role:
|
||||||
|
name: scale_up
|
||||||
|
tasks_from: web_servers
|
||||||
|
vars:
|
||||||
|
scale_count: "{{ [max_scale_up, (load_5min / 10) | int] | min }}"
|
||||||
|
when: scale_up_needed and "'webservers' in group_names"
|
||||||
|
|
||||||
|
- name: Scale up database servers
|
||||||
|
include_role:
|
||||||
|
name: scale_up
|
||||||
|
tasks_from: database_servers
|
||||||
|
vars:
|
||||||
|
scale_count: "{{ [2, (load_5min / 20) | int] | min }}"
|
||||||
|
when: scale_up_needed and "'databases' in group_names"
|
||||||
|
|
||||||
|
- name: Update load balancer configuration
|
||||||
|
include_role:
|
||||||
|
name: load_balancer
|
||||||
|
tasks_from: update_backends
|
||||||
|
when: scale_up_needed
|
||||||
|
|
||||||
|
- name: Scale down web servers
|
||||||
|
include_role:
|
||||||
|
name: scale_down
|
||||||
|
tasks_from: web_servers
|
||||||
|
vars:
|
||||||
|
scale_count: "{{ [(inventory_hostname | regex_findall('[0-9]+') | first | int) - min_instances, 1] | max }}"
|
||||||
|
when: scale_down_needed and "'webservers' in group_names" and (inventory_hostname | regex_findall('[0-9]+') | first | int) > min_instances
|
||||||
|
|
||||||
|
- name: Wait for cooldown period
|
||||||
|
pause:
|
||||||
|
seconds: "{{ cooldown_period }}"
|
||||||
|
when: scale_up_needed or scale_down_needed
|
||||||
|
|
||||||
|
- name: Verify scaling results
|
||||||
|
uri:
|
||||||
|
url: http://localhost/health
|
||||||
|
method: GET
|
||||||
|
status_code: 200
|
||||||
|
register: health_check
|
||||||
|
until: health_check.status == 200
|
||||||
|
retries: 5
|
||||||
|
delay: 10
|
||||||
|
when: "'webservers' in group_names"
|
||||||
|
|
||||||
|
- name: Update monitoring thresholds
|
||||||
|
include_role:
|
||||||
|
name: monitoring
|
||||||
|
tasks_from: update_alerts
|
||||||
|
vars:
|
||||||
|
new_threshold: "{{ scaling_threshold + 10 }}"
|
||||||
|
|
||||||
|
- name: Send scaling notification
|
||||||
|
mail:
|
||||||
|
to: "{{ scaling_notification_email | default('infra-team@company.com') }}"
|
||||||
|
subject: "Infrastructure Scaling Event - {{ inventory_hostname }}"
|
||||||
|
body: |
|
||||||
|
Scaling event completed on {{ inventory_hostname }}
|
||||||
|
|
||||||
|
Load averages: {{ load_1min }}, {{ load_5min }}, {{ load_15min }}
|
||||||
|
Action taken: {{ 'Scale Up' if scale_up_needed else 'Scale Down' if scale_down_needed else 'No Action' }}
|
||||||
|
Health check: {{ 'PASSED' if health_check.status == 200 else 'FAILED' }}
|
||||||
|
|
||||||
|
See /var/log/scaling_scenario.log for details
|
||||||
|
when: scaling_notification_email is defined
|
||||||
|
ignore_errors: yes
|
||||||
|
|
||||||
|
post_tasks:
|
||||||
|
- name: Generate scaling scenario report
|
||||||
|
template:
|
||||||
|
src: templates/scaling_scenario_report.j2
|
||||||
|
dest: "/var/log/scaling_scenario_report_{{ ansible_date_time.iso8601 }}.log"
|
||||||
|
vars:
|
||||||
|
scenario_outcome: "{{ 'SUCCESS' if health_check.status == 200 else 'WARNING' }}"
|
||||||
|
load_metrics: "{{ load_1min }}, {{ load_5min }}, {{ load_15min }}"
|
||||||
|
|
||||||
|
- name: Log scenario completion
|
||||||
|
lineinfile:
|
||||||
|
path: "/var/log/scaling_scenario.log"
|
||||||
|
line: "{{ ansible_date_time.iso8601 }} - Scaling event scenario completed"
|
||||||
@@ -0,0 +1,388 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Enterprise Infrastructure Failure Simulation Script
|
||||||
|
# Simulates various types of infrastructure failures for testing
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
DOCKER_COMPOSE_FILE="docker-compose.yml"
|
||||||
|
INVENTORY_FILE="inventory/hosts.ini"
|
||||||
|
LOG_FILE="logs/failure_simulation.log"
|
||||||
|
|
||||||
|
# Default values
|
||||||
|
FAILURE_TYPE="${1:-network}"
|
||||||
|
DURATION="${2:-60}"
|
||||||
|
TARGET_NODES="${3:-all}"
|
||||||
|
INTENSITY="${INTENSITY:-medium}"
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Error handling
|
||||||
|
error_exit() {
|
||||||
|
log "ERROR: $1"
|
||||||
|
# Cleanup any active failures
|
||||||
|
cleanup_failure
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Validate inputs
|
||||||
|
validate_inputs() {
|
||||||
|
case "$FAILURE_TYPE" in
|
||||||
|
network|disk|service|node|cpu|memory) ;;
|
||||||
|
*) error_exit "Invalid failure type: $FAILURE_TYPE. Must be network, disk, service, node, cpu, or memory" ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
if ! [[ "$DURATION" =~ ^[0-9]+$ ]] || [ "$DURATION" -lt 1 ]; then
|
||||||
|
error_exit "Invalid duration: $DURATION. Must be a positive integer (seconds)"
|
||||||
|
fi
|
||||||
|
|
||||||
|
case "$INTENSITY" in
|
||||||
|
low|medium|high|critical) ;;
|
||||||
|
*) error_exit "Invalid intensity: $INTENSITY. Must be low, medium, high, or critical" ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get target containers
|
||||||
|
get_target_containers() {
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
case "$TARGET_NODES" in
|
||||||
|
all) echo "web db lb" ;;
|
||||||
|
*) echo "$TARGET_NODES" ;;
|
||||||
|
esac
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
case "$TARGET_NODES" in
|
||||||
|
all)
|
||||||
|
docker compose ps --services | grep -v "^NAME$" || true
|
||||||
|
;;
|
||||||
|
web)
|
||||||
|
echo "web"
|
||||||
|
;;
|
||||||
|
db)
|
||||||
|
echo "db"
|
||||||
|
;;
|
||||||
|
lb)
|
||||||
|
echo "lb"
|
||||||
|
;;
|
||||||
|
monitor)
|
||||||
|
echo "monitor"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "$TARGET_NODES"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Network failure simulation
|
||||||
|
simulate_network_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating network failure on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping Docker network changes"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Disconnecting network for container $cid"
|
||||||
|
|
||||||
|
# Disconnect from network
|
||||||
|
docker network disconnect "$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" "$cid" 2>/dev/null || true
|
||||||
|
|
||||||
|
# Store original network for restoration
|
||||||
|
echo "$cid:$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" >> /tmp/network_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Disk failure simulation
|
||||||
|
simulate_disk_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating disk space exhaustion on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping container disk writes"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Filling disk space in container $cid"
|
||||||
|
|
||||||
|
# Create a large file to consume disk space
|
||||||
|
local fill_size_mb=100
|
||||||
|
case "$INTENSITY" in
|
||||||
|
low) fill_size_mb=50 ;;
|
||||||
|
medium) fill_size_mb=100 ;;
|
||||||
|
high) fill_size_mb=500 ;;
|
||||||
|
critical) fill_size_mb=1024 ;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
docker exec "$cid" bash -c "dd if=/dev/zero of=/tmp/disk_fill bs=1M count=${fill_size_mb}" 2>/dev/null || true
|
||||||
|
echo "$cid:disk_fill" >> /tmp/disk_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Service failure simulation
|
||||||
|
simulate_service_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating service failures on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
for container in $containers; do
|
||||||
|
log "SIMULATION_MODE=true: would stop services in $container"
|
||||||
|
done
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Stopping services in container $cid"
|
||||||
|
|
||||||
|
# Stop common services
|
||||||
|
docker exec "$cid" systemctl stop nginx 2>/dev/null || true
|
||||||
|
docker exec "$cid" systemctl stop postgresql 2>/dev/null || true
|
||||||
|
docker exec "$cid" systemctl stop haproxy 2>/dev/null || true
|
||||||
|
|
||||||
|
echo "$cid:services" >> /tmp/service_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Node failure simulation
|
||||||
|
simulate_node_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating complete node failures on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping container pause"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Stopping container $cid (node failure)"
|
||||||
|
docker pause "$cid"
|
||||||
|
echo "$cid:paused" >> /tmp/node_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# CPU stress simulation
|
||||||
|
simulate_cpu_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating CPU stress on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping CPU stress"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Starting CPU stress in container $cid"
|
||||||
|
|
||||||
|
# Start CPU stress process
|
||||||
|
docker exec -d "$cid" bash -c "while true; do :; done" 2>/dev/null || true
|
||||||
|
echo "$cid:cpu_stress:$(docker exec "$cid" ps aux | grep "while true" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/cpu_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Memory stress simulation
|
||||||
|
simulate_memory_failure() {
|
||||||
|
local containers=$(get_target_containers)
|
||||||
|
log "Simulating memory exhaustion on containers: $containers"
|
||||||
|
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping memory stress"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
for container in $containers; do
|
||||||
|
local container_ids=$(docker compose ps -q "$container" 2>/dev/null || true)
|
||||||
|
|
||||||
|
for cid in $container_ids; do
|
||||||
|
if [ -n "$cid" ]; then
|
||||||
|
log "Starting memory stress in container $cid"
|
||||||
|
|
||||||
|
# Start memory stress process
|
||||||
|
docker exec -d "$cid" bash -c "tail /dev/zero" 2>/dev/null || true
|
||||||
|
echo "$cid:memory_stress:$(docker exec "$cid" ps aux | grep "tail /dev/zero" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/memory_failure_state
|
||||||
|
fi
|
||||||
|
done
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Inject failure
|
||||||
|
inject_failure() {
|
||||||
|
case "$FAILURE_TYPE" in
|
||||||
|
network) simulate_network_failure ;;
|
||||||
|
disk) simulate_disk_failure ;;
|
||||||
|
service) simulate_service_failure ;;
|
||||||
|
node) simulate_node_failure ;;
|
||||||
|
cpu) simulate_cpu_failure ;;
|
||||||
|
memory) simulate_memory_failure ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Cleanup failure
|
||||||
|
cleanup_failure() {
|
||||||
|
log "Cleaning up failure simulation"
|
||||||
|
|
||||||
|
# Restore network connections
|
||||||
|
if [ -f /tmp/network_failure_state ]; then
|
||||||
|
while IFS=: read -r cid network; do
|
||||||
|
docker network connect "$network" "$cid" 2>/dev/null || true
|
||||||
|
done < /tmp/network_failure_state
|
||||||
|
rm -f /tmp/network_failure_state
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Clean up disk fill files
|
||||||
|
if [ -f /tmp/disk_failure_state ]; then
|
||||||
|
while IFS=: read -r cid _; do
|
||||||
|
docker exec "$cid" rm -f /tmp/disk_fill 2>/dev/null || true
|
||||||
|
done < /tmp/disk_failure_state
|
||||||
|
rm -f /tmp/disk_failure_state
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Restart services
|
||||||
|
if [ -f /tmp/service_failure_state ]; then
|
||||||
|
while IFS=: read -r cid _; do
|
||||||
|
docker exec "$cid" systemctl start nginx 2>/dev/null || true
|
||||||
|
docker exec "$cid" systemctl start postgresql 2>/dev/null || true
|
||||||
|
docker exec "$cid" systemctl start haproxy 2>/dev/null || true
|
||||||
|
done < /tmp/service_failure_state
|
||||||
|
rm -f /tmp/service_failure_state
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Unpause containers
|
||||||
|
if [ -f /tmp/node_failure_state ]; then
|
||||||
|
while IFS=: read -r cid _; do
|
||||||
|
docker unpause "$cid" 2>/dev/null || true
|
||||||
|
done < /tmp/node_failure_state
|
||||||
|
rm -f /tmp/node_failure_state
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Kill stress processes
|
||||||
|
if [ -f /tmp/cpu_failure_state ]; then
|
||||||
|
while IFS=: read -r cid _ pid; do
|
||||||
|
docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
|
||||||
|
done < /tmp/cpu_failure_state
|
||||||
|
rm -f /tmp/cpu_failure_state
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ -f /tmp/memory_failure_state ]; then
|
||||||
|
while IFS=: read -r cid _ pid; do
|
||||||
|
docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
|
||||||
|
done < /tmp/memory_failure_state
|
||||||
|
rm -f /tmp/memory_failure_state
|
||||||
|
fi
|
||||||
|
}
|
||||||
|
|
||||||
|
# Monitor failure
|
||||||
|
monitor_failure() {
|
||||||
|
local end_time=$(( $(date +%s) + DURATION ))
|
||||||
|
|
||||||
|
log "Monitoring failure for $DURATION seconds"
|
||||||
|
|
||||||
|
while [ $(date +%s) -lt $end_time ]; do
|
||||||
|
# Check container status
|
||||||
|
if [ "${SIMULATION_MODE:-false}" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: validation simulated"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! docker compose ps | grep -q "Up\|Paused"; then
|
||||||
|
log "WARNING: All containers are down"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Log system metrics
|
||||||
|
log "System status: $(docker stats --no-stream --format 'table {{.Container}}\t{{.CPUPerc}}\t{{.MemPerc}}' | tail -n +2)"
|
||||||
|
|
||||||
|
sleep 10
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Generate failure report
|
||||||
|
generate_report() {
|
||||||
|
local report_file="reports/failure_simulation_$(date +%Y%m%d_%H%M%S).txt"
|
||||||
|
|
||||||
|
cat > "$report_file" << EOF
|
||||||
|
Failure Simulation Report
|
||||||
|
========================
|
||||||
|
|
||||||
|
Timestamp: $(date)
|
||||||
|
Failure Type: $FAILURE_TYPE
|
||||||
|
Duration: $DURATION seconds
|
||||||
|
Target Nodes: $TARGET_NODES
|
||||||
|
Intensity: $INTENSITY
|
||||||
|
|
||||||
|
Pre-failure Status:
|
||||||
|
$(docker compose ps 2>/dev/null || echo "Docker Compose not running")
|
||||||
|
|
||||||
|
Post-failure Status:
|
||||||
|
$(docker compose ps 2>/dev/null || echo "Docker Compose not running")
|
||||||
|
|
||||||
|
Log File: $LOG_FILE
|
||||||
|
EOF
|
||||||
|
|
||||||
|
log "Failure simulation report generated: $report_file"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main execution
|
||||||
|
main() {
|
||||||
|
log "Starting failure simulation: $FAILURE_TYPE for $DURATION seconds"
|
||||||
|
|
||||||
|
validate_inputs
|
||||||
|
|
||||||
|
# Inject failure
|
||||||
|
inject_failure
|
||||||
|
|
||||||
|
# Monitor during failure
|
||||||
|
monitor_failure
|
||||||
|
|
||||||
|
# Cleanup
|
||||||
|
cleanup_failure
|
||||||
|
|
||||||
|
# Generate report
|
||||||
|
generate_report
|
||||||
|
|
||||||
|
log "Failure simulation completed successfully"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Trap for cleanup on script exit
|
||||||
|
trap cleanup_failure EXIT
|
||||||
|
|
||||||
|
# Initialize logging
|
||||||
|
mkdir -p logs reports
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main "$@"
|
||||||
@@ -0,0 +1,229 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Enterprise Infrastructure Scaling Simulation Script
|
||||||
|
# Simulates scaling operations for infrastructure nodes
|
||||||
|
|
||||||
|
set -euo pipefail
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
DOCKER_COMPOSE_FILE="docker-compose.yml"
|
||||||
|
INVENTORY_FILE="inventory/hosts.ini"
|
||||||
|
LOG_FILE="logs/scaling_simulation.log"
|
||||||
|
|
||||||
|
# Default values
|
||||||
|
DIRECTION="${1:-up}"
|
||||||
|
COUNT="${2:-1}"
|
||||||
|
NODE_TYPE="${3:-web}"
|
||||||
|
SIMULATION_MODE="${SIMULATION_MODE:-false}"
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Error handling
|
||||||
|
error_exit() {
|
||||||
|
log "ERROR: $1"
|
||||||
|
exit 1
|
||||||
|
}
|
||||||
|
|
||||||
|
# Validate inputs
|
||||||
|
validate_inputs() {
|
||||||
|
if [[ "$DIRECTION" != "up" && "$DIRECTION" != "down" ]]; then
|
||||||
|
error_exit "Invalid direction: $DIRECTION. Must be 'up' or 'down'"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! [[ "$COUNT" =~ ^[0-9]+$ ]] || [ "$COUNT" -lt 1 ]; then
|
||||||
|
error_exit "Invalid count: $COUNT. Must be a positive integer"
|
||||||
|
fi
|
||||||
|
|
||||||
|
case "$NODE_TYPE" in
|
||||||
|
web|db|lb|monitor) ;;
|
||||||
|
*) error_exit "Invalid node type: $NODE_TYPE. Must be web, db, lb, or monitor" ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Get current node count
|
||||||
|
get_current_count() {
|
||||||
|
local type="$1"
|
||||||
|
if [ "$SIMULATION_MODE" = true ]; then
|
||||||
|
case "$type" in
|
||||||
|
web) echo 3 ;;
|
||||||
|
db) echo 2 ;;
|
||||||
|
lb|monitor) echo 1 ;;
|
||||||
|
esac
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
case "$type" in
|
||||||
|
web) docker compose ps web | grep -c "Up" ;;
|
||||||
|
db) docker compose ps db | grep -c "Up" ;;
|
||||||
|
lb) docker compose ps lb | grep -c "Up" ;;
|
||||||
|
monitor) docker compose ps monitor | grep -c "Up" ;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Scale up infrastructure
|
||||||
|
scale_up() {
|
||||||
|
local type="$1"
|
||||||
|
local count="$2"
|
||||||
|
|
||||||
|
log "Scaling up $count $type nodes"
|
||||||
|
|
||||||
|
if [ "$SIMULATION_MODE" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping Docker Compose mutation and Ansible provisioning"
|
||||||
|
update_inventory "$type" "$count" "add"
|
||||||
|
log "Successfully simulated scale up of $count $type nodes"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
docker compose -f "$DOCKER_COMPOSE_FILE" up -d --scale "${type}=${count}"
|
||||||
|
|
||||||
|
# Wait for containers to be ready
|
||||||
|
log "Waiting for containers to be ready..."
|
||||||
|
sleep 30
|
||||||
|
|
||||||
|
# Update inventory
|
||||||
|
update_inventory "$type" "$count" "add"
|
||||||
|
|
||||||
|
# Run provisioning playbook on new nodes
|
||||||
|
if [ "$SIMULATION_MODE" = false ]; then
|
||||||
|
ansible-playbook -i "$INVENTORY_FILE" playbooks/provision.yml --limit "${type}*"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Successfully scaled up $count $type nodes"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Scale down infrastructure
|
||||||
|
scale_down() {
|
||||||
|
local type="$1"
|
||||||
|
local count="$2"
|
||||||
|
|
||||||
|
local current_count=$(get_current_count "$type")
|
||||||
|
if [ "$current_count" -lt "$count" ]; then
|
||||||
|
error_exit "Cannot scale down $count nodes. Only $current_count $type nodes currently running"
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Scaling down $count $type nodes"
|
||||||
|
|
||||||
|
# Select nodes to remove (oldest first)
|
||||||
|
if [ "$SIMULATION_MODE" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: skipping Docker Compose mutation and Ansible decommissioning"
|
||||||
|
update_inventory "$type" "$count" "remove"
|
||||||
|
log "Successfully simulated scale down of $count $type nodes"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
local nodes_to_remove=$(docker compose ps "$type" | grep "Up" | head -n "$count" | awk '{print $1}')
|
||||||
|
|
||||||
|
# Decommission nodes
|
||||||
|
for node in $nodes_to_remove; do
|
||||||
|
if [ "$SIMULATION_MODE" = false ]; then
|
||||||
|
ansible-playbook -i "$INVENTORY_FILE" playbooks/decommission.yml --limit "$node"
|
||||||
|
fi
|
||||||
|
docker stop "$node"
|
||||||
|
docker rm "$node"
|
||||||
|
done
|
||||||
|
|
||||||
|
# Update inventory
|
||||||
|
update_inventory "$type" "$count" "remove"
|
||||||
|
|
||||||
|
log "Successfully scaled down $count $type nodes"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Update Ansible inventory
|
||||||
|
update_inventory() {
|
||||||
|
local type="$1"
|
||||||
|
local count="$2"
|
||||||
|
local action="$3"
|
||||||
|
|
||||||
|
log "Updating inventory for $action $count $type nodes"
|
||||||
|
|
||||||
|
# This would be more complex in a real implementation
|
||||||
|
# For simulation, we'll just log the action
|
||||||
|
case "$action" in
|
||||||
|
add)
|
||||||
|
log "Added $count $type nodes to inventory"
|
||||||
|
;;
|
||||||
|
remove)
|
||||||
|
log "Removed $count $type nodes from inventory"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
}
|
||||||
|
|
||||||
|
# Health check after scaling
|
||||||
|
health_check() {
|
||||||
|
log "Running health checks after scaling"
|
||||||
|
|
||||||
|
# Check container status
|
||||||
|
if [ "$SIMULATION_MODE" = true ]; then
|
||||||
|
log "SIMULATION_MODE=true: health checks simulated"
|
||||||
|
return
|
||||||
|
fi
|
||||||
|
|
||||||
|
if ! docker compose ps | grep -q "Up"; then
|
||||||
|
error_exit "Some containers failed to start"
|
||||||
|
fi
|
||||||
|
|
||||||
|
# Ansible ping check
|
||||||
|
if [ "$SIMULATION_MODE" = false ]; then
|
||||||
|
if ! ansible -i "$INVENTORY_FILE" all -m ping >/dev/null 2>&1; then
|
||||||
|
log "WARNING: Some nodes failed Ansible ping check"
|
||||||
|
fi
|
||||||
|
fi
|
||||||
|
|
||||||
|
log "Health checks completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Generate scaling report
|
||||||
|
generate_report() {
|
||||||
|
local report_file="reports/scaling_report_$(date +%Y%m%d_%H%M%S).txt"
|
||||||
|
|
||||||
|
cat > "$report_file" << EOF
|
||||||
|
Scaling Simulation Report
|
||||||
|
========================
|
||||||
|
|
||||||
|
Timestamp: $(date)
|
||||||
|
Direction: $DIRECTION
|
||||||
|
Node Type: $NODE_TYPE
|
||||||
|
Count: $COUNT
|
||||||
|
Simulation Mode: $SIMULATION_MODE
|
||||||
|
|
||||||
|
Current Status:
|
||||||
|
$(docker compose ps 2>/dev/null || echo "Docker Compose not running")
|
||||||
|
|
||||||
|
Inventory Status:
|
||||||
|
$(ansible -i "$INVENTORY_FILE" --list-hosts all 2>/dev/null || echo "Ansible inventory check failed")
|
||||||
|
|
||||||
|
Log File: $LOG_FILE
|
||||||
|
EOF
|
||||||
|
|
||||||
|
log "Scaling report generated: $report_file"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main execution
|
||||||
|
main() {
|
||||||
|
log "Starting scaling simulation: $DIRECTION $COUNT $NODE_TYPE nodes"
|
||||||
|
|
||||||
|
validate_inputs
|
||||||
|
|
||||||
|
case "$DIRECTION" in
|
||||||
|
up)
|
||||||
|
scale_up "$NODE_TYPE" "$COUNT"
|
||||||
|
;;
|
||||||
|
down)
|
||||||
|
scale_down "$NODE_TYPE" "$COUNT"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
health_check
|
||||||
|
generate_report
|
||||||
|
|
||||||
|
log "Scaling simulation completed successfully"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Initialize logging
|
||||||
|
mkdir -p logs reports
|
||||||
|
|
||||||
|
# Run main function
|
||||||
|
main "$@"
|
||||||
@@ -0,0 +1,20 @@
|
|||||||
|
.PHONY: run test demo down
|
||||||
|
|
||||||
|
run:
|
||||||
|
docker compose up -d
|
||||||
|
|
||||||
|
test:
|
||||||
|
docker compose config --quiet
|
||||||
|
test -f elasticsearch/config/elasticsearch.yml
|
||||||
|
test -f logstash/config/logstash.yml
|
||||||
|
test -f logstash/pipeline/logstash.conf
|
||||||
|
test -f kibana/config/kibana.yml
|
||||||
|
test -f filebeat/config/filebeat.yml
|
||||||
|
test -d grafana/provisioning
|
||||||
|
test -d grafana/dashboards
|
||||||
|
|
||||||
|
demo:
|
||||||
|
bash ./scenarios/incident_simulation.sh app-errors 3
|
||||||
|
|
||||||
|
down:
|
||||||
|
docker compose down
|
||||||
@@ -0,0 +1,98 @@
|
|||||||
|
# Log Observability ELK/Grafana
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Operations teams need searchable logs and reviewable incident evidence in addition to simple OS checks. Zabbix is useful for host and service health signals; ELK/Grafana is better suited for log ingestion, error analysis, dashboards, and environment-level observability.
|
||||||
|
|
||||||
|
## CV Relevance
|
||||||
|
|
||||||
|
This project supports the monitoring and troubleshooting part of the CV by showing how incident logs can be collected, parsed, searched, and reviewed. It is separate from the Zabbix project: Zabbix handles simple checks, while this project focuses on logs and observability evidence.
|
||||||
|
|
||||||
|
## What This Project Demonstrates
|
||||||
|
|
||||||
|
- A local Docker Compose scaffold for Elasticsearch, Logstash, Kibana, Grafana, and Filebeat.
|
||||||
|
- Minimal configs required for the stack to validate independently.
|
||||||
|
- Sample logs and alert intent that can be reviewed without starting the full stack.
|
||||||
|
- An incident simulation script for generating operational log evidence.
|
||||||
|
|
||||||
|
This is a local demo stack. The default credentials are for non-production use only.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
|
||||||
|
|
|
||||||
|
v
|
||||||
|
Grafana
|
||||||
|
|
||||||
|
Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
|
||||||
|
```
|
||||||
|
|
||||||
|
Core components:
|
||||||
|
|
||||||
|
- `docker-compose.yml` defines the observability services.
|
||||||
|
- `alerting/alert_rules.yml` records alert intent and severity.
|
||||||
|
- `examples/` contains representative operational logs and alert output.
|
||||||
|
- `scenarios/incident_simulation.sh` emits incident activity.
|
||||||
|
- `grafana/`, `kibana/`, `logstash/`, `filebeat/`, and `elasticsearch/` contain minimal local configs.
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/log-observability-elk-grafana
|
||||||
|
make test
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
Start the full local stack with Docker:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test
|
||||||
|
make run
|
||||||
|
make down
|
||||||
|
```
|
||||||
|
|
||||||
|
When running locally:
|
||||||
|
|
||||||
|
- Kibana: `http://localhost:5601`
|
||||||
|
- Grafana: `http://localhost:3000`
|
||||||
|
- Elasticsearch: `http://localhost:9200`
|
||||||
|
|
||||||
|
Default demo credentials:
|
||||||
|
|
||||||
|
- Elasticsearch/Kibana: `elastic` / `elastic`
|
||||||
|
- Grafana: `admin` / `admin`
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test
|
||||||
|
docker compose config --quiet
|
||||||
|
```
|
||||||
|
|
||||||
|
`make test` also checks that all bind-mounted config files and directories exist.
|
||||||
|
|
||||||
|
## Example Output
|
||||||
|
|
||||||
|
```text
|
||||||
|
[2026-04-29 04:18:23] WARN Database connection pool nearing capacity
|
||||||
|
[2026-04-29 04:18:28] ERROR Database connection pool exhausted
|
||||||
|
[2026-04-29 04:18:33] ERROR Database query timeout occurred
|
||||||
|
[2026-04-29 04:18:44] INFO Database connections restored
|
||||||
|
```
|
||||||
|
|
||||||
|
Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
|
||||||
|
|
||||||
|
## Interview Talking Points
|
||||||
|
|
||||||
|
- When to use Zabbix checks versus ELK log analysis.
|
||||||
|
- How Filebeat, Logstash, and Elasticsearch fit into a basic log pipeline.
|
||||||
|
- How incident simulations create evidence for troubleshooting discussions.
|
||||||
|
- Why local demo credentials and single-node Elasticsearch are not production architecture.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
- Add curated Grafana and Kibana dashboards.
|
||||||
|
- Add Prometheus metrics collection.
|
||||||
|
- Add distributed tracing with Jaeger or OpenTelemetry.
|
||||||
|
- Add synthetic monitoring checks.
|
||||||
@@ -0,0 +1,326 @@
|
|||||||
|
# Enterprise Observability Alert Rules
|
||||||
|
# Alert definitions for automated incident detection and notification
|
||||||
|
|
||||||
|
alert_rules:
|
||||||
|
# System Resource Alerts
|
||||||
|
- name: "High CPU Usage"
|
||||||
|
description: "CPU utilization exceeds threshold"
|
||||||
|
condition: "cpu_usage_percent > 90"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- system
|
||||||
|
- performance
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- slack
|
||||||
|
labels:
|
||||||
|
team: "platform"
|
||||||
|
component: "system"
|
||||||
|
|
||||||
|
- name: "High Memory Usage"
|
||||||
|
description: "Memory utilization exceeds threshold"
|
||||||
|
condition: "memory_usage_percent > 85"
|
||||||
|
duration: "3m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- system
|
||||||
|
- memory
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "platform"
|
||||||
|
component: "system"
|
||||||
|
|
||||||
|
- name: "Disk Space Critical"
|
||||||
|
description: "Disk usage exceeds critical threshold"
|
||||||
|
condition: "disk_usage_percent > 95"
|
||||||
|
duration: "2m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- storage
|
||||||
|
- disk
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "platform"
|
||||||
|
component: "storage"
|
||||||
|
|
||||||
|
- name: "Disk Space Warning"
|
||||||
|
description: "Disk usage exceeds warning threshold"
|
||||||
|
condition: "disk_usage_percent > 85"
|
||||||
|
duration: "10m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- storage
|
||||||
|
- disk
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "platform"
|
||||||
|
component: "storage"
|
||||||
|
|
||||||
|
# Service Availability Alerts
|
||||||
|
- name: "Service Down"
|
||||||
|
description: "Critical service is not responding"
|
||||||
|
condition: "service_status == 'down' OR http_status_code >= 500"
|
||||||
|
duration: "2m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- service
|
||||||
|
- availability
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- slack
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "service"
|
||||||
|
|
||||||
|
- name: "Database Connection Failed"
|
||||||
|
description: "Database connection pool exhausted or unresponsive"
|
||||||
|
condition: "db_connections_active == 0 OR db_response_time > 5000"
|
||||||
|
duration: "1m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- database
|
||||||
|
- connectivity
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "database"
|
||||||
|
component: "postgresql"
|
||||||
|
|
||||||
|
- name: "Cache Unavailable"
|
||||||
|
description: "Cache service is down or unresponsive"
|
||||||
|
condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
|
||||||
|
duration: "3m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- cache
|
||||||
|
- performance
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "infrastructure"
|
||||||
|
component: "redis"
|
||||||
|
|
||||||
|
# Application Performance Alerts
|
||||||
|
- name: "High Error Rate"
|
||||||
|
description: "Application error rate exceeds threshold"
|
||||||
|
condition: "error_rate_percent > 5"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- application
|
||||||
|
- errors
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- slack
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "api"
|
||||||
|
|
||||||
|
- name: "Slow Response Time"
|
||||||
|
description: "API response time exceeds SLA"
|
||||||
|
condition: "response_time_p95 > 2000"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- application
|
||||||
|
- performance
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "api"
|
||||||
|
|
||||||
|
- name: "High Request Queue"
|
||||||
|
description: "Request queue depth is too high"
|
||||||
|
condition: "queue_depth > 100"
|
||||||
|
duration: "3m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- application
|
||||||
|
- queue
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "queue"
|
||||||
|
|
||||||
|
# Infrastructure Alerts
|
||||||
|
- name: "Network Latency High"
|
||||||
|
description: "Network round-trip time exceeds threshold"
|
||||||
|
condition: "network_rtt > 100"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- network
|
||||||
|
- latency
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "network"
|
||||||
|
component: "infrastructure"
|
||||||
|
|
||||||
|
- name: "Load Balancer Unhealthy"
|
||||||
|
description: "Load balancer backend servers are unhealthy"
|
||||||
|
condition: "lb_unhealthy_backends > 0"
|
||||||
|
duration: "2m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- loadbalancer
|
||||||
|
- availability
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "infrastructure"
|
||||||
|
component: "loadbalancer"
|
||||||
|
|
||||||
|
# Security Alerts
|
||||||
|
- name: "Failed Login Attempts"
|
||||||
|
description: "Multiple failed authentication attempts detected"
|
||||||
|
condition: "failed_login_attempts > 5"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- security
|
||||||
|
- authentication
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- slack
|
||||||
|
labels:
|
||||||
|
team: "security"
|
||||||
|
component: "authentication"
|
||||||
|
|
||||||
|
- name: "Suspicious Network Traffic"
|
||||||
|
description: "Unusual network traffic patterns detected"
|
||||||
|
condition: "network_bytes_unusual > 1000000"
|
||||||
|
duration: "10m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- security
|
||||||
|
- network
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "security"
|
||||||
|
component: "network"
|
||||||
|
|
||||||
|
# Log-based Alerts
|
||||||
|
- name: "Application Errors"
|
||||||
|
description: "High volume of application error logs"
|
||||||
|
condition: "log_errors_per_minute > 10"
|
||||||
|
duration: "2m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- logs
|
||||||
|
- errors
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "logs"
|
||||||
|
|
||||||
|
- name: "Out of Memory Errors"
|
||||||
|
description: "Out of memory errors detected in logs"
|
||||||
|
condition: "log_oom_errors > 0"
|
||||||
|
duration: "1m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- memory
|
||||||
|
- errors
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "application"
|
||||||
|
component: "memory"
|
||||||
|
|
||||||
|
# Business Logic Alerts
|
||||||
|
- name: "Low Business Transactions"
|
||||||
|
description: "Business transaction volume below expected threshold"
|
||||||
|
condition: "business_transactions_per_hour < 100"
|
||||||
|
duration: "15m"
|
||||||
|
severity: "warning"
|
||||||
|
tags:
|
||||||
|
- business
|
||||||
|
- transactions
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
labels:
|
||||||
|
team: "business"
|
||||||
|
component: "transactions"
|
||||||
|
|
||||||
|
- name: "Payment Failures"
|
||||||
|
description: "Payment processing failure rate is high"
|
||||||
|
condition: "payment_failure_rate > 0.05"
|
||||||
|
duration: "5m"
|
||||||
|
severity: "critical"
|
||||||
|
tags:
|
||||||
|
- payments
|
||||||
|
- business
|
||||||
|
channels:
|
||||||
|
- email
|
||||||
|
- pagerduty
|
||||||
|
labels:
|
||||||
|
team: "payments"
|
||||||
|
component: "processing"
|
||||||
|
|
||||||
|
# Alert Channels Configuration
|
||||||
|
alert_channels:
|
||||||
|
email:
|
||||||
|
type: "email"
|
||||||
|
recipients:
|
||||||
|
- "platform-team@company.com"
|
||||||
|
- "oncall@company.com"
|
||||||
|
subject_template: "[{{severity}}] {{name}} - {{description}}"
|
||||||
|
|
||||||
|
slack:
|
||||||
|
type: "slack"
|
||||||
|
webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
|
||||||
|
channel: "#alerts"
|
||||||
|
username: "Observability Bot"
|
||||||
|
icon_emoji: ":warning:"
|
||||||
|
|
||||||
|
pagerduty:
|
||||||
|
type: "pagerduty"
|
||||||
|
integration_key: "your-pagerduty-integration-key"
|
||||||
|
severity_mapping:
|
||||||
|
critical: "critical"
|
||||||
|
warning: "warning"
|
||||||
|
info: "info"
|
||||||
|
|
||||||
|
# Alert Silencing Rules
|
||||||
|
silence_rules:
|
||||||
|
- name: "Maintenance Window"
|
||||||
|
condition: "maintenance_window == true"
|
||||||
|
duration: "4h"
|
||||||
|
comment: "Silenced during scheduled maintenance"
|
||||||
|
|
||||||
|
- name: "Known Issue"
|
||||||
|
condition: "known_issue_id == 'TICKET-123'"
|
||||||
|
duration: "24h"
|
||||||
|
comment: "Silenced for known issue resolution"
|
||||||
|
|
||||||
|
# Escalation Policies
|
||||||
|
escalation_policies:
|
||||||
|
- name: "Default Escalation"
|
||||||
|
steps:
|
||||||
|
- delay: "5m"
|
||||||
|
channels: ["email"]
|
||||||
|
- delay: "15m"
|
||||||
|
channels: ["slack"]
|
||||||
|
- delay: "30m"
|
||||||
|
channels: ["pagerduty"]
|
||||||
|
|
||||||
|
- name: "Critical Escalation"
|
||||||
|
steps:
|
||||||
|
- delay: "0m"
|
||||||
|
channels: ["email", "slack", "pagerduty"]
|
||||||
|
- delay: "10m"
|
||||||
|
channels: ["pagerduty"] # Escalation
|
||||||
@@ -0,0 +1,120 @@
|
|||||||
|
services:
|
||||||
|
elasticsearch:
|
||||||
|
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
|
||||||
|
container_name: observability-elasticsearch
|
||||||
|
environment:
|
||||||
|
- discovery.type=single-node
|
||||||
|
- xpack.security.enabled=true
|
||||||
|
- ELASTIC_PASSWORD=elastic
|
||||||
|
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
|
||||||
|
volumes:
|
||||||
|
- elasticsearch_data:/usr/share/elasticsearch/data
|
||||||
|
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
|
||||||
|
ports:
|
||||||
|
- "9200:9200"
|
||||||
|
- "9300:9300"
|
||||||
|
networks:
|
||||||
|
- observability
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -u elastic:elastic -f http://localhost:9200/_cluster/health || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 5
|
||||||
|
|
||||||
|
logstash:
|
||||||
|
image: docker.elastic.co/logstash/logstash:8.11.0
|
||||||
|
container_name: observability-logstash
|
||||||
|
environment:
|
||||||
|
- "LS_JAVA_OPTS=-Xms512m -Xmx512m"
|
||||||
|
volumes:
|
||||||
|
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
|
||||||
|
- ./logstash/pipeline:/usr/share/logstash/pipeline
|
||||||
|
- ./logs:/usr/share/logstash/logs
|
||||||
|
ports:
|
||||||
|
- "5044:5044"
|
||||||
|
- "8080:8080"
|
||||||
|
networks:
|
||||||
|
- observability
|
||||||
|
depends_on:
|
||||||
|
elasticsearch:
|
||||||
|
condition: service_healthy
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -f http://localhost:9600/_node/pipelines || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
kibana:
|
||||||
|
image: docker.elastic.co/kibana/kibana:8.11.0
|
||||||
|
container_name: observability-kibana
|
||||||
|
environment:
|
||||||
|
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
||||||
|
- ELASTICSEARCH_USERNAME=elastic
|
||||||
|
- ELASTICSEARCH_PASSWORD=elastic
|
||||||
|
volumes:
|
||||||
|
- ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
|
||||||
|
ports:
|
||||||
|
- "5601:5601"
|
||||||
|
networks:
|
||||||
|
- observability
|
||||||
|
depends_on:
|
||||||
|
elasticsearch:
|
||||||
|
condition: service_healthy
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 5
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
image: grafana/grafana:10.2.0
|
||||||
|
container_name: observability-grafana
|
||||||
|
environment:
|
||||||
|
- GF_SECURITY_ADMIN_USER=admin
|
||||||
|
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||||
|
- GF_USERS_ALLOW_SIGN_UP=false
|
||||||
|
volumes:
|
||||||
|
- grafana_data:/var/lib/grafana
|
||||||
|
- ./grafana/provisioning:/etc/grafana/provisioning
|
||||||
|
- ./grafana/dashboards:/var/lib/grafana/dashboards
|
||||||
|
ports:
|
||||||
|
- "3000:3000"
|
||||||
|
networks:
|
||||||
|
- observability
|
||||||
|
restart: unless-stopped
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
|
||||||
|
filebeat:
|
||||||
|
image: docker.elastic.co/beats/filebeat:8.11.0
|
||||||
|
container_name: observability-filebeat
|
||||||
|
user: root
|
||||||
|
volumes:
|
||||||
|
- ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
|
||||||
|
- ./logs:/var/log/sample
|
||||||
|
- /var/lib/docker/containers:/var/lib/docker/containers:ro
|
||||||
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||||
|
networks:
|
||||||
|
- observability
|
||||||
|
depends_on:
|
||||||
|
- logstash
|
||||||
|
restart: unless-stopped
|
||||||
|
|
||||||
|
volumes:
|
||||||
|
elasticsearch_data:
|
||||||
|
driver: local
|
||||||
|
grafana_data:
|
||||||
|
driver: local
|
||||||
|
|
||||||
|
networks:
|
||||||
|
observability:
|
||||||
|
driver: bridge
|
||||||
|
ipam:
|
||||||
|
config:
|
||||||
|
- subnet: 172.25.0.0/16
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
# Log Observability ELK/Grafana Architecture
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
- Filebeat: tails sample and container logs.
|
||||||
|
- Logstash: receives and processes log events.
|
||||||
|
- Elasticsearch: stores searchable observability data.
|
||||||
|
- Kibana: supports log exploration and dashboards.
|
||||||
|
- Grafana: provides operational dashboards.
|
||||||
|
- Alert rules: document symptoms, thresholds, and severity.
|
||||||
|
- Incident simulation: generates controlled failure signals.
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Log source -> Filebeat -> Logstash -> Elasticsearch -> Kibana
|
||||||
|
|
|
||||||
|
v
|
||||||
|
Grafana
|
||||||
|
```
|
||||||
|
|
||||||
|
Incident exercises follow this flow:
|
||||||
|
|
||||||
|
```
|
||||||
|
Operator -> incident_simulation.sh -> logs/incident_simulation.log -> Filebeat -> Logstash -> alerts/dashboards
|
||||||
|
```
|
||||||
|
|
||||||
|
## Notes
|
||||||
|
|
||||||
|
This is a local demonstration stack, not a production Elasticsearch deployment. A production version would add dedicated nodes, TLS, secret management, retention policies, index lifecycle management, and external alert delivery.
|
||||||
+4
@@ -0,0 +1,4 @@
|
|||||||
|
cluster.name: portfolio-observability
|
||||||
|
node.name: elasticsearch-demo
|
||||||
|
network.host: 0.0.0.0
|
||||||
|
xpack.security.enabled: true
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
2026-04-29T04:19:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=100 threshold=95 status=firing
|
||||||
|
2026-04-29T04:19:30Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=7.8 threshold=5.0 status=firing
|
||||||
|
2026-04-29T04:22:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=71 threshold=95 status=resolved
|
||||||
|
2026-04-29T04:23:15Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=1.2 threshold=5.0 status=resolved
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
2026-04-29T04:18:21Z INFO service=checkout-api host=app-web-02 request_id=8f4b2 path=/checkout status=200 latency_ms=142
|
||||||
|
2026-04-29T04:18:28Z WARN service=checkout-api host=app-web-02 event=db_pool_pressure active=92 max=100
|
||||||
|
2026-04-29T04:18:33Z ERROR service=checkout-api host=app-web-02 event=db_timeout query=CreateOrder timeout_ms=5000 customer_tier=enterprise
|
||||||
|
2026-04-29T04:18:39Z ERROR service=checkout-api host=app-web-02 event=payment_retry_exhausted order_id=ord-104288 provider=stripe
|
||||||
|
2026-04-29T04:18:44Z INFO service=checkout-api host=app-web-02 event=recovery db_pool_active=48
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
filebeat.inputs:
|
||||||
|
- type: filestream
|
||||||
|
id: portfolio-sample-logs
|
||||||
|
enabled: true
|
||||||
|
paths:
|
||||||
|
- /var/log/sample/*.log
|
||||||
|
|
||||||
|
output.logstash:
|
||||||
|
hosts: ["logstash:5044"]
|
||||||
|
|
||||||
|
logging.level: info
|
||||||
@@ -0,0 +1,3 @@
|
|||||||
|
# Dashboards
|
||||||
|
|
||||||
|
This directory is reserved for local demo dashboards. The current portfolio scope validates the observability stack scaffold, sample logs, alert intent, and incident simulation without claiming production-ready dashboards.
|
||||||
+14
@@ -0,0 +1,14 @@
|
|||||||
|
apiVersion: 1
|
||||||
|
|
||||||
|
datasources:
|
||||||
|
- name: Elasticsearch
|
||||||
|
type: elasticsearch
|
||||||
|
access: proxy
|
||||||
|
url: http://elasticsearch:9200
|
||||||
|
basicAuth: true
|
||||||
|
basicAuthUser: elastic
|
||||||
|
jsonData:
|
||||||
|
index: portfolio-logs-*
|
||||||
|
timeField: "@timestamp"
|
||||||
|
secureJsonData:
|
||||||
|
basicAuthPassword: elastic
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
server.host: 0.0.0.0
|
||||||
|
elasticsearch.hosts: ["http://elasticsearch:9200"]
|
||||||
|
elasticsearch.username: elastic
|
||||||
|
elasticsearch.password: elastic
|
||||||
@@ -0,0 +1,2 @@
|
|||||||
|
http.host: 0.0.0.0
|
||||||
|
pipeline.ecs_compatibility: disabled
|
||||||
@@ -0,0 +1,24 @@
|
|||||||
|
input {
|
||||||
|
beats {
|
||||||
|
port => 5044
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
filter {
|
||||||
|
grok {
|
||||||
|
match => { "message" => "\[%{TIMESTAMP_ISO8601:observed_at}\] %{LOGLEVEL:level} %{GREEDYDATA:event_message}" }
|
||||||
|
tag_on_failure => ["portfolio_parse_failure"]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
output {
|
||||||
|
elasticsearch {
|
||||||
|
hosts => ["http://elasticsearch:9200"]
|
||||||
|
user => "elastic"
|
||||||
|
password => "elastic"
|
||||||
|
index => "portfolio-logs-%{+YYYY.MM.dd}"
|
||||||
|
}
|
||||||
|
stdout {
|
||||||
|
codec => rubydebug
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,21 @@
|
|||||||
|
# Scenario: Incident Simulation
|
||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
Generate a controlled application and infrastructure incident so the logging pipeline, alert rules, and dashboards can be reviewed with realistic event timing.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/log-observability-elk-grafana
|
||||||
|
docker compose config
|
||||||
|
./scenarios/incident_simulation.sh comprehensive
|
||||||
|
tail -n 40 logs/incident_simulation.log
|
||||||
|
```
|
||||||
|
|
||||||
|
## Expected Result
|
||||||
|
|
||||||
|
- The compose file validates successfully.
|
||||||
|
- The simulation writes a sequence of CPU, memory, service, database, and application error events.
|
||||||
|
- Alert examples indicate firing and resolved states.
|
||||||
|
- Operators can trace incident progression through logs and dashboard queries.
|
||||||
+318
@@ -0,0 +1,318 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
# Enterprise Incident Simulation Script
|
||||||
|
# Simulates various failure scenarios for testing observability stack
|
||||||
|
|
||||||
|
set -e
|
||||||
|
|
||||||
|
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||||
|
PROJECT_ROOT="$(dirname "$SCRIPT_DIR")"
|
||||||
|
LOG_FILE="$PROJECT_ROOT/logs/incident_simulation.log"
|
||||||
|
|
||||||
|
# Colors for output
|
||||||
|
RED='\033[0;31m'
|
||||||
|
GREEN='\033[0;32m'
|
||||||
|
YELLOW='\033[1;33m'
|
||||||
|
BLUE='\033[0;34m'
|
||||||
|
NC='\033[0m' # No Color
|
||||||
|
|
||||||
|
# Logging function
|
||||||
|
log() {
|
||||||
|
local level=$1
|
||||||
|
local message=$2
|
||||||
|
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
echo "[$timestamp] $level $message" >> "$LOG_FILE"
|
||||||
|
echo -e "${BLUE}[$timestamp]${NC} $level $message"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate CPU spike
|
||||||
|
simulate_cpu_spike() {
|
||||||
|
local duration=${1:-60}
|
||||||
|
log "INFO" "Starting CPU spike simulation for ${duration} seconds"
|
||||||
|
|
||||||
|
# Launch CPU-intensive processes
|
||||||
|
for i in {1..4}; do
|
||||||
|
(
|
||||||
|
end_time=$((SECONDS + duration))
|
||||||
|
while [ $SECONDS -lt $end_time ]; do
|
||||||
|
# CPU-intensive calculation
|
||||||
|
result=0
|
||||||
|
for j in {1..100000}; do
|
||||||
|
result=$((result + j))
|
||||||
|
done
|
||||||
|
done
|
||||||
|
) &
|
||||||
|
PIDS[$i]=$!
|
||||||
|
done
|
||||||
|
|
||||||
|
# Wait for simulation to complete
|
||||||
|
for pid in "${PIDS[@]}"; do
|
||||||
|
wait $pid 2>/dev/null || true
|
||||||
|
done
|
||||||
|
|
||||||
|
log "INFO" "CPU spike simulation completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate memory leak
|
||||||
|
simulate_memory_leak() {
|
||||||
|
local duration=${1:-30}
|
||||||
|
log "INFO" "Starting memory leak simulation for ${duration} seconds"
|
||||||
|
|
||||||
|
# Create a process that gradually consumes memory
|
||||||
|
(
|
||||||
|
data=""
|
||||||
|
end_time=$((SECONDS + duration))
|
||||||
|
while [ $SECONDS -lt $end_time ]; do
|
||||||
|
# Gradually consume memory
|
||||||
|
data="${data}X"
|
||||||
|
sleep 0.1
|
||||||
|
done
|
||||||
|
) &
|
||||||
|
MEM_PID=$!
|
||||||
|
|
||||||
|
wait $MEM_PID 2>/dev/null || true
|
||||||
|
log "INFO" "Memory leak simulation completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate disk space exhaustion
|
||||||
|
simulate_disk_full() {
|
||||||
|
local target_dir=${1:-"/tmp"}
|
||||||
|
local duration=${2:-30}
|
||||||
|
log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
|
||||||
|
|
||||||
|
# Create large files to fill disk space
|
||||||
|
(
|
||||||
|
end_time=$((SECONDS + duration))
|
||||||
|
while [ $SECONDS -lt $end_time ]; do
|
||||||
|
# Create 100MB file
|
||||||
|
dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
|
||||||
|
sleep 2
|
||||||
|
done
|
||||||
|
) &
|
||||||
|
DISK_PID=$!
|
||||||
|
|
||||||
|
wait $DISK_PID 2>/dev/null || true
|
||||||
|
|
||||||
|
# Cleanup test files
|
||||||
|
rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
|
||||||
|
log "INFO" "Disk space exhaustion simulation completed and cleaned up"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate network issues
|
||||||
|
simulate_network_issues() {
|
||||||
|
local interface=${1:-"lo"}
|
||||||
|
local duration=${2:-20}
|
||||||
|
log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
|
||||||
|
|
||||||
|
# Add network delay and packet loss
|
||||||
|
sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
|
||||||
|
|
||||||
|
sleep $duration
|
||||||
|
|
||||||
|
# Remove network simulation
|
||||||
|
sudo tc qdisc del dev $interface root 2>/dev/null || true
|
||||||
|
log "INFO" "Network issues simulation completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate service crashes
|
||||||
|
simulate_service_crash() {
|
||||||
|
local service_name=${1:-"test-service"}
|
||||||
|
log "INFO" "Starting service crash simulation for ${service_name}"
|
||||||
|
|
||||||
|
# Simulate service going down
|
||||||
|
log "ERROR" "Service ${service_name} crashed unexpectedly"
|
||||||
|
sleep 5
|
||||||
|
log "INFO" "Service ${service_name} restarted automatically"
|
||||||
|
|
||||||
|
# Simulate multiple crashes
|
||||||
|
for i in {1..3}; do
|
||||||
|
sleep 2
|
||||||
|
log "ERROR" "Service ${service_name} crashed again (attempt $i)"
|
||||||
|
sleep 1
|
||||||
|
log "INFO" "Service ${service_name} recovered after crash $i"
|
||||||
|
done
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate database issues
|
||||||
|
simulate_database_issues() {
|
||||||
|
local duration=${1:-25}
|
||||||
|
log "INFO" "Starting database issues simulation for ${duration} seconds"
|
||||||
|
|
||||||
|
# Simulate connection pool exhaustion
|
||||||
|
log "WARN" "Database connection pool nearing capacity"
|
||||||
|
sleep 5
|
||||||
|
log "ERROR" "Database connection pool exhausted"
|
||||||
|
sleep 5
|
||||||
|
log "ERROR" "Database query timeout occurred"
|
||||||
|
sleep 5
|
||||||
|
log "WARN" "Database connections recovering"
|
||||||
|
sleep 5
|
||||||
|
log "INFO" "Database connections restored"
|
||||||
|
|
||||||
|
log "INFO" "Database issues simulation completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to simulate application errors
|
||||||
|
simulate_application_errors() {
|
||||||
|
local error_count=${1:-10}
|
||||||
|
log "INFO" "Starting application error simulation (${error_count} errors)"
|
||||||
|
|
||||||
|
for i in $(seq 1 "$error_count"); do
|
||||||
|
case $((RANDOM % 4)) in
|
||||||
|
0)
|
||||||
|
log "ERROR" "NullPointerException in UserService.getUser($i)"
|
||||||
|
;;
|
||||||
|
1)
|
||||||
|
log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
|
||||||
|
;;
|
||||||
|
2)
|
||||||
|
log "ERROR" "ValidationException: Invalid input data for request $i"
|
||||||
|
;;
|
||||||
|
3)
|
||||||
|
log "ERROR" "IOException: Failed to write to log file"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
sleep $((RANDOM % 3 + 1))
|
||||||
|
done
|
||||||
|
|
||||||
|
log "INFO" "Application error simulation completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to run comprehensive incident scenario
|
||||||
|
run_comprehensive_scenario() {
|
||||||
|
log "INFO" "Starting comprehensive incident scenario simulation"
|
||||||
|
|
||||||
|
# Phase 1: Initial system stress
|
||||||
|
log "INFO" "Phase 1: System stress simulation"
|
||||||
|
simulate_cpu_spike 30 &
|
||||||
|
CPU_PID=$!
|
||||||
|
simulate_memory_leak 20 &
|
||||||
|
MEM_PID=$!
|
||||||
|
|
||||||
|
sleep 10
|
||||||
|
|
||||||
|
# Phase 2: Service degradation
|
||||||
|
log "INFO" "Phase 2: Service degradation simulation"
|
||||||
|
simulate_service_crash "web-service" &
|
||||||
|
SERVICE_PID=$!
|
||||||
|
|
||||||
|
sleep 5
|
||||||
|
|
||||||
|
# Phase 3: Database issues
|
||||||
|
log "INFO" "Phase 3: Database issues simulation"
|
||||||
|
simulate_database_issues 15 &
|
||||||
|
DB_PID=$!
|
||||||
|
|
||||||
|
# Phase 4: Application errors
|
||||||
|
log "INFO" "Phase 4: Application error burst"
|
||||||
|
simulate_application_errors 15 &
|
||||||
|
APP_PID=$!
|
||||||
|
|
||||||
|
# Phase 5: Infrastructure issues
|
||||||
|
log "INFO" "Phase 5: Infrastructure issues simulation"
|
||||||
|
simulate_disk_full "/tmp" 10 &
|
||||||
|
DISK_PID=$!
|
||||||
|
|
||||||
|
# Wait for all simulations to complete
|
||||||
|
wait $CPU_PID 2>/dev/null || true
|
||||||
|
wait $MEM_PID 2>/dev/null || true
|
||||||
|
wait $SERVICE_PID 2>/dev/null || true
|
||||||
|
wait $DB_PID 2>/dev/null || true
|
||||||
|
wait $APP_PID 2>/dev/null || true
|
||||||
|
wait $DISK_PID 2>/dev/null || true
|
||||||
|
|
||||||
|
log "INFO" "Comprehensive incident scenario completed"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Function to show usage
|
||||||
|
show_usage() {
|
||||||
|
echo "Enterprise Incident Simulation Script"
|
||||||
|
echo "Usage: $0 [SCENARIO] [OPTIONS]"
|
||||||
|
echo ""
|
||||||
|
echo "SCENARIOS:"
|
||||||
|
echo " cpu [DURATION] - Simulate CPU spike (default: 60s)"
|
||||||
|
echo " memory [DURATION] - Simulate memory leak (default: 30s)"
|
||||||
|
echo " disk [DIR] [DURATION] - Simulate disk space exhaustion (default: /tmp, 30s)"
|
||||||
|
echo " network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
|
||||||
|
echo " service [NAME] - Simulate service crashes (default: test-service)"
|
||||||
|
echo " database [DURATION] - Simulate database issues (default: 25s)"
|
||||||
|
echo " app-errors [COUNT] - Simulate application errors (default: 10)"
|
||||||
|
echo " comprehensive - Run full incident scenario"
|
||||||
|
echo " all - Run all individual scenarios sequentially"
|
||||||
|
echo ""
|
||||||
|
echo "EXAMPLES:"
|
||||||
|
echo " $0 cpu 120 - CPU spike for 2 minutes"
|
||||||
|
echo " $0 disk /var/log 45 - Disk full simulation in /var/log for 45 seconds"
|
||||||
|
echo " $0 comprehensive - Full incident simulation"
|
||||||
|
echo ""
|
||||||
|
}
|
||||||
|
|
||||||
|
# Main execution
|
||||||
|
main() {
|
||||||
|
local scenario=${1:-"comprehensive"}
|
||||||
|
|
||||||
|
# Create log directory if it doesn't exist
|
||||||
|
mkdir -p "$(dirname "$LOG_FILE")"
|
||||||
|
|
||||||
|
log "INFO" "Incident simulation script started"
|
||||||
|
log "INFO" "Scenario: $scenario"
|
||||||
|
|
||||||
|
case $scenario in
|
||||||
|
"cpu")
|
||||||
|
simulate_cpu_spike "${2:-60}"
|
||||||
|
;;
|
||||||
|
"memory")
|
||||||
|
simulate_memory_leak "${2:-30}"
|
||||||
|
;;
|
||||||
|
"disk")
|
||||||
|
simulate_disk_full "${2:-/tmp}" "${3:-30}"
|
||||||
|
;;
|
||||||
|
"network")
|
||||||
|
simulate_network_issues "${2:-lo}" "${3:-20}"
|
||||||
|
;;
|
||||||
|
"service")
|
||||||
|
simulate_service_crash "${2:-test-service}"
|
||||||
|
;;
|
||||||
|
"database")
|
||||||
|
simulate_database_issues "${2:-25}"
|
||||||
|
;;
|
||||||
|
"app-errors")
|
||||||
|
simulate_application_errors "${2:-10}"
|
||||||
|
;;
|
||||||
|
"comprehensive")
|
||||||
|
run_comprehensive_scenario
|
||||||
|
;;
|
||||||
|
"all")
|
||||||
|
log "INFO" "Running all scenarios sequentially"
|
||||||
|
simulate_cpu_spike 30
|
||||||
|
sleep 5
|
||||||
|
simulate_memory_leak 20
|
||||||
|
sleep 5
|
||||||
|
simulate_disk_full "/tmp" 15
|
||||||
|
sleep 5
|
||||||
|
simulate_service_crash "test-service"
|
||||||
|
sleep 5
|
||||||
|
simulate_database_issues 15
|
||||||
|
sleep 5
|
||||||
|
simulate_application_errors 8
|
||||||
|
sleep 5
|
||||||
|
simulate_network_issues "lo" 10
|
||||||
|
;;
|
||||||
|
"help"|"-h"|"--help")
|
||||||
|
show_usage
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
|
||||||
|
echo ""
|
||||||
|
show_usage
|
||||||
|
exit 1
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
log "INFO" "Incident simulation script completed successfully"
|
||||||
|
echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
|
||||||
|
}
|
||||||
|
|
||||||
|
# Run main function with all arguments
|
||||||
|
main "$@"
|
||||||
@@ -0,0 +1,11 @@
|
|||||||
|
.PHONY: run test demo
|
||||||
|
|
||||||
|
run:
|
||||||
|
python3 cli.py --help
|
||||||
|
|
||||||
|
test:
|
||||||
|
python3 -m py_compile cli.py collectors/*.py validators/*.py reports/*.py
|
||||||
|
python3 -m unittest discover -s tests
|
||||||
|
|
||||||
|
demo:
|
||||||
|
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
|
||||||
@@ -0,0 +1,87 @@
|
|||||||
|
# Migration Validation Framework
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Infrastructure migrations often fail in small, expensive ways: a mount option changes, a service is disabled, or disk usage moves past an operational threshold. Teams need structured evidence that the migrated host still matches the expected operating profile.
|
||||||
|
|
||||||
|
## CV Relevance
|
||||||
|
|
||||||
|
This project maps to storage/platform migration validation work: collecting pre-migration and post-migration state, comparing results, and producing evidence that can be attached to change or migration tickets.
|
||||||
|
|
||||||
|
## What This Project Demonstrates
|
||||||
|
|
||||||
|
- A Python CLI for collecting and comparing system snapshots.
|
||||||
|
- Modular collectors for mounts, services, and disk usage.
|
||||||
|
- Risk assessment and validation checks for before/after drift.
|
||||||
|
- JSON and HTML evidence suitable for migration review.
|
||||||
|
- Offline tests and examples that run without remote hosts.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
Operator -> CLI -> Collectors -> JSON Snapshot -> Comparator -> Diff/Report
|
||||||
|
```
|
||||||
|
|
||||||
|
Core components:
|
||||||
|
|
||||||
|
- `cli.py` provides collect, compare, snapshot, list, and report commands.
|
||||||
|
- `collectors/` gathers mounts, services, and disk usage.
|
||||||
|
- `validators/compare.py` identifies drift and validation failures.
|
||||||
|
- `reports/` contains report generation helpers with escaped HTML output.
|
||||||
|
- `examples/` contains realistic before/after evidence.
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/migration-validation-framework
|
||||||
|
make test
|
||||||
|
make demo
|
||||||
|
```
|
||||||
|
|
||||||
|
The demo compares the included example snapshots:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
|
||||||
|
```
|
||||||
|
|
||||||
|
The example intentionally returns `FAIL` because it demonstrates a high-risk migration finding.
|
||||||
|
|
||||||
|
Legacy snapshot IDs are still supported:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 cli.py snapshot --env prod --label pre --systems web01,db01
|
||||||
|
python3 cli.py compare prod-pre-20260429_020000 prod-post-20260429_030000 --output change-0429
|
||||||
|
```
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
This compiles the Python modules and runs unit tests for example comparison, parser behavior, and HTML escaping.
|
||||||
|
|
||||||
|
## Example Output
|
||||||
|
|
||||||
|
```text
|
||||||
|
Comparison completed: diff.json (FAIL)
|
||||||
|
Overall risk: high
|
||||||
|
Total changes: 4
|
||||||
|
Failed checks: critical_services_running
|
||||||
|
Recommendation: restore sshd before production cutover
|
||||||
|
```
|
||||||
|
|
||||||
|
Sample inputs and output are available in [examples/before.json](examples/before.json), [examples/after.json](examples/after.json), and [examples/diff.json](examples/diff.json).
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
- Add database-specific migration checks.
|
||||||
|
- Add performance baseline comparisons.
|
||||||
|
- Add a REST API wrapper for CI/CD integration.
|
||||||
|
- Add compliance-oriented validation profiles.
|
||||||
|
|
||||||
|
## Interview Talking Points
|
||||||
|
|
||||||
|
- Why pre/post migration evidence reduces risk during storage and platform migrations.
|
||||||
|
- How to separate collection from comparison so evidence can be replayed.
|
||||||
|
- How drift detection supports change approval and rollback decisions.
|
||||||
@@ -0,0 +1,323 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Migration Validation Framework - CLI Interface
|
||||||
|
|
||||||
|
A comprehensive tool for validating system migrations through data collection,
|
||||||
|
snapshot comparison, and automated reporting.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
import sys
|
||||||
|
from datetime import datetime
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Dict, List, Optional, Any
|
||||||
|
|
||||||
|
# Import framework modules
|
||||||
|
from collectors import mounts, services, disk_usage
|
||||||
|
from validators import compare
|
||||||
|
from reports import html_report
|
||||||
|
|
||||||
|
# Configuration
|
||||||
|
SNAPSHOTS_DIR = Path("snapshots")
|
||||||
|
LOGS_DIR = Path("logs")
|
||||||
|
REPORTS_DIR = Path("reports")
|
||||||
|
|
||||||
|
class MigrationValidator:
|
||||||
|
"""Main migration validation class."""
|
||||||
|
|
||||||
|
def __init__(self, verbose: bool = False):
|
||||||
|
self.verbose = verbose
|
||||||
|
self.ensure_directories()
|
||||||
|
self.setup_logging()
|
||||||
|
|
||||||
|
def setup_logging(self):
|
||||||
|
"""Configure logging."""
|
||||||
|
log_level = logging.DEBUG if self.verbose else logging.INFO
|
||||||
|
logging.basicConfig(
|
||||||
|
level=log_level,
|
||||||
|
format='%(asctime)s - %(levelname)s - %(message)s',
|
||||||
|
handlers=[
|
||||||
|
logging.FileHandler(LOGS_DIR / "validation.log"),
|
||||||
|
logging.StreamHandler(sys.stdout)
|
||||||
|
]
|
||||||
|
)
|
||||||
|
self.logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
def ensure_directories(self):
|
||||||
|
"""Ensure required directories exist."""
|
||||||
|
for directory in [SNAPSHOTS_DIR, LOGS_DIR, REPORTS_DIR]:
|
||||||
|
directory.mkdir(exist_ok=True)
|
||||||
|
|
||||||
|
def collect_system_data(self, systems: List[str]) -> Dict[str, Any]:
|
||||||
|
"""Collect data from target systems."""
|
||||||
|
self.logger.info(f"Collecting data from systems: {systems}")
|
||||||
|
|
||||||
|
snapshot = {
|
||||||
|
"metadata": {
|
||||||
|
"timestamp": datetime.now().isoformat(),
|
||||||
|
"systems": systems,
|
||||||
|
"version": "1.0"
|
||||||
|
},
|
||||||
|
"data": {}
|
||||||
|
}
|
||||||
|
|
||||||
|
collectors = [
|
||||||
|
("mounts", mounts.collect),
|
||||||
|
("services", services.collect),
|
||||||
|
("disk_usage", disk_usage.collect)
|
||||||
|
]
|
||||||
|
|
||||||
|
for system in systems:
|
||||||
|
self.logger.info(f"Collecting data from {system}")
|
||||||
|
snapshot["data"][system] = {}
|
||||||
|
|
||||||
|
for collector_name, collector_func in collectors:
|
||||||
|
try:
|
||||||
|
self.logger.debug(f"Running {collector_name} collector on {system}")
|
||||||
|
data = collector_func(system)
|
||||||
|
snapshot["data"][system][collector_name] = data
|
||||||
|
except Exception as e:
|
||||||
|
self.logger.error(f"Failed to collect {collector_name} from {system}: {e}")
|
||||||
|
snapshot["data"][system][collector_name] = {"error": str(e)}
|
||||||
|
|
||||||
|
return snapshot
|
||||||
|
|
||||||
|
def save_snapshot(self, snapshot: Dict[str, Any], label: str, env: str) -> str:
|
||||||
|
"""Save snapshot to disk."""
|
||||||
|
snapshot_id = f"{env}-{label}-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
|
||||||
|
snapshot_file = SNAPSHOTS_DIR / f"{snapshot_id}.json"
|
||||||
|
|
||||||
|
with open(snapshot_file, 'w') as f:
|
||||||
|
json.dump(snapshot, f, indent=2)
|
||||||
|
|
||||||
|
self.logger.info(f"Snapshot saved: {snapshot_id}")
|
||||||
|
return snapshot_id
|
||||||
|
|
||||||
|
def load_snapshot(self, snapshot_id: str) -> Dict[str, Any]:
|
||||||
|
"""Load snapshot from disk."""
|
||||||
|
snapshot_path = Path(snapshot_id)
|
||||||
|
snapshot_file = snapshot_path if snapshot_path.exists() else SNAPSHOTS_DIR / f"{snapshot_id}.json"
|
||||||
|
if not snapshot_file.exists():
|
||||||
|
raise FileNotFoundError(f"Snapshot {snapshot_id} not found")
|
||||||
|
|
||||||
|
with open(snapshot_file, 'r') as f:
|
||||||
|
return json.load(f)
|
||||||
|
|
||||||
|
def collect_to_file(self, output_file: str, systems: List[str]) -> str:
|
||||||
|
"""Collect a snapshot and write it to an explicit file path."""
|
||||||
|
snapshot = self.collect_system_data(systems)
|
||||||
|
with open(output_file, 'w') as f:
|
||||||
|
json.dump(snapshot, f, indent=2)
|
||||||
|
f.write("\n")
|
||||||
|
self.logger.info(f"Snapshot written: {output_file}")
|
||||||
|
return output_file
|
||||||
|
|
||||||
|
def create_snapshot(self, env: str, label: str, systems: List[str]) -> str:
|
||||||
|
"""Create and save a system snapshot."""
|
||||||
|
self.logger.info(f"Creating snapshot for environment: {env}, label: {label}")
|
||||||
|
|
||||||
|
snapshot = self.collect_system_data(systems)
|
||||||
|
snapshot_id = self.save_snapshot(snapshot, label, env)
|
||||||
|
|
||||||
|
return snapshot_id
|
||||||
|
|
||||||
|
def compare_snapshots(self, snapshot1_id: str, snapshot2_id: str, output_id: str) -> Dict[str, Any]:
|
||||||
|
"""Compare two snapshots."""
|
||||||
|
self.logger.info(f"Comparing snapshots: {snapshot1_id} vs {snapshot2_id}")
|
||||||
|
|
||||||
|
snapshot1 = self.load_snapshot(snapshot1_id)
|
||||||
|
snapshot2 = self.load_snapshot(snapshot2_id)
|
||||||
|
|
||||||
|
comparison = compare.compare_snapshots(snapshot1, snapshot2)
|
||||||
|
comparison["metadata"] = {
|
||||||
|
"snapshot1": snapshot1_id,
|
||||||
|
"snapshot2": snapshot2_id,
|
||||||
|
"timestamp": datetime.now().isoformat(),
|
||||||
|
"comparison_id": output_id
|
||||||
|
}
|
||||||
|
|
||||||
|
# Save comparison results
|
||||||
|
comparison_file = REPORTS_DIR / f"comparison_{output_id}.json"
|
||||||
|
with open(comparison_file, 'w') as f:
|
||||||
|
json.dump(comparison, f, indent=2)
|
||||||
|
|
||||||
|
self.logger.info(f"Comparison saved: {output_id}")
|
||||||
|
return comparison
|
||||||
|
|
||||||
|
def compare_files(self, before_file: str, after_file: str, output_file: Optional[str] = None) -> Dict[str, Any]:
|
||||||
|
"""Compare two explicit JSON snapshot files."""
|
||||||
|
self.logger.info(f"Comparing files: {before_file} vs {after_file}")
|
||||||
|
|
||||||
|
before = self.load_snapshot(before_file)
|
||||||
|
after = self.load_snapshot(after_file)
|
||||||
|
comparison = compare.compare_snapshots(before, after)
|
||||||
|
comparison["metadata"] = {
|
||||||
|
"before": before_file,
|
||||||
|
"after": after_file,
|
||||||
|
"timestamp": datetime.now().isoformat()
|
||||||
|
}
|
||||||
|
|
||||||
|
if output_file:
|
||||||
|
with open(output_file, 'w') as f:
|
||||||
|
json.dump(comparison, f, indent=2)
|
||||||
|
f.write("\n")
|
||||||
|
self.logger.info(f"Comparison written: {output_file}")
|
||||||
|
|
||||||
|
return comparison
|
||||||
|
|
||||||
|
def generate_report(self, comparison_id: str, format_type: str, output_file: Optional[str] = None) -> str:
|
||||||
|
"""Generate a report from comparison results."""
|
||||||
|
self.logger.info(f"Generating {format_type} report for comparison: {comparison_id}")
|
||||||
|
|
||||||
|
comparison_file = REPORTS_DIR / f"comparison_{comparison_id}.json"
|
||||||
|
if not comparison_file.exists():
|
||||||
|
raise FileNotFoundError(f"Comparison {comparison_id} not found")
|
||||||
|
|
||||||
|
with open(comparison_file, 'r') as f:
|
||||||
|
comparison = json.load(f)
|
||||||
|
|
||||||
|
if format_type == "html":
|
||||||
|
if output_file is None:
|
||||||
|
output_file = f"migration_report_{comparison_id}.html"
|
||||||
|
html_report.generate(comparison, output_file)
|
||||||
|
elif format_type == "json":
|
||||||
|
if output_file is None:
|
||||||
|
output_file = f"migration_report_{comparison_id}.json"
|
||||||
|
with open(output_file, 'w') as f:
|
||||||
|
json.dump(comparison, f, indent=2)
|
||||||
|
else:
|
||||||
|
raise ValueError(f"Unsupported format: {format_type}")
|
||||||
|
|
||||||
|
self.logger.info(f"Report generated: {output_file}")
|
||||||
|
return output_file
|
||||||
|
|
||||||
|
def main():
|
||||||
|
"""Main CLI entry point."""
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Migration Validation Framework",
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Collect pre-migration snapshot
|
||||||
|
python3 cli.py collect --output before.json --systems web01,db01
|
||||||
|
|
||||||
|
# Compare snapshot files
|
||||||
|
python3 cli.py compare before.json after.json --output diff.json
|
||||||
|
|
||||||
|
# Generate HTML report
|
||||||
|
python3 cli.py report --comparison comparison_001 --format html
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--verbose', '-v', action='store_true', help='Enable verbose logging')
|
||||||
|
parser.add_argument('--dry-run', action='store_true', help='Preview actions without execution')
|
||||||
|
|
||||||
|
subparsers = parser.add_subparsers(dest='command', help='Available commands')
|
||||||
|
|
||||||
|
# Collect command
|
||||||
|
collect_parser = subparsers.add_parser('collect', help='Collect a system snapshot to a JSON file')
|
||||||
|
collect_parser.add_argument('--output', required=True, help='Output JSON file')
|
||||||
|
collect_parser.add_argument('--systems', default='localhost', help='Comma-separated list of systems')
|
||||||
|
|
||||||
|
# Snapshot command
|
||||||
|
snapshot_parser = subparsers.add_parser('snapshot', help='Create system snapshot')
|
||||||
|
snapshot_parser.add_argument('--env', required=True, help='Target environment')
|
||||||
|
snapshot_parser.add_argument('--label', required=True, help='Snapshot label')
|
||||||
|
snapshot_parser.add_argument('--systems', required=True, help='Comma-separated list of systems')
|
||||||
|
|
||||||
|
# Compare command
|
||||||
|
compare_parser = subparsers.add_parser('compare', help='Compare two snapshots')
|
||||||
|
compare_parser.add_argument('snapshot1', help='First snapshot ID')
|
||||||
|
compare_parser.add_argument('snapshot2', help='Second snapshot ID')
|
||||||
|
compare_parser.add_argument('--output', help='Output comparison ID or JSON file')
|
||||||
|
|
||||||
|
# Report command
|
||||||
|
report_parser = subparsers.add_parser('report', help='Generate report from comparison')
|
||||||
|
report_parser.add_argument('--comparison', required=True, help='Comparison ID')
|
||||||
|
report_parser.add_argument('--format', choices=['html', 'json'], default='html', help='Report format')
|
||||||
|
report_parser.add_argument('--output', help='Output file path')
|
||||||
|
|
||||||
|
# List command
|
||||||
|
list_parser = subparsers.add_parser('list', help='List snapshots or comparisons')
|
||||||
|
list_parser.add_argument('type', choices=['snapshots', 'comparisons'], help='Type to list')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if not args.command:
|
||||||
|
parser.print_help()
|
||||||
|
return
|
||||||
|
|
||||||
|
# Initialize validator
|
||||||
|
validator = MigrationValidator(verbose=args.verbose)
|
||||||
|
|
||||||
|
try:
|
||||||
|
if args.command == 'collect':
|
||||||
|
systems = [system.strip() for system in args.systems.split(',') if system.strip()]
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"DRY RUN: Would collect {systems} into {args.output}")
|
||||||
|
return
|
||||||
|
|
||||||
|
output_file = validator.collect_to_file(args.output, systems)
|
||||||
|
print(f"Snapshot written: {output_file}")
|
||||||
|
|
||||||
|
elif args.command == 'snapshot':
|
||||||
|
systems = args.systems.split(',')
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"DRY RUN: Would create snapshot for systems: {systems}")
|
||||||
|
return
|
||||||
|
|
||||||
|
snapshot_id = validator.create_snapshot(args.env, args.label, systems)
|
||||||
|
print(f"Snapshot created: {snapshot_id}")
|
||||||
|
|
||||||
|
elif args.command == 'compare':
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"DRY RUN: Would compare {args.snapshot1} vs {args.snapshot2}")
|
||||||
|
return
|
||||||
|
|
||||||
|
output = args.output
|
||||||
|
if output and output.endswith('.json'):
|
||||||
|
comparison = validator.compare_files(args.snapshot1, args.snapshot2, output)
|
||||||
|
result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
|
||||||
|
print(f"Comparison completed: {output} ({result})")
|
||||||
|
else:
|
||||||
|
output_id = output or datetime.now().strftime('%Y%m%d_%H%M%S')
|
||||||
|
comparison = validator.compare_snapshots(args.snapshot1, args.snapshot2, output_id)
|
||||||
|
result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
|
||||||
|
print(f"Comparison completed: {output_id} ({result})")
|
||||||
|
|
||||||
|
elif args.command == 'report':
|
||||||
|
if args.dry_run:
|
||||||
|
print(f"DRY RUN: Would generate {args.format} report for {args.comparison}")
|
||||||
|
return
|
||||||
|
|
||||||
|
output_file = validator.generate_report(args.comparison, args.format, args.output)
|
||||||
|
print(f"Report generated: {output_file}")
|
||||||
|
|
||||||
|
elif args.command == 'list':
|
||||||
|
if args.type == 'snapshots':
|
||||||
|
snapshots = list(SNAPSHOTS_DIR.glob("*.json"))
|
||||||
|
if snapshots:
|
||||||
|
print("Available snapshots:")
|
||||||
|
for snapshot in sorted(snapshots):
|
||||||
|
print(f" {snapshot.stem}")
|
||||||
|
else:
|
||||||
|
print("No snapshots found")
|
||||||
|
elif args.type == 'comparisons':
|
||||||
|
comparisons = list(REPORTS_DIR.glob("comparison_*.json"))
|
||||||
|
if comparisons:
|
||||||
|
print("Available comparisons:")
|
||||||
|
for comparison in sorted(comparisons):
|
||||||
|
comp_id = comparison.stem.replace('comparison_', '')
|
||||||
|
print(f" {comp_id}")
|
||||||
|
else:
|
||||||
|
print("No comparisons found")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
validator.logger.error(f"Command failed: {e}")
|
||||||
|
print(f"Error: {e}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -0,0 +1,206 @@
|
|||||||
|
"""
|
||||||
|
Disk Usage Data Collector
|
||||||
|
|
||||||
|
Collects disk usage statistics including directory sizes,
|
||||||
|
file system usage, and largest files information.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shlex
|
||||||
|
import subprocess
|
||||||
|
from typing import Dict, Any, List
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class DiskUsageCollector:
|
||||||
|
"""Collector for disk usage statistics."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.max_depth = 3
|
||||||
|
self.exclude_paths = [
|
||||||
|
"/proc",
|
||||||
|
"/sys",
|
||||||
|
"/dev",
|
||||||
|
"/run",
|
||||||
|
"/tmp",
|
||||||
|
"/var/log"
|
||||||
|
]
|
||||||
|
|
||||||
|
def collect_disk_usage(self, system: str) -> Dict[str, Any]:
|
||||||
|
"""Collect disk usage information from target system."""
|
||||||
|
logger.info(f"Collecting disk usage data from {system}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Collect filesystem usage
|
||||||
|
filesystem_usage = self.collect_filesystem_usage(system)
|
||||||
|
|
||||||
|
# Collect directory sizes
|
||||||
|
directory_sizes = self.collect_directory_sizes(system)
|
||||||
|
|
||||||
|
# Collect largest files
|
||||||
|
largest_files = self.collect_largest_files(system)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"filesystem_usage": filesystem_usage,
|
||||||
|
"directory_sizes": directory_sizes,
|
||||||
|
"largest_files": largest_files,
|
||||||
|
"timestamp": self.get_timestamp(system)
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect disk usage from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def collect_filesystem_usage(self, system: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Collect filesystem usage statistics."""
|
||||||
|
usage_stats = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run df command
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "df -h --output=source,fstype,size,used,avail,pcent,target"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"df command failed: {result.stderr}")
|
||||||
|
|
||||||
|
# Parse output
|
||||||
|
lines = result.stdout.strip().split('\n')
|
||||||
|
if len(lines) < 2:
|
||||||
|
return usage_stats
|
||||||
|
|
||||||
|
for line in lines[1:]: # Skip header
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) >= 7:
|
||||||
|
usage_stat = {
|
||||||
|
"filesystem": parts[0],
|
||||||
|
"type": parts[1],
|
||||||
|
"size": parts[2],
|
||||||
|
"used": parts[3],
|
||||||
|
"available": parts[4],
|
||||||
|
"use_percent": parts[5],
|
||||||
|
"mountpoint": parts[6]
|
||||||
|
}
|
||||||
|
usage_stats.append(usage_stat)
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.error(f"Timeout collecting filesystem usage from {system}")
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect filesystem usage from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
return usage_stats
|
||||||
|
|
||||||
|
def collect_directory_sizes(self, system: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Collect sizes of top-level directories."""
|
||||||
|
directory_sizes = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get top-level directories
|
||||||
|
dirs_to_check = ["/", "/home", "/var", "/usr", "/opt", "/etc"]
|
||||||
|
|
||||||
|
for directory in dirs_to_check:
|
||||||
|
if directory in self.exclude_paths:
|
||||||
|
continue
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run du command for directory size
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, f"du -sh -- {shlex.quote(directory)} 2>/dev/null"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=60
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
size, path = result.stdout.strip().split('\t', 1)
|
||||||
|
directory_sizes.append({
|
||||||
|
"path": path,
|
||||||
|
"size": size
|
||||||
|
})
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.warning(f"Timeout getting size for {directory} on {system}")
|
||||||
|
continue
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to get size for {directory} on {system}: {e}")
|
||||||
|
continue
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect directory sizes from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
return directory_sizes
|
||||||
|
|
||||||
|
def collect_largest_files(self, system: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Collect information about largest files in the system."""
|
||||||
|
largest_files = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Find largest files (excluding certain paths)
|
||||||
|
exclude_expr = " ".join(f"-not -path {shlex.quote(path + '/*')}" for path in self.exclude_paths)
|
||||||
|
|
||||||
|
cmd = f"find / {exclude_expr} -type f -exec ls -lh {{}} \\; 2>/dev/null | sort -k5 -hr | head -20"
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, cmd],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=120
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
for line in result.stdout.strip().split('\n'):
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) >= 9:
|
||||||
|
file_info = {
|
||||||
|
"permissions": parts[0],
|
||||||
|
"links": parts[1],
|
||||||
|
"owner": parts[2],
|
||||||
|
"group": parts[3],
|
||||||
|
"size": parts[4],
|
||||||
|
"month": parts[5],
|
||||||
|
"day": parts[6],
|
||||||
|
"time": parts[7],
|
||||||
|
"path": " ".join(parts[8:])
|
||||||
|
}
|
||||||
|
largest_files.append(file_info)
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.error(f"Timeout collecting largest files from {system}")
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect largest files from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
return largest_files
|
||||||
|
|
||||||
|
def get_timestamp(self, system: str) -> str:
|
||||||
|
"""Get current timestamp from target system."""
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "date -Iseconds"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
return result.stdout.strip()
|
||||||
|
else:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
def collect(system: str) -> Dict[str, Any]:
|
||||||
|
"""Main collection function for disk usage data."""
|
||||||
|
collector = DiskUsageCollector()
|
||||||
|
return collector.collect_disk_usage(system)
|
||||||
@@ -0,0 +1,172 @@
|
|||||||
|
"""
|
||||||
|
Mounts Data Collector
|
||||||
|
|
||||||
|
Collects filesystem mount information including mount points, devices,
|
||||||
|
filesystem types, and usage statistics.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shlex
|
||||||
|
import subprocess
|
||||||
|
from typing import Dict, Any, List
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class MountsCollector:
|
||||||
|
"""Collector for filesystem mount information."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.exclude_patterns = [
|
||||||
|
"/proc/*",
|
||||||
|
"/sys/*",
|
||||||
|
"/dev/*",
|
||||||
|
"/run/*"
|
||||||
|
]
|
||||||
|
|
||||||
|
def collect_mounts(self, system: str) -> Dict[str, Any]:
|
||||||
|
"""Collect mount information from target system."""
|
||||||
|
logger.info(f"Collecting mounts data from {system}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run mount command
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "mount"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"Mount command failed: {result.stderr}")
|
||||||
|
|
||||||
|
mounts = self.parse_mount_output(result.stdout)
|
||||||
|
filtered_mounts = self.filter_mounts(mounts)
|
||||||
|
|
||||||
|
# Get usage statistics
|
||||||
|
usage_stats = self.collect_usage_stats(system, filtered_mounts)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"mounts": filtered_mounts,
|
||||||
|
"usage": usage_stats,
|
||||||
|
"timestamp": self.get_timestamp(system)
|
||||||
|
}
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.error(f"Timeout collecting mounts from {system}")
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect mounts from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def parse_mount_output(self, output: str) -> List[Dict[str, str]]:
|
||||||
|
"""Parse mount command output."""
|
||||||
|
mounts = []
|
||||||
|
|
||||||
|
for line in output.strip().split('\n'):
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Parse mount output format: device on mountpoint type fstype (options)
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) >= 6 and parts[1] == 'on' and parts[3] == 'type':
|
||||||
|
mount_info = {
|
||||||
|
"device": parts[0],
|
||||||
|
"mountpoint": parts[2],
|
||||||
|
"fstype": parts[4],
|
||||||
|
"options": parts[5].strip('()')
|
||||||
|
}
|
||||||
|
mounts.append(mount_info)
|
||||||
|
|
||||||
|
return mounts
|
||||||
|
|
||||||
|
def filter_mounts(self, mounts: List[Dict[str, str]]) -> List[Dict[str, str]]:
|
||||||
|
"""Filter out unwanted mount points."""
|
||||||
|
filtered = []
|
||||||
|
|
||||||
|
for mount in mounts:
|
||||||
|
mountpoint = mount["mountpoint"]
|
||||||
|
if not any(mountpoint.startswith(pattern.rstrip('*')) for pattern in self.exclude_patterns):
|
||||||
|
filtered.append(mount)
|
||||||
|
|
||||||
|
return filtered
|
||||||
|
|
||||||
|
def collect_usage_stats(self, system: str, mounts: List[Dict[str, str]]) -> Dict[str, Any]:
|
||||||
|
"""Collect disk usage statistics for mount points."""
|
||||||
|
usage_stats = {}
|
||||||
|
|
||||||
|
for mount in mounts:
|
||||||
|
mountpoint = mount["mountpoint"]
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Run df command for usage statistics
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, f"df -BG -- {shlex.quote(mountpoint)}"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=15
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
usage_stats[mountpoint] = self.parse_df_output(result.stdout)
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.warning(f"Timeout getting usage for {mountpoint} on {system}")
|
||||||
|
usage_stats[mountpoint] = {"error": "timeout"}
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to get usage for {mountpoint} on {system}: {e}")
|
||||||
|
usage_stats[mountpoint] = {"error": str(e)}
|
||||||
|
|
||||||
|
return usage_stats
|
||||||
|
|
||||||
|
def parse_df_output(self, output: str) -> Dict[str, Any]:
|
||||||
|
"""Parse df command output."""
|
||||||
|
lines = output.strip().split('\n')
|
||||||
|
if len(lines) < 2:
|
||||||
|
return {"error": "invalid df output"}
|
||||||
|
|
||||||
|
# Parse header and data
|
||||||
|
header = lines[0].split()
|
||||||
|
data = lines[1].split()
|
||||||
|
|
||||||
|
if len(header) != len(data):
|
||||||
|
return {"error": "header/data mismatch"}
|
||||||
|
|
||||||
|
stats = {}
|
||||||
|
for i, field in enumerate(header):
|
||||||
|
if i < len(data):
|
||||||
|
if field in ['1G-blocks', 'Used', 'Available']:
|
||||||
|
# Convert to GB
|
||||||
|
value = data[i]
|
||||||
|
if value.endswith('G'):
|
||||||
|
stats[field.lower()] = float(value.rstrip('G'))
|
||||||
|
else:
|
||||||
|
stats[field.lower()] = float(value) / (1024**3) # Assume bytes
|
||||||
|
elif field == 'Use%':
|
||||||
|
stats['use_percent'] = int(data[i].rstrip('%'))
|
||||||
|
else:
|
||||||
|
stats[field.lower()] = data[i]
|
||||||
|
|
||||||
|
return stats
|
||||||
|
|
||||||
|
def get_timestamp(self, system: str) -> str:
|
||||||
|
"""Get current timestamp from target system."""
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "date -Iseconds"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
return result.stdout.strip()
|
||||||
|
else:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
def collect(system: str) -> Dict[str, Any]:
|
||||||
|
"""Main collection function for mounts data."""
|
||||||
|
collector = MountsCollector()
|
||||||
|
return collector.collect_mounts(system)
|
||||||
@@ -0,0 +1,223 @@
|
|||||||
|
"""
|
||||||
|
Services Data Collector
|
||||||
|
|
||||||
|
Collects system service information including running services,
|
||||||
|
their states, startup configuration, and dependencies.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import logging
|
||||||
|
import shlex
|
||||||
|
import subprocess
|
||||||
|
from typing import Dict, Any, List
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class ServicesCollector:
|
||||||
|
"""Collector for system service information."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.service_manager = "systemd" # Default to systemd
|
||||||
|
self.include_disabled = False
|
||||||
|
|
||||||
|
def collect_services(self, system: str) -> Dict[str, Any]:
|
||||||
|
"""Collect service information from target system."""
|
||||||
|
logger.info(f"Collecting services data from {system}")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Detect service manager
|
||||||
|
service_manager = self.detect_service_manager(system)
|
||||||
|
|
||||||
|
if service_manager == "systemd":
|
||||||
|
services = self.collect_systemd_services(system)
|
||||||
|
elif service_manager == "sysv":
|
||||||
|
services = self.collect_sysv_services(system)
|
||||||
|
else:
|
||||||
|
raise RuntimeError(f"Unsupported service manager: {service_manager}")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"service_manager": service_manager,
|
||||||
|
"services": services,
|
||||||
|
"timestamp": self.get_timestamp(system)
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect services from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
def detect_service_manager(self, system: str) -> str:
|
||||||
|
"""Detect which service manager is running on the system."""
|
||||||
|
try:
|
||||||
|
# Check for systemd
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "ps -p 1 -o comm="],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
if "systemd" in result.stdout.strip():
|
||||||
|
return "systemd"
|
||||||
|
elif "init" in result.stdout.strip():
|
||||||
|
return "sysv"
|
||||||
|
|
||||||
|
# Fallback check
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "which systemctl"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
return "systemd"
|
||||||
|
|
||||||
|
return "sysv"
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
def collect_systemd_services(self, system: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Collect systemd service information."""
|
||||||
|
services = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get all services
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "systemctl list-units --type=service --all --no-pager --no-legend"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=30
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"systemctl list-units failed: {result.stderr}")
|
||||||
|
|
||||||
|
# Parse service list
|
||||||
|
for line in result.stdout.strip().split('\n'):
|
||||||
|
if not line.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
parts = line.split()
|
||||||
|
if len(parts) >= 4:
|
||||||
|
service_name = parts[0]
|
||||||
|
load_state = parts[1]
|
||||||
|
active_state = parts[2]
|
||||||
|
sub_state = parts[3]
|
||||||
|
|
||||||
|
# Skip if disabled and not including disabled
|
||||||
|
if not self.include_disabled and load_state == "not-found":
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get detailed service info
|
||||||
|
service_info = self.get_systemd_service_details(system, service_name)
|
||||||
|
|
||||||
|
services.append({
|
||||||
|
"name": service_name,
|
||||||
|
"load_state": load_state,
|
||||||
|
"active_state": active_state,
|
||||||
|
"sub_state": sub_state,
|
||||||
|
**service_info
|
||||||
|
})
|
||||||
|
|
||||||
|
except subprocess.TimeoutExpired:
|
||||||
|
logger.error(f"Timeout collecting systemd services from {system}")
|
||||||
|
raise
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect systemd services from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
return services
|
||||||
|
|
||||||
|
def get_systemd_service_details(self, system: str, service_name: str) -> Dict[str, Any]:
|
||||||
|
"""Get detailed information for a systemd service."""
|
||||||
|
details = {}
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get service status
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, f"systemctl show {shlex.quote(service_name)} --no-pager"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=15
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
for line in result.stdout.strip().split('\n'):
|
||||||
|
if '=' in line:
|
||||||
|
key, value = line.split('=', 1)
|
||||||
|
details[key.lower()] = value
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.warning(f"Failed to get details for {service_name}: {e}")
|
||||||
|
|
||||||
|
return details
|
||||||
|
|
||||||
|
def collect_sysv_services(self, system: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Collect SysV init service information."""
|
||||||
|
services = []
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get service list from /etc/init.d/
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "ls -1 /etc/init.d/"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=15
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode != 0:
|
||||||
|
raise RuntimeError(f"Failed to list init.d services: {result.stderr}")
|
||||||
|
|
||||||
|
for service_name in result.stdout.strip().split('\n'):
|
||||||
|
if not service_name.strip():
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get service status
|
||||||
|
status_result = subprocess.run(
|
||||||
|
["ssh", system, f"/etc/init.d/{shlex.quote(service_name)} status"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
status = "unknown"
|
||||||
|
if status_result.returncode == 0:
|
||||||
|
status = "running"
|
||||||
|
elif "not running" in status_result.stdout.lower():
|
||||||
|
status = "stopped"
|
||||||
|
|
||||||
|
services.append({
|
||||||
|
"name": service_name,
|
||||||
|
"status": status,
|
||||||
|
"type": "sysv"
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
logger.error(f"Failed to collect SysV services from {system}: {e}")
|
||||||
|
raise
|
||||||
|
|
||||||
|
return services
|
||||||
|
|
||||||
|
def get_timestamp(self, system: str) -> str:
|
||||||
|
"""Get current timestamp from target system."""
|
||||||
|
try:
|
||||||
|
result = subprocess.run(
|
||||||
|
["ssh", system, "date -Iseconds"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=10
|
||||||
|
)
|
||||||
|
|
||||||
|
if result.returncode == 0:
|
||||||
|
return result.stdout.strip()
|
||||||
|
else:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
except Exception:
|
||||||
|
return "unknown"
|
||||||
|
|
||||||
|
def collect(system: str) -> Dict[str, Any]:
|
||||||
|
"""Main collection function for services data."""
|
||||||
|
collector = ServicesCollector()
|
||||||
|
return collector.collect_services(system)
|
||||||
@@ -0,0 +1,30 @@
|
|||||||
|
# Migration Validation Framework Architecture
|
||||||
|
|
||||||
|
## Components
|
||||||
|
|
||||||
|
- CLI: parses operator commands and coordinates workflows.
|
||||||
|
- Collectors: gather mounts, services, and disk usage from target systems.
|
||||||
|
- Snapshot files: JSON evidence used as immutable migration checkpoints.
|
||||||
|
- Comparator: evaluates drift between before and after snapshots.
|
||||||
|
- Reports: stores JSON or HTML output for audit and review.
|
||||||
|
|
||||||
|
## Data Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
Operator
|
||||||
|
-> python3 cli.py collect
|
||||||
|
-> collectors over SSH
|
||||||
|
-> before.json / after.json
|
||||||
|
-> python3 cli.py compare
|
||||||
|
-> diff.json with PASS/FAIL validation
|
||||||
|
```
|
||||||
|
|
||||||
|
## Validation Flow
|
||||||
|
|
||||||
|
```
|
||||||
|
before.json -> Comparator -> service checks
|
||||||
|
after.json -> Comparator -> filesystem checks -> validation result
|
||||||
|
-> mount checks
|
||||||
|
```
|
||||||
|
|
||||||
|
The framework keeps collection and comparison separate so migration evidence can be reviewed, archived, and replayed without recollecting from production systems.
|
||||||
@@ -0,0 +1,40 @@
|
|||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"timestamp": "2026-04-29T03:40:00Z",
|
||||||
|
"systems": ["web01"],
|
||||||
|
"version": "1.0"
|
||||||
|
},
|
||||||
|
"data": {
|
||||||
|
"web01": {
|
||||||
|
"mounts": {
|
||||||
|
"mounts": [
|
||||||
|
{"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
|
||||||
|
{"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
|
||||||
|
],
|
||||||
|
"usage": {
|
||||||
|
"/": {"filesystem": "/dev/sda1", "use_percent": "62%"},
|
||||||
|
"/var": {"filesystem": "/dev/sdb1", "use_percent": "94%"}
|
||||||
|
},
|
||||||
|
"timestamp": "2026-04-29T03:40:00Z"
|
||||||
|
},
|
||||||
|
"services": {
|
||||||
|
"service_manager": "systemd",
|
||||||
|
"services": [
|
||||||
|
{"name": "sshd", "active_state": "failed", "sub_state": "failed"},
|
||||||
|
{"name": "nginx", "active_state": "active", "sub_state": "running"},
|
||||||
|
{"name": "node-exporter", "active_state": "active", "sub_state": "running"}
|
||||||
|
],
|
||||||
|
"timestamp": "2026-04-29T03:40:00Z"
|
||||||
|
},
|
||||||
|
"disk_usage": {
|
||||||
|
"filesystem_usage": [
|
||||||
|
{"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "50G", "available": "30G", "use_percent": "62%", "mountpoint": "/"},
|
||||||
|
{"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "188G", "available": "12G", "use_percent": "94%", "mountpoint": "/var"}
|
||||||
|
],
|
||||||
|
"directory_sizes": [{"path": "/var/lib/app", "size": "139G"}],
|
||||||
|
"largest_files": [{"path": "/var/lib/app/import/archive.tar", "size": "42G"}],
|
||||||
|
"timestamp": "2026-04-29T03:40:00Z"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,39 @@
|
|||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"timestamp": "2026-04-29T01:15:00Z",
|
||||||
|
"systems": ["web01"],
|
||||||
|
"version": "1.0"
|
||||||
|
},
|
||||||
|
"data": {
|
||||||
|
"web01": {
|
||||||
|
"mounts": {
|
||||||
|
"mounts": [
|
||||||
|
{"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
|
||||||
|
{"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
|
||||||
|
],
|
||||||
|
"usage": {
|
||||||
|
"/": {"filesystem": "/dev/sda1", "use_percent": "61%"},
|
||||||
|
"/var": {"filesystem": "/dev/sdb1", "use_percent": "68%"}
|
||||||
|
},
|
||||||
|
"timestamp": "2026-04-29T01:15:00Z"
|
||||||
|
},
|
||||||
|
"services": {
|
||||||
|
"service_manager": "systemd",
|
||||||
|
"services": [
|
||||||
|
{"name": "sshd", "active_state": "active", "sub_state": "running"},
|
||||||
|
{"name": "nginx", "active_state": "active", "sub_state": "running"}
|
||||||
|
],
|
||||||
|
"timestamp": "2026-04-29T01:15:00Z"
|
||||||
|
},
|
||||||
|
"disk_usage": {
|
||||||
|
"filesystem_usage": [
|
||||||
|
{"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "49G", "available": "31G", "use_percent": "61%", "mountpoint": "/"},
|
||||||
|
{"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "136G", "available": "64G", "use_percent": "68%", "mountpoint": "/var"}
|
||||||
|
],
|
||||||
|
"directory_sizes": [{"path": "/var/lib/app", "size": "84G"}],
|
||||||
|
"largest_files": [],
|
||||||
|
"timestamp": "2026-04-29T01:15:00Z"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
@@ -0,0 +1,211 @@
|
|||||||
|
{
|
||||||
|
"summary": {
|
||||||
|
"total_systems": 1,
|
||||||
|
"systems_with_changes": 1,
|
||||||
|
"total_changes": 7,
|
||||||
|
"changes_by_type": {
|
||||||
|
"mounts": 2,
|
||||||
|
"services": 2,
|
||||||
|
"disk_usage": 3
|
||||||
|
},
|
||||||
|
"most_affected_systems": [
|
||||||
|
[
|
||||||
|
"web01",
|
||||||
|
7
|
||||||
|
]
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"differences": {
|
||||||
|
"mounts": {
|
||||||
|
"web01": {
|
||||||
|
"added_mounts": [],
|
||||||
|
"removed_mounts": [],
|
||||||
|
"changed_mounts": [],
|
||||||
|
"usage_changes": [
|
||||||
|
{
|
||||||
|
"mountpoint": "/",
|
||||||
|
"before": {
|
||||||
|
"filesystem": "/dev/sda1",
|
||||||
|
"use_percent": "61%"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"filesystem": "/dev/sda1",
|
||||||
|
"use_percent": "62%"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"mountpoint": "/var",
|
||||||
|
"before": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"use_percent": "68%"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"use_percent": "94%"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"services": {
|
||||||
|
"web01": {
|
||||||
|
"added_services": [
|
||||||
|
{
|
||||||
|
"name": "node-exporter",
|
||||||
|
"active_state": "active",
|
||||||
|
"sub_state": "running"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"removed_services": [],
|
||||||
|
"status_changes": [
|
||||||
|
{
|
||||||
|
"name": "sshd",
|
||||||
|
"before": {
|
||||||
|
"active_state": "active",
|
||||||
|
"sub_state": "running"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"active_state": "failed",
|
||||||
|
"sub_state": "failed"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"configuration_changes": []
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"disk_usage": {
|
||||||
|
"web01": {
|
||||||
|
"filesystem_changes": [
|
||||||
|
{
|
||||||
|
"mountpoint": "/",
|
||||||
|
"before": {
|
||||||
|
"filesystem": "/dev/sda1",
|
||||||
|
"type": "ext4",
|
||||||
|
"size": "80G",
|
||||||
|
"used": "49G",
|
||||||
|
"available": "31G",
|
||||||
|
"use_percent": "61%",
|
||||||
|
"mountpoint": "/"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"filesystem": "/dev/sda1",
|
||||||
|
"type": "ext4",
|
||||||
|
"size": "80G",
|
||||||
|
"used": "50G",
|
||||||
|
"available": "30G",
|
||||||
|
"use_percent": "62%",
|
||||||
|
"mountpoint": "/"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"mountpoint": "/var",
|
||||||
|
"before": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"type": "xfs",
|
||||||
|
"size": "200G",
|
||||||
|
"used": "136G",
|
||||||
|
"available": "64G",
|
||||||
|
"use_percent": "68%",
|
||||||
|
"mountpoint": "/var"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"type": "xfs",
|
||||||
|
"size": "200G",
|
||||||
|
"used": "188G",
|
||||||
|
"available": "12G",
|
||||||
|
"use_percent": "94%",
|
||||||
|
"mountpoint": "/var"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"directory_size_changes": [],
|
||||||
|
"significant_usage_changes": [
|
||||||
|
{
|
||||||
|
"mountpoint": "/var",
|
||||||
|
"change_percent": 26,
|
||||||
|
"before": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"type": "xfs",
|
||||||
|
"size": "200G",
|
||||||
|
"used": "136G",
|
||||||
|
"available": "64G",
|
||||||
|
"use_percent": "68%",
|
||||||
|
"mountpoint": "/var"
|
||||||
|
},
|
||||||
|
"after": {
|
||||||
|
"filesystem": "/dev/sdb1",
|
||||||
|
"type": "xfs",
|
||||||
|
"size": "200G",
|
||||||
|
"used": "188G",
|
||||||
|
"available": "12G",
|
||||||
|
"use_percent": "94%",
|
||||||
|
"mountpoint": "/var"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"risk_assessment": {
|
||||||
|
"overall_risk": "high",
|
||||||
|
"risk_factors": [
|
||||||
|
{
|
||||||
|
"type": "service_failure",
|
||||||
|
"description": "Service failed: sshd",
|
||||||
|
"level": 3
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "disk_usage_spike",
|
||||||
|
"description": "Significant disk usage change: /var (26%)",
|
||||||
|
"level": 2
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"critical_changes": [],
|
||||||
|
"recommendations": [
|
||||||
|
"Immediate review required - critical changes detected",
|
||||||
|
"Consider rolling back migration if critical services are affected"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
"validation_results": {
|
||||||
|
"passed": false,
|
||||||
|
"checks": [
|
||||||
|
{
|
||||||
|
"name": "critical_services_running",
|
||||||
|
"description": "Verify critical services remain operational",
|
||||||
|
"passed": false,
|
||||||
|
"details": [
|
||||||
|
"Critical service sshd failed on web01"
|
||||||
|
]
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "filesystem_integrity",
|
||||||
|
"description": "Verify filesystem integrity maintained",
|
||||||
|
"passed": true,
|
||||||
|
"details": []
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "no_critical_mounts_removed",
|
||||||
|
"description": "Verify critical mount points remain",
|
||||||
|
"passed": true,
|
||||||
|
"details": []
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"failed_checks": [
|
||||||
|
{
|
||||||
|
"name": "critical_services_running",
|
||||||
|
"description": "Verify critical services remain operational",
|
||||||
|
"passed": false,
|
||||||
|
"details": [
|
||||||
|
"Critical service sshd failed on web01"
|
||||||
|
]
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"result": "FAIL"
|
||||||
|
},
|
||||||
|
"metadata": {
|
||||||
|
"before": "migration-validation-framework/examples/before.json",
|
||||||
|
"after": "migration-validation-framework/examples/after.json",
|
||||||
|
"timestamp": "2026-04-29T23:29:07.510774"
|
||||||
|
}
|
||||||
|
}
|
||||||
+19
@@ -0,0 +1,19 @@
|
|||||||
|
# Scenario: Before/After Migration Comparison
|
||||||
|
|
||||||
|
## Description
|
||||||
|
|
||||||
|
Compare a pre-cutover host snapshot against a post-cutover snapshot and determine whether the migrated system is ready for production traffic.
|
||||||
|
|
||||||
|
## Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/migration-validation-framework
|
||||||
|
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
|
||||||
|
```
|
||||||
|
|
||||||
|
## Expected Result
|
||||||
|
|
||||||
|
- The command writes a JSON diff.
|
||||||
|
- The result is `FAIL` because `sshd` is failed after migration.
|
||||||
|
- The risk assessment highlights the `/var` disk usage increase.
|
||||||
|
- The remediation path is to restore SSH and reduce or expand `/var` before approving cutover.
|
||||||
@@ -0,0 +1,67 @@
|
|||||||
|
import json
|
||||||
|
import unittest
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
from collectors.mounts import MountsCollector
|
||||||
|
from reports.html_report import HTMLReportGenerator
|
||||||
|
from validators.compare import compare_snapshots
|
||||||
|
|
||||||
|
|
||||||
|
PROJECT_ROOT = Path(__file__).resolve().parents[1]
|
||||||
|
|
||||||
|
|
||||||
|
class ComparatorExampleTests(unittest.TestCase):
|
||||||
|
def test_example_comparison_detects_expected_failure(self):
|
||||||
|
before = json.loads((PROJECT_ROOT / "examples" / "before.json").read_text())
|
||||||
|
after = json.loads((PROJECT_ROOT / "examples" / "after.json").read_text())
|
||||||
|
|
||||||
|
comparison = compare_snapshots(before, after)
|
||||||
|
|
||||||
|
self.assertFalse(comparison["validation_results"]["passed"])
|
||||||
|
self.assertEqual(comparison["validation_results"]["result"], "FAIL")
|
||||||
|
self.assertGreater(comparison["summary"]["total_changes"], 0)
|
||||||
|
|
||||||
|
|
||||||
|
class HtmlReportTests(unittest.TestCase):
|
||||||
|
def test_report_escapes_untrusted_snapshot_content(self):
|
||||||
|
report = HTMLReportGenerator().build_html_content({
|
||||||
|
"metadata": {"comparison_id": "<script>alert(1)</script>"},
|
||||||
|
"summary": {
|
||||||
|
"total_systems": 1,
|
||||||
|
"systems_with_changes": 1,
|
||||||
|
"total_changes": 1,
|
||||||
|
"changes_by_type": {"services": 1},
|
||||||
|
"most_affected_systems": [("<img src=x onerror=alert(1)>", 1)],
|
||||||
|
},
|
||||||
|
"differences": {},
|
||||||
|
"risk_assessment": {"overall_risk": "low", "risk_factors": [], "critical_changes": [], "recommendations": ["Review <b>change</b>"]},
|
||||||
|
"validation_results": {"passed": True, "checks": []},
|
||||||
|
})
|
||||||
|
|
||||||
|
self.assertNotIn("<script>alert(1)</script>", report)
|
||||||
|
self.assertNotIn("<img src=x onerror=alert(1)>", report)
|
||||||
|
self.assertIn("<script>alert(1)</script>", report)
|
||||||
|
self.assertIn("<b>change</b>", report)
|
||||||
|
|
||||||
|
|
||||||
|
class CollectorParserTests(unittest.TestCase):
|
||||||
|
def test_mount_parser_handles_standard_mount_output(self):
|
||||||
|
output = "/dev/sda1 on / type ext4 (rw,relatime)\nproc on /proc type proc (rw,nosuid,nodev,noexec,relatime)\n"
|
||||||
|
mounts = MountsCollector().parse_mount_output(output)
|
||||||
|
|
||||||
|
self.assertEqual(mounts[0]["device"], "/dev/sda1")
|
||||||
|
self.assertEqual(mounts[0]["mountpoint"], "/")
|
||||||
|
self.assertEqual(mounts[0]["fstype"], "ext4")
|
||||||
|
|
||||||
|
def test_df_parser_handles_gigabyte_output(self):
|
||||||
|
output = "Filesystem 1G-blocks Used Available Use% Mounted\n/dev/sda1 100G 45G 55G 45% /\n"
|
||||||
|
stats = MountsCollector().parse_df_output(output)
|
||||||
|
|
||||||
|
self.assertEqual(stats["1g-blocks"], 100.0)
|
||||||
|
self.assertEqual(stats["used"], 45.0)
|
||||||
|
self.assertEqual(stats["available"], 55.0)
|
||||||
|
self.assertEqual(stats["use_percent"], 45)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
unittest.main()
|
||||||
@@ -0,0 +1,501 @@
|
|||||||
|
"""
|
||||||
|
Snapshot Comparison Engine
|
||||||
|
|
||||||
|
Compares two system snapshots and identifies differences,
|
||||||
|
risk levels, and validation results.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
import logging
|
||||||
|
from typing import Dict, Any, List, Tuple
|
||||||
|
from datetime import datetime
|
||||||
|
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
|
class SnapshotComparator:
|
||||||
|
"""Engine for comparing system snapshots."""
|
||||||
|
|
||||||
|
def __init__(self):
|
||||||
|
self.risk_levels = {
|
||||||
|
"low": 1,
|
||||||
|
"medium": 2,
|
||||||
|
"high": 3,
|
||||||
|
"critical": 4
|
||||||
|
}
|
||||||
|
|
||||||
|
def compare_snapshots(self, snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Compare two snapshots and return detailed comparison results."""
|
||||||
|
logger.info("Starting snapshot comparison")
|
||||||
|
|
||||||
|
comparison = {
|
||||||
|
"summary": {},
|
||||||
|
"differences": {},
|
||||||
|
"risk_assessment": {},
|
||||||
|
"validation_results": {}
|
||||||
|
}
|
||||||
|
|
||||||
|
# Compare each data type
|
||||||
|
data_types = ["mounts", "services", "disk_usage"]
|
||||||
|
|
||||||
|
data1 = snapshot1.get("data", {})
|
||||||
|
data2 = snapshot2.get("data", {})
|
||||||
|
|
||||||
|
for data_type in data_types:
|
||||||
|
if self.data_type_exists(data1, data_type) or self.data_type_exists(data2, data_type):
|
||||||
|
differences = self.compare_data_type(data1, data2, data_type)
|
||||||
|
comparison["differences"][data_type] = differences
|
||||||
|
|
||||||
|
# Generate summary
|
||||||
|
comparison["summary"] = self.generate_summary(comparison["differences"])
|
||||||
|
|
||||||
|
# Risk assessment
|
||||||
|
comparison["risk_assessment"] = self.assess_risks(comparison["differences"])
|
||||||
|
|
||||||
|
# Validation results
|
||||||
|
comparison["validation_results"] = self.validate_changes(comparison["differences"])
|
||||||
|
comparison["validation_results"]["result"] = "PASS" if comparison["validation_results"]["passed"] else "FAIL"
|
||||||
|
|
||||||
|
logger.info("Snapshot comparison completed")
|
||||||
|
return comparison
|
||||||
|
|
||||||
|
def data_type_exists(self, systems: Dict[str, Any], data_type: str) -> bool:
|
||||||
|
"""Return true when at least one system has the requested collector data."""
|
||||||
|
return any(data_type in system_data for system_data in systems.values())
|
||||||
|
|
||||||
|
def compare_data_type(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
|
||||||
|
"""Compare a specific data type between two snapshots."""
|
||||||
|
differences = {}
|
||||||
|
|
||||||
|
# Get all systems from both snapshots
|
||||||
|
systems1 = set(data1.keys())
|
||||||
|
systems2 = set(data2.keys())
|
||||||
|
all_systems = systems1.union(systems2)
|
||||||
|
|
||||||
|
for system in all_systems:
|
||||||
|
system_diffs = {}
|
||||||
|
|
||||||
|
if system not in data1:
|
||||||
|
system_diffs["status"] = "added"
|
||||||
|
system_diffs["details"] = {"new_system": True}
|
||||||
|
elif system not in data2:
|
||||||
|
system_diffs["status"] = "removed"
|
||||||
|
system_diffs["details"] = {"removed_system": True}
|
||||||
|
else:
|
||||||
|
# Compare data for this system and data type
|
||||||
|
if data_type in data1[system] and data_type in data2[system]:
|
||||||
|
system_diffs = self.compare_system_data(
|
||||||
|
data1[system][data_type],
|
||||||
|
data2[system][data_type],
|
||||||
|
data_type
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
system_diffs["status"] = "data_missing"
|
||||||
|
system_diffs["details"] = {"missing_data_type": data_type}
|
||||||
|
|
||||||
|
if system_diffs:
|
||||||
|
differences[system] = system_diffs
|
||||||
|
|
||||||
|
return differences
|
||||||
|
|
||||||
|
def compare_system_data(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
|
||||||
|
"""Compare data for a specific system and data type."""
|
||||||
|
differences = {}
|
||||||
|
|
||||||
|
if data_type == "mounts":
|
||||||
|
differences = self.compare_mounts(data1, data2)
|
||||||
|
elif data_type == "services":
|
||||||
|
differences = self.compare_services(data1, data2)
|
||||||
|
elif data_type == "disk_usage":
|
||||||
|
differences = self.compare_disk_usage(data1, data2)
|
||||||
|
else:
|
||||||
|
differences["status"] = "unknown_data_type"
|
||||||
|
|
||||||
|
return differences
|
||||||
|
|
||||||
|
def compare_mounts(self, mounts1: Dict[str, Any], mounts2: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Compare mounts data between snapshots."""
|
||||||
|
differences = {
|
||||||
|
"added_mounts": [],
|
||||||
|
"removed_mounts": [],
|
||||||
|
"changed_mounts": [],
|
||||||
|
"usage_changes": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Compare mount lists
|
||||||
|
mounts_list1 = mounts1.get("mounts", [])
|
||||||
|
mounts_list2 = mounts2.get("mounts", [])
|
||||||
|
|
||||||
|
# Create mountpoint maps
|
||||||
|
mounts_map1 = {m["mountpoint"]: m for m in mounts_list1}
|
||||||
|
mounts_map2 = {m["mountpoint"]: m for m in mounts_list2}
|
||||||
|
|
||||||
|
# Find added and removed mounts
|
||||||
|
added = set(mounts_map2.keys()) - set(mounts_map1.keys())
|
||||||
|
removed = set(mounts_map1.keys()) - set(mounts_map2.keys())
|
||||||
|
|
||||||
|
differences["added_mounts"] = [{"mountpoint": mp, **mounts_map2[mp]} for mp in added]
|
||||||
|
differences["removed_mounts"] = [{"mountpoint": mp, **mounts_map1[mp]} for mp in removed]
|
||||||
|
|
||||||
|
# Find changed mounts
|
||||||
|
common = set(mounts_map1.keys()) & set(mounts_map2.keys())
|
||||||
|
for mp in common:
|
||||||
|
m1, m2 = mounts_map1[mp], mounts_map2[mp]
|
||||||
|
if m1 != m2:
|
||||||
|
differences["changed_mounts"].append({
|
||||||
|
"mountpoint": mp,
|
||||||
|
"before": m1,
|
||||||
|
"after": m2
|
||||||
|
})
|
||||||
|
|
||||||
|
# Compare usage statistics
|
||||||
|
usage1 = mounts1.get("usage", {})
|
||||||
|
usage2 = mounts2.get("usage", {})
|
||||||
|
|
||||||
|
for mp in set(usage1.keys()) | set(usage2.keys()):
|
||||||
|
if mp in usage1 and mp in usage2:
|
||||||
|
u1, u2 = usage1[mp], usage2[mp]
|
||||||
|
if u1 != u2:
|
||||||
|
differences["usage_changes"].append({
|
||||||
|
"mountpoint": mp,
|
||||||
|
"before": u1,
|
||||||
|
"after": u2
|
||||||
|
})
|
||||||
|
|
||||||
|
return differences
|
||||||
|
|
||||||
|
def compare_services(self, services1: Dict[str, Any], services2: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Compare services data between snapshots."""
|
||||||
|
differences = {
|
||||||
|
"added_services": [],
|
||||||
|
"removed_services": [],
|
||||||
|
"status_changes": [],
|
||||||
|
"configuration_changes": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Compare service lists
|
||||||
|
services_list1 = services1.get("services", [])
|
||||||
|
services_list2 = services2.get("services", [])
|
||||||
|
|
||||||
|
# Create service maps
|
||||||
|
services_map1 = {s["name"]: s for s in services_list1}
|
||||||
|
services_map2 = {s["name"]: s for s in services_list2}
|
||||||
|
|
||||||
|
# Find added and removed services
|
||||||
|
added = set(services_map2.keys()) - set(services_map1.keys())
|
||||||
|
removed = set(services_map1.keys()) - set(services_map2.keys())
|
||||||
|
|
||||||
|
differences["added_services"] = [{"name": name, **services_map2[name]} for name in added]
|
||||||
|
differences["removed_services"] = [{"name": name, **services_map1[name]} for name in removed]
|
||||||
|
|
||||||
|
# Find status changes
|
||||||
|
common = set(services_map1.keys()) & set(services_map2.keys())
|
||||||
|
for name in common:
|
||||||
|
s1, s2 = services_map1[name], services_map2[name]
|
||||||
|
if s1.get("active_state") != s2.get("active_state") or s1.get("sub_state") != s2.get("sub_state"):
|
||||||
|
differences["status_changes"].append({
|
||||||
|
"name": name,
|
||||||
|
"before": {"active_state": s1.get("active_state"), "sub_state": s1.get("sub_state")},
|
||||||
|
"after": {"active_state": s2.get("active_state"), "sub_state": s2.get("sub_state")}
|
||||||
|
})
|
||||||
|
|
||||||
|
return differences
|
||||||
|
|
||||||
|
def compare_disk_usage(self, usage1: Dict[str, Any], usage2: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Compare disk usage data between snapshots."""
|
||||||
|
differences = {
|
||||||
|
"filesystem_changes": [],
|
||||||
|
"directory_size_changes": [],
|
||||||
|
"significant_usage_changes": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Compare filesystem usage
|
||||||
|
fs1 = usage1.get("filesystem_usage", [])
|
||||||
|
fs2 = usage2.get("filesystem_usage", [])
|
||||||
|
|
||||||
|
# Create filesystem maps by mountpoint
|
||||||
|
fs_map1 = {fs["mountpoint"]: fs for fs in fs1}
|
||||||
|
fs_map2 = {fs["mountpoint"]: fs for fs in fs2}
|
||||||
|
|
||||||
|
common_fs = set(fs_map1.keys()) & set(fs_map2.keys())
|
||||||
|
for mp in common_fs:
|
||||||
|
f1, f2 = fs_map1[mp], fs_map2[mp]
|
||||||
|
if f1 != f2:
|
||||||
|
differences["filesystem_changes"].append({
|
||||||
|
"mountpoint": mp,
|
||||||
|
"before": f1,
|
||||||
|
"after": f2
|
||||||
|
})
|
||||||
|
|
||||||
|
# Check for significant usage changes
|
||||||
|
try:
|
||||||
|
use1 = int(f1.get("use_percent", "0").rstrip("%"))
|
||||||
|
use2 = int(f2.get("use_percent", "0").rstrip("%"))
|
||||||
|
if abs(use2 - use1) > 10: # 10% change threshold
|
||||||
|
differences["significant_usage_changes"].append({
|
||||||
|
"mountpoint": mp,
|
||||||
|
"change_percent": use2 - use1,
|
||||||
|
"before": f1,
|
||||||
|
"after": f2
|
||||||
|
})
|
||||||
|
except (ValueError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
return differences
|
||||||
|
|
||||||
|
def generate_summary(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Generate a summary of all differences."""
|
||||||
|
summary = {
|
||||||
|
"total_systems": 0,
|
||||||
|
"systems_with_changes": 0,
|
||||||
|
"total_changes": 0,
|
||||||
|
"changes_by_type": {},
|
||||||
|
"most_affected_systems": []
|
||||||
|
}
|
||||||
|
|
||||||
|
system_change_counts = {}
|
||||||
|
|
||||||
|
for data_type, systems in differences.items():
|
||||||
|
summary["changes_by_type"][data_type] = 0
|
||||||
|
|
||||||
|
for system, system_diffs in systems.items():
|
||||||
|
if system not in system_change_counts:
|
||||||
|
system_change_counts[system] = 0
|
||||||
|
|
||||||
|
# Count changes for this system and data type
|
||||||
|
change_count = self.count_changes(system_diffs)
|
||||||
|
system_change_counts[system] += change_count
|
||||||
|
summary["changes_by_type"][data_type] += change_count
|
||||||
|
summary["total_changes"] += change_count
|
||||||
|
|
||||||
|
summary["total_systems"] = len(system_change_counts)
|
||||||
|
|
||||||
|
# Count systems with changes
|
||||||
|
summary["systems_with_changes"] = len([s for s in system_change_counts.values() if s > 0])
|
||||||
|
|
||||||
|
# Find most affected systems
|
||||||
|
sorted_systems = sorted(system_change_counts.items(), key=lambda x: x[1], reverse=True)
|
||||||
|
summary["most_affected_systems"] = sorted_systems[:5]
|
||||||
|
|
||||||
|
return summary
|
||||||
|
|
||||||
|
def count_changes(self, system_diffs: Dict[str, Any]) -> int:
|
||||||
|
"""Count the number of changes in system differences."""
|
||||||
|
count = 0
|
||||||
|
|
||||||
|
for key, value in system_diffs.items():
|
||||||
|
if isinstance(value, list):
|
||||||
|
count += len(value)
|
||||||
|
elif isinstance(value, dict) and key not in ["status"]:
|
||||||
|
# Count nested changes
|
||||||
|
count += sum(1 for v in value.values() if isinstance(v, list) and v)
|
||||||
|
|
||||||
|
return count
|
||||||
|
|
||||||
|
def assess_risks(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Assess risk levels for the changes."""
|
||||||
|
risk_assessment = {
|
||||||
|
"overall_risk": "low",
|
||||||
|
"risk_factors": [],
|
||||||
|
"critical_changes": [],
|
||||||
|
"recommendations": []
|
||||||
|
}
|
||||||
|
|
||||||
|
max_risk_level = 1
|
||||||
|
|
||||||
|
# Analyze each type of change
|
||||||
|
for data_type, systems in differences.items():
|
||||||
|
for system, system_diffs in systems.items():
|
||||||
|
risk_factors = self.analyze_system_risks(system_diffs, data_type)
|
||||||
|
risk_assessment["risk_factors"].extend(risk_factors)
|
||||||
|
|
||||||
|
for factor in risk_factors:
|
||||||
|
if factor["level"] > max_risk_level:
|
||||||
|
max_risk_level = factor["level"]
|
||||||
|
|
||||||
|
if factor["level"] >= 4: # Critical
|
||||||
|
risk_assessment["critical_changes"].append({
|
||||||
|
"system": system,
|
||||||
|
"data_type": data_type,
|
||||||
|
"factor": factor
|
||||||
|
})
|
||||||
|
|
||||||
|
# Set overall risk
|
||||||
|
risk_levels = {1: "low", 2: "medium", 3: "high", 4: "critical"}
|
||||||
|
risk_assessment["overall_risk"] = risk_levels.get(max_risk_level, "unknown")
|
||||||
|
|
||||||
|
# Generate recommendations
|
||||||
|
risk_assessment["recommendations"] = self.generate_recommendations(risk_assessment)
|
||||||
|
|
||||||
|
return risk_assessment
|
||||||
|
|
||||||
|
def analyze_system_risks(self, system_diffs: Dict[str, Any], data_type: str) -> List[Dict[str, Any]]:
|
||||||
|
"""Analyze risks for a specific system's changes."""
|
||||||
|
risk_factors = []
|
||||||
|
|
||||||
|
if data_type == "mounts":
|
||||||
|
# Check for removed critical mounts
|
||||||
|
for mount in system_diffs.get("removed_mounts", []):
|
||||||
|
if mount["mountpoint"] in ["/", "/boot", "/usr", "/var"]:
|
||||||
|
risk_factors.append({
|
||||||
|
"type": "critical_mount_removed",
|
||||||
|
"description": f"Critical mount point removed: {mount['mountpoint']}",
|
||||||
|
"level": 4
|
||||||
|
})
|
||||||
|
|
||||||
|
# Check for significant usage changes
|
||||||
|
for change in system_diffs.get("usage_changes", []):
|
||||||
|
try:
|
||||||
|
before_pct = int(change["before"].get("use_percent", "0").rstrip("%"))
|
||||||
|
after_pct = int(change["after"].get("use_percent", "0").rstrip("%"))
|
||||||
|
if after_pct > 95:
|
||||||
|
risk_factors.append({
|
||||||
|
"type": "filesystem_full",
|
||||||
|
"description": f"Filesystem usage critical: {change['mountpoint']} at {after_pct}%",
|
||||||
|
"level": 3
|
||||||
|
})
|
||||||
|
except (ValueError, KeyError):
|
||||||
|
pass
|
||||||
|
|
||||||
|
elif data_type == "services":
|
||||||
|
# Check for critical service changes
|
||||||
|
critical_services = ["sshd", "systemd", "networking", "dbus"]
|
||||||
|
for service in system_diffs.get("removed_services", []):
|
||||||
|
if service["name"] in critical_services:
|
||||||
|
risk_factors.append({
|
||||||
|
"type": "critical_service_removed",
|
||||||
|
"description": f"Critical service removed: {service['name']}",
|
||||||
|
"level": 4
|
||||||
|
})
|
||||||
|
|
||||||
|
for change in system_diffs.get("status_changes", []):
|
||||||
|
if change["after"]["active_state"] == "failed":
|
||||||
|
risk_factors.append({
|
||||||
|
"type": "service_failure",
|
||||||
|
"description": f"Service failed: {change['name']}",
|
||||||
|
"level": 3
|
||||||
|
})
|
||||||
|
|
||||||
|
elif data_type == "disk_usage":
|
||||||
|
for change in system_diffs.get("significant_usage_changes", []):
|
||||||
|
if change["change_percent"] > 20:
|
||||||
|
risk_factors.append({
|
||||||
|
"type": "disk_usage_spike",
|
||||||
|
"description": f"Significant disk usage change: {change['mountpoint']} ({change['change_percent']}%)",
|
||||||
|
"level": 2
|
||||||
|
})
|
||||||
|
|
||||||
|
return risk_factors
|
||||||
|
|
||||||
|
def generate_recommendations(self, risk_assessment: Dict[str, Any]) -> List[str]:
|
||||||
|
"""Generate recommendations based on risk assessment."""
|
||||||
|
recommendations = []
|
||||||
|
|
||||||
|
if risk_assessment["overall_risk"] in ["high", "critical"]:
|
||||||
|
recommendations.append("Immediate review required - critical changes detected")
|
||||||
|
recommendations.append("Consider rolling back migration if critical services are affected")
|
||||||
|
|
||||||
|
if any(f["type"] == "critical_mount_removed" for f in risk_assessment["risk_factors"]):
|
||||||
|
recommendations.append("Verify system boot capability after mount changes")
|
||||||
|
|
||||||
|
if any(f["type"] == "critical_service_removed" for f in risk_assessment["risk_factors"]):
|
||||||
|
recommendations.append("Ensure critical services are restored before production cutover")
|
||||||
|
|
||||||
|
if any(f["type"] == "filesystem_full" for f in risk_assessment["risk_factors"]):
|
||||||
|
recommendations.append("Monitor disk space closely - cleanup may be required")
|
||||||
|
|
||||||
|
if not recommendations:
|
||||||
|
recommendations.append("Changes appear safe - proceed with standard validation procedures")
|
||||||
|
|
||||||
|
return recommendations
|
||||||
|
|
||||||
|
def validate_changes(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Validate that changes meet requirements."""
|
||||||
|
validation_results = {
|
||||||
|
"passed": True,
|
||||||
|
"checks": [],
|
||||||
|
"failed_checks": []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Define validation checks
|
||||||
|
checks = [
|
||||||
|
self.check_critical_services_running,
|
||||||
|
self.check_filesystem_integrity,
|
||||||
|
self.check_no_critical_mounts_removed
|
||||||
|
]
|
||||||
|
|
||||||
|
for check_func in checks:
|
||||||
|
check_result = check_func(differences)
|
||||||
|
validation_results["checks"].append(check_result)
|
||||||
|
|
||||||
|
if not check_result["passed"]:
|
||||||
|
validation_results["passed"] = False
|
||||||
|
validation_results["failed_checks"].append(check_result)
|
||||||
|
|
||||||
|
return validation_results
|
||||||
|
|
||||||
|
def check_critical_services_running(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Check that critical services are still running."""
|
||||||
|
check = {
|
||||||
|
"name": "critical_services_running",
|
||||||
|
"description": "Verify critical services remain operational",
|
||||||
|
"passed": True,
|
||||||
|
"details": []
|
||||||
|
}
|
||||||
|
|
||||||
|
critical_services = ["sshd", "systemd"]
|
||||||
|
|
||||||
|
for data_type, systems in differences.items():
|
||||||
|
if data_type == "services":
|
||||||
|
for system, system_diffs in systems.items():
|
||||||
|
for change in system_diffs.get("status_changes", []):
|
||||||
|
if change["name"] in critical_services:
|
||||||
|
if change["after"]["active_state"] == "failed":
|
||||||
|
check["passed"] = False
|
||||||
|
check["details"].append(f"Critical service {change['name']} failed on {system}")
|
||||||
|
|
||||||
|
return check
|
||||||
|
|
||||||
|
def check_filesystem_integrity(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Check filesystem integrity after changes."""
|
||||||
|
check = {
|
||||||
|
"name": "filesystem_integrity",
|
||||||
|
"description": "Verify filesystem integrity maintained",
|
||||||
|
"passed": True,
|
||||||
|
"details": []
|
||||||
|
}
|
||||||
|
|
||||||
|
for data_type, systems in differences.items():
|
||||||
|
if data_type == "disk_usage":
|
||||||
|
for system, system_diffs in systems.items():
|
||||||
|
for change in system_diffs.get("significant_usage_changes", []):
|
||||||
|
if change["change_percent"] > 50: # Arbitrary threshold
|
||||||
|
check["passed"] = False
|
||||||
|
check["details"].append(f"Extreme usage change on {system}:{change['mountpoint']}")
|
||||||
|
|
||||||
|
return check
|
||||||
|
|
||||||
|
def check_no_critical_mounts_removed(self, differences: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Check that no critical mount points were removed."""
|
||||||
|
check = {
|
||||||
|
"name": "no_critical_mounts_removed",
|
||||||
|
"description": "Verify critical mount points remain",
|
||||||
|
"passed": True,
|
||||||
|
"details": []
|
||||||
|
}
|
||||||
|
|
||||||
|
critical_mounts = ["/", "/boot", "/usr", "/var"]
|
||||||
|
|
||||||
|
for data_type, systems in differences.items():
|
||||||
|
if data_type == "mounts":
|
||||||
|
for system, system_diffs in systems.items():
|
||||||
|
for mount in system_diffs.get("removed_mounts", []):
|
||||||
|
if mount["mountpoint"] in critical_mounts:
|
||||||
|
check["passed"] = False
|
||||||
|
check["details"].append(f"Critical mount {mount['mountpoint']} removed from {system}")
|
||||||
|
|
||||||
|
return check
|
||||||
|
|
||||||
|
def compare_snapshots(snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
|
||||||
|
"""Main comparison function."""
|
||||||
|
comparator = SnapshotComparator()
|
||||||
|
return comparator.compare_snapshots(snapshot1, snapshot2)
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
skip_list:
|
||||||
|
- role-name
|
||||||
|
- name[casing]
|
||||||
|
- line-too-long
|
||||||
|
|
||||||
|
exclude_paths:
|
||||||
|
- .git
|
||||||
@@ -0,0 +1,19 @@
|
|||||||
|
.PHONY: help test lint syntax validate-assets
|
||||||
|
|
||||||
|
help:
|
||||||
|
@echo "Zabbix Monitoring + Incident Response"
|
||||||
|
@echo " make test Run syntax, lint, and asset validation"
|
||||||
|
@echo " make syntax Run Ansible syntax checks"
|
||||||
|
@echo " make lint Run ansible-lint"
|
||||||
|
@echo " make validate-assets Validate template and sample JSON assets"
|
||||||
|
|
||||||
|
test: syntax lint validate-assets
|
||||||
|
|
||||||
|
syntax:
|
||||||
|
ansible-playbook --syntax-check playbooks/*.yml
|
||||||
|
|
||||||
|
lint:
|
||||||
|
ansible-lint
|
||||||
|
|
||||||
|
validate-assets:
|
||||||
|
python3 scripts/validate_assets.py
|
||||||
@@ -0,0 +1,63 @@
|
|||||||
|
# Zabbix Monitoring + Incident Response
|
||||||
|
|
||||||
|
## Problem
|
||||||
|
|
||||||
|
Large Linux/Unix environments need simple, reliable OS checks before more advanced observability becomes useful. Filesystems, CPU, memory, network, process status, proxy backlog, and agent availability must be monitored consistently across Linux and AIX estates.
|
||||||
|
|
||||||
|
## CV Relevance
|
||||||
|
|
||||||
|
This project maps to Zabbix monitoring platform work, proxy maintenance, custom checks, alert noise reduction, and incident response in enterprise environments. It shows operational design and automation without pretending to run AIX locally.
|
||||||
|
|
||||||
|
## What This Project Demonstrates
|
||||||
|
|
||||||
|
- Ansible-first Zabbix server, proxy, and agent/agent2 configuration structure.
|
||||||
|
- Proxy topology for active and passive checks.
|
||||||
|
- Linux and AIX OS monitoring templates as reviewable JSON assets.
|
||||||
|
- Sample Linux/AIX check data for filesystem, CPU, memory, network, and process monitoring.
|
||||||
|
- Runbooks for Zabbix maintenance and incident response.
|
||||||
|
|
||||||
|
## Architecture
|
||||||
|
|
||||||
|
```text
|
||||||
|
Linux/AIX hosts -> Zabbix agent/agent2 -> Zabbix proxy -> Zabbix server/web
|
||||||
|
| |
|
||||||
|
v v
|
||||||
|
OS simple checks proxy queue/cache
|
||||||
|
|
||||||
|
Incident -> Alert -> Operator triage -> Maintenance or remediation evidence
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quickstart
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd professional-infra/zabbix-monitoring-incident-response
|
||||||
|
make test
|
||||||
|
```
|
||||||
|
|
||||||
|
`make test` performs Ansible syntax/lint checks and validates the Zabbix template/sample JSON assets.
|
||||||
|
|
||||||
|
## Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-playbook --syntax-check playbooks/*.yml
|
||||||
|
ansible-lint
|
||||||
|
python3 scripts/validate_assets.py
|
||||||
|
```
|
||||||
|
|
||||||
|
## Example Output
|
||||||
|
|
||||||
|
Sample check payloads are available in `samples/linux-os-checks.json` and `samples/aix-os-checks.json`. These show what a reviewable `zabbix_sender` or API-driven evidence artifact could look like for Linux and AIX hosts.
|
||||||
|
|
||||||
|
## Interview Talking Points
|
||||||
|
|
||||||
|
- Why Zabbix is suitable for simple OS checks while ELK/Grafana is better for log analysis.
|
||||||
|
- How proxies reduce WAN dependency and support branch/client environments.
|
||||||
|
- Difference between active and passive checks.
|
||||||
|
- How to troubleshoot unsupported items, missing data, proxy backlog, and agent reachability.
|
||||||
|
- How Linux and AIX monitoring differ without inventing local AIX runtime.
|
||||||
|
|
||||||
|
## Roadmap
|
||||||
|
|
||||||
|
- Add API import helpers for templates.
|
||||||
|
- Add a Docker-based Zabbix server/proxy demo scaffold.
|
||||||
|
- Add Wazuh or security monitoring integration as a separate side lab.
|
||||||
@@ -0,0 +1,5 @@
|
|||||||
|
[defaults]
|
||||||
|
roles_path = ./roles
|
||||||
|
inventory = ./inventory/hosts.ini
|
||||||
|
host_key_checking = False
|
||||||
|
retry_files_enabled = False
|
||||||
+30
@@ -0,0 +1,30 @@
|
|||||||
|
# Incident Response Runbook
|
||||||
|
|
||||||
|
## Filesystem Alert
|
||||||
|
|
||||||
|
1. Confirm current usage and growth trend.
|
||||||
|
2. Check whether the host is Linux or AIX and use the correct runbook.
|
||||||
|
3. Validate application ownership of the filesystem.
|
||||||
|
4. Clean known temporary paths or request LVM expansion when approved.
|
||||||
|
5. Attach before/after evidence to the incident ticket.
|
||||||
|
|
||||||
|
## Agent Unreachable
|
||||||
|
|
||||||
|
1. Confirm whether data loss affects one host, one proxy, or one network segment.
|
||||||
|
2. Check proxy queue and last seen timestamp.
|
||||||
|
3. Validate agent service state and firewall path.
|
||||||
|
4. For active checks, confirm `ServerActive` and hostname match.
|
||||||
|
|
||||||
|
## Proxy Backlog
|
||||||
|
|
||||||
|
1. Check server reachability from proxy.
|
||||||
|
2. Check proxy DB filesystem usage.
|
||||||
|
3. Confirm whether config sync recently changed.
|
||||||
|
4. Reduce noise by temporarily disabling non-critical discovery rules if required.
|
||||||
|
|
||||||
|
## Unsupported Items
|
||||||
|
|
||||||
|
1. Identify affected template and item key.
|
||||||
|
2. Check whether item is Linux-specific or AIX-specific.
|
||||||
|
3. Validate agent version and custom user parameters.
|
||||||
|
4. Roll back template change if canary host group is affected.
|
||||||
@@ -0,0 +1,29 @@
|
|||||||
|
# Zabbix Maintenance Runbook
|
||||||
|
|
||||||
|
## Server Checks
|
||||||
|
|
||||||
|
- Confirm Zabbix server process and web frontend availability.
|
||||||
|
- Check database health, free space, and slow queries.
|
||||||
|
- Review cache usage, poller utilization, and housekeeper activity.
|
||||||
|
- Confirm recent values are arriving for representative Linux and AIX hosts.
|
||||||
|
|
||||||
|
## Proxy Checks
|
||||||
|
|
||||||
|
- Confirm proxy last seen timestamp.
|
||||||
|
- Check proxy queue and delayed values.
|
||||||
|
- Validate proxy database size and filesystem usage.
|
||||||
|
- Confirm active/passive connectivity based on proxy mode.
|
||||||
|
|
||||||
|
## Template Maintenance
|
||||||
|
|
||||||
|
- Import templates in a controlled window.
|
||||||
|
- Watch unsupported items after import.
|
||||||
|
- Validate a small canary host group before wider rollout.
|
||||||
|
- Document changed triggers and thresholds.
|
||||||
|
|
||||||
|
## Common Failure Modes
|
||||||
|
|
||||||
|
- Agent unreachable: check DNS, firewall, agent service, proxy route.
|
||||||
|
- Unsupported item: check key spelling, OS capability, agent version, user parameter.
|
||||||
|
- Proxy backlog: check WAN, DB size, proxy process, server availability.
|
||||||
|
- Alert noise: review trigger thresholds and dependency design.
|
||||||
@@ -0,0 +1,27 @@
|
|||||||
|
# Zabbix Proxy Design
|
||||||
|
|
||||||
|
## Purpose
|
||||||
|
|
||||||
|
Zabbix proxies reduce dependency on direct connectivity between the central server and monitored hosts. They are useful for client networks, segmented environments, remote sites, and maintenance windows.
|
||||||
|
|
||||||
|
## Active Proxy
|
||||||
|
|
||||||
|
- Proxy connects to the Zabbix server.
|
||||||
|
- Good for restricted networks where inbound access to the proxy is not allowed.
|
||||||
|
- Hosts can use active agent checks against the proxy.
|
||||||
|
- Main operational checks: proxy last seen, delayed values, local DB size, config sync.
|
||||||
|
|
||||||
|
## Passive Proxy
|
||||||
|
|
||||||
|
- Zabbix server connects to the proxy.
|
||||||
|
- Useful when central server can reach the proxy network.
|
||||||
|
- Requires firewall rules from server to proxy.
|
||||||
|
- Main operational checks: proxy listener, network latency, poller load.
|
||||||
|
|
||||||
|
## Operational Signals
|
||||||
|
|
||||||
|
- Proxy queue growth.
|
||||||
|
- Unsupported items after template changes.
|
||||||
|
- Agent unreachable or active checks delayed.
|
||||||
|
- Proxy DB growth during WAN outage.
|
||||||
|
- Config sync failures after maintenance.
|
||||||
@@ -0,0 +1,4 @@
|
|||||||
|
2026-05-04 10:21:14 WARN zbx-proxy-bank01 proxy queue above threshold: 420 delayed values
|
||||||
|
2026-05-04 10:22:01 HIGH linux-app01 Root filesystem above 85 percent
|
||||||
|
2026-05-04 10:25:33 INFO linux-app01 filesystem cleanup completed, usage back to 74 percent
|
||||||
|
2026-05-04 10:30:12 WARN aix-core01 active check delayed, proxy connectivity validated
|
||||||
@@ -0,0 +1,12 @@
|
|||||||
|
[zabbix_server]
|
||||||
|
zbx-server01 ansible_connection=local
|
||||||
|
|
||||||
|
[zabbix_proxy]
|
||||||
|
zbx-proxy-bank01 ansible_connection=local zabbix_proxy_mode=active
|
||||||
|
zbx-proxy-bank02 ansible_connection=local zabbix_proxy_mode=passive
|
||||||
|
|
||||||
|
[zabbix_agents_linux]
|
||||||
|
linux-app01 ansible_connection=local zabbix_agent_mode=active
|
||||||
|
|
||||||
|
[zabbix_agents_aix]
|
||||||
|
aix-core01 ansible_connection=local zabbix_agent_mode=active
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
- name: Configure Zabbix agents
|
||||||
|
hosts: zabbix_agents_linux:zabbix_agents_aix
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: zabbix_agent
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
- name: Configure Zabbix proxy nodes
|
||||||
|
hosts: zabbix_proxy
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: zabbix_proxy
|
||||||
@@ -0,0 +1,8 @@
|
|||||||
|
---
|
||||||
|
- name: Configure Zabbix server control plane
|
||||||
|
hosts: zabbix_server
|
||||||
|
become: true
|
||||||
|
gather_facts: false
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: zabbix_server
|
||||||
+7
@@ -0,0 +1,7 @@
|
|||||||
|
---
|
||||||
|
zabbix_agent_server: zbx-proxy-bank01
|
||||||
|
zabbix_agent_server_active: zbx-proxy-bank01
|
||||||
|
zabbix_agent_hostname: "{{ inventory_hostname }}"
|
||||||
|
zabbix_agent_mode: active
|
||||||
|
zabbix_agent_listen_port: 10050
|
||||||
|
zabbix_agent_include_dir: /etc/zabbix/zabbix_agentd.d
|
||||||
Some files were not shown because too many files have changed in this diff Show More
Reference in New Issue
Block a user