Polish infrastructure portfolio projects
ci / validate (push) Waiting to run

This commit is contained in:
Mateusz Suski
2026-04-29 23:30:30 +00:00
parent b0537b4bff
commit 8783892241
34 changed files with 762 additions and 1226 deletions
+20 -107
View File
@@ -1,118 +1,31 @@
name: CI Pipeline name: ci
on: on:
push: push:
branches: [ main, develop ] branches: [main]
pull_request: pull_request:
branches: [ main ] branches: [main]
jobs: jobs:
lint-ansible: validate:
runs-on: ubuntu-latest runs-on: ubuntu-latest
steps: steps:
- uses: actions/checkout@v3 - uses: actions/checkout@v4
- name: Install Ansible Lint
run: pip install ansible-lint
- name: Lint Ansible Playbooks
run: |
cd enterprise-infra-simulator
ansible-lint playbooks/*.yml
- name: Check Ansible Syntax
run: |
cd enterprise-infra-simulator
ansible-playbook --syntax-check playbooks/*.yml
test-python: - name: Python syntax check
runs-on: ubuntu-latest run: |
steps: python3 -m py_compile \
- uses: actions/checkout@v3 migration-validation-framework/cli.py \
- name: Set up Python migration-validation-framework/collectors/*.py \
uses: actions/setup-python@v4 migration-validation-framework/validators/*.py \
with: migration-validation-framework/reports/*.py
python-version: '3.8'
- name: Install Dependencies
run: |
cd migration-validation-framework
pip install -r requirements.txt
- name: Run Python Tests
run: |
cd migration-validation-framework
python -m pytest tests/ -v --cov=. --cov-report=xml
- name: Lint Python Code
run: |
pip install flake8 black isort
cd migration-validation-framework
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
black --check .
isort --check-only .
validate-docker: - name: Ansible syntax check
runs-on: ubuntu-latest run: |
steps: python3 -m pip install --user ansible
- uses: actions/checkout@v3 ansible-playbook -i enterprise-infra-simulator/inventory/hosts.ini \
- name: Validate Docker Compose --syntax-check enterprise-infra-simulator/playbooks/*.yml
run: |
cd observability-stack
docker-compose config
- name: Check Docker Images
run: |
cd observability-stack
docker-compose pull --quiet
security-scan: - name: Docker compose validation
runs-on: ubuntu-latest run: |
steps: docker compose -f observability-stack/docker-compose.yml config
- uses: actions/checkout@v3
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
documentation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check Documentation
run: |
# Check for broken links in README files
find . -name "README.md" -exec markdown-link-check {} \;
# Validate YAML files
find . -name "*.yml" -o -name "*.yaml" | xargs -I {} yamllint {}
integration-test:
runs-on: ubuntu-latest
needs: [lint-ansible, test-python, validate-docker]
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
- name: Install Dependencies
run: |
pip install ansible docker-compose
- name: Run Integration Tests
run: |
# Start infrastructure simulator
cd enterprise-infra-simulator
make up
sleep 30
# Run basic validation
ansible -i inventory/hosts.ini all -m ping
# Test migration framework
cd ../migration-validation-framework
python cli.py --help
# Validate observability stack
cd ../observability-stack
docker-compose config
# Cleanup
cd ../enterprise-infra-simulator
make destroy
+5
View File
@@ -0,0 +1,5 @@
**pycache**/
*.pyc
*.log
.env
.DS_Store
+2 -2
View File
@@ -8,7 +8,7 @@
- **Comprehensive Ansible automation suite**: - **Comprehensive Ansible automation suite**:
- `provision.yml`: Node provisioning with security hardening, package installation, and service configuration - `provision.yml`: Node provisioning with security hardening, package installation, and service configuration
- `patch.yml`: Automated patching with rollback capabilities and notification system - `patch.yml`: Automated patching with rollback capabilities and notification system
- `harden.yml`: Security hardening following CIS benchmarks (firewall, SSH, user management) - `hardening.yml`: Security hardening following CIS benchmarks (firewall, SSH, user management)
- `decommission.yml`: Graceful node decommissioning with cleanup and notification - `decommission.yml`: Graceful node decommissioning with cleanup and notification
- **Operational scripts**: - **Operational scripts**:
- `simulate_scaling.sh`: Infrastructure scaling simulation - `simulate_scaling.sh`: Infrastructure scaling simulation
@@ -120,4 +120,4 @@
- Disaster recovery procedures - Disaster recovery procedures
--- ---
*Portfolio created to demonstrate enterprise-level Linux infrastructure engineering capabilities across the full technology stack.* *Portfolio created to demonstrate enterprise-level Linux infrastructure engineering capabilities across the full technology stack.*
+13 -53
View File
@@ -1,59 +1,19 @@
# Enterprise Infrastructure Portfolio # Infrastructure Engineering Portfolio
This mono-repository showcases enterprise-level Linux infrastructure engineering capabilities through three comprehensive projects demonstrating DevOps and Platform Engineering best practices. This repository contains independent infrastructure projects focused on automation, migration assurance, and observability. The projects are intentionally small enough to run locally, but structured around the operating patterns used in enterprise platform teams: repeatable workflows, clear evidence artifacts, and operational documentation.
## Projects Overview ## Projects
### 1. Enterprise Infrastructure Simulator - [Enterprise Infrastructure Simulator](enterprise-infra-simulator/) - Ansible-driven lifecycle operations for provisioning, patching, hardening, decommissioning, and failure simulation across Linux nodes.
A container-based simulation of enterprise Linux infrastructure with Ansible automation for provisioning, patching, hardening, and decommissioning operations. Includes failure simulation and scaling scenarios. - [Migration Validation Framework](migration-validation-framework/) - Python CLI for collecting before/after system snapshots and producing structured migration comparison results.
- [Observability Stack](observability-stack/) - Docker Compose based logging and dashboard stack with alert rules, sample logs, and incident simulation.
**Key Features:** ## Skills Demonstrated
- Multi-node Linux simulation using Docker containers
- Ansible playbooks for infrastructure lifecycle management
- Automated scaling and failure injection scripts
- Enterprise-grade inventory and scenario management
### 2. Migration Validation Framework - Infrastructure automation with Ansible
A Python CLI tool for validating system migrations by collecting, comparing, and reporting on system state changes. Generates comprehensive before/after snapshots and HTML reports. - Operational scenario design and incident simulation
- Migration validation, drift detection, and JSON reporting
- Docker Compose service validation
- Repository hygiene, CI checks, and professional project documentation
**Key Features:** Each project remains independent and includes its own README, architecture notes, examples, and runnable scenarios.
- Automated system data collection (mounts, services, disk usage)
- JSON snapshot generation and comparison
- HTML report generation with change visualization
- CLI interface for enterprise migration workflows
### 3. Observability Stack
A complete monitoring and logging stack using ELK (Elasticsearch, Logstash, Kibana) and Grafana for comprehensive infrastructure observability.
**Key Features:**
- Docker Compose deployment of full observability stack
- Sample log ingestion and processing pipelines
- Alerting rules and incident simulation scenarios
- Real-time dashboards and monitoring capabilities
## Architecture
See [docs/architecture.md](docs/architecture.md) for detailed architecture overview.
## Runbooks
Operational procedures and troubleshooting guides available in [docs/runbooks.md](docs/runbooks.md).
## Getting Started
Each project contains its own README.md with setup and usage instructions.
## CI/CD
Automated testing and linting via Gitea workflows in [.gitea/workflows/](.gitea/workflows/).
## Prerequisites
- Docker and Docker Compose
- Ansible
- Python 3.8+
- Make
## License
Enterprise Internal Use Only
+7 -7
View File
@@ -53,7 +53,7 @@ make up
```bash ```bash
cd enterprise-infra-simulator cd enterprise-infra-simulator
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
``` ```
**Hardening Steps:** **Hardening Steps:**
@@ -106,7 +106,7 @@ make destroy
```bash ```bash
cd migration-validation-framework cd migration-validation-framework
python cli.py snapshot --env production --label pre-migration python3 cli.py collect --output before.json --systems web01,db01
``` ```
**Data Collected:** **Data Collected:**
@@ -118,8 +118,8 @@ python cli.py snapshot --env production --label pre-migration
### Post-Migration Validation ### Post-Migration Validation
```bash ```bash
python cli.py snapshot --env production --label post-migration python3 cli.py collect --output after.json --systems web01,db01
python cli.py compare pre-migration post-migration python3 cli.py compare before.json after.json --output diff.json
``` ```
**Validation Checks:** **Validation Checks:**
@@ -131,7 +131,7 @@ python cli.py compare pre-migration post-migration
### Report Generation ### Report Generation
```bash ```bash
python cli.py report --comparison-id <comparison-id> --format html python3 cli.py report --comparison <comparison-id> --format html
``` ```
**Report Contents:** **Report Contents:**
@@ -293,7 +293,7 @@ docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup"
#### Migration Data Backup #### Migration Data Backup
```bash ```bash
cd migration-validation-framework cd migration-validation-framework
python cli.py backup --destination /backup/location tar -czf /backup/location/migration-validation-framework.tgz migration-validation-framework
``` ```
## Emergency Procedures ## Emergency Procedures
@@ -326,4 +326,4 @@ cd observability-stack && docker-compose up -d
- **Daily:** Log rotation and cleanup - **Daily:** Log rotation and cleanup
- **Weekly:** Security patching and updates - **Weekly:** Security patching and updates
- **Monthly:** Performance optimization and capacity planning - **Monthly:** Performance optimization and capacity planning
- **Quarterly:** Architecture review and modernization - **Quarterly:** Architecture review and modernization
+10 -3
View File
@@ -1,6 +1,6 @@
# Enterprise Infrastructure Simulator Makefile # Enterprise Infrastructure Simulator Makefile
.PHONY: help up down patch destroy status logs clean test .PHONY: help run demo up down patch destroy status logs clean test
# Default target # Default target
help: ## Show this help message help: ## Show this help message
@@ -9,6 +9,13 @@ help: ## Show this help message
@echo "Available commands:" @echo "Available commands:"
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf " %-15s %s\n", $$1, $$2}' @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf " %-15s %s\n", $$1, $$2}'
run: ## Run the default simulator workflow
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
demo: ## Run a failure-and-patch demonstration
./scripts/simulate_failure.sh service 30 web
ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
# Infrastructure management # Infrastructure management
up: ## Start the infrastructure simulation up: ## Start the infrastructure simulation
@echo "Starting enterprise infrastructure simulation..." @echo "Starting enterprise infrastructure simulation..."
@@ -144,7 +151,7 @@ format: ## Format code and configuration
# Security # Security
harden: ## Apply security hardening harden: ## Apply security hardening
@echo "Applying security hardening..." @echo "Applying security hardening..."
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
security-scan: ## Run security scans security-scan: ## Run security scans
@echo "Running security scans..." @echo "Running security scans..."
@@ -163,4 +170,4 @@ help-failure: ## Show failure simulation commands
@echo " make fail-network DURATION=60 - Network failure for 60s" @echo " make fail-network DURATION=60 - Network failure for 60s"
@echo " make fail-disk DURATION=120 - Disk exhaustion for 120s" @echo " make fail-disk DURATION=120 - Disk exhaustion for 120s"
@echo " make fail-service DURATION=30 - Service failure for 30s" @echo " make fail-service DURATION=30 - Service failure for 30s"
@echo " make fail-node DURATION=300 - Node failure for 300s" @echo " make fail-node DURATION=300 - Node failure for 300s"
+44 -238
View File
@@ -1,268 +1,74 @@
# Enterprise Infrastructure Simulator # Enterprise Infrastructure Simulator
A container-based simulation environment for enterprise Linux infrastructure operations. This project provides Ansible automation for provisioning, patching, hardening, and decommissioning of simulated Linux nodes, along with scripts for scaling and failure simulation. ## Problem Statement
## Overview Infrastructure teams need a safe place to rehearse lifecycle operations before applying them to production fleets. Patch windows, hardening changes, scale events, and node failures all carry operational risk when they are tested only during real incidents.
The Enterprise Infrastructure Simulator creates a realistic environment for testing and demonstrating infrastructure automation at scale. It uses Docker containers to simulate multiple Linux nodes and provides comprehensive Ansible playbooks for enterprise operations. ## Solution Overview
## Architecture This project models common Linux infrastructure operations with Ansible playbooks and shell-based simulations. It keeps the automation readable and auditable while producing example evidence that resembles a real change record.
- **Container Simulation:** Docker-based Linux nodes with realistic configurations ## Architecture Overview
- **Ansible Automation:** Modular playbooks for infrastructure lifecycle management
- **Dynamic Inventory:** Automated host discovery and grouping
- **Simulation Scripts:** Automated scaling and failure injection
- **Scenario Management:** Pre-defined operational scenarios
## Quick Start ```
Operator -> Make/CLI -> Ansible Inventory -> Playbooks -> Linux Nodes
| |
v v
Scenarios Reports/Logs
```
### Prerequisites Core components:
- Docker and Docker Compose - `inventory/hosts.ini` defines managed node groups.
- Ansible 2.9+ - `playbooks/` contains provisioning, patching, hardening, and decommissioning workflows.
- Make - `scripts/` injects scaling and failure conditions.
- `scenarios/` documents operational exercises.
- `examples/` stores representative outputs for review.
### Setup ## How to Run
```bash ```bash
# Clone and navigate to project
cd enterprise-infra-simulator cd enterprise-infra-simulator
# Start the infrastructure # Validate playbook syntax.
make up make test
# Verify deployment # Provision the simulated estate.
ansible -i inventory/hosts.ini all -m ping make run
```
## Available Operations # Apply security patches.
### Infrastructure Management
```bash
# Provision new nodes
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
# Apply security patches
make patch make patch
# Harden systems # Apply host hardening.
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml make harden
# Decommission nodes # Run the failure and patch demo.
ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml make demo
# Destroy infrastructure
make destroy
``` ```
### Simulation Operations Direct Ansible commands are also supported:
```bash ```bash
# Scale up infrastructure ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
./scripts/simulate_scaling.sh up 5 ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
# Simulate network failure
./scripts/simulate_failure.sh --type network --duration 300
# Run operational scenario
ansible-playbook -i inventory/hosts.ini scenarios/scaling_event.yml
``` ```
## Project Structure ## Example Output
``` ```text
enterprise-infra-simulator/ PLAY RECAP *********************************************************************
├── inventory/ # Ansible inventory files web01 : ok=21 changed=7 unreachable=0 failed=0 skipped=3 rescued=0 ignored=1
│ └── hosts.ini # Dynamic host inventory db01 : ok=18 changed=4 unreachable=0 failed=0 skipped=5 rescued=0 ignored=1
├── playbooks/ # Ansible automation playbooks lb01 : ok=16 changed=3 unreachable=0 failed=0 skipped=6 rescued=0 ignored=0
│ ├── provision.yml # Node provisioning
│ ├── patch.yml # Security patching Patch status: SUCCESS
│ ├── harden.yml # Security hardening Updates applied: 12
│ └── decommission.yml # Node decommissioning Reboot required: false
├── scripts/ # Simulation and utility scripts
│ ├── simulate_scaling.sh # Infrastructure scaling
│ └── simulate_failure.sh # Failure injection
├── scenarios/ # Operational scenarios
│ └── scaling_event.yml # Scaling scenario
├── docker-compose.yml # Container orchestration
├── Makefile # Build automation
└── README.md
``` ```
## Inventory Management Additional sample evidence is available in [examples/patch-output.txt](examples/patch-output.txt) and [examples/failure-simulation.txt](examples/failure-simulation.txt).
The simulator uses dynamic inventory with the following groups: ## Real-World Use Case
- `webservers`: Web application servers A platform team can use this project to demonstrate how routine operating procedures are encoded, reviewed, and tested before production change windows. The same patterns apply to regulated Linux estates where patch evidence, hardening controls, and incident drills must be repeatable.
- `databases`: Database servers
- `loadbalancers`: Load balancing infrastructure
- `monitoring`: Monitoring and logging servers
## Playbooks
### Provision Playbook
- Creates Docker containers with base Linux configurations
- Installs required packages and services
- Configures basic networking and security
- Registers nodes in inventory
### Patch Playbook
- Updates system packages
- Applies security patches
- Restarts services as needed
- Generates patch reports
### Harden Playbook
- Implements CIS security benchmarks
- Configures firewall rules
- Hardens SSH configuration
- Disables unnecessary services
### Decommission Playbook
- Gracefully stops services
- Exports configuration and data
- Removes containers
- Cleans up inventory
## Simulation Scripts
### Scaling Simulation
```bash
./scripts/simulate_scaling.sh [up|down] [count] [type]
```
Parameters:
- `direction`: up/down
- `count`: Number of nodes to add/remove
- `type`: Node type (web/db/lb/monitor)
### Failure Simulation
```bash
./scripts/simulate_failure.sh --type [failure_type] --duration [seconds]
```
Failure Types:
- `network`: Network connectivity issues
- `disk`: Disk space exhaustion
- `service`: Service failures
- `node`: Complete node outages
## Scenarios
Pre-defined operational scenarios for testing:
- **Scaling Event:** Automated scaling during traffic spikes
- **Disaster Recovery:** Node failure and recovery procedures
- **Maintenance Window:** Scheduled patching and updates
- **Security Incident:** Breach simulation and response
## Configuration
### Environment Variables
```bash
# Number of initial nodes
INFRA_NODE_COUNT=3
# Node types to deploy
INFRA_NODE_TYPES=web,db,lb
# Simulation parameters
SIMULATION_DURATION=3600
SIMULATION_INTENSITY=medium
```
### Docker Configuration
Container resources and networking are configured in `docker-compose.yml`:
```yaml
services:
infra-node:
image: ubuntu:20.04
deploy:
replicas: 3
resources:
limits:
memory: 512M
cpus: '0.5'
```
## Monitoring and Logging
- Ansible execution logs: `ansible.log`
- Container logs: `docker logs <container-name>`
- Simulation logs: `logs/simulation.log`
## Troubleshooting
### Common Issues
**Ansible Connection Failures:**
```bash
# Check container status
docker ps | grep infra-sim
# Verify SSH connectivity
ansible -i inventory/hosts.ini all -m ping
```
**Container Resource Issues:**
```bash
# Check Docker resources
docker system df
# Clean up containers
docker system prune
```
**Simulation Script Errors:**
```bash
# Check script permissions
chmod +x scripts/*.sh
# Verify dependencies
./scripts/simulate_failure.sh --help
```
## Development
### Adding New Playbooks
1. Create playbook in `playbooks/` directory
2. Follow Ansible best practices
3. Test with `--check` mode
4. Update documentation
### Custom Scenarios
1. Define scenario in `scenarios/` directory
2. Include required variables
3. Test with dry-run
4. Document operational procedures
## Security Considerations
- Containers run with limited privileges
- SSH keys are generated per deployment
- Firewall rules are applied automatically
- Security scanning integrated in CI/CD
## Performance Optimization
- Container resource limits prevent resource exhaustion
- Ansible parallel execution for faster operations
- Efficient failure simulation without full outages
- Optimized Docker layer caching
## Contributing
1. Follow existing code structure and naming conventions
2. Add comprehensive documentation
3. Include tests for new functionality
4. Update runbooks for operational changes
## License
Enterprise Internal Use Only
@@ -0,0 +1,30 @@
# Enterprise Infrastructure Simulator Architecture
## Components
- Operator interface: `make` targets and direct Ansible commands.
- Inventory: static host groups in `inventory/hosts.ini`.
- Automation: lifecycle playbooks in `playbooks/`.
- Simulation scripts: controlled failure and scaling events in `scripts/`.
- Evidence: logs, reports, scenario notes, and examples.
## Data Flow
```
Operator
-> Make target or shell script
-> Ansible inventory
-> lifecycle playbook
-> managed Linux node
-> log/report artifact
```
Failure drills follow a parallel flow:
```
Operator -> simulate_failure.sh -> target node/service -> health check -> patch/hardening playbook -> evidence
```
## Notes
The project favors explicit playbooks over hidden orchestration so the operational intent is visible during review. In a production implementation, the same workflows would typically run from a CI runner or automation controller with credentials supplied by a secret manager.
@@ -0,0 +1,8 @@
2026-04-29 02:13:41 - Starting failure simulation: service 30 web
2026-04-29 02:13:41 - Simulating service failures on containers: web
2026-04-29 02:13:42 - Stopping services in container enterprise-web-1
2026-04-29 02:13:44 - Health probe failed: http://web01/health returned 503
2026-04-29 02:14:12 - Cleaning up failure simulation
2026-04-29 02:14:13 - Restarted nginx in enterprise-web-1
2026-04-29 02:14:18 - Health probe recovered: http://web01/health returned 200
2026-04-29 02:14:18 - Failure simulation completed successfully
@@ -0,0 +1,33 @@
PLAY [Apply Security Patches and Updates] **************************************
TASK [Update package cache] *****************************************************
changed: [web01]
changed: [db01]
ok: [lb01]
TASK [Check for available updates] **********************************************
ok: [web01] => {"stdout": "9"}
ok: [db01] => {"stdout": "4"}
ok: [lb01] => {"stdout": "0"}
TASK [Apply security updates only] **********************************************
changed: [web01]
changed: [db01]
ok: [lb01]
TASK [Verify critical services] *************************************************
ok: [web01] => (item=systemd-journald)
ok: [web01] => (item=cron)
ok: [db01] => (item=systemd-journald)
ok: [lb01] => (item=cron)
PLAY RECAP *********************************************************************
web01 : ok=19 changed=6 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
db01 : ok=18 changed=5 unreachable=0 failed=0 skipped=2 rescued=0 ignored=1
lb01 : ok=15 changed=1 unreachable=0 failed=0 skipped=4 rescued=0 ignored=0
Patch report
Status: SUCCESS
Window: 02:00-04:00 UTC
Reboot required: false
Notification: infra-team@example.com
@@ -0,0 +1,21 @@
# Scenario: Simulate Failure and Patch
## Description
Validate that a service-level failure can be detected, recovered, and followed by a controlled patch workflow. This mirrors a maintenance window where a degraded node is stabilized before package updates are applied.
## Commands
```bash
cd enterprise-infra-simulator
./scripts/simulate_failure.sh service 30 web
ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml --check
```
## Expected Result
- The simulation records a temporary service failure.
- The service is restored after cleanup.
- The patch playbook completes without unreachable hosts.
- Hardening check mode reports no destructive changes.
+10
View File
@@ -0,0 +1,10 @@
.PHONY: run test demo
run:
python3 cli.py --help
test:
python3 -m py_compile cli.py collectors/*.py validators/*.py reports/*.py
demo:
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
+30 -363
View File
@@ -1,389 +1,56 @@
# Migration Validation Framework # Migration Validation Framework
A comprehensive Python CLI tool for validating system migrations through data collection, snapshot comparison, and automated reporting. Designed for enterprise migration workflows where system consistency and data integrity are critical. ## Problem Statement
## Overview Infrastructure migrations often fail in small, expensive ways: a mount option changes, a service is disabled, or disk usage moves past an operational threshold. Teams need structured evidence that the migrated host still matches the expected operating profile.
The Migration Validation Framework provides a systematic approach to validating system migrations by: ## Solution Overview
- Collecting comprehensive system data before and after migration This project provides a Python CLI that collects system state into JSON snapshots and compares before/after files. The output is designed for change records, migration gates, and post-cutover validation.
- Generating structured JSON snapshots for comparison
- Performing intelligent diff analysis between snapshots
- Generating detailed HTML reports with change visualization
- Providing CLI interface for integration into migration pipelines
## Architecture ## Architecture Overview
``` ```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ Operator -> CLI -> Collectors -> JSON Snapshot -> Comparator -> Diff/Report
│ CLI Interface │ │ Data │ │ Validation │
│ (cli.py) │◄──►│ Collectors │◄──►│ Engine │
│ │ │ │ │ │
│ - Command │ │ - mounts.py │ │ - compare.py │
│ parsing │ │ - services.py │ │ - diff.py │
│ - Workflow │ │ - disk_usage.py │ │ - validate.py │
│ orchestration │ │ - network.py │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ JSON │ │ Comparison │ │ HTML │
│ Snapshots │ │ Results │ │ Reports │
│ │ │ │ │ │
│ - Pre-migration │ │ - Differences │ │ - Summary │
│ - Post-migration│ │ - Risk levels │ │ - Details │
│ - Metadata │ │ - Validation │ │ - Charts │
└─────────────────┘ └─────────────────┘ └─────────────────┘
``` ```
## Quick Start Core components:
### Prerequisites - `cli.py` provides collect, compare, snapshot, list, and report commands.
- `collectors/` gathers mounts, services, and disk usage.
- `validators/compare.py` identifies drift and validation failures.
- `reports/` contains report generation helpers.
- `examples/` contains realistic before/after evidence.
- Python 3.8+ ## How to Run
- SSH access to target systems
- Appropriate permissions for data collection
### Installation
```bash ```bash
cd migration-validation-framework cd migration-validation-framework
pip install -r requirements.txt python3 cli.py collect --output before.json --systems web01,db01
python3 cli.py collect --output after.json --systems web01,db01
python3 cli.py compare before.json after.json --output diff.json
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
``` ```
### Basic Usage Legacy snapshot IDs are still supported:
```bash ```bash
# Create pre-migration snapshot python3 cli.py snapshot --env prod --label pre --systems web01,db01
python cli.py snapshot --env production --label pre-migration --systems web01,db01 python3 cli.py compare prod-pre-20260429_020000 prod-post-20260429_030000 --output change-0429
# Perform migration...
# Create post-migration snapshot
python cli.py snapshot --env production --label post-migration --systems web01,db01
# Compare snapshots
python cli.py compare pre-migration post-migration --output comparison_001
# Generate HTML report
python cli.py report --comparison comparison_001 --format html --output migration_report.html
``` ```
## Project Structure ## Example Output
``` ```text
migration-validation-framework/ Comparison completed: diff.json (FAIL)
├── cli.py # Main CLI interface Overall risk: high
├── collectors/ # Data collection modules Total changes: 4
│ ├── mounts.py # Filesystem mount collection Failed checks: critical_services_running
│ ├── services.py # System services collection Recommendation: restore sshd before production cutover
│ ├── disk_usage.py # Disk usage statistics
│ ├── network.py # Network configuration
│ └── processes.py # Running processes
├── validators/ # Validation and comparison logic
│ ├── compare.py # Snapshot comparison engine
│ ├── diff.py # Difference calculation
│ └── validate.py # Validation rules
├── reports/ # Report generation
│ ├── html_report.py # HTML report generator
│ ├── json_report.py # JSON report generator
│ └── summary.py # Summary calculations
├── config/ # Configuration files
│ ├── collectors.yaml # Collector configurations
│ └── validators.yaml # Validation rules
├── tests/ # Unit and integration tests
├── logs/ # Application logs
└── snapshots/ # Stored snapshots
``` ```
## Data Collectors Sample inputs and output are available in [examples/before.json](examples/before.json), [examples/after.json](examples/after.json), and [examples/diff.json](examples/diff.json).
### Mounts Collector (`collectors/mounts.py`) ## Real-World Use Case
Collects filesystem mount information including:
- Mount points and devices
- Filesystem types
- Mount options
- Capacity and usage statistics
### Services Collector (`collectors/services.py`) During a data center migration, a platform team can collect baseline state before cutover, collect the same evidence after DNS or workload migration, and attach the diff to the change ticket. The framework gives reviewers a compact signal on whether the host is ready for production traffic.
Gathers system service status:
- Running services
- Service states (active, inactive, failed)
- Startup configuration
- Dependencies
### Disk Usage Collector (`collectors/disk_usage.py`)
Analyzes disk space utilization:
- Directory size statistics
- File system usage
- Inode usage
- Largest files and directories
### Network Collector (`collectors/network.py`)
Captures network configuration:
- Interface configurations
- Routing tables
- DNS settings
- Firewall rules
### Processes Collector (`collectors/processes.py`)
Documents running processes:
- Process lists with PIDs
- Memory and CPU usage
- Process owners
- Command lines
## Validation Engine
### Comparison Logic (`validators/compare.py`)
Performs intelligent comparison of snapshots:
- Structural differences detection
- Semantic change analysis
- Risk level assessment
- Change categorization
### Difference Calculator (`validators/diff.py`)
Calculates detailed differences:
- Added/removed/modified items
- Quantitative changes
- Configuration drift detection
- Anomaly identification
### Validation Rules (`validators/validate.py`)
Applies validation rules:
- Critical change detection
- Compliance checking
- Threshold validation
- Custom rule engine
## Reporting
### HTML Reports (`reports/html_report.py`)
Generates comprehensive HTML reports featuring:
- Executive summary dashboard
- Detailed change logs
- Risk assessment visualizations
- Interactive charts and graphs
- Export capabilities
### JSON Reports (`reports/json_report.py`)
Provides structured JSON output for:
- API integration
- Automated processing
- Audit trails
- Compliance reporting
## CLI Interface
### Commands
```bash
# Snapshot management
python cli.py snapshot --env <env> --label <label> [--systems <hosts>]
python cli.py list-snapshots [--env <env>]
python cli.py delete-snapshot <snapshot-id>
# Comparison operations
python cli.py compare <snapshot1> <snapshot2> [--output <comparison-id>]
python cli.py list-comparisons
python cli.py show-comparison <comparison-id>
# Reporting
python cli.py report --comparison <comparison-id> --format <format> [--output <file>]
python cli.py export --comparison <comparison-id> --format <format>
# Configuration
python cli.py config --show
python cli.py config --set <key> <value>
```
### Options
- `--env`: Target environment (production, staging, development)
- `--systems`: Comma-separated list of target systems
- `--parallel`: Number of parallel collection threads
- `--timeout`: Collection timeout in seconds
- `--verbose`: Enable verbose output
- `--dry-run`: Preview actions without execution
## Configuration
### Collector Configuration (`config/collectors.yaml`)
```yaml
collectors:
mounts:
enabled: true
timeout: 30
exclude_patterns:
- "/proc/*"
- "/sys/*"
services:
enabled: true
include_disabled: false
service_manager: systemd
disk_usage:
enabled: true
max_depth: 3
exclude_paths:
- "/tmp"
- "/var/log"
```
### Validation Rules (`config/validators.yaml`)
```yaml
rules:
critical_services:
- sshd
- systemd
- network
filesystem_thresholds:
warning: 80
critical: 95
network_changes:
allow_new_interfaces: false
allow_route_changes: false
```
## Examples
### Complete Migration Validation Workflow
```bash
# 1. Pre-migration snapshot
python cli.py snapshot --env production --label "migration-pre-20241201" \
--systems web01,web02,db01,lb01 --parallel 4
# 2. Execute migration process
# ... migration steps ...
# 3. Post-migration snapshot
python cli.py snapshot --env production --label "migration-post-20241201" \
--systems web01,web02,db01,lb01 --parallel 4
# 4. Compare snapshots
python cli.py compare migration-pre-20241201 migration-post-20241201 \
--output migration-dec2024
# 5. Generate reports
python cli.py report --comparison migration-dec2024 --format html \
--output migration_validation_report.html
python cli.py report --comparison migration-dec2024 --format json \
--output migration_validation_data.json
```
### Automated Validation in CI/CD
```bash
#!/bin/bash
# CI/CD validation script
ENVIRONMENT=$1
SNAPSHOT_LABEL="ci-${BUILD_NUMBER}"
# Create snapshot
python cli.py snapshot --env $ENVIRONMENT --label $SNAPSHOT_LABEL
# Compare with baseline
python cli.py compare baseline-$ENVIRONMENT $SNAPSHOT_LABEL --output ci-$BUILD_NUMBER
# Generate report
python cli.py report --comparison ci-$BUILD_NUMBER --format html
# Check for critical changes
if python cli.py check-critical --comparison ci-$BUILD_NUMBER; then
echo "Migration validation passed"
exit 0
else
echo "Critical changes detected - review required"
exit 1
fi
```
## Security Considerations
- SSH key-based authentication only
- Encrypted snapshot storage
- Access control for sensitive data
- Audit logging of all operations
- Data sanitization and filtering
## Performance Optimization
- Parallel data collection
- Incremental snapshots
- Compressed storage
- Memory-efficient processing
- Timeout handling
## Monitoring and Logging
- Comprehensive logging to `logs/validation.log`
- Performance metrics collection
- Error tracking and alerting
- Audit trail generation
## Troubleshooting
### Common Issues
**Connection Failures:**
```bash
# Check SSH connectivity
ssh -i ~/.ssh/id_rsa user@target-host
# Verify Python availability
python cli.py --test-connection --systems target-host
```
**Collection Timeouts:**
```bash
# Increase timeout
python cli.py snapshot --timeout 300 --systems slow-host
# Check system load
ssh user@target-host uptime
```
**Permission Errors:**
```bash
# Verify sudo access
ssh user@target-host sudo -l
# Check file permissions
ssh user@target-host ls -la /etc/
```
## Development
### Adding New Collectors
1. Create collector module in `collectors/`
2. Implement collection logic
3. Add configuration schema
4. Update CLI interface
5. Add unit tests
### Custom Validation Rules
1. Define rules in `config/validators.yaml`
2. Implement validation logic in `validators/`
3. Update report generation
4. Test with sample data
## Contributing
1. Follow existing code structure and naming conventions
2. Add comprehensive tests for new functionality
3. Update documentation for API changes
4. Ensure backward compatibility
## License
Enterprise Internal Use Only
+65 -12
View File
@@ -29,8 +29,8 @@ class MigrationValidator:
def __init__(self, verbose: bool = False): def __init__(self, verbose: bool = False):
self.verbose = verbose self.verbose = verbose
self.setup_logging()
self.ensure_directories() self.ensure_directories()
self.setup_logging()
def setup_logging(self): def setup_logging(self):
"""Configure logging.""" """Configure logging."""
@@ -97,13 +97,23 @@ class MigrationValidator:
def load_snapshot(self, snapshot_id: str) -> Dict[str, Any]: def load_snapshot(self, snapshot_id: str) -> Dict[str, Any]:
"""Load snapshot from disk.""" """Load snapshot from disk."""
snapshot_file = SNAPSHOTS_DIR / f"{snapshot_id}.json" snapshot_path = Path(snapshot_id)
snapshot_file = snapshot_path if snapshot_path.exists() else SNAPSHOTS_DIR / f"{snapshot_id}.json"
if not snapshot_file.exists(): if not snapshot_file.exists():
raise FileNotFoundError(f"Snapshot {snapshot_id} not found") raise FileNotFoundError(f"Snapshot {snapshot_id} not found")
with open(snapshot_file, 'r') as f: with open(snapshot_file, 'r') as f:
return json.load(f) return json.load(f)
def collect_to_file(self, output_file: str, systems: List[str]) -> str:
"""Collect a snapshot and write it to an explicit file path."""
snapshot = self.collect_system_data(systems)
with open(output_file, 'w') as f:
json.dump(snapshot, f, indent=2)
f.write("\n")
self.logger.info(f"Snapshot written: {output_file}")
return output_file
def create_snapshot(self, env: str, label: str, systems: List[str]) -> str: def create_snapshot(self, env: str, label: str, systems: List[str]) -> str:
"""Create and save a system snapshot.""" """Create and save a system snapshot."""
self.logger.info(f"Creating snapshot for environment: {env}, label: {label}") self.logger.info(f"Creating snapshot for environment: {env}, label: {label}")
@@ -136,6 +146,27 @@ class MigrationValidator:
self.logger.info(f"Comparison saved: {output_id}") self.logger.info(f"Comparison saved: {output_id}")
return comparison return comparison
def compare_files(self, before_file: str, after_file: str, output_file: Optional[str] = None) -> Dict[str, Any]:
"""Compare two explicit JSON snapshot files."""
self.logger.info(f"Comparing files: {before_file} vs {after_file}")
before = self.load_snapshot(before_file)
after = self.load_snapshot(after_file)
comparison = compare.compare_snapshots(before, after)
comparison["metadata"] = {
"before": before_file,
"after": after_file,
"timestamp": datetime.now().isoformat()
}
if output_file:
with open(output_file, 'w') as f:
json.dump(comparison, f, indent=2)
f.write("\n")
self.logger.info(f"Comparison written: {output_file}")
return comparison
def generate_report(self, comparison_id: str, format_type: str, output_file: Optional[str] = None) -> str: def generate_report(self, comparison_id: str, format_type: str, output_file: Optional[str] = None) -> str:
"""Generate a report from comparison results.""" """Generate a report from comparison results."""
self.logger.info(f"Generating {format_type} report for comparison: {comparison_id}") self.logger.info(f"Generating {format_type} report for comparison: {comparison_id}")
@@ -169,14 +200,14 @@ def main():
formatter_class=argparse.RawDescriptionHelpFormatter, formatter_class=argparse.RawDescriptionHelpFormatter,
epilog=""" epilog="""
Examples: Examples:
# Create pre-migration snapshot # Collect pre-migration snapshot
python cli.py snapshot --env production --label pre-migration --systems web01,db01 python3 cli.py collect --output before.json --systems web01,db01
# Compare snapshots # Compare snapshot files
python cli.py compare pre-migration-snapshot post-migration-snapshot --output comparison_001 python3 cli.py compare before.json after.json --output diff.json
# Generate HTML report # Generate HTML report
python cli.py report --comparison comparison_001 --format html python3 cli.py report --comparison comparison_001 --format html
""" """
) )
@@ -185,6 +216,11 @@ Examples:
subparsers = parser.add_subparsers(dest='command', help='Available commands') subparsers = parser.add_subparsers(dest='command', help='Available commands')
# Collect command
collect_parser = subparsers.add_parser('collect', help='Collect a system snapshot to a JSON file')
collect_parser.add_argument('--output', required=True, help='Output JSON file')
collect_parser.add_argument('--systems', default='localhost', help='Comma-separated list of systems')
# Snapshot command # Snapshot command
snapshot_parser = subparsers.add_parser('snapshot', help='Create system snapshot') snapshot_parser = subparsers.add_parser('snapshot', help='Create system snapshot')
snapshot_parser.add_argument('--env', required=True, help='Target environment') snapshot_parser.add_argument('--env', required=True, help='Target environment')
@@ -195,7 +231,7 @@ Examples:
compare_parser = subparsers.add_parser('compare', help='Compare two snapshots') compare_parser = subparsers.add_parser('compare', help='Compare two snapshots')
compare_parser.add_argument('snapshot1', help='First snapshot ID') compare_parser.add_argument('snapshot1', help='First snapshot ID')
compare_parser.add_argument('snapshot2', help='Second snapshot ID') compare_parser.add_argument('snapshot2', help='Second snapshot ID')
compare_parser.add_argument('--output', required=True, help='Comparison output ID') compare_parser.add_argument('--output', help='Output comparison ID or JSON file')
# Report command # Report command
report_parser = subparsers.add_parser('report', help='Generate report from comparison') report_parser = subparsers.add_parser('report', help='Generate report from comparison')
@@ -217,7 +253,16 @@ Examples:
validator = MigrationValidator(verbose=args.verbose) validator = MigrationValidator(verbose=args.verbose)
try: try:
if args.command == 'snapshot': if args.command == 'collect':
systems = [system.strip() for system in args.systems.split(',') if system.strip()]
if args.dry_run:
print(f"DRY RUN: Would collect {systems} into {args.output}")
return
output_file = validator.collect_to_file(args.output, systems)
print(f"Snapshot written: {output_file}")
elif args.command == 'snapshot':
systems = args.systems.split(',') systems = args.systems.split(',')
if args.dry_run: if args.dry_run:
print(f"DRY RUN: Would create snapshot for systems: {systems}") print(f"DRY RUN: Would create snapshot for systems: {systems}")
@@ -231,8 +276,16 @@ Examples:
print(f"DRY RUN: Would compare {args.snapshot1} vs {args.snapshot2}") print(f"DRY RUN: Would compare {args.snapshot1} vs {args.snapshot2}")
return return
comparison = validator.compare_snapshots(args.snapshot1, args.snapshot2, args.output) output = args.output
print(f"Comparison completed: {args.output}") if output and output.endswith('.json'):
comparison = validator.compare_files(args.snapshot1, args.snapshot2, output)
result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
print(f"Comparison completed: {output} ({result})")
else:
output_id = output or datetime.now().strftime('%Y%m%d_%H%M%S')
comparison = validator.compare_snapshots(args.snapshot1, args.snapshot2, output_id)
result = "PASS" if comparison.get("validation_results", {}).get("passed") else "FAIL"
print(f"Comparison completed: {output_id} ({result})")
elif args.command == 'report': elif args.command == 'report':
if args.dry_run: if args.dry_run:
@@ -267,4 +320,4 @@ Examples:
sys.exit(1) sys.exit(1)
if __name__ == "__main__": if __name__ == "__main__":
main() main()
@@ -0,0 +1,30 @@
# Migration Validation Framework Architecture
## Components
- CLI: parses operator commands and coordinates workflows.
- Collectors: gather mounts, services, and disk usage from target systems.
- Snapshot files: JSON evidence used as immutable migration checkpoints.
- Comparator: evaluates drift between before and after snapshots.
- Reports: stores JSON or HTML output for audit and review.
## Data Flow
```
Operator
-> python3 cli.py collect
-> collectors over SSH
-> before.json / after.json
-> python3 cli.py compare
-> diff.json with PASS/FAIL validation
```
## Validation Flow
```
before.json -> Comparator -> service checks
after.json -> Comparator -> filesystem checks -> validation result
-> mount checks
```
The framework keeps collection and comparison separate so migration evidence can be reviewed, archived, and replayed without recollecting from production systems.
@@ -0,0 +1,40 @@
{
"metadata": {
"timestamp": "2026-04-29T03:40:00Z",
"systems": ["web01"],
"version": "1.0"
},
"data": {
"web01": {
"mounts": {
"mounts": [
{"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
{"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
],
"usage": {
"/": {"filesystem": "/dev/sda1", "use_percent": "62%"},
"/var": {"filesystem": "/dev/sdb1", "use_percent": "94%"}
},
"timestamp": "2026-04-29T03:40:00Z"
},
"services": {
"service_manager": "systemd",
"services": [
{"name": "sshd", "active_state": "failed", "sub_state": "failed"},
{"name": "nginx", "active_state": "active", "sub_state": "running"},
{"name": "node-exporter", "active_state": "active", "sub_state": "running"}
],
"timestamp": "2026-04-29T03:40:00Z"
},
"disk_usage": {
"filesystem_usage": [
{"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "50G", "available": "30G", "use_percent": "62%", "mountpoint": "/"},
{"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "188G", "available": "12G", "use_percent": "94%", "mountpoint": "/var"}
],
"directory_sizes": [{"path": "/var/lib/app", "size": "139G"}],
"largest_files": [{"path": "/var/lib/app/import/archive.tar", "size": "42G"}],
"timestamp": "2026-04-29T03:40:00Z"
}
}
}
}
@@ -0,0 +1,39 @@
{
"metadata": {
"timestamp": "2026-04-29T01:15:00Z",
"systems": ["web01"],
"version": "1.0"
},
"data": {
"web01": {
"mounts": {
"mounts": [
{"device": "/dev/sda1", "mountpoint": "/", "fstype": "ext4", "options": "rw,relatime"},
{"device": "/dev/sdb1", "mountpoint": "/var", "fstype": "xfs", "options": "rw,noatime"}
],
"usage": {
"/": {"filesystem": "/dev/sda1", "use_percent": "61%"},
"/var": {"filesystem": "/dev/sdb1", "use_percent": "68%"}
},
"timestamp": "2026-04-29T01:15:00Z"
},
"services": {
"service_manager": "systemd",
"services": [
{"name": "sshd", "active_state": "active", "sub_state": "running"},
{"name": "nginx", "active_state": "active", "sub_state": "running"}
],
"timestamp": "2026-04-29T01:15:00Z"
},
"disk_usage": {
"filesystem_usage": [
{"filesystem": "/dev/sda1", "type": "ext4", "size": "80G", "used": "49G", "available": "31G", "use_percent": "61%", "mountpoint": "/"},
{"filesystem": "/dev/sdb1", "type": "xfs", "size": "200G", "used": "136G", "available": "64G", "use_percent": "68%", "mountpoint": "/var"}
],
"directory_sizes": [{"path": "/var/lib/app", "size": "84G"}],
"largest_files": [],
"timestamp": "2026-04-29T01:15:00Z"
}
}
}
}
@@ -0,0 +1,211 @@
{
"summary": {
"total_systems": 1,
"systems_with_changes": 1,
"total_changes": 7,
"changes_by_type": {
"mounts": 2,
"services": 2,
"disk_usage": 3
},
"most_affected_systems": [
[
"web01",
7
]
]
},
"differences": {
"mounts": {
"web01": {
"added_mounts": [],
"removed_mounts": [],
"changed_mounts": [],
"usage_changes": [
{
"mountpoint": "/",
"before": {
"filesystem": "/dev/sda1",
"use_percent": "61%"
},
"after": {
"filesystem": "/dev/sda1",
"use_percent": "62%"
}
},
{
"mountpoint": "/var",
"before": {
"filesystem": "/dev/sdb1",
"use_percent": "68%"
},
"after": {
"filesystem": "/dev/sdb1",
"use_percent": "94%"
}
}
]
}
},
"services": {
"web01": {
"added_services": [
{
"name": "node-exporter",
"active_state": "active",
"sub_state": "running"
}
],
"removed_services": [],
"status_changes": [
{
"name": "sshd",
"before": {
"active_state": "active",
"sub_state": "running"
},
"after": {
"active_state": "failed",
"sub_state": "failed"
}
}
],
"configuration_changes": []
}
},
"disk_usage": {
"web01": {
"filesystem_changes": [
{
"mountpoint": "/",
"before": {
"filesystem": "/dev/sda1",
"type": "ext4",
"size": "80G",
"used": "49G",
"available": "31G",
"use_percent": "61%",
"mountpoint": "/"
},
"after": {
"filesystem": "/dev/sda1",
"type": "ext4",
"size": "80G",
"used": "50G",
"available": "30G",
"use_percent": "62%",
"mountpoint": "/"
}
},
{
"mountpoint": "/var",
"before": {
"filesystem": "/dev/sdb1",
"type": "xfs",
"size": "200G",
"used": "136G",
"available": "64G",
"use_percent": "68%",
"mountpoint": "/var"
},
"after": {
"filesystem": "/dev/sdb1",
"type": "xfs",
"size": "200G",
"used": "188G",
"available": "12G",
"use_percent": "94%",
"mountpoint": "/var"
}
}
],
"directory_size_changes": [],
"significant_usage_changes": [
{
"mountpoint": "/var",
"change_percent": 26,
"before": {
"filesystem": "/dev/sdb1",
"type": "xfs",
"size": "200G",
"used": "136G",
"available": "64G",
"use_percent": "68%",
"mountpoint": "/var"
},
"after": {
"filesystem": "/dev/sdb1",
"type": "xfs",
"size": "200G",
"used": "188G",
"available": "12G",
"use_percent": "94%",
"mountpoint": "/var"
}
}
]
}
}
},
"risk_assessment": {
"overall_risk": "high",
"risk_factors": [
{
"type": "service_failure",
"description": "Service failed: sshd",
"level": 3
},
{
"type": "disk_usage_spike",
"description": "Significant disk usage change: /var (26%)",
"level": 2
}
],
"critical_changes": [],
"recommendations": [
"Immediate review required - critical changes detected",
"Consider rolling back migration if critical services are affected"
]
},
"validation_results": {
"passed": false,
"checks": [
{
"name": "critical_services_running",
"description": "Verify critical services remain operational",
"passed": false,
"details": [
"Critical service sshd failed on web01"
]
},
{
"name": "filesystem_integrity",
"description": "Verify filesystem integrity maintained",
"passed": true,
"details": []
},
{
"name": "no_critical_mounts_removed",
"description": "Verify critical mount points remain",
"passed": true,
"details": []
}
],
"failed_checks": [
{
"name": "critical_services_running",
"description": "Verify critical services remain operational",
"passed": false,
"details": [
"Critical service sshd failed on web01"
]
}
],
"result": "FAIL"
},
"metadata": {
"before": "migration-validation-framework/examples/before.json",
"after": "migration-validation-framework/examples/after.json",
"timestamp": "2026-04-29T23:29:07.510774"
}
}
@@ -0,0 +1,19 @@
# Scenario: Before/After Migration Comparison
## Description
Compare a pre-cutover host snapshot against a post-cutover snapshot and determine whether the migrated system is ready for production traffic.
## Commands
```bash
cd migration-validation-framework
python3 cli.py compare examples/before.json examples/after.json --output /tmp/migration-diff.json
```
## Expected Result
- The command writes a JSON diff.
- The result is `FAIL` because `sshd` is failed after migration.
- The risk assessment highlights the `/var` disk usage increase.
- The remediation path is to restore SSH and reduce or expand `/var` before approving cutover.
@@ -37,9 +37,12 @@ class SnapshotComparator:
# Compare each data type # Compare each data type
data_types = ["mounts", "services", "disk_usage"] data_types = ["mounts", "services", "disk_usage"]
data1 = snapshot1.get("data", {})
data2 = snapshot2.get("data", {})
for data_type in data_types: for data_type in data_types:
if data_type in snapshot1.get("data", {}) and data_type in snapshot2.get("data", {}): if self.data_type_exists(data1, data_type) or self.data_type_exists(data2, data_type):
differences = self.compare_data_type(snapshot1["data"], snapshot2["data"], data_type) differences = self.compare_data_type(data1, data2, data_type)
comparison["differences"][data_type] = differences comparison["differences"][data_type] = differences
# Generate summary # Generate summary
@@ -50,10 +53,15 @@ class SnapshotComparator:
# Validation results # Validation results
comparison["validation_results"] = self.validate_changes(comparison["differences"]) comparison["validation_results"] = self.validate_changes(comparison["differences"])
comparison["validation_results"]["result"] = "PASS" if comparison["validation_results"]["passed"] else "FAIL"
logger.info("Snapshot comparison completed") logger.info("Snapshot comparison completed")
return comparison return comparison
def data_type_exists(self, systems: Dict[str, Any], data_type: str) -> bool:
"""Return true when at least one system has the requested collector data."""
return any(data_type in system_data for system_data in systems.values())
def compare_data_type(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]: def compare_data_type(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
"""Compare a specific data type between two snapshots.""" """Compare a specific data type between two snapshots."""
differences = {} differences = {}
@@ -237,7 +245,7 @@ class SnapshotComparator:
def generate_summary(self, differences: Dict[str, Any]) -> Dict[str, Any]: def generate_summary(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Generate a summary of all differences.""" """Generate a summary of all differences."""
summary = { summary = {
"total_systems": len(differences), "total_systems": 0,
"systems_with_changes": 0, "systems_with_changes": 0,
"total_changes": 0, "total_changes": 0,
"changes_by_type": {}, "changes_by_type": {},
@@ -259,6 +267,8 @@ class SnapshotComparator:
summary["changes_by_type"][data_type] += change_count summary["changes_by_type"][data_type] += change_count
summary["total_changes"] += change_count summary["total_changes"] += change_count
summary["total_systems"] = len(system_change_counts)
# Count systems with changes # Count systems with changes
summary["systems_with_changes"] = len([s for s in system_change_counts.values() if s > 0]) summary["systems_with_changes"] = len([s for s in system_change_counts.values() if s > 0])
@@ -488,4 +498,4 @@ class SnapshotComparator:
def compare_snapshots(snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]: def compare_snapshots(snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
"""Main comparison function.""" """Main comparison function."""
comparator = SnapshotComparator() comparator = SnapshotComparator()
return comparator.compare_snapshots(snapshot1, snapshot2) return comparator.compare_snapshots(snapshot1, snapshot2)
+10
View File
@@ -0,0 +1,10 @@
.PHONY: run test demo
run:
docker compose up -d
test:
docker compose config
demo:
./scenarios/incident_simulation.sh comprehensive
+40 -434
View File
@@ -1,461 +1,67 @@
# Observability Stack # Observability Stack
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios. ## Problem Statement
## Overview Operations teams need correlated logs, dashboards, and alert examples that make incidents observable before they become customer-facing outages. A stack that only starts containers is not enough; it also needs meaningful sample data and incident exercises.
The Observability Stack provides a complete monitoring solution with: ## Solution Overview
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics This project defines a local observability environment with Elasticsearch, Logstash, Kibana, Grafana, Filebeat, alert rules, sample logs, and an incident simulation script. It is built to demonstrate practical monitoring workflows rather than a production-sized cluster.
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
## Architecture ## Architecture Overview
``` ```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
│ Log Sources │ │ Logstash │ │ Elasticsearch │ |
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │ v
│ / Systems) │ │ Processing) │ │ Analytics) │ Grafana
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │ Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │ │ Kibana │ │ Grafana │
│ Rules │ │ (Dashboards & │ │ (Metrics & │
│ │ │ Exploration) │ │ Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
``` ```
## Quick Start Core components:
### Prerequisites - `docker-compose.yml` defines the observability services.
- `alerting/alert_rules.yml` records alert intent and severity.
- `logs/` contains representative operational logs.
- `scenarios/incident_simulation.sh` emits incident activity.
- `examples/` contains sample alert and log outputs.
- Docker and Docker Compose ## How to Run
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
### Setup
```bash ```bash
cd observability-stack cd observability-stack
# Start the observability stack # Validate the compose model.
docker-compose up -d make test
# Wait for services to be ready (may take 2-3 minutes) # Start the stack.
sleep 180 make run
# Verify services are running # Run the incident simulation.
curl -X GET "localhost:9200/_cluster/health?pretty" make demo
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health" # Stop the stack.
docker compose down
``` ```
### Access Interfaces When running locally:
- **Kibana**: http://localhost:5601 (admin/elastic) - Kibana: `http://localhost:5601`
- **Grafana**: http://localhost:3000 (admin/admin) - Grafana: `http://localhost:3000`
- **Elasticsearch**: http://localhost:9200 - Elasticsearch: `http://localhost:9200`
## Project Structure ## Example Output
``` ```text
observability-stack/ [2026-04-29 04:18:23] WARN Database connection pool nearing capacity
├── docker-compose.yml # Service orchestration [2026-04-29 04:18:28] ERROR Database connection pool exhausted
├── logstash/ # Logstash configuration [2026-04-29 04:18:33] ERROR Database query timeout occurred
│ ├── pipeline/ # Processing pipelines [2026-04-29 04:18:44] INFO Database connections restored
│ └── config/ # Logstash settings
├── elasticsearch/ # Elasticsearch configuration
│ └── config/ # Cluster settings
├── kibana/ # Kibana configuration
│ └── config/ # Dashboard settings
├── grafana/ # Grafana configuration
│ ├── provisioning/ # Dashboards and datasources
│ └── dashboards/ # Dashboard definitions
├── logs/ # Sample log data
│ └── sample.log # Realistic application logs
├── alerting/ # Alert configuration
│ └── alert_rules.yml # Alert definitions
├── scenarios/ # Incident simulation
│ └── incident_simulation.sh # Simulation scripts
└── README.md
``` ```
## Services Configuration Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
### Elasticsearch ## Real-World Use Case
**Configuration**: `elasticsearch/config/elasticsearch.yml` A platform team can use this project to explain how logs move through an ingestion pipeline, how alert rules map to operational symptoms, and how incident exercises create evidence for on-call readiness reviews.
Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data
### Logstash
**Pipelines**: `logstash/pipeline/`
- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing
**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
### Kibana
**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
### Grafana
**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
## Log Ingestion
### Sample Data
The stack includes realistic sample logs for testing:
```bash
# Ingest sample logs
curl -X POST "localhost:8080" \
-H "Content-Type: application/json" \
-d @logs/sample.log
```
### Log Formats Supported
- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules
### Data Enrichment
Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
## Alerting and Monitoring
### Alert Rules
Located in `alerting/alert_rules.yml`:
```yaml
alert_rules:
- name: "High CPU Usage"
condition: "cpu_usage > 90"
duration: "5m"
severity: "critical"
channels: ["email", "slack"]
- name: "Disk Space Low"
condition: "disk_usage > 85"
duration: "10m"
severity: "warning"
channels: ["email"]
- name: "Service Down"
condition: "service_status == 'down'"
duration: "2m"
severity: "critical"
channels: ["email", "pagerduty"]
```
### Alert Channels
- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints
## Incident Simulation
### Available Scenarios
```bash
cd scenarios
# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical
# Simulate service failure
./incident_simulation.sh --type service-down --service nginx
# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms
# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
```
### Scenario Types
- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation
## Dashboards and Visualization
### Kibana Dashboards
Pre-configured dashboards for:
1. **Log Analysis**
- Log volume over time
- Error rate trends
- Top error messages
- Geographic request distribution
2. **System Monitoring**
- CPU and memory usage
- Disk I/O statistics
- Network traffic
- System load averages
3. **Application Performance**
- Response time distributions
- Request rate metrics
- Error percentages
- User session analytics
### Grafana Dashboards
Advanced visualization panels:
- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting
## API Endpoints
### Elasticsearch APIs
```bash
# Cluster health
GET /_cluster/health
# Index statistics
GET /_cat/indices?v
# Search logs
GET /logs-*/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
```
### Kibana APIs
```bash
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard
# Export visualizations
GET /api/saved_objects/visualization/{id}
```
### Grafana APIs
```bash
# Get dashboard list
GET /api/search?query=*
# Alert status
GET /api/alerts
```
## Configuration Management
### Environment Variables
```bash
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"
# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"
# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
```
### Scaling Configuration
For production deployment:
```yaml
version: '3.8'
services:
elasticsearch:
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: '2.0'
```
## Security Considerations
### Authentication
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
### Network Security
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
### Data Protection
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
## Troubleshooting
### Common Issues
**Elasticsearch Won't Start:**
```bash
# Check memory allocation
docker-compose logs elasticsearch
# Verify Java heap settings
docker-compose exec elasticsearch ps aux
```
**Logstash Pipeline Errors:**
```bash
# Check pipeline configuration
docker-compose logs logstash
# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
```
**Kibana Connection Issues:**
```bash
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"
# Check Kibana logs
docker-compose logs kibana
```
### Performance Tuning
**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
## Development and Testing
### Adding New Dashboards
1. Create dashboard JSON in `grafana/dashboards/`
2. Update provisioning configuration
3. Restart Grafana service
### Custom Alert Rules
1. Define rules in `alerting/alert_rules.yml`
2. Update alerting configuration
3. Test rules with simulation scenarios
### Log Pipeline Development
1. Add pipeline configuration in `logstash/pipeline/`
2. Test with sample data
3. Validate parsing with Kibana
## Backup and Recovery
### Data Backup
```bash
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-H "Content-Type: application/json" \
-d '{"indices": "*"}'
```
### Configuration Backup
```bash
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
logstash/ elasticsearch/ kibana/ grafana/
```
## Contributing
1. Follow existing configuration patterns
2. Test changes with simulation scenarios
3. Update documentation for new features
4. Ensure backward compatibility
## License
Enterprise Internal Use Only
+1 -3
View File
@@ -1,5 +1,3 @@
version: '3.8'
services: services:
elasticsearch: elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0 image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
@@ -119,4 +117,4 @@ networks:
driver: bridge driver: bridge
ipam: ipam:
config: config:
- subnet: 172.25.0.0/16 - subnet: 172.25.0.0/16
+30
View File
@@ -0,0 +1,30 @@
# Observability Stack Architecture
## Components
- Filebeat: tails sample and container logs.
- Logstash: receives and processes log events.
- Elasticsearch: stores searchable observability data.
- Kibana: supports log exploration and dashboards.
- Grafana: provides operational dashboards.
- Alert rules: document symptoms, thresholds, and severity.
- Incident simulation: generates controlled failure signals.
## Data Flow
```
Log source -> Filebeat -> Logstash -> Elasticsearch -> Kibana
|
v
Grafana
```
Incident exercises follow this flow:
```
Operator -> incident_simulation.sh -> logs/incident_simulation.log -> Filebeat -> Logstash -> alerts/dashboards
```
## Notes
This is a local demonstration stack, not a production Elasticsearch deployment. A production version would add dedicated nodes, TLS, secret management, retention policies, index lifecycle management, and external alert delivery.
@@ -0,0 +1,4 @@
2026-04-29T04:19:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=100 threshold=95 status=firing
2026-04-29T04:19:30Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=7.8 threshold=5.0 status=firing
2026-04-29T04:22:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=71 threshold=95 status=resolved
2026-04-29T04:23:15Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=1.2 threshold=5.0 status=resolved
@@ -0,0 +1,5 @@
2026-04-29T04:18:21Z INFO service=checkout-api host=app-web-02 request_id=8f4b2 path=/checkout status=200 latency_ms=142
2026-04-29T04:18:28Z WARN service=checkout-api host=app-web-02 event=db_pool_pressure active=92 max=100
2026-04-29T04:18:33Z ERROR service=checkout-api host=app-web-02 event=db_timeout query=CreateOrder timeout_ms=5000 customer_tier=enterprise
2026-04-29T04:18:39Z ERROR service=checkout-api host=app-web-02 event=payment_retry_exhausted order_id=ord-104288 provider=stripe
2026-04-29T04:18:44Z INFO service=checkout-api host=app-web-02 event=recovery db_pool_active=48
@@ -0,0 +1,21 @@
# Scenario: Incident Simulation
## Description
Generate a controlled application and infrastructure incident so the logging pipeline, alert rules, and dashboards can be reviewed with realistic event timing.
## Commands
```bash
cd observability-stack
docker compose config
./scenarios/incident_simulation.sh comprehensive
tail -n 40 logs/incident_simulation.log
```
## Expected Result
- The compose file validates successfully.
- The simulation writes a sequence of CPU, memory, service, database, and application error events.
- Alert examples indicate firing and resolved states.
- Operators can trace incident progression through logs and dashboard queries.