feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions

This commit is contained in:
Mateusz Suski
2026-04-29 23:14:14 +00:00
parent 2313efac88
commit 7757020014
33 changed files with 6165 additions and 0 deletions
+118
View File
@@ -0,0 +1,118 @@
name: CI Pipeline
on:
push:
branches: [ main, develop ]
pull_request:
branches: [ main ]
jobs:
lint-ansible:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Install Ansible Lint
run: pip install ansible-lint
- name: Lint Ansible Playbooks
run: |
cd enterprise-infra-simulator
ansible-lint playbooks/*.yml
- name: Check Ansible Syntax
run: |
cd enterprise-infra-simulator
ansible-playbook --syntax-check playbooks/*.yml
test-python:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
- name: Install Dependencies
run: |
cd migration-validation-framework
pip install -r requirements.txt
- name: Run Python Tests
run: |
cd migration-validation-framework
python -m pytest tests/ -v --cov=. --cov-report=xml
- name: Lint Python Code
run: |
pip install flake8 black isort
cd migration-validation-framework
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
black --check .
isort --check-only .
validate-docker:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Validate Docker Compose
run: |
cd observability-stack
docker-compose config
- name: Check Docker Images
run: |
cd observability-stack
docker-compose pull --quiet
security-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
scan-type: 'fs'
scan-ref: '.'
format: 'sarif'
output: 'trivy-results.sarif'
- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@v2
if: always()
with:
sarif_file: 'trivy-results.sarif'
documentation:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Check Documentation
run: |
# Check for broken links in README files
find . -name "README.md" -exec markdown-link-check {} \;
# Validate YAML files
find . -name "*.yml" -o -name "*.yaml" | xargs -I {} yamllint {}
integration-test:
runs-on: ubuntu-latest
needs: [lint-ansible, test-python, validate-docker]
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.8'
- name: Install Dependencies
run: |
pip install ansible docker-compose
- name: Run Integration Tests
run: |
# Start infrastructure simulator
cd enterprise-infra-simulator
make up
sleep 30
# Run basic validation
ansible -i inventory/hosts.ini all -m ping
# Test migration framework
cd ../migration-validation-framework
python cli.py --help
# Validate observability stack
cd ../observability-stack
docker-compose config
# Cleanup
cd ../enterprise-infra-simulator
make destroy
+59
View File
@@ -0,0 +1,59 @@
# Enterprise Infrastructure Portfolio
This mono-repository showcases enterprise-level Linux infrastructure engineering capabilities through three comprehensive projects demonstrating DevOps and Platform Engineering best practices.
## Projects Overview
### 1. Enterprise Infrastructure Simulator
A container-based simulation of enterprise Linux infrastructure with Ansible automation for provisioning, patching, hardening, and decommissioning operations. Includes failure simulation and scaling scenarios.
**Key Features:**
- Multi-node Linux simulation using Docker containers
- Ansible playbooks for infrastructure lifecycle management
- Automated scaling and failure injection scripts
- Enterprise-grade inventory and scenario management
### 2. Migration Validation Framework
A Python CLI tool for validating system migrations by collecting, comparing, and reporting on system state changes. Generates comprehensive before/after snapshots and HTML reports.
**Key Features:**
- Automated system data collection (mounts, services, disk usage)
- JSON snapshot generation and comparison
- HTML report generation with change visualization
- CLI interface for enterprise migration workflows
### 3. Observability Stack
A complete monitoring and logging stack using ELK (Elasticsearch, Logstash, Kibana) and Grafana for comprehensive infrastructure observability.
**Key Features:**
- Docker Compose deployment of full observability stack
- Sample log ingestion and processing pipelines
- Alerting rules and incident simulation scenarios
- Real-time dashboards and monitoring capabilities
## Architecture
See [docs/architecture.md](docs/architecture.md) for detailed architecture overview.
## Runbooks
Operational procedures and troubleshooting guides available in [docs/runbooks.md](docs/runbooks.md).
## Getting Started
Each project contains its own README.md with setup and usage instructions.
## CI/CD
Automated testing and linting via Gitea workflows in [.gitea/workflows/](.gitea/workflows/).
## Prerequisites
- Docker and Docker Compose
- Ansible
- Python 3.8+
- Make
## License
Enterprise Internal Use Only
+147
View File
@@ -0,0 +1,147 @@
# Architecture Overview
## Enterprise Infrastructure Portfolio Architecture
This document provides a high-level overview of the architecture and design principles implemented across the three main projects in this portfolio.
## Overall Architecture
```
┌─────────────────────────────────────────────────────────────┐
│ Enterprise Portfolio │
├─────────────────────────────────────────────────────────────┤
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ Infra Simulator│ │Migration │ │Observability│ │
│ │ (Ansible/Docker│ │Validation │ │Stack │ │
│ │ Container Sim) │ │(Python CLI) │ │(ELK/Grafana)│ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
├─────────────────────────────────────────────────────────────┤
│ Infrastructure Simulation │ Validation Framework │ Monitoring │
└─────────────────────────────────────────────────────────────┘
```
## Project Architectures
### 1. Enterprise Infrastructure Simulator
**Architecture Pattern:** Container-based Infrastructure Simulation
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Ansible │ │ Docker │ │ Simulation │
│ Controller │◄──►│ Containers │◄──►│ Scripts │
│ │ │ (Linux Nodes) │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Inventory │ │ Playbooks │ │ Scenarios │
│ Management │ │ (Provision/ │ │ (Scaling/ │
│ │ │ Patch/ │ │ Failures) │
│ │ │ Harden/ │ │ │
│ │ │ Decommission)│ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
**Key Components:**
- **Ansible Controller:** Central orchestration for infrastructure operations
- **Docker Containers:** Simulated Linux nodes with realistic configurations
- **Simulation Scripts:** Automated scaling and failure injection
- **Inventory System:** Dynamic host management and grouping
- **Playbook Library:** Modular automation for different lifecycle phases
### 2. Migration Validation Framework
**Architecture Pattern:** Data Collection and Comparison Pipeline
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CLI Interface │ │ Data │ │ Validation │
│ (cli.py) │◄──►│ Collectors │◄──►│ Engine │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ JSON │ │ Comparison │ │ HTML │
│ Snapshots │ │ Logic │ │ Reports │
│ (Before/After)│ │ │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
**Key Components:**
- **CLI Interface:** Command-line tool for migration workflow orchestration
- **Data Collectors:** Specialized modules for system data extraction
- **Validation Engine:** Snapshot comparison and difference analysis
- **Report Generator:** HTML output with change visualization
- **JSON Storage:** Structured data persistence for before/after states
### 3. Observability Stack
**Architecture Pattern:** Distributed Monitoring and Logging
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Logstash │ │ Elasticsearch │ │ Kibana │
│ (Ingestion) │◄──►│ (Storage) │◄──►│ (Visualization)│
└─────────────────┘ └─────────────────┘ └─────────────────┘
▲ ▲ ▲
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Sample Logs │ │ Alert Rules │ │ Grafana │
│ (Data Sources)│ │ (Conditions) │ │ (Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
**Key Components:**
- **Logstash Pipelines:** Data ingestion and transformation
- **Elasticsearch Cluster:** Distributed search and analytics
- **Kibana Dashboards:** Real-time visualization and exploration
- **Grafana Integration:** Advanced metrics and alerting
- **Alerting Engine:** Automated incident detection and notification
## Design Principles
### Infrastructure as Code
- All infrastructure defined in code (Ansible, Docker Compose, Python)
- Version-controlled configurations and automation
- Reproducible environments and deployments
### Modular Architecture
- Separated concerns across projects and components
- Reusable modules and playbooks
- Clear interfaces between systems
### Enterprise Standards
- Realistic naming conventions and structures
- Production-quality error handling and logging
- Security hardening and compliance considerations
### Observability First
- Comprehensive logging and monitoring
- Automated alerting and incident response
- Performance metrics and health checks
## Technology Stack
- **Containerization:** Docker, Docker Compose
- **Configuration Management:** Ansible
- **Programming Language:** Python 3.8+
- **Monitoring Stack:** ELK Stack (Elasticsearch, Logstash, Kibana)
- **Visualization:** Grafana
- **CI/CD:** Gitea Actions
- **Documentation:** Markdown
## Security Considerations
- Container security scanning integration
- Ansible vault for secrets management
- Network segmentation in Docker Compose
- Least privilege access principles
- Audit logging and compliance reporting
## Scalability and Performance
- Horizontal scaling through container orchestration
- Efficient data collection and processing
- Optimized Elasticsearch indexing
- Resource-aware automation scripts
+329
View File
@@ -0,0 +1,329 @@
# Runbooks and Operational Procedures
This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects.
## Table of Contents
1. [Infrastructure Simulator Operations](#infrastructure-simulator-operations)
2. [Migration Validation Procedures](#migration-validation-procedures)
3. [Observability Stack Management](#observability-stack-management)
4. [Troubleshooting Guide](#troubleshooting-guide)
## Infrastructure Simulator Operations
### Starting the Infrastructure
```bash
cd enterprise-infra-simulator
make up
```
**Expected Outcome:**
- Docker containers for simulated Linux nodes are created
- Ansible inventory is populated
- Basic services are running on all nodes
**Verification:**
```bash
docker ps | grep infra-sim
ansible -i inventory/hosts.ini all -m ping
```
### Patching Operations
```bash
cd enterprise-infra-simulator
make patch
```
**Procedure:**
1. Backup current container states
2. Apply security patches via Ansible
3. Validate service availability
4. Generate patch report
**Rollback:**
```bash
docker-compose down
docker-compose up --scale node=0
make up
```
### Hardening Operations
```bash
cd enterprise-infra-simulator
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml
```
**Hardening Steps:**
- Disable unnecessary services
- Configure firewall rules
- Set secure SSH configurations
- Apply CIS benchmarks
### Scaling Operations
```bash
cd enterprise-infra-simulator
./scripts/simulate_scaling.sh up 3
```
**Scaling Parameters:**
- Direction: up/down
- Count: number of nodes to add/remove
- Type: web/app/db
### Failure Simulation
```bash
cd enterprise-infra-simulator
./scripts/simulate_failure.sh --type network --duration 300
```
**Failure Types:**
- network: Network partition
- disk: Disk space exhaustion
- service: Service crashes
- node: Complete node failure
### Decommissioning
```bash
cd enterprise-infra-simulator
make destroy
```
**Decommission Steps:**
1. Graceful service shutdown
2. Data backup and export
3. Configuration cleanup
4. Container removal
## Migration Validation Procedures
### Pre-Migration Snapshot
```bash
cd migration-validation-framework
python cli.py snapshot --env production --label pre-migration
```
**Data Collected:**
- Mount points and filesystem usage
- Running services and their states
- Disk usage statistics
- Network configurations
### Post-Migration Validation
```bash
python cli.py snapshot --env production --label post-migration
python cli.py compare pre-migration post-migration
```
**Validation Checks:**
- Service availability verification
- Filesystem integrity
- Configuration consistency
- Performance metrics comparison
### Report Generation
```bash
python cli.py report --comparison-id <comparison-id> --format html
```
**Report Contents:**
- Executive summary
- Detailed change log
- Risk assessment
- Recommendations
## Observability Stack Management
### Starting the Stack
```bash
cd observability-stack
docker-compose up -d
```
**Service Startup Order:**
1. Elasticsearch
2. Logstash
3. Kibana
4. Grafana
### Log Ingestion Testing
```bash
# Send sample logs
curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log
```
### Alert Configuration
```bash
# Load alert rules
curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer <token>" -d @alerting/alert_rules.json
```
### Incident Simulation
```bash
cd observability-stack
./scenarios/incident_simulation.sh --type disk-full --severity critical
```
**Incident Types:**
- disk-full: Simulate disk space exhaustion
- service-down: Service failure simulation
- high-cpu: CPU utilization spike
- network-latency: Network performance degradation
## Troubleshooting Guide
### Common Issues
#### Ansible Connection Failures
**Symptoms:**
- `UNREACHABLE` errors in Ansible output
- SSH connection timeouts
**Resolution:**
```bash
# Check container status
docker ps | grep infra-sim
# Verify SSH keys
ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa
# Restart containers
make destroy && make up
```
#### Elasticsearch Cluster Issues
**Symptoms:**
- Kibana shows "No living connections"
- Logstash pipeline failures
**Resolution:**
```bash
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Restart services
docker-compose restart elasticsearch logstash kibana
```
#### Python Import Errors
**Symptoms:**
- ModuleNotFoundError in migration framework
- Collector failures
**Resolution:**
```bash
# Install dependencies
pip install -r requirements.txt
# Check Python path
python -c "import sys; print(sys.path)"
```
#### Docker Resource Constraints
**Symptoms:**
- Container startup failures
- Out of memory errors
**Resolution:**
```bash
# Check Docker resources
docker system df
# Clean up unused resources
docker system prune -a
# Increase Docker memory limit
# Edit /etc/docker/daemon.json
{
"memory": "4g",
"cpu-count": 2
}
```
### Log Locations
- **Ansible:** `enterprise-infra-simulator/ansible.log`
- **Docker:** `docker logs <container-name>`
- **Elasticsearch:** `observability-stack/logs/elasticsearch.log`
- **Migration Framework:** `migration-validation-framework/logs/validation.log`
### Performance Monitoring
```bash
# Infrastructure monitoring
ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20"
# Elasticsearch metrics
curl -X GET "localhost:9200/_cluster/stats?pretty"
# Python performance
python -m cProfile cli.py snapshot
```
### Backup and Recovery
#### Infrastructure Backup
```bash
cd enterprise-infra-simulator
docker-compose exec ansible ansible-playbook /playbooks/backup.yml
```
#### Data Backup
```bash
cd observability-stack
docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json
```
#### Migration Data Backup
```bash
cd migration-validation-framework
python cli.py backup --destination /backup/location
```
## Emergency Procedures
### Complete System Reset
```bash
# Stop all services
docker-compose down -v
cd enterprise-infra-simulator && make destroy
# Clean up volumes
docker volume prune -f
# Restart from clean state
cd enterprise-infra-simulator && make up
cd observability-stack && docker-compose up -d
```
### Incident Response
1. **Assess Impact:** Check monitoring dashboards
2. **Isolate Issue:** Use failure simulation scripts to reproduce
3. **Implement Fix:** Apply appropriate runbook procedure
4. **Validate Recovery:** Run validation framework
5. **Document Incident:** Update runbooks with lessons learned
## Maintenance Schedules
- **Daily:** Log rotation and cleanup
- **Weekly:** Security patching and updates
- **Monthly:** Performance optimization and capacity planning
- **Quarterly:** Architecture review and modernization
+166
View File
@@ -0,0 +1,166 @@
# Enterprise Infrastructure Simulator Makefile
.PHONY: help up down patch destroy status logs clean test
# Default target
help: ## Show this help message
@echo "Enterprise Infrastructure Simulator"
@echo ""
@echo "Available commands:"
@grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | sort | awk 'BEGIN {FS = ":.*?## "}; {printf " %-15s %s\n", $$1, $$2}'
# Infrastructure management
up: ## Start the infrastructure simulation
@echo "Starting enterprise infrastructure simulation..."
docker-compose up -d
@echo "Waiting for containers to be ready..."
@sleep 30
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
@echo "Infrastructure simulation started successfully"
down: ## Stop the infrastructure simulation
@echo "Stopping infrastructure simulation..."
ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml || true
docker-compose down
@echo "Infrastructure simulation stopped"
patch: ## Apply security patches to all nodes
@echo "Applying security patches..."
ansible-playbook -i inventory/hosts.ini playbooks/patch.yml
@echo "Security patches applied"
destroy: ## Completely destroy the infrastructure
@echo "Destroying infrastructure..."
ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml || true
docker-compose down -v --remove-orphans
docker system prune -f
rm -rf logs/* reports/*
@echo "Infrastructure completely destroyed"
# Scaling operations
scale-up-web: ## Scale up web servers (usage: make scale-up-web COUNT=2)
@echo "Scaling up $(COUNT) web servers..."
./scripts/simulate_scaling.sh up $(or $(COUNT),1) web
scale-up-db: ## Scale up database servers (usage: make scale-up-db COUNT=1)
@echo "Scaling up $(COUNT) database servers..."
./scripts/simulate_scaling.sh up $(or $(COUNT),1) db
scale-down-web: ## Scale down web servers (usage: make scale-down-web COUNT=1)
@echo "Scaling down $(COUNT) web servers..."
./scripts/simulate_scaling.sh down $(or $(COUNT),1) web
scale-down-db: ## Scale down database servers (usage: make scale-down-db COUNT=1)
@echo "Scaling down $(COUNT) database servers..."
./scripts/simulate_scaling.sh down $(or $(COUNT),1) db
# Failure simulation
fail-network: ## Simulate network failure (usage: make fail-network DURATION=60)
@echo "Simulating network failure for $(or $(DURATION),60) seconds..."
./scripts/simulate_failure.sh network $(or $(DURATION),60)
fail-disk: ## Simulate disk space exhaustion (usage: make fail-disk DURATION=120)
@echo "Simulating disk failure for $(or $(DURATION),120) seconds..."
./scripts/simulate_failure.sh disk $(or $(DURATION),120)
fail-service: ## Simulate service failures (usage: make fail-service DURATION=30)
@echo "Simulating service failure for $(or $(DURATION),30) seconds..."
./scripts/simulate_failure.sh service $(or $(DURATION),30)
fail-node: ## Simulate complete node failure (usage: make fail-node DURATION=300)
@echo "Simulating node failure for $(or $(DURATION),300) seconds..."
./scripts/simulate_failure.sh node $(or $(DURATION),300)
# Monitoring and status
status: ## Show infrastructure status
@echo "=== Docker Containers ==="
docker-compose ps
@echo ""
@echo "=== Ansible Inventory ==="
ansible -i inventory/hosts.ini --list-hosts all || echo "Inventory check failed"
@echo ""
@echo "=== System Resources ==="
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemPerc}}\t{{.NetIO}}"
logs: ## Show infrastructure logs
docker-compose logs -f --tail=100
logs-web: ## Show web server logs
docker-compose logs -f web
logs-db: ## Show database logs
docker-compose logs -f db
# Testing and validation
test: ## Run infrastructure tests
@echo "Running infrastructure tests..."
ansible -i inventory/hosts.ini all -m ping
ansible-playbook -i inventory/hosts.ini --syntax-check playbooks/*.yml
@echo "Testing scaling scripts..."
./scripts/simulate_scaling.sh up 0 web # Dry run
./scripts/simulate_failure.sh network 1 # Quick test
@echo "All tests passed"
validate: ## Validate infrastructure configuration
@echo "Validating configuration..."
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml --check
docker-compose config
@echo "Configuration validation complete"
# Scenarios
scenario-scaling: ## Run scaling event scenario
@echo "Running scaling event scenario..."
ansible-playbook -i inventory/hosts.ini scenarios/scaling_event.yml
scenario-disaster: ## Run disaster recovery scenario
@echo "Running disaster recovery scenario..."
ansible-playbook -i inventory/hosts.ini scenarios/disaster_recovery.yml
# Maintenance
clean: ## Clean up temporary files and logs
@echo "Cleaning up temporary files..."
rm -rf logs/*.log reports/*.txt
docker system prune -f
@echo "Cleanup complete"
backup: ## Create infrastructure backup
@echo "Creating infrastructure backup..."
mkdir -p backups/$(shell date +%Y%m%d_%H%M%S)
ansible-playbook -i inventory/hosts.ini playbooks/backup.yml
docker-compose exec ansible tar -czf /backups/infra_backup.tar.gz /infrastructure
@echo "Backup created"
# Development
lint: ## Lint Ansible playbooks
@echo "Linting Ansible playbooks..."
ansible-lint playbooks/*.yml scenarios/*.yml
@echo "Linting complete"
format: ## Format code and configuration
@echo "Formatting code..."
# Add formatting commands here
@echo "Formatting complete"
# Security
harden: ## Apply security hardening
@echo "Applying security hardening..."
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml
security-scan: ## Run security scans
@echo "Running security scans..."
ansible-playbook -i inventory/hosts.ini playbooks/security_scan.yml
# Help for specific targets
help-scaling: ## Show scaling-related commands
@echo "Scaling Commands:"
@echo " make scale-up-web COUNT=2 - Add 2 web servers"
@echo " make scale-up-db COUNT=1 - Add 1 database server"
@echo " make scale-down-web COUNT=1 - Remove 1 web server"
@echo " make scale-down-db COUNT=1 - Remove 1 database server"
help-failure: ## Show failure simulation commands
@echo "Failure Simulation Commands:"
@echo " make fail-network DURATION=60 - Network failure for 60s"
@echo " make fail-disk DURATION=120 - Disk exhaustion for 120s"
@echo " make fail-service DURATION=30 - Service failure for 30s"
@echo " make fail-node DURATION=300 - Node failure for 300s"
+268
View File
@@ -0,0 +1,268 @@
# Enterprise Infrastructure Simulator
A container-based simulation environment for enterprise Linux infrastructure operations. This project provides Ansible automation for provisioning, patching, hardening, and decommissioning of simulated Linux nodes, along with scripts for scaling and failure simulation.
## Overview
The Enterprise Infrastructure Simulator creates a realistic environment for testing and demonstrating infrastructure automation at scale. It uses Docker containers to simulate multiple Linux nodes and provides comprehensive Ansible playbooks for enterprise operations.
## Architecture
- **Container Simulation:** Docker-based Linux nodes with realistic configurations
- **Ansible Automation:** Modular playbooks for infrastructure lifecycle management
- **Dynamic Inventory:** Automated host discovery and grouping
- **Simulation Scripts:** Automated scaling and failure injection
- **Scenario Management:** Pre-defined operational scenarios
## Quick Start
### Prerequisites
- Docker and Docker Compose
- Ansible 2.9+
- Make
### Setup
```bash
# Clone and navigate to project
cd enterprise-infra-simulator
# Start the infrastructure
make up
# Verify deployment
ansible -i inventory/hosts.ini all -m ping
```
## Available Operations
### Infrastructure Management
```bash
# Provision new nodes
ansible-playbook -i inventory/hosts.ini playbooks/provision.yml
# Apply security patches
make patch
# Harden systems
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml
# Decommission nodes
ansible-playbook -i inventory/hosts.ini playbooks/decommission.yml
# Destroy infrastructure
make destroy
```
### Simulation Operations
```bash
# Scale up infrastructure
./scripts/simulate_scaling.sh up 5
# Simulate network failure
./scripts/simulate_failure.sh --type network --duration 300
# Run operational scenario
ansible-playbook -i inventory/hosts.ini scenarios/scaling_event.yml
```
## Project Structure
```
enterprise-infra-simulator/
├── inventory/ # Ansible inventory files
│ └── hosts.ini # Dynamic host inventory
├── playbooks/ # Ansible automation playbooks
│ ├── provision.yml # Node provisioning
│ ├── patch.yml # Security patching
│ ├── harden.yml # Security hardening
│ └── decommission.yml # Node decommissioning
├── scripts/ # Simulation and utility scripts
│ ├── simulate_scaling.sh # Infrastructure scaling
│ └── simulate_failure.sh # Failure injection
├── scenarios/ # Operational scenarios
│ └── scaling_event.yml # Scaling scenario
├── docker-compose.yml # Container orchestration
├── Makefile # Build automation
└── README.md
```
## Inventory Management
The simulator uses dynamic inventory with the following groups:
- `webservers`: Web application servers
- `databases`: Database servers
- `loadbalancers`: Load balancing infrastructure
- `monitoring`: Monitoring and logging servers
## Playbooks
### Provision Playbook
- Creates Docker containers with base Linux configurations
- Installs required packages and services
- Configures basic networking and security
- Registers nodes in inventory
### Patch Playbook
- Updates system packages
- Applies security patches
- Restarts services as needed
- Generates patch reports
### Harden Playbook
- Implements CIS security benchmarks
- Configures firewall rules
- Hardens SSH configuration
- Disables unnecessary services
### Decommission Playbook
- Gracefully stops services
- Exports configuration and data
- Removes containers
- Cleans up inventory
## Simulation Scripts
### Scaling Simulation
```bash
./scripts/simulate_scaling.sh [up|down] [count] [type]
```
Parameters:
- `direction`: up/down
- `count`: Number of nodes to add/remove
- `type`: Node type (web/db/lb/monitor)
### Failure Simulation
```bash
./scripts/simulate_failure.sh --type [failure_type] --duration [seconds]
```
Failure Types:
- `network`: Network connectivity issues
- `disk`: Disk space exhaustion
- `service`: Service failures
- `node`: Complete node outages
## Scenarios
Pre-defined operational scenarios for testing:
- **Scaling Event:** Automated scaling during traffic spikes
- **Disaster Recovery:** Node failure and recovery procedures
- **Maintenance Window:** Scheduled patching and updates
- **Security Incident:** Breach simulation and response
## Configuration
### Environment Variables
```bash
# Number of initial nodes
INFRA_NODE_COUNT=3
# Node types to deploy
INFRA_NODE_TYPES=web,db,lb
# Simulation parameters
SIMULATION_DURATION=3600
SIMULATION_INTENSITY=medium
```
### Docker Configuration
Container resources and networking are configured in `docker-compose.yml`:
```yaml
services:
infra-node:
image: ubuntu:20.04
deploy:
replicas: 3
resources:
limits:
memory: 512M
cpus: '0.5'
```
## Monitoring and Logging
- Ansible execution logs: `ansible.log`
- Container logs: `docker logs <container-name>`
- Simulation logs: `logs/simulation.log`
## Troubleshooting
### Common Issues
**Ansible Connection Failures:**
```bash
# Check container status
docker ps | grep infra-sim
# Verify SSH connectivity
ansible -i inventory/hosts.ini all -m ping
```
**Container Resource Issues:**
```bash
# Check Docker resources
docker system df
# Clean up containers
docker system prune
```
**Simulation Script Errors:**
```bash
# Check script permissions
chmod +x scripts/*.sh
# Verify dependencies
./scripts/simulate_failure.sh --help
```
## Development
### Adding New Playbooks
1. Create playbook in `playbooks/` directory
2. Follow Ansible best practices
3. Test with `--check` mode
4. Update documentation
### Custom Scenarios
1. Define scenario in `scenarios/` directory
2. Include required variables
3. Test with dry-run
4. Document operational procedures
## Security Considerations
- Containers run with limited privileges
- SSH keys are generated per deployment
- Firewall rules are applied automatically
- Security scanning integrated in CI/CD
## Performance Optimization
- Container resource limits prevent resource exhaustion
- Ansible parallel execution for faster operations
- Efficient failure simulation without full outages
- Optimized Docker layer caching
## Contributing
1. Follow existing code structure and naming conventions
2. Add comprehensive documentation
3. Include tests for new functionality
4. Update runbooks for operational changes
## License
Enterprise Internal Use Only
@@ -0,0 +1,35 @@
[webservers]
web01 ansible_host=172.20.0.11 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
web02 ansible_host=172.20.0.12 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
web03 ansible_host=172.20.0.13 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
[databases]
db01 ansible_host=172.20.0.21 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
db02 ansible_host=172.20.0.22 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
[loadbalancers]
lb01 ansible_host=172.20.0.31 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
[monitoring]
mon01 ansible_host=172.20.0.41 ansible_user=root ansible_ssh_private_key_file=/root/.ssh/id_rsa
[all:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_common_args='-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null'
ansible_connection=ssh
[webservers:vars]
node_type=web
environment=production
[databases:vars]
node_type=database
environment=production
[loadbalancers:vars]
node_type=loadbalancer
environment=production
[monitoring:vars]
node_type=monitoring
environment=production
@@ -0,0 +1,181 @@
---
- name: Decommission Enterprise Infrastructure Nodes
hosts: all
become: true
gather_facts: true
vars:
backup_data: true
export_config: true
graceful_shutdown: true
cleanup_inventory: true
pre_tasks:
- name: Check node health before decommissioning
uri:
url: http://localhost/health
method: GET
status_code: 200
register: health_check
ignore_errors: true
when: "'webservers' in group_names"
- name: Create decommissioning backup directory
file:
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
state: directory
mode: '0755'
- name: Log decommissioning start
lineinfile:
path: "/var/log/decommission.log"
line: "{{ ansible_date_time.iso8601 }} - Starting decommissioning of {{ inventory_hostname }}"
create: yes
tasks:
- name: Stop application services gracefully
service:
name: "{{ item }}"
state: stopped
loop: "{{ application_services | default(['nginx', 'postgresql', 'haproxy']) }}"
ignore_errors: true
when: graceful_shutdown
- name: Wait for connections to drain
pause:
seconds: 30
when: graceful_shutdown and "'webservers' in group_names or 'loadbalancers' in group_names"
- name: Export configuration files
block:
- name: Create config export directory
file:
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config"
state: directory
- name: Archive system configuration
archive:
path:
- /etc/
- /opt/application/
dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/config/system_config.tar.gz"
format: gz
- name: Export service configurations
command: >
tar -czf /var/backups/decommission-{{ ansible_date_time.iso8601 }}/config/services.tar.gz
/etc/nginx /etc/postgresql /etc/haproxy
ignore_errors: true
when: export_config
- name: Backup application data
block:
- name: Create data backup directory
file:
path: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data"
state: directory
- name: Backup database data
command: >
pg_dumpall -U postgres > /var/backups/decommission-{{ ansible_date_time.iso8601 }}/data/database_backup.sql
ignore_errors: true
when: "'databases' in group_names"
- name: Backup application files
archive:
path: "/var/www/html"
dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data/application_data.tar.gz"
format: gz
ignore_errors: true
when: "'webservers' in group_names"
- name: Backup monitoring data
archive:
path: "/var/lib/prometheus"
dest: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}/data/monitoring_data.tar.gz"
format: gz
ignore_errors: true
when: "'monitoring' in group_names"
when: backup_data
- name: Remove from load balancer
include_tasks: tasks/remove_from_lb.yml
when: "'webservers' in group_names or 'databases' in group_names"
- name: Update monitoring alerts
include_tasks: tasks/update_monitoring.yml
when: "'monitoring' not in group_names"
- name: Clean up application directories
file:
path: "{{ item }}"
state: absent
loop:
- /opt/application
- /var/www/html
- /var/lib/postgresql
- /var/lib/prometheus
ignore_errors: true
- name: Remove application packages
apt:
name: "{{ item }}"
state: absent
purge: yes
loop: "{{ application_packages | default(['nginx', 'postgresql', 'haproxy', 'prometheus']) }}"
when: ansible_os_family == "Debian"
ignore_errors: true
- name: Clean up system logs
command: >
find /var/log -name "*.log" -type f -exec truncate -s 0 {} \;
ignore_errors: true
- name: Remove SSH keys and known hosts
file:
path: "{{ item }}"
state: absent
loop:
- /root/.ssh/authorized_keys
- /root/.ssh/known_hosts
- /home/infra-admin/.ssh/authorized_keys
ignore_errors: true
- name: Generate decommissioning report
template:
src: templates/decommission_report.j2
dest: "/var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log"
vars:
decommission_status: "SUCCESS"
backup_location: "/var/backups/decommission-{{ ansible_date_time.iso8601 }}"
post_tasks:
- name: Send decommissioning notification
mail:
to: "{{ decommission_notification_email | default('infra-team@company.com') }}"
subject: "Node Decommissioned - {{ inventory_hostname }}"
body: |
Node {{ inventory_hostname }} has been successfully decommissioned.
Backup location: /var/backups/decommission-{{ ansible_date_time.iso8601 }}
Services stopped: {{ application_services | default(['nginx', 'postgresql', 'haproxy']) | join(', ') }}
Configuration exported: {{ export_config }}
Data backed up: {{ backup_data }}
See /var/log/decommission_report_{{ ansible_date_time.iso8601 }}.log for details
when: decommission_notification_email is defined
ignore_errors: true
- name: Update dynamic inventory
include_tasks: tasks/update_inventory.yml
when: cleanup_inventory
- name: Final log entry
lineinfile:
path: "/var/log/decommission.log"
line: "{{ ansible_date_time.iso8601 }} - Decommissioning completed for {{ inventory_hostname }}"
- name: Shutdown node
command: shutdown -h now
async: 10
poll: 0
when: auto_shutdown | default(false) | bool
@@ -0,0 +1,210 @@
---
- name: Harden Enterprise Infrastructure Nodes
hosts: all
become: true
gather_facts: true
vars:
cis_level: 1
disable_root_login: true
secure_ssh_config: true
firewall_policy: deny
auditd_enabled: true
selinux_mode: enforcing
apparmor_enabled: true
tasks:
- name: Include CIS hardening tasks
include_tasks: tasks/cis_hardening.yml
- name: Configure SSH hardening
block:
- name: Disable root SSH login
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitRootLogin'
line: 'PermitRootLogin no'
when: disable_root_login
- name: Disable password authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PasswordAuthentication'
line: 'PasswordAuthentication no'
- name: Set MaxAuthTries
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^MaxAuthTries'
line: 'MaxAuthTries 3'
- name: Disable empty passwords
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitEmptyPasswords'
line: 'PermitEmptyPasswords no'
- name: Set ClientAliveInterval
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^ClientAliveInterval'
line: 'ClientAliveInterval 300'
- name: Set ClientAliveCountMax
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^ClientAliveCountMax'
line: 'ClientAliveCountMax 2'
notify: restart sshd
- name: Configure firewall
ufw:
state: enabled
policy: "{{ firewall_policy }}"
rules:
- rule: allow
port: '22'
proto: tcp
from: 10.0.0.0/8
- rule: allow
port: '22'
proto: tcp
from: 172.16.0.0/12
- rule: allow
port: '22'
proto: tcp
from: 192.168.0.0/16
- name: Disable unnecessary services
service:
name: "{{ item }}"
state: stopped
enabled: no
loop:
- cups
- avahi-daemon
- bluetooth
- nfs-server
- rpcbind
ignore_errors: true
- name: Remove unnecessary packages
apt:
name: "{{ item }}"
state: absent
purge: yes
loop:
- telnet
- rsh-client
- talk
- ntalk
when: ansible_os_family == "Debian"
ignore_errors: true
- name: Configure auditd
block:
- name: Install auditd
apt:
name: auditd
state: present
when: ansible_os_family == "Debian"
- name: Configure audit rules
template:
src: templates/audit.rules.j2
dest: /etc/audit/rules.d/hardening.rules
- name: Enable auditd service
service:
name: auditd
state: started
enabled: yes
when: auditd_enabled
- name: Configure AppArmor
block:
- name: Install apparmor
apt:
name: apparmor
state: present
when: ansible_os_family == "Debian"
- name: Enable apparmor service
service:
name: apparmor
state: started
enabled: yes
when: apparmor_enabled and ansible_os_family == "Debian"
- name: Configure sysctl hardening
sysctl:
name: "{{ item.key }}"
value: "{{ item.value }}"
state: present
reload: yes
loop:
- { key: 'net.ipv4.ip_forward', value: '0' }
- { key: 'net.ipv4.conf.all.send_redirects', value: '0' }
- { key: 'net.ipv4.conf.default.send_redirects', value: '0' }
- { key: 'net.ipv4.tcp_syncookies', value: '1' }
- { key: 'net.ipv4.icmp_echo_ignore_broadcasts', value: '1' }
- name: Set secure file permissions
file:
path: "{{ item }}"
mode: '0644'
owner: root
group: root
loop:
- /etc/passwd
- /etc/group
- /etc/shadow
- /etc/gshadow
- name: Lock inactive user accounts
command: usermod -L "{{ item }}"
loop: "{{ inactive_users | default([]) }}"
ignore_errors: true
- name: Configure password policies
pam_limits:
domain: '*'
limit_type: hard
limit_item: nofile
value: 1024
- name: Generate hardening report
template:
src: templates/hardening_report.j2
dest: "/var/log/hardening_report_{{ ansible_date_time.iso8601 }}.log"
handlers:
- name: restart sshd
service:
name: ssh
state: restarted
- name: restart auditd
service:
name: auditd
state: restarted
when: auditd_enabled
post_tasks:
- name: Run CIS compliance check
command: >
bash -c "
score=0
total=0
echo 'CIS Compliance Check Results:' > /tmp/cis_check.log
# Add CIS checks here
echo 'Overall Score: $score/$total' >> /tmp/cis_check.log
cat /tmp/cis_check.log
"
register: cis_check
changed_when: false
- name: Archive CIS results
copy:
content: "{{ cis_check.stdout }}"
dest: "/var/log/cis_compliance_{{ ansible_date_time.iso8601 }}.log"
@@ -0,0 +1,139 @@
---
- name: Apply Security Patches and Updates
hosts: all
become: true
gather_facts: true
vars:
patch_window_start: "02:00"
patch_window_end: "04:00"
reboot_required: false
security_only: true
pre_tasks:
- name: Check patch window
assert:
that: ansible_date_time.hour|int >= patch_window_start.split(':')[0]|int and ansible_date_time.hour|int < patch_window_end.split(':')[0]|int
fail_msg: "Current time {{ ansible_date_time.hour }}:{{ ansible_date_time.minute }} is outside patch window {{ patch_window_start }}-{{ patch_window_end }}"
when: enforce_patch_window | default(true) | bool
- name: Create patch backup
file:
path: "/var/backups/pre-patch-{{ ansible_date_time.iso8601 }}"
state: directory
- name: Backup package list
command: dpkg --get-selections
register: package_backup
changed_when: false
- name: Save package backup
copy:
content: "{{ package_backup.stdout }}"
dest: "/var/backups/pre-patch-{{ ansible_date_time.iso8601 }}/packages.list"
tasks:
- name: Update package cache
apt:
update_cache: yes
cache_valid_time: 300
when: ansible_os_family == "Debian"
- name: Check for available updates
command: apt list --upgradable 2>/dev/null | grep -v "Listing..." | wc -l
register: updates_available
changed_when: false
when: ansible_os_family == "Debian"
- name: Apply security updates only
apt:
upgrade: dist
update_cache: yes
when: security_only and ansible_os_family == "Debian"
- name: Apply all updates
apt:
upgrade: dist
update_cache: yes
when: not security_only and ansible_os_family == "Debian"
- name: Check if reboot required
stat:
path: /var/run/reboot-required
register: reboot_required_file
when: ansible_os_family == "Debian"
- name: Set reboot flag
set_fact:
reboot_required: "{{ reboot_required_file.stat.exists | default(false) }}"
- name: Restart services after patching
service:
name: "{{ item }}"
state: restarted
loop:
- sshd
- fail2ban
- unattended-upgrades
ignore_errors: true
- name: Update monitoring agent
include_role:
name: monitoring_agent_update
when: "'monitoring' in group_names"
- name: Verify critical services
service:
name: "{{ item }}"
state: started
loop:
- systemd-journald
- systemd-logind
- cron
ignore_errors: true
- name: Run post-patch health checks
uri:
url: http://localhost/health
method: GET
status_code: 200
register: health_result
ignore_errors: true
when: "'webservers' in group_names"
post_tasks:
- name: Generate patch report
template:
src: templates/patch_report.j2
dest: "/var/log/patch_report_{{ ansible_date_time.iso8601 }}.log"
vars:
patch_status: "{{ 'SUCCESS' if health_result.status == 200 else 'WARNING' }}"
updates_applied: "{{ updates_available.stdout | default('0') }}"
reboot_needed: "{{ reboot_required }}"
- name: Send patch notification
mail:
to: "{{ patch_notification_email | default('infra-team@company.com') }}"
subject: "Patch Report - {{ inventory_hostname }}"
body: |
Patch completed for {{ inventory_hostname }}
Updates applied: {{ updates_applied }}
Reboot required: {{ reboot_required }}
Health check: {{ 'PASSED' if health_result.status == 200 else 'FAILED' }}
See /var/log/patch_report_{{ ansible_date_time.iso8601 }}.log for details
when: patch_notification_email is defined
ignore_errors: true
- name: Schedule reboot if required
command: shutdown -r +5 "Rebooting for security patches"
when: reboot_required and auto_reboot | default(false) | bool
async: 600
poll: 0
handlers:
- name: restart monitoring
service:
name: "{{ monitoring_service | default('prometheus-node-exporter') }}"
state: restarted
when: "'monitoring' in group_names"
@@ -0,0 +1,158 @@
---
- name: Provision Enterprise Infrastructure Nodes
hosts: all
become: true
gather_facts: true
vars:
node_timezone: "UTC"
admin_user: "infra-admin"
ssh_port: 22
packages:
- curl
- wget
- vim
- htop
- net-tools
- iptables
- fail2ban
- unattended-upgrades
tasks:
- name: Update package cache
apt:
update_cache: yes
cache_valid_time: 3600
when: ansible_os_family == "Debian"
- name: Install base packages
apt:
name: "{{ packages }}"
state: present
when: ansible_os_family == "Debian"
- name: Create admin user
user:
name: "{{ admin_user }}"
groups: sudo
append: yes
create_home: yes
shell: /bin/bash
password: "{{ 'infra-admin-password' | password_hash('sha512') }}"
- name: Configure timezone
timezone:
name: "{{ node_timezone }}"
- name: Configure SSH
block:
- name: Disable root SSH login
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PermitRootLogin'
line: 'PermitRootLogin no'
- name: Set SSH port
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^Port'
line: "Port {{ ssh_port }}"
- name: Disable password authentication
lineinfile:
path: /etc/ssh/sshd_config
regexp: '^PasswordAuthentication'
line: 'PasswordAuthentication no'
- name: Restart SSH service
service:
name: sshd
state: restarted
- name: Configure firewall
ufw:
state: enabled
policy: deny
rules:
- rule: allow
port: "{{ ssh_port }}"
proto: tcp
- rule: allow
port: '80'
proto: tcp
- rule: allow
port: '443'
proto: tcp
- name: Configure fail2ban
template:
src: templates/jail.local.j2
dest: /etc/fail2ban/jail.local
notify: restart fail2ban
- name: Enable unattended upgrades
lineinfile:
path: /etc/apt/apt.conf.d/20auto-upgrades
regexp: '^APT::Periodic::Unattended-Upgrade'
line: 'APT::Periodic::Unattended-Upgrade "1";'
when: ansible_os_family == "Debian"
- name: Create application directories
file:
path: "{{ item }}"
state: directory
owner: "{{ admin_user }}"
group: "{{ admin_user }}"
mode: '0755'
loop:
- /opt/application
- /var/log/application
- /etc/application
- name: Deploy monitoring agent
include_role:
name: monitoring_agent
when: "'monitoring' in group_names"
- name: Deploy web server
include_role:
name: nginx
when: "'webservers' in group_names"
- name: Deploy database server
include_role:
name: postgresql
when: "'databases' in group_names"
- name: Deploy load balancer
include_role:
name: haproxy
when: "'loadbalancers' in group_names"
- name: Generate provisioning report
template:
src: templates/provisioning_report.j2
dest: /var/log/provisioning_report_{{ ansible_date_time.iso8601 }}.log
delegate_to: localhost
handlers:
- name: restart fail2ban
service:
name: fail2ban
state: restarted
post_tasks:
- name: Verify services
service:
name: "{{ item }}"
state: started
enabled: yes
loop: "{{ services_to_verify | default([]) }}"
ignore_errors: true
- name: Run health checks
uri:
url: http://localhost/health
method: GET
register: health_check
ignore_errors: true
when: "'webservers' in group_names"
@@ -0,0 +1,116 @@
---
- name: Enterprise Scaling Event Scenario
hosts: all
become: yes
gather_facts: yes
vars:
scaling_threshold: 80
cooldown_period: 300
max_scale_up: 5
min_instances: 2
pre_tasks:
- name: Log scenario start
lineinfile:
path: "/var/log/scaling_scenario.log"
line: "{{ ansible_date_time.iso8601 }} - Starting scaling event scenario"
create: yes
- name: Check current load
command: uptime
register: system_load
changed_when: false
- name: Parse load average
set_fact:
load_1min: "{{ system_load.stdout.split(',')[0].split()[-1] | float }}"
load_5min: "{{ system_load.stdout.split(',')[1] | float }}"
load_15min: "{{ system_load.stdout.split(',')[2] | float }}"
tasks:
- name: Evaluate scaling conditions
set_fact:
scale_up_needed: "{{ load_5min > scaling_threshold }}"
scale_down_needed: "{{ load_5min < (scaling_threshold * 0.3) }}"
- name: Scale up web servers
include_role:
name: scale_up
tasks_from: web_servers
vars:
scale_count: "{{ [max_scale_up, (load_5min / 10) | int] | min }}"
when: scale_up_needed and "'webservers' in group_names"
- name: Scale up database servers
include_role:
name: scale_up
tasks_from: database_servers
vars:
scale_count: "{{ [2, (load_5min / 20) | int] | min }}"
when: scale_up_needed and "'databases' in group_names"
- name: Update load balancer configuration
include_role:
name: load_balancer
tasks_from: update_backends
when: scale_up_needed
- name: Scale down web servers
include_role:
name: scale_down
tasks_from: web_servers
vars:
scale_count: "{{ [(inventory_hostname | regex_findall('[0-9]+') | first | int) - min_instances, 1] | max }}"
when: scale_down_needed and "'webservers' in group_names" and (inventory_hostname | regex_findall('[0-9]+') | first | int) > min_instances
- name: Wait for cooldown period
pause:
seconds: "{{ cooldown_period }}"
when: scale_up_needed or scale_down_needed
- name: Verify scaling results
uri:
url: http://localhost/health
method: GET
status_code: 200
register: health_check
until: health_check.status == 200
retries: 5
delay: 10
when: "'webservers' in group_names"
- name: Update monitoring thresholds
include_role:
name: monitoring
tasks_from: update_alerts
vars:
new_threshold: "{{ scaling_threshold + 10 }}"
- name: Send scaling notification
mail:
to: "{{ scaling_notification_email | default('infra-team@company.com') }}"
subject: "Infrastructure Scaling Event - {{ inventory_hostname }}"
body: |
Scaling event completed on {{ inventory_hostname }}
Load averages: {{ load_1min }}, {{ load_5min }}, {{ load_15min }}
Action taken: {{ 'Scale Up' if scale_up_needed else 'Scale Down' if scale_down_needed else 'No Action' }}
Health check: {{ 'PASSED' if health_check.status == 200 else 'FAILED' }}
See /var/log/scaling_scenario.log for details
when: scaling_notification_email is defined
ignore_errors: yes
post_tasks:
- name: Generate scaling scenario report
template:
src: templates/scaling_scenario_report.j2
dest: "/var/log/scaling_scenario_report_{{ ansible_date_time.iso8601 }}.log"
vars:
scenario_outcome: "{{ 'SUCCESS' if health_check.status == 200 else 'WARNING' }}"
load_metrics: "{{ load_1min }}, {{ load_5min }}, {{ load_15min }}"
- name: Log scenario completion
lineinfile:
path: "/var/log/scaling_scenario.log"
line: "{{ ansible_date_time.iso8601 }} - Scaling event scenario completed"
@@ -0,0 +1,343 @@
#!/bin/bash
# Enterprise Infrastructure Failure Simulation Script
# Simulates various types of infrastructure failures for testing
set -euo pipefail
# Configuration
DOCKER_COMPOSE_FILE="docker-compose.yml"
INVENTORY_FILE="inventory/hosts.ini"
LOG_FILE="logs/failure_simulation.log"
# Default values
FAILURE_TYPE="${1:-network}"
DURATION="${2:-60}"
TARGET_NODES="${3:-all}"
INTENSITY="${INTENSITY:-medium}"
# Logging function
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
}
# Error handling
error_exit() {
log "ERROR: $1"
# Cleanup any active failures
cleanup_failure
exit 1
}
# Validate inputs
validate_inputs() {
case "$FAILURE_TYPE" in
network|disk|service|node|cpu|memory) ;;
*) error_exit "Invalid failure type: $FAILURE_TYPE. Must be network, disk, service, node, cpu, or memory" ;;
esac
if ! [[ "$DURATION" =~ ^[0-9]+$ ]] || [ "$DURATION" -lt 1 ]; then
error_exit "Invalid duration: $DURATION. Must be a positive integer (seconds)"
fi
case "$INTENSITY" in
low|medium|high|critical) ;;
*) error_exit "Invalid intensity: $INTENSITY. Must be low, medium, high, or critical" ;;
esac
}
# Get target containers
get_target_containers() {
case "$TARGET_NODES" in
all)
docker-compose ps --services | grep -v "^NAME$" || true
;;
web)
echo "web"
;;
db)
echo "db"
;;
lb)
echo "lb"
;;
monitor)
echo "monitor"
;;
*)
echo "$TARGET_NODES"
;;
esac
}
# Network failure simulation
simulate_network_failure() {
local containers=$(get_target_containers)
log "Simulating network failure on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Disconnecting network for container $cid"
# Disconnect from network
docker network disconnect "$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" "$cid" 2>/dev/null || true
# Store original network for restoration
echo "$cid:$(docker inspect "$cid" --format '{{.HostConfig.NetworkMode}}')" >> /tmp/network_failure_state
fi
done
done
}
# Disk failure simulation
simulate_disk_failure() {
local containers=$(get_target_containers)
log "Simulating disk space exhaustion on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Filling disk space in container $cid"
# Create a large file to consume disk space
local fill_size="100M"
case "$INTENSITY" in
low) fill_size="50M" ;;
medium) fill_size="100M" ;;
high) fill_size="500M" ;;
critical) fill_size="1G" ;;
esac
docker exec "$cid" bash -c "dd if=/dev/zero of=/tmp/disk_fill bs=1M count=$(( ${fill_size%M} * 1024 ))" 2>/dev/null || true
echo "$cid:disk_fill" >> /tmp/disk_failure_state
fi
done
done
}
# Service failure simulation
simulate_service_failure() {
local containers=$(get_target_containers)
log "Simulating service failures on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Stopping services in container $cid"
# Stop common services
docker exec "$cid" systemctl stop nginx 2>/dev/null || true
docker exec "$cid" systemctl stop postgresql 2>/dev/null || true
docker exec "$cid" systemctl stop haproxy 2>/dev/null || true
echo "$cid:services" >> /tmp/service_failure_state
fi
done
done
}
# Node failure simulation
simulate_node_failure() {
local containers=$(get_target_containers)
log "Simulating complete node failures on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Stopping container $cid (node failure)"
docker pause "$cid"
echo "$cid:paused" >> /tmp/node_failure_state
fi
done
done
}
# CPU stress simulation
simulate_cpu_failure() {
local containers=$(get_target_containers)
log "Simulating CPU stress on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Starting CPU stress in container $cid"
# Start CPU stress process
docker exec -d "$cid" bash -c "while true; do :; done" 2>/dev/null || true
echo "$cid:cpu_stress:$(docker exec "$cid" ps aux | grep "while true" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/cpu_failure_state
fi
done
done
}
# Memory stress simulation
simulate_memory_failure() {
local containers=$(get_target_containers)
log "Simulating memory exhaustion on containers: $containers"
for container in $containers; do
local container_ids=$(docker-compose ps -q "$container" 2>/dev/null || true)
for cid in $container_ids; do
if [ -n "$cid" ]; then
log "Starting memory stress in container $cid"
# Start memory stress process
docker exec -d "$cid" bash -c "tail /dev/zero" 2>/dev/null || true
echo "$cid:memory_stress:$(docker exec "$cid" ps aux | grep "tail /dev/zero" | grep -v grep | awk '{print $2}' | head -1)" >> /tmp/memory_failure_state
fi
done
done
}
# Inject failure
inject_failure() {
case "$FAILURE_TYPE" in
network) simulate_network_failure ;;
disk) simulate_disk_failure ;;
service) simulate_service_failure ;;
node) simulate_node_failure ;;
cpu) simulate_cpu_failure ;;
memory) simulate_memory_failure ;;
esac
}
# Cleanup failure
cleanup_failure() {
log "Cleaning up failure simulation"
# Restore network connections
if [ -f /tmp/network_failure_state ]; then
while IFS=: read -r cid network; do
docker network connect "$network" "$cid" 2>/dev/null || true
done < /tmp/network_failure_state
rm -f /tmp/network_failure_state
fi
# Clean up disk fill files
if [ -f /tmp/disk_failure_state ]; then
while IFS=: read -r cid _; do
docker exec "$cid" rm -f /tmp/disk_fill 2>/dev/null || true
done < /tmp/disk_failure_state
rm -f /tmp/disk_failure_state
fi
# Restart services
if [ -f /tmp/service_failure_state ]; then
while IFS=: read -r cid _; do
docker exec "$cid" systemctl start nginx 2>/dev/null || true
docker exec "$cid" systemctl start postgresql 2>/dev/null || true
docker exec "$cid" systemctl start haproxy 2>/dev/null || true
done < /tmp/service_failure_state
rm -f /tmp/service_failure_state
fi
# Unpause containers
if [ -f /tmp/node_failure_state ]; then
while IFS=: read -r cid _; do
docker unpause "$cid" 2>/dev/null || true
done < /tmp/node_failure_state
rm -f /tmp/node_failure_state
fi
# Kill stress processes
if [ -f /tmp/cpu_failure_state ]; then
while IFS=: read -r cid _ pid; do
docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
done < /tmp/cpu_failure_state
rm -f /tmp/cpu_failure_state
fi
if [ -f /tmp/memory_failure_state ]; then
while IFS=: read -r cid _ pid; do
docker exec "$cid" kill -9 "$pid" 2>/dev/null || true
done < /tmp/memory_failure_state
rm -f /tmp/memory_failure_state
fi
}
# Monitor failure
monitor_failure() {
local end_time=$(( $(date +%s) + DURATION ))
log "Monitoring failure for $DURATION seconds"
while [ $(date +%s) -lt $end_time ]; do
# Check container status
if ! docker-compose ps | grep -q "Up\|Paused"; then
log "WARNING: All containers are down"
fi
# Log system metrics
log "System status: $(docker stats --no-stream --format 'table {{.Container}}\t{{.CPUPerc}}\t{{.MemPerc}}' | tail -n +2)"
sleep 10
done
}
# Generate failure report
generate_report() {
local report_file="reports/failure_simulation_$(date +%Y%m%d_%H%M%S).txt"
cat > "$report_file" << EOF
Failure Simulation Report
========================
Timestamp: $(date)
Failure Type: $FAILURE_TYPE
Duration: $DURATION seconds
Target Nodes: $TARGET_NODES
Intensity: $INTENSITY
Pre-failure Status:
$(docker-compose ps)
Post-failure Status:
$(docker-compose ps)
Log File: $LOG_FILE
EOF
log "Failure simulation report generated: $report_file"
}
# Main execution
main() {
log "Starting failure simulation: $FAILURE_TYPE for $DURATION seconds"
validate_inputs
# Inject failure
inject_failure
# Monitor during failure
monitor_failure
# Cleanup
cleanup_failure
# Generate report
generate_report
log "Failure simulation completed successfully"
}
# Trap for cleanup on script exit
trap cleanup_failure EXIT
# Initialize logging
mkdir -p logs reports
# Run main function
main "$@"
@@ -0,0 +1,208 @@
#!/bin/bash
# Enterprise Infrastructure Scaling Simulation Script
# Simulates scaling operations for infrastructure nodes
set -euo pipefail
# Configuration
DOCKER_COMPOSE_FILE="docker-compose.yml"
INVENTORY_FILE="inventory/hosts.ini"
LOG_FILE="logs/scaling_simulation.log"
# Default values
DIRECTION="${1:-up}"
COUNT="${2:-1}"
NODE_TYPE="${3:-web}"
SIMULATION_MODE="${SIMULATION_MODE:-false}"
# Logging function
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $*" | tee -a "$LOG_FILE"
}
# Error handling
error_exit() {
log "ERROR: $1"
exit 1
}
# Validate inputs
validate_inputs() {
if [[ "$DIRECTION" != "up" && "$DIRECTION" != "down" ]]; then
error_exit "Invalid direction: $DIRECTION. Must be 'up' or 'down'"
fi
if ! [[ "$COUNT" =~ ^[0-9]+$ ]] || [ "$COUNT" -lt 1 ]; then
error_exit "Invalid count: $COUNT. Must be a positive integer"
fi
case "$NODE_TYPE" in
web|db|lb|monitor) ;;
*) error_exit "Invalid node type: $NODE_TYPE. Must be web, db, lb, or monitor" ;;
esac
}
# Get current node count
get_current_count() {
local type="$1"
case "$type" in
web) docker-compose ps web | grep -c "Up" ;;
db) docker-compose ps db | grep -c "Up" ;;
lb) docker-compose ps lb | grep -c "Up" ;;
monitor) docker-compose ps monitor | grep -c "Up" ;;
esac
}
# Scale up infrastructure
scale_up() {
local type="$1"
local count="$2"
log "Scaling up $count $type nodes"
# Update docker-compose replica count
sed -i.bak "s/replicas: [0-9]\+/replicas: $(( $(get_current_count "$type") + count ))/" "$DOCKER_COMPOSE_FILE"
# Deploy new containers
docker-compose up -d --scale "${type}=${count}"
# Wait for containers to be ready
log "Waiting for containers to be ready..."
sleep 30
# Update inventory
update_inventory "$type" "$count" "add"
# Run provisioning playbook on new nodes
if [ "$SIMULATION_MODE" = false ]; then
ansible-playbook -i "$INVENTORY_FILE" playbooks/provision.yml --limit "${type}*"
fi
log "Successfully scaled up $count $type nodes"
}
# Scale down infrastructure
scale_down() {
local type="$1"
local count="$2"
local current_count=$(get_current_count "$type")
if [ "$current_count" -lt "$count" ]; then
error_exit "Cannot scale down $count nodes. Only $current_count $type nodes currently running"
fi
log "Scaling down $count $type nodes"
# Select nodes to remove (oldest first)
local nodes_to_remove=$(docker-compose ps "$type" | grep "Up" | head -n "$count" | awk '{print $1}')
# Decommission nodes
for node in $nodes_to_remove; do
if [ "$SIMULATION_MODE" = false ]; then
ansible-playbook -i "$INVENTORY_FILE" playbooks/decommission.yml --limit "$node"
fi
docker stop "$node"
docker rm "$node"
done
# Update docker-compose replica count
sed -i.bak "s/replicas: [0-9]\+/replicas: $(( current_count - count ))/" "$DOCKER_COMPOSE_FILE"
# Update inventory
update_inventory "$type" "$count" "remove"
log "Successfully scaled down $count $type nodes"
}
# Update Ansible inventory
update_inventory() {
local type="$1"
local count="$2"
local action="$3"
log "Updating inventory for $action $count $type nodes"
# This would be more complex in a real implementation
# For simulation, we'll just log the action
case "$action" in
add)
log "Added $count $type nodes to inventory"
;;
remove)
log "Removed $count $type nodes from inventory"
;;
esac
}
# Health check after scaling
health_check() {
log "Running health checks after scaling"
# Check container status
if ! docker-compose ps | grep -q "Up"; then
error_exit "Some containers failed to start"
fi
# Ansible ping check
if [ "$SIMULATION_MODE" = false ]; then
if ! ansible -i "$INVENTORY_FILE" all -m ping >/dev/null 2>&1; then
log "WARNING: Some nodes failed Ansible ping check"
fi
fi
log "Health checks completed"
}
# Generate scaling report
generate_report() {
local report_file="reports/scaling_report_$(date +%Y%m%d_%H%M%S).txt"
cat > "$report_file" << EOF
Scaling Simulation Report
========================
Timestamp: $(date)
Direction: $DIRECTION
Node Type: $NODE_TYPE
Count: $COUNT
Simulation Mode: $SIMULATION_MODE
Current Status:
$(docker-compose ps)
Inventory Status:
$(ansible -i "$INVENTORY_FILE" --list-hosts all 2>/dev/null || echo "Ansible inventory check failed")
Log File: $LOG_FILE
EOF
log "Scaling report generated: $report_file"
}
# Main execution
main() {
log "Starting scaling simulation: $DIRECTION $COUNT $NODE_TYPE nodes"
validate_inputs
case "$DIRECTION" in
up)
scale_up "$NODE_TYPE" "$COUNT"
;;
down)
scale_down "$NODE_TYPE" "$COUNT"
;;
esac
health_check
generate_report
log "Scaling simulation completed successfully"
}
# Initialize logging
mkdir -p logs reports
# Run main function
main "$@"
+389
View File
@@ -0,0 +1,389 @@
# Migration Validation Framework
A comprehensive Python CLI tool for validating system migrations through data collection, snapshot comparison, and automated reporting. Designed for enterprise migration workflows where system consistency and data integrity are critical.
## Overview
The Migration Validation Framework provides a systematic approach to validating system migrations by:
- Collecting comprehensive system data before and after migration
- Generating structured JSON snapshots for comparison
- Performing intelligent diff analysis between snapshots
- Generating detailed HTML reports with change visualization
- Providing CLI interface for integration into migration pipelines
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ CLI Interface │ │ Data │ │ Validation │
│ (cli.py) │◄──►│ Collectors │◄──►│ Engine │
│ │ │ │ │ │
│ - Command │ │ - mounts.py │ │ - compare.py │
│ parsing │ │ - services.py │ │ - diff.py │
│ - Workflow │ │ - disk_usage.py │ │ - validate.py │
│ orchestration │ │ - network.py │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ JSON │ │ Comparison │ │ HTML │
│ Snapshots │ │ Results │ │ Reports │
│ │ │ │ │ │
│ - Pre-migration │ │ - Differences │ │ - Summary │
│ - Post-migration│ │ - Risk levels │ │ - Details │
│ - Metadata │ │ - Validation │ │ - Charts │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Quick Start
### Prerequisites
- Python 3.8+
- SSH access to target systems
- Appropriate permissions for data collection
### Installation
```bash
cd migration-validation-framework
pip install -r requirements.txt
```
### Basic Usage
```bash
# Create pre-migration snapshot
python cli.py snapshot --env production --label pre-migration --systems web01,db01
# Perform migration...
# Create post-migration snapshot
python cli.py snapshot --env production --label post-migration --systems web01,db01
# Compare snapshots
python cli.py compare pre-migration post-migration --output comparison_001
# Generate HTML report
python cli.py report --comparison comparison_001 --format html --output migration_report.html
```
## Project Structure
```
migration-validation-framework/
├── cli.py # Main CLI interface
├── collectors/ # Data collection modules
│ ├── mounts.py # Filesystem mount collection
│ ├── services.py # System services collection
│ ├── disk_usage.py # Disk usage statistics
│ ├── network.py # Network configuration
│ └── processes.py # Running processes
├── validators/ # Validation and comparison logic
│ ├── compare.py # Snapshot comparison engine
│ ├── diff.py # Difference calculation
│ └── validate.py # Validation rules
├── reports/ # Report generation
│ ├── html_report.py # HTML report generator
│ ├── json_report.py # JSON report generator
│ └── summary.py # Summary calculations
├── config/ # Configuration files
│ ├── collectors.yaml # Collector configurations
│ └── validators.yaml # Validation rules
├── tests/ # Unit and integration tests
├── logs/ # Application logs
└── snapshots/ # Stored snapshots
```
## Data Collectors
### Mounts Collector (`collectors/mounts.py`)
Collects filesystem mount information including:
- Mount points and devices
- Filesystem types
- Mount options
- Capacity and usage statistics
### Services Collector (`collectors/services.py`)
Gathers system service status:
- Running services
- Service states (active, inactive, failed)
- Startup configuration
- Dependencies
### Disk Usage Collector (`collectors/disk_usage.py`)
Analyzes disk space utilization:
- Directory size statistics
- File system usage
- Inode usage
- Largest files and directories
### Network Collector (`collectors/network.py`)
Captures network configuration:
- Interface configurations
- Routing tables
- DNS settings
- Firewall rules
### Processes Collector (`collectors/processes.py`)
Documents running processes:
- Process lists with PIDs
- Memory and CPU usage
- Process owners
- Command lines
## Validation Engine
### Comparison Logic (`validators/compare.py`)
Performs intelligent comparison of snapshots:
- Structural differences detection
- Semantic change analysis
- Risk level assessment
- Change categorization
### Difference Calculator (`validators/diff.py`)
Calculates detailed differences:
- Added/removed/modified items
- Quantitative changes
- Configuration drift detection
- Anomaly identification
### Validation Rules (`validators/validate.py`)
Applies validation rules:
- Critical change detection
- Compliance checking
- Threshold validation
- Custom rule engine
## Reporting
### HTML Reports (`reports/html_report.py`)
Generates comprehensive HTML reports featuring:
- Executive summary dashboard
- Detailed change logs
- Risk assessment visualizations
- Interactive charts and graphs
- Export capabilities
### JSON Reports (`reports/json_report.py`)
Provides structured JSON output for:
- API integration
- Automated processing
- Audit trails
- Compliance reporting
## CLI Interface
### Commands
```bash
# Snapshot management
python cli.py snapshot --env <env> --label <label> [--systems <hosts>]
python cli.py list-snapshots [--env <env>]
python cli.py delete-snapshot <snapshot-id>
# Comparison operations
python cli.py compare <snapshot1> <snapshot2> [--output <comparison-id>]
python cli.py list-comparisons
python cli.py show-comparison <comparison-id>
# Reporting
python cli.py report --comparison <comparison-id> --format <format> [--output <file>]
python cli.py export --comparison <comparison-id> --format <format>
# Configuration
python cli.py config --show
python cli.py config --set <key> <value>
```
### Options
- `--env`: Target environment (production, staging, development)
- `--systems`: Comma-separated list of target systems
- `--parallel`: Number of parallel collection threads
- `--timeout`: Collection timeout in seconds
- `--verbose`: Enable verbose output
- `--dry-run`: Preview actions without execution
## Configuration
### Collector Configuration (`config/collectors.yaml`)
```yaml
collectors:
mounts:
enabled: true
timeout: 30
exclude_patterns:
- "/proc/*"
- "/sys/*"
services:
enabled: true
include_disabled: false
service_manager: systemd
disk_usage:
enabled: true
max_depth: 3
exclude_paths:
- "/tmp"
- "/var/log"
```
### Validation Rules (`config/validators.yaml`)
```yaml
rules:
critical_services:
- sshd
- systemd
- network
filesystem_thresholds:
warning: 80
critical: 95
network_changes:
allow_new_interfaces: false
allow_route_changes: false
```
## Examples
### Complete Migration Validation Workflow
```bash
# 1. Pre-migration snapshot
python cli.py snapshot --env production --label "migration-pre-20241201" \
--systems web01,web02,db01,lb01 --parallel 4
# 2. Execute migration process
# ... migration steps ...
# 3. Post-migration snapshot
python cli.py snapshot --env production --label "migration-post-20241201" \
--systems web01,web02,db01,lb01 --parallel 4
# 4. Compare snapshots
python cli.py compare migration-pre-20241201 migration-post-20241201 \
--output migration-dec2024
# 5. Generate reports
python cli.py report --comparison migration-dec2024 --format html \
--output migration_validation_report.html
python cli.py report --comparison migration-dec2024 --format json \
--output migration_validation_data.json
```
### Automated Validation in CI/CD
```bash
#!/bin/bash
# CI/CD validation script
ENVIRONMENT=$1
SNAPSHOT_LABEL="ci-${BUILD_NUMBER}"
# Create snapshot
python cli.py snapshot --env $ENVIRONMENT --label $SNAPSHOT_LABEL
# Compare with baseline
python cli.py compare baseline-$ENVIRONMENT $SNAPSHOT_LABEL --output ci-$BUILD_NUMBER
# Generate report
python cli.py report --comparison ci-$BUILD_NUMBER --format html
# Check for critical changes
if python cli.py check-critical --comparison ci-$BUILD_NUMBER; then
echo "Migration validation passed"
exit 0
else
echo "Critical changes detected - review required"
exit 1
fi
```
## Security Considerations
- SSH key-based authentication only
- Encrypted snapshot storage
- Access control for sensitive data
- Audit logging of all operations
- Data sanitization and filtering
## Performance Optimization
- Parallel data collection
- Incremental snapshots
- Compressed storage
- Memory-efficient processing
- Timeout handling
## Monitoring and Logging
- Comprehensive logging to `logs/validation.log`
- Performance metrics collection
- Error tracking and alerting
- Audit trail generation
## Troubleshooting
### Common Issues
**Connection Failures:**
```bash
# Check SSH connectivity
ssh -i ~/.ssh/id_rsa user@target-host
# Verify Python availability
python cli.py --test-connection --systems target-host
```
**Collection Timeouts:**
```bash
# Increase timeout
python cli.py snapshot --timeout 300 --systems slow-host
# Check system load
ssh user@target-host uptime
```
**Permission Errors:**
```bash
# Verify sudo access
ssh user@target-host sudo -l
# Check file permissions
ssh user@target-host ls -la /etc/
```
## Development
### Adding New Collectors
1. Create collector module in `collectors/`
2. Implement collection logic
3. Add configuration schema
4. Update CLI interface
5. Add unit tests
### Custom Validation Rules
1. Define rules in `config/validators.yaml`
2. Implement validation logic in `validators/`
3. Update report generation
4. Test with sample data
## Contributing
1. Follow existing code structure and naming conventions
2. Add comprehensive tests for new functionality
3. Update documentation for API changes
4. Ensure backward compatibility
## License
Enterprise Internal Use Only
+270
View File
@@ -0,0 +1,270 @@
#!/usr/bin/env python3
"""
Migration Validation Framework - CLI Interface
A comprehensive tool for validating system migrations through data collection,
snapshot comparison, and automated reporting.
"""
import argparse
import json
import logging
import sys
from datetime import datetime
from pathlib import Path
from typing import Dict, List, Optional, Any
# Import framework modules
from collectors import mounts, services, disk_usage
from validators import compare
from reports import html_report
# Configuration
SNAPSHOTS_DIR = Path("snapshots")
LOGS_DIR = Path("logs")
REPORTS_DIR = Path("reports")
class MigrationValidator:
"""Main migration validation class."""
def __init__(self, verbose: bool = False):
self.verbose = verbose
self.setup_logging()
self.ensure_directories()
def setup_logging(self):
"""Configure logging."""
log_level = logging.DEBUG if self.verbose else logging.INFO
logging.basicConfig(
level=log_level,
format='%(asctime)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler(LOGS_DIR / "validation.log"),
logging.StreamHandler(sys.stdout)
]
)
self.logger = logging.getLogger(__name__)
def ensure_directories(self):
"""Ensure required directories exist."""
for directory in [SNAPSHOTS_DIR, LOGS_DIR, REPORTS_DIR]:
directory.mkdir(exist_ok=True)
def collect_system_data(self, systems: List[str]) -> Dict[str, Any]:
"""Collect data from target systems."""
self.logger.info(f"Collecting data from systems: {systems}")
snapshot = {
"metadata": {
"timestamp": datetime.now().isoformat(),
"systems": systems,
"version": "1.0"
},
"data": {}
}
collectors = [
("mounts", mounts.collect),
("services", services.collect),
("disk_usage", disk_usage.collect)
]
for system in systems:
self.logger.info(f"Collecting data from {system}")
snapshot["data"][system] = {}
for collector_name, collector_func in collectors:
try:
self.logger.debug(f"Running {collector_name} collector on {system}")
data = collector_func(system)
snapshot["data"][system][collector_name] = data
except Exception as e:
self.logger.error(f"Failed to collect {collector_name} from {system}: {e}")
snapshot["data"][system][collector_name] = {"error": str(e)}
return snapshot
def save_snapshot(self, snapshot: Dict[str, Any], label: str, env: str) -> str:
"""Save snapshot to disk."""
snapshot_id = f"{env}-{label}-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
snapshot_file = SNAPSHOTS_DIR / f"{snapshot_id}.json"
with open(snapshot_file, 'w') as f:
json.dump(snapshot, f, indent=2)
self.logger.info(f"Snapshot saved: {snapshot_id}")
return snapshot_id
def load_snapshot(self, snapshot_id: str) -> Dict[str, Any]:
"""Load snapshot from disk."""
snapshot_file = SNAPSHOTS_DIR / f"{snapshot_id}.json"
if not snapshot_file.exists():
raise FileNotFoundError(f"Snapshot {snapshot_id} not found")
with open(snapshot_file, 'r') as f:
return json.load(f)
def create_snapshot(self, env: str, label: str, systems: List[str]) -> str:
"""Create and save a system snapshot."""
self.logger.info(f"Creating snapshot for environment: {env}, label: {label}")
snapshot = self.collect_system_data(systems)
snapshot_id = self.save_snapshot(snapshot, label, env)
return snapshot_id
def compare_snapshots(self, snapshot1_id: str, snapshot2_id: str, output_id: str) -> Dict[str, Any]:
"""Compare two snapshots."""
self.logger.info(f"Comparing snapshots: {snapshot1_id} vs {snapshot2_id}")
snapshot1 = self.load_snapshot(snapshot1_id)
snapshot2 = self.load_snapshot(snapshot2_id)
comparison = compare.compare_snapshots(snapshot1, snapshot2)
comparison["metadata"] = {
"snapshot1": snapshot1_id,
"snapshot2": snapshot2_id,
"timestamp": datetime.now().isoformat(),
"comparison_id": output_id
}
# Save comparison results
comparison_file = REPORTS_DIR / f"comparison_{output_id}.json"
with open(comparison_file, 'w') as f:
json.dump(comparison, f, indent=2)
self.logger.info(f"Comparison saved: {output_id}")
return comparison
def generate_report(self, comparison_id: str, format_type: str, output_file: Optional[str] = None) -> str:
"""Generate a report from comparison results."""
self.logger.info(f"Generating {format_type} report for comparison: {comparison_id}")
comparison_file = REPORTS_DIR / f"comparison_{comparison_id}.json"
if not comparison_file.exists():
raise FileNotFoundError(f"Comparison {comparison_id} not found")
with open(comparison_file, 'r') as f:
comparison = json.load(f)
if format_type == "html":
if output_file is None:
output_file = f"migration_report_{comparison_id}.html"
html_report.generate(comparison, output_file)
elif format_type == "json":
if output_file is None:
output_file = f"migration_report_{comparison_id}.json"
with open(output_file, 'w') as f:
json.dump(comparison, f, indent=2)
else:
raise ValueError(f"Unsupported format: {format_type}")
self.logger.info(f"Report generated: {output_file}")
return output_file
def main():
"""Main CLI entry point."""
parser = argparse.ArgumentParser(
description="Migration Validation Framework",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Create pre-migration snapshot
python cli.py snapshot --env production --label pre-migration --systems web01,db01
# Compare snapshots
python cli.py compare pre-migration-snapshot post-migration-snapshot --output comparison_001
# Generate HTML report
python cli.py report --comparison comparison_001 --format html
"""
)
parser.add_argument('--verbose', '-v', action='store_true', help='Enable verbose logging')
parser.add_argument('--dry-run', action='store_true', help='Preview actions without execution')
subparsers = parser.add_subparsers(dest='command', help='Available commands')
# Snapshot command
snapshot_parser = subparsers.add_parser('snapshot', help='Create system snapshot')
snapshot_parser.add_argument('--env', required=True, help='Target environment')
snapshot_parser.add_argument('--label', required=True, help='Snapshot label')
snapshot_parser.add_argument('--systems', required=True, help='Comma-separated list of systems')
# Compare command
compare_parser = subparsers.add_parser('compare', help='Compare two snapshots')
compare_parser.add_argument('snapshot1', help='First snapshot ID')
compare_parser.add_argument('snapshot2', help='Second snapshot ID')
compare_parser.add_argument('--output', required=True, help='Comparison output ID')
# Report command
report_parser = subparsers.add_parser('report', help='Generate report from comparison')
report_parser.add_argument('--comparison', required=True, help='Comparison ID')
report_parser.add_argument('--format', choices=['html', 'json'], default='html', help='Report format')
report_parser.add_argument('--output', help='Output file path')
# List command
list_parser = subparsers.add_parser('list', help='List snapshots or comparisons')
list_parser.add_argument('type', choices=['snapshots', 'comparisons'], help='Type to list')
args = parser.parse_args()
if not args.command:
parser.print_help()
return
# Initialize validator
validator = MigrationValidator(verbose=args.verbose)
try:
if args.command == 'snapshot':
systems = args.systems.split(',')
if args.dry_run:
print(f"DRY RUN: Would create snapshot for systems: {systems}")
return
snapshot_id = validator.create_snapshot(args.env, args.label, systems)
print(f"Snapshot created: {snapshot_id}")
elif args.command == 'compare':
if args.dry_run:
print(f"DRY RUN: Would compare {args.snapshot1} vs {args.snapshot2}")
return
comparison = validator.compare_snapshots(args.snapshot1, args.snapshot2, args.output)
print(f"Comparison completed: {args.output}")
elif args.command == 'report':
if args.dry_run:
print(f"DRY RUN: Would generate {args.format} report for {args.comparison}")
return
output_file = validator.generate_report(args.comparison, args.format, args.output)
print(f"Report generated: {output_file}")
elif args.command == 'list':
if args.type == 'snapshots':
snapshots = list(SNAPSHOTS_DIR.glob("*.json"))
if snapshots:
print("Available snapshots:")
for snapshot in sorted(snapshots):
print(f" {snapshot.stem}")
else:
print("No snapshots found")
elif args.type == 'comparisons':
comparisons = list(REPORTS_DIR.glob("comparison_*.json"))
if comparisons:
print("Available comparisons:")
for comparison in sorted(comparisons):
comp_id = comparison.stem.replace('comparison_', '')
print(f" {comp_id}")
else:
print("No comparisons found")
except Exception as e:
validator.logger.error(f"Command failed: {e}")
print(f"Error: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
@@ -0,0 +1,207 @@
"""
Disk Usage Data Collector
Collects disk usage statistics including directory sizes,
file system usage, and largest files information.
"""
import json
import logging
import subprocess
from typing import Dict, Any, List
from pathlib import Path
logger = logging.getLogger(__name__)
class DiskUsageCollector:
"""Collector for disk usage statistics."""
def __init__(self):
self.max_depth = 3
self.exclude_paths = [
"/proc",
"/sys",
"/dev",
"/run",
"/tmp",
"/var/log"
]
def collect_disk_usage(self, system: str) -> Dict[str, Any]:
"""Collect disk usage information from target system."""
logger.info(f"Collecting disk usage data from {system}")
try:
# Collect filesystem usage
filesystem_usage = self.collect_filesystem_usage(system)
# Collect directory sizes
directory_sizes = self.collect_directory_sizes(system)
# Collect largest files
largest_files = self.collect_largest_files(system)
return {
"filesystem_usage": filesystem_usage,
"directory_sizes": directory_sizes,
"largest_files": largest_files,
"timestamp": self.get_timestamp(system)
}
except Exception as e:
logger.error(f"Failed to collect disk usage from {system}: {e}")
raise
def collect_filesystem_usage(self, system: str) -> List[Dict[str, Any]]:
"""Collect filesystem usage statistics."""
usage_stats = []
try:
# Run df command
result = subprocess.run(
["ssh", system, "df -h --output=source,fstype,size,used,avail,pcent,target"],
capture_output=True,
text=True,
timeout=30
)
if result.returncode != 0:
raise RuntimeError(f"df command failed: {result.stderr}")
# Parse output
lines = result.stdout.strip().split('\n')
if len(lines) < 2:
return usage_stats
for line in lines[1:]: # Skip header
parts = line.split()
if len(parts) >= 7:
usage_stat = {
"filesystem": parts[0],
"type": parts[1],
"size": parts[2],
"used": parts[3],
"available": parts[4],
"use_percent": parts[5],
"mountpoint": parts[6]
}
usage_stats.append(usage_stat)
except subprocess.TimeoutExpired:
logger.error(f"Timeout collecting filesystem usage from {system}")
raise
except Exception as e:
logger.error(f"Failed to collect filesystem usage from {system}: {e}")
raise
return usage_stats
def collect_directory_sizes(self, system: str) -> List[Dict[str, Any]]:
"""Collect sizes of top-level directories."""
directory_sizes = []
try:
# Get top-level directories
dirs_to_check = ["/", "/home", "/var", "/usr", "/opt", "/etc"]
for directory in dirs_to_check:
if directory in self.exclude_paths:
continue
try:
# Run du command for directory size
result = subprocess.run(
["ssh", system, f"du -sh {directory} 2>/dev/null"],
capture_output=True,
text=True,
timeout=60
)
if result.returncode == 0:
size, path = result.stdout.strip().split('\t', 1)
directory_sizes.append({
"path": path,
"size": size
})
except subprocess.TimeoutExpired:
logger.warning(f"Timeout getting size for {directory} on {system}")
continue
except Exception as e:
logger.warning(f"Failed to get size for {directory} on {system}: {e}")
continue
except Exception as e:
logger.error(f"Failed to collect directory sizes from {system}: {e}")
raise
return directory_sizes
def collect_largest_files(self, system: str) -> List[Dict[str, Any]]:
"""Collect information about largest files in the system."""
largest_files = []
try:
# Find largest files (excluding certain paths)
exclude_expr = " ".join(f"-not -path '{path}/*'" for path in self.exclude_paths)
cmd = f"find / {exclude_expr} -type f -exec ls -lh {{}} \\; 2>/dev/null | sort -k5 -hr | head -20"
result = subprocess.run(
["ssh", system, cmd],
capture_output=True,
text=True,
timeout=120
)
if result.returncode == 0:
for line in result.stdout.strip().split('\n'):
if not line.strip():
continue
parts = line.split()
if len(parts) >= 9:
file_info = {
"permissions": parts[0],
"links": parts[1],
"owner": parts[2],
"group": parts[3],
"size": parts[4],
"month": parts[5],
"day": parts[6],
"time": parts[7],
"path": " ".join(parts[8:])
}
largest_files.append(file_info)
except subprocess.TimeoutExpired:
logger.error(f"Timeout collecting largest files from {system}")
raise
except Exception as e:
logger.error(f"Failed to collect largest files from {system}: {e}")
raise
return largest_files
def get_timestamp(self, system: str) -> str:
"""Get current timestamp from target system."""
try:
result = subprocess.run(
["ssh", system, "date -Iseconds"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return result.stdout.strip()
else:
return "unknown"
except Exception:
return "unknown"
def collect(system: str) -> Dict[str, Any]:
"""Main collection function for disk usage data."""
collector = DiskUsageCollector()
return collector.collect_disk_usage(system)
@@ -0,0 +1,173 @@
"""
Mounts Data Collector
Collects filesystem mount information including mount points, devices,
filesystem types, and usage statistics.
"""
import json
import logging
import subprocess
from typing import Dict, Any, List
from pathlib import Path
logger = logging.getLogger(__name__)
class MountsCollector:
"""Collector for filesystem mount information."""
def __init__(self):
self.exclude_patterns = [
"/proc/*",
"/sys/*",
"/dev/*",
"/run/*"
]
def collect_mounts(self, system: str) -> Dict[str, Any]:
"""Collect mount information from target system."""
logger.info(f"Collecting mounts data from {system}")
try:
# Run mount command
result = subprocess.run(
["ssh", system, "mount"],
capture_output=True,
text=True,
timeout=30
)
if result.returncode != 0:
raise RuntimeError(f"Mount command failed: {result.stderr}")
mounts = self.parse_mount_output(result.stdout)
filtered_mounts = self.filter_mounts(mounts)
# Get usage statistics
usage_stats = self.collect_usage_stats(system, filtered_mounts)
return {
"mounts": filtered_mounts,
"usage": usage_stats,
"timestamp": self.get_timestamp(system)
}
except subprocess.TimeoutExpired:
logger.error(f"Timeout collecting mounts from {system}")
raise
except Exception as e:
logger.error(f"Failed to collect mounts from {system}: {e}")
raise
def parse_mount_output(self, output: str) -> List[Dict[str, str]]:
"""Parse mount command output."""
mounts = []
for line in output.strip().split('\n'):
if not line.strip():
continue
# Parse mount output format: device on mountpoint type fstype (options)
parts = line.split()
if len(parts) >= 6 and parts[1] == 'on' and parts[3] == 'type':
mount_info = {
"device": parts[0],
"mountpoint": parts[2],
"fstype": parts[4],
"options": parts[5].strip('()')
}
mounts.append(mount_info)
return mounts
def filter_mounts(self, mounts: List[Dict[str, str]]) -> List[Dict[str, str]]:
"""Filter out unwanted mount points."""
filtered = []
for mount in mounts:
mountpoint = mount["mountpoint"]
if not any(Path(mountpoint).match(pattern.rstrip('*')) for pattern in self.exclude_patterns):
filtered.append(mount)
return filtered
def collect_usage_stats(self, system: str, mounts: List[Dict[str, str]]) -> Dict[str, Any]:
"""Collect disk usage statistics for mount points."""
usage_stats = {}
for mount in mounts:
mountpoint = mount["mountpoint"]
try:
# Run df command for usage statistics
result = subprocess.run(
["ssh", system, f"df -BG {mountpoint}"],
capture_output=True,
text=True,
timeout=15
)
if result.returncode == 0:
usage_stats[mountpoint] = self.parse_df_output(result.stdout)
except subprocess.TimeoutExpired:
logger.warning(f"Timeout getting usage for {mountpoint} on {system}")
usage_stats[mountpoint] = {"error": "timeout"}
except Exception as e:
logger.warning(f"Failed to get usage for {mountpoint} on {system}: {e}")
usage_stats[mountpoint] = {"error": str(e)}
return usage_stats
def parse_df_output(self, output: str) -> Dict[str, Any]:
"""Parse df command output."""
lines = output.strip().split('\n')
if len(lines) < 2:
return {"error": "invalid df output"}
# Parse header and data
header = lines[0].split()
data = lines[1].split()
if len(header) != len(data):
return {"error": "header/data mismatch"}
stats = {}
for i, field in enumerate(header):
if i < len(data):
if field in ['1G-blocks', 'Used', 'Available']:
# Convert to GB
value = data[i]
if value.endswith('G'):
stats[field.lower()] = float(value.rstrip('G'))
else:
stats[field.lower()] = float(value) / (1024**3) # Assume bytes
elif field == 'Use%':
stats['use_percent'] = int(data[i].rstrip('%'))
else:
stats[field.lower()] = data[i]
return stats
def get_timestamp(self, system: str) -> str:
"""Get current timestamp from target system."""
try:
result = subprocess.run(
["ssh", system, "date -Iseconds"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return result.stdout.strip()
else:
return "unknown"
except Exception:
return "unknown"
def collect(system: str) -> Dict[str, Any]:
"""Main collection function for mounts data."""
collector = MountsCollector()
return collector.collect_mounts(system)
@@ -0,0 +1,223 @@
"""
Services Data Collector
Collects system service information including running services,
their states, startup configuration, and dependencies.
"""
import json
import logging
import subprocess
from typing import Dict, Any, List
logger = logging.getLogger(__name__)
class ServicesCollector:
"""Collector for system service information."""
def __init__(self):
self.service_manager = "systemd" # Default to systemd
self.include_disabled = False
def collect_services(self, system: str) -> Dict[str, Any]:
"""Collect service information from target system."""
logger.info(f"Collecting services data from {system}")
try:
# Detect service manager
service_manager = self.detect_service_manager(system)
if service_manager == "systemd":
services = self.collect_systemd_services(system)
elif service_manager == "sysv":
services = self.collect_sysv_services(system)
else:
raise RuntimeError(f"Unsupported service manager: {service_manager}")
return {
"service_manager": service_manager,
"services": services,
"timestamp": self.get_timestamp(system)
}
except Exception as e:
logger.error(f"Failed to collect services from {system}: {e}")
raise
def detect_service_manager(self, system: str) -> str:
"""Detect which service manager is running on the system."""
try:
# Check for systemd
result = subprocess.run(
["ssh", system, "ps -p 1 -o comm="],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
if "systemd" in result.stdout.strip():
return "systemd"
elif "init" in result.stdout.strip():
return "sysv"
# Fallback check
result = subprocess.run(
["ssh", system, "which systemctl"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return "systemd"
return "sysv"
except Exception:
return "unknown"
def collect_systemd_services(self, system: str) -> List[Dict[str, Any]]:
"""Collect systemd service information."""
services = []
try:
# Get all services
result = subprocess.run(
["ssh", system, "systemctl list-units --type=service --all --no-pager --no-legend"],
capture_output=True,
text=True,
timeout=30
)
if result.returncode != 0:
raise RuntimeError(f"systemctl list-units failed: {result.stderr}")
# Parse service list
for line in result.stdout.strip().split('\n'):
if not line.strip():
continue
parts = line.split()
if len(parts) >= 4:
service_name = parts[0]
load_state = parts[1]
active_state = parts[2]
sub_state = parts[3]
# Skip if disabled and not including disabled
if not self.include_disabled and load_state == "not-found":
continue
# Get detailed service info
service_info = self.get_systemd_service_details(system, service_name)
services.append({
"name": service_name,
"load_state": load_state,
"active_state": active_state,
"sub_state": sub_state,
**service_info
})
except subprocess.TimeoutExpired:
logger.error(f"Timeout collecting systemd services from {system}")
raise
except Exception as e:
logger.error(f"Failed to collect systemd services from {system}: {e}")
raise
return services
def get_systemd_service_details(self, system: str, service_name: str) -> Dict[str, Any]:
"""Get detailed information for a systemd service."""
details = {}
try:
# Get service status
result = subprocess.run(
["ssh", system, f"systemctl show {service_name} --no-pager"],
capture_output=True,
text=True,
timeout=15
)
if result.returncode == 0:
for line in result.stdout.strip().split('\n'):
if '=' in line:
key, value = line.split('=', 1)
details[key.lower()] = value
except Exception as e:
logger.warning(f"Failed to get details for {service_name}: {e}")
return details
def collect_sysv_services(self, system: str) -> List[Dict[str, Any]]:
"""Collect SysV init service information."""
services = []
try:
# Get service list from /etc/init.d/
result = subprocess.run(
["ssh", system, "ls -1 /etc/init.d/"],
capture_output=True,
text=True,
timeout=15
)
if result.returncode != 0:
raise RuntimeError(f"Failed to list init.d services: {result.stderr}")
for service_name in result.stdout.strip().split('\n'):
if not service_name.strip():
continue
# Get service status
status_result = subprocess.run(
["ssh", system, f"/etc/init.d/{service_name} status"],
capture_output=True,
text=True,
timeout=10
)
status = "unknown"
if status_result.returncode == 0:
status = "running"
elif "not running" in status_result.stdout.lower():
status = "stopped"
services.append({
"name": service_name,
"status": status,
"type": "sysv"
})
except Exception as e:
logger.error(f"Failed to collect SysV services from {system}: {e}")
raise
return services
def get_timestamp(self, system: str) -> str:
"""Get current timestamp from target system."""
try:
result = subprocess.run(
["ssh", system, "date -Iseconds"],
capture_output=True,
text=True,
timeout=10
)
if result.returncode == 0:
return result.stdout.strip()
else:
return "unknown"
except Exception:
return "unknown"
def collect(system: str) -> Dict[str, Any]:
"""Main collection function for services data."""
collector = ServicesCollector()
return collector.collect_services(system)
@@ -0,0 +1,608 @@
"""
HTML Report Generator
Generates comprehensive HTML reports from migration validation comparison results.
"""
import json
import logging
from typing import Dict, Any
from datetime import datetime
from pathlib import Path
logger = logging.getLogger(__name__)
class HTMLReportGenerator:
"""Generator for HTML migration validation reports."""
def __init__(self):
self.css_styles = self.get_css_styles()
self.js_scripts = self.get_js_scripts()
def generate_report(self, comparison: Dict[str, Any], output_file: str) -> str:
"""Generate HTML report from comparison data."""
logger.info(f"Generating HTML report: {output_file}")
html_content = self.build_html_content(comparison)
with open(output_file, 'w', encoding='utf-8') as f:
f.write(html_content)
logger.info(f"HTML report generated: {output_file}")
return output_file
def build_html_content(self, comparison: Dict[str, Any]) -> str:
"""Build complete HTML content."""
metadata = comparison.get("metadata", {})
summary = comparison.get("summary", {})
differences = comparison.get("differences", {})
risk_assessment = comparison.get("risk_assessment", {})
validation_results = comparison.get("validation_results", {})
html = f"""<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Migration Validation Report</title>
<style>{self.css_styles}</style>
</head>
<body>
<div class="container">
<header>
<h1>Migration Validation Report</h1>
<div class="report-meta">
<p><strong>Report Generated:</strong> {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}</p>
<p><strong>Comparison ID:</strong> {metadata.get('comparison_id', 'N/A')}</p>
<p><strong>Snapshot 1:</strong> {metadata.get('snapshot1', 'N/A')}</p>
<p><strong>Snapshot 2:</strong> {metadata.get('snapshot2', 'N/A')}</p>
</div>
</header>
<nav class="toc">
<h2>Table of Contents</h2>
<ul>
<li><a href="#executive-summary">Executive Summary</a></li>
<li><a href="#risk-assessment">Risk Assessment</a></li>
<li><a href="#validation-results">Validation Results</a></li>
<li><a href="#detailed-changes">Detailed Changes</a></li>
<li><a href="#recommendations">Recommendations</a></li>
</ul>
</nav>
<section id="executive-summary">
<h2>Executive Summary</h2>
{self.generate_summary_section(summary)}
</section>
<section id="risk-assessment">
<h2>Risk Assessment</h2>
{self.generate_risk_section(risk_assessment)}
</section>
<section id="validation-results">
<h2>Validation Results</h2>
{self.generate_validation_section(validation_results)}
</section>
<section id="detailed-changes">
<h2>Detailed Changes</h2>
{self.generate_changes_section(differences)}
</section>
<section id="recommendations">
<h2>Recommendations</h2>
{self.generate_recommendations_section(risk_assessment)}
</section>
</div>
<script>{self.js_scripts}</script>
</body>
</html>"""
return html
def generate_summary_section(self, summary: Dict[str, Any]) -> str:
"""Generate executive summary HTML."""
total_systems = summary.get('total_systems', 0)
systems_with_changes = summary.get('systems_with_changes', 0)
total_changes = summary.get('total_changes', 0)
html = f"""
<div class="summary-grid">
<div class="summary-card">
<h3>Systems Analyzed</h3>
<div class="metric">{total_systems}</div>
</div>
<div class="summary-card">
<h3>Systems with Changes</h3>
<div class="metric">{systems_with_changes}</div>
</div>
<div class="summary-card">
<h3>Total Changes</h3>
<div class="metric">{total_changes}</div>
</div>
</div>
<h3>Changes by Type</h3>
<table class="changes-table">
<thead>
<tr>
<th>Data Type</th>
<th>Changes</th>
</tr>
</thead>
<tbody>"""
for data_type, count in summary.get('changes_by_type', {}).items():
html += f"""
<tr>
<td>{data_type.replace('_', ' ').title()}</td>
<td>{count}</td>
</tr>"""
html += """
</tbody>
</table>
<h3>Most Affected Systems</h3>
<table class="systems-table">
<thead>
<tr>
<th>System</th>
<th>Changes</th>
</tr>
</thead>
<tbody>"""
for system, count in summary.get('most_affected_systems', []):
html += f"""
<tr>
<td>{system}</td>
<td>{count}</td>
</tr>"""
html += """
</tbody>
</table>"""
return html
def generate_risk_section(self, risk_assessment: Dict[str, Any]) -> str:
"""Generate risk assessment HTML."""
overall_risk = risk_assessment.get('overall_risk', 'unknown')
risk_color = self.get_risk_color(overall_risk)
html = f"""
<div class="risk-overview">
<h3>Overall Risk Level</h3>
<div class="risk-badge risk-{overall_risk}" style="background-color: {risk_color}">
{overall_risk.upper()}
</div>
</div>
<h3>Risk Factors</h3>
<div class="risk-factors">"""
for factor in risk_assessment.get('risk_factors', []):
factor_color = self.get_risk_color(self.get_risk_level_name(factor.get('level', 1)))
html += f"""
<div class="risk-factor">
<div class="risk-badge risk-{self.get_risk_level_name(factor.get('level', 1))}" style="background-color: {factor_color}">
{self.get_risk_level_name(factor.get('level', 1)).upper()}
</div>
<div class="risk-details">
<strong>{factor.get('type', 'Unknown').replace('_', ' ').title()}</strong>
<p>{factor.get('description', 'No description')}</p>
</div>
</div>"""
html += """
</div>
<h3>Critical Changes</h3>
<div class="critical-changes">"""
for change in risk_assessment.get('critical_changes', []):
html += f"""
<div class="critical-change">
<h4>{change.get('system', 'Unknown System')}</h4>
<p><strong>Type:</strong> {change.get('data_type', 'Unknown')}</p>
<p>{change.get('factor', {}).get('description', 'No details')}</p>
</div>"""
html += "</div>"
return html
def generate_validation_section(self, validation_results: Dict[str, Any]) -> str:
"""Generate validation results HTML."""
passed = validation_results.get('passed', False)
status_color = "#28a745" if passed else "#dc3545"
status_text = "PASSED" if passed else "FAILED"
html = f"""
<div class="validation-status">
<div class="status-indicator" style="background-color: {status_color}">
{status_text}
</div>
</div>
<h3>Validation Checks</h3>
<div class="validation-checks">"""
for check in validation_results.get('checks', []):
check_status = "" if check.get('passed', False) else ""
check_color = "#28a745" if check.get('passed', False) else "#dc3545"
html += f"""
<div class="validation-check">
<div class="check-status" style="color: {check_color}">{check_status}</div>
<div class="check-details">
<h4>{check.get('name', 'Unknown').replace('_', ' ').title()}</h4>
<p>{check.get('description', 'No description')}</p>
{"<ul>" + "".join(f"<li>{detail}</li>" for detail in check.get('details', [])) + "</ul>" if check.get('details') else ""}
</div>
</div>"""
html += "</div>"
return html
def generate_changes_section(self, differences: Dict[str, Any]) -> str:
"""Generate detailed changes HTML."""
html = ""
for data_type, systems in differences.items():
html += f"<h3>{data_type.replace('_', ' ').title()} Changes</h3>"
for system, system_diffs in systems.items():
html += f"<h4>System: {system}</h4>"
# Generate tables for different change types
html += self.generate_change_tables(system_diffs)
return html
def generate_change_tables(self, system_diffs: Dict[str, Any]) -> str:
"""Generate HTML tables for different types of changes."""
html = ""
change_types = {
'added_mounts': ('Added Mounts', ['mountpoint', 'device', 'fstype']),
'removed_mounts': ('Removed Mounts', ['mountpoint', 'device', 'fstype']),
'changed_mounts': ('Changed Mounts', ['mountpoint', 'before', 'after']),
'usage_changes': ('Usage Changes', ['mountpoint', 'before', 'after']),
'added_services': ('Added Services', ['name', 'active_state', 'sub_state']),
'removed_services': ('Removed Services', ['name', 'active_state', 'sub_state']),
'status_changes': ('Service Status Changes', ['name', 'before', 'after']),
'filesystem_changes': ('Filesystem Changes', ['mountpoint', 'before', 'after']),
'directory_size_changes': ('Directory Size Changes', ['path', 'before', 'after']),
'significant_usage_changes': ('Significant Usage Changes', ['mountpoint', 'change_percent', 'before', 'after'])
}
for change_key, (title, columns) in change_types.items():
if change_key in system_diffs and system_diffs[change_key]:
html += f"<h5>{title}</h5>"
html += '<table class="changes-table">'
html += '<thead><tr>'
for col in columns:
html += f'<th>{col.replace("_", " ").title()}</th>'
html += '</tr></thead><tbody>'
for item in system_diffs[change_key]:
html += '<tr>'
for col in columns:
value = item.get(col, '')
if isinstance(value, dict):
value = json.dumps(value, indent=2)
html += f'<td><pre>{value}</pre></td>'
html += '</tr>'
html += '</tbody></table>'
return html
def generate_recommendations_section(self, risk_assessment: Dict[str, Any]) -> str:
"""Generate recommendations HTML."""
recommendations = risk_assessment.get('recommendations', [])
html = '<div class="recommendations">'
if not recommendations:
html += '<p>No specific recommendations at this time.</p>'
else:
html += '<ul>'
for rec in recommendations:
html += f'<li>{rec}</li>'
html += '</ul>'
html += '</div>'
return html
def get_risk_color(self, risk_level: str) -> str:
"""Get color for risk level."""
colors = {
'low': '#28a745',
'medium': '#ffc107',
'high': '#fd7e14',
'critical': '#dc3545'
}
return colors.get(risk_level.lower(), '#6c757d')
def get_risk_level_name(self, level: int) -> str:
"""Get risk level name from numeric level."""
levels = {1: 'low', 2: 'medium', 3: 'high', 4: 'critical'}
return levels.get(level, 'unknown')
def get_css_styles(self) -> str:
"""Get CSS styles for the report."""
return """
body {
font-family: -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
line-height: 1.6;
color: #333;
margin: 0;
padding: 20px;
background-color: #f8f9fa;
}
.container {
max-width: 1200px;
margin: 0 auto;
background: white;
padding: 30px;
border-radius: 8px;
box-shadow: 0 2px 10px rgba(0,0,0,0.1);
}
header {
text-align: center;
margin-bottom: 40px;
padding-bottom: 20px;
border-bottom: 2px solid #e9ecef;
}
h1 {
color: #2c3e50;
margin-bottom: 10px;
}
.report-meta {
color: #6c757d;
font-size: 0.9em;
}
.toc {
background: #f8f9fa;
padding: 20px;
border-radius: 5px;
margin-bottom: 30px;
}
.toc ul {
list-style: none;
padding: 0;
}
.toc li {
margin: 5px 0;
}
.toc a {
color: #007bff;
text-decoration: none;
}
.toc a:hover {
text-decoration: underline;
}
section {
margin-bottom: 40px;
}
h2 {
color: #2c3e50;
border-bottom: 1px solid #e9ecef;
padding-bottom: 10px;
}
h3 {
color: #495057;
margin-top: 30px;
}
.summary-grid {
display: grid;
grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
gap: 20px;
margin: 20px 0;
}
.summary-card {
background: #f8f9fa;
padding: 20px;
border-radius: 5px;
text-align: center;
}
.metric {
font-size: 2em;
font-weight: bold;
color: #007bff;
margin: 10px 0;
}
table {
width: 100%;
border-collapse: collapse;
margin: 20px 0;
}
th, td {
padding: 12px;
text-align: left;
border-bottom: 1px solid #e9ecef;
}
th {
background-color: #f8f9fa;
font-weight: 600;
}
.risk-overview {
text-align: center;
margin: 20px 0;
}
.risk-badge {
display: inline-block;
padding: 8px 16px;
border-radius: 20px;
color: white;
font-weight: bold;
margin: 10px;
}
.risk-factors, .critical-changes {
margin: 20px 0;
}
.risk-factor, .critical-change {
background: #f8f9fa;
padding: 15px;
border-radius: 5px;
margin: 10px 0;
border-left: 4px solid #007bff;
}
.risk-factor {
display: flex;
align-items: center;
}
.risk-details {
margin-left: 15px;
}
.validation-status {
text-align: center;
margin: 20px 0;
}
.status-indicator {
display: inline-block;
padding: 15px 30px;
border-radius: 5px;
color: white;
font-weight: bold;
font-size: 1.2em;
}
.validation-checks {
margin: 20px 0;
}
.validation-check {
display: flex;
align-items: flex-start;
background: #f8f9fa;
padding: 15px;
border-radius: 5px;
margin: 10px 0;
}
.check-status {
font-size: 1.5em;
margin-right: 15px;
margin-top: 5px;
}
.check-details {
flex: 1;
}
.recommendations ul {
background: #e7f3ff;
padding: 20px;
border-radius: 5px;
border-left: 4px solid #007bff;
}
.recommendations li {
margin: 10px 0;
}
pre {
background: #f8f9fa;
padding: 10px;
border-radius: 3px;
overflow-x: auto;
max-width: 100%;
white-space: pre-wrap;
word-wrap: break-word;
}
@media (max-width: 768px) {
.container {
padding: 15px;
}
.summary-grid {
grid-template-columns: 1fr;
}
.risk-factor {
flex-direction: column;
text-align: center;
}
.validation-check {
flex-direction: column;
}
}
"""
def get_js_scripts(self) -> str:
"""Get JavaScript for interactive features."""
return """
// Add smooth scrolling to TOC links
document.addEventListener('DOMContentLoaded', function() {
const links = document.querySelectorAll('.toc a');
links.forEach(link => {
link.addEventListener('click', function(e) {
e.preventDefault();
const target = document.querySelector(this.getAttribute('href'));
if (target) {
target.scrollIntoView({ behavior: 'smooth' });
}
});
});
// Add collapsible sections for large tables
const tables = document.querySelectorAll('table');
tables.forEach(table => {
if (table.rows.length > 10) {
const toggle = document.createElement('button');
toggle.textContent = 'Toggle Details';
toggle.style.marginBottom = '10px';
toggle.addEventListener('click', function() {
const tbody = table.querySelector('tbody');
tbody.style.display = tbody.style.display === 'none' ? '' : 'none';
});
table.parentNode.insertBefore(toggle, table);
}
});
});
"""
def generate(comparison: Dict[str, Any], output_file: str) -> str:
"""Main function to generate HTML report."""
generator = HTMLReportGenerator()
return generator.generate_report(comparison, output_file)
@@ -0,0 +1,491 @@
"""
Snapshot Comparison Engine
Compares two system snapshots and identifies differences,
risk levels, and validation results.
"""
import json
import logging
from typing import Dict, Any, List, Tuple
from datetime import datetime
logger = logging.getLogger(__name__)
class SnapshotComparator:
"""Engine for comparing system snapshots."""
def __init__(self):
self.risk_levels = {
"low": 1,
"medium": 2,
"high": 3,
"critical": 4
}
def compare_snapshots(self, snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
"""Compare two snapshots and return detailed comparison results."""
logger.info("Starting snapshot comparison")
comparison = {
"summary": {},
"differences": {},
"risk_assessment": {},
"validation_results": {}
}
# Compare each data type
data_types = ["mounts", "services", "disk_usage"]
for data_type in data_types:
if data_type in snapshot1.get("data", {}) and data_type in snapshot2.get("data", {}):
differences = self.compare_data_type(snapshot1["data"], snapshot2["data"], data_type)
comparison["differences"][data_type] = differences
# Generate summary
comparison["summary"] = self.generate_summary(comparison["differences"])
# Risk assessment
comparison["risk_assessment"] = self.assess_risks(comparison["differences"])
# Validation results
comparison["validation_results"] = self.validate_changes(comparison["differences"])
logger.info("Snapshot comparison completed")
return comparison
def compare_data_type(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
"""Compare a specific data type between two snapshots."""
differences = {}
# Get all systems from both snapshots
systems1 = set(data1.keys())
systems2 = set(data2.keys())
all_systems = systems1.union(systems2)
for system in all_systems:
system_diffs = {}
if system not in data1:
system_diffs["status"] = "added"
system_diffs["details"] = {"new_system": True}
elif system not in data2:
system_diffs["status"] = "removed"
system_diffs["details"] = {"removed_system": True}
else:
# Compare data for this system and data type
if data_type in data1[system] and data_type in data2[system]:
system_diffs = self.compare_system_data(
data1[system][data_type],
data2[system][data_type],
data_type
)
else:
system_diffs["status"] = "data_missing"
system_diffs["details"] = {"missing_data_type": data_type}
if system_diffs:
differences[system] = system_diffs
return differences
def compare_system_data(self, data1: Dict[str, Any], data2: Dict[str, Any], data_type: str) -> Dict[str, Any]:
"""Compare data for a specific system and data type."""
differences = {}
if data_type == "mounts":
differences = self.compare_mounts(data1, data2)
elif data_type == "services":
differences = self.compare_services(data1, data2)
elif data_type == "disk_usage":
differences = self.compare_disk_usage(data1, data2)
else:
differences["status"] = "unknown_data_type"
return differences
def compare_mounts(self, mounts1: Dict[str, Any], mounts2: Dict[str, Any]) -> Dict[str, Any]:
"""Compare mounts data between snapshots."""
differences = {
"added_mounts": [],
"removed_mounts": [],
"changed_mounts": [],
"usage_changes": []
}
# Compare mount lists
mounts_list1 = mounts1.get("mounts", [])
mounts_list2 = mounts2.get("mounts", [])
# Create mountpoint maps
mounts_map1 = {m["mountpoint"]: m for m in mounts_list1}
mounts_map2 = {m["mountpoint"]: m for m in mounts_list2}
# Find added and removed mounts
added = set(mounts_map2.keys()) - set(mounts_map1.keys())
removed = set(mounts_map1.keys()) - set(mounts_map2.keys())
differences["added_mounts"] = [{"mountpoint": mp, **mounts_map2[mp]} for mp in added]
differences["removed_mounts"] = [{"mountpoint": mp, **mounts_map1[mp]} for mp in removed]
# Find changed mounts
common = set(mounts_map1.keys()) & set(mounts_map2.keys())
for mp in common:
m1, m2 = mounts_map1[mp], mounts_map2[mp]
if m1 != m2:
differences["changed_mounts"].append({
"mountpoint": mp,
"before": m1,
"after": m2
})
# Compare usage statistics
usage1 = mounts1.get("usage", {})
usage2 = mounts2.get("usage", {})
for mp in set(usage1.keys()) | set(usage2.keys()):
if mp in usage1 and mp in usage2:
u1, u2 = usage1[mp], usage2[mp]
if u1 != u2:
differences["usage_changes"].append({
"mountpoint": mp,
"before": u1,
"after": u2
})
return differences
def compare_services(self, services1: Dict[str, Any], services2: Dict[str, Any]) -> Dict[str, Any]:
"""Compare services data between snapshots."""
differences = {
"added_services": [],
"removed_services": [],
"status_changes": [],
"configuration_changes": []
}
# Compare service lists
services_list1 = services1.get("services", [])
services_list2 = services2.get("services", [])
# Create service maps
services_map1 = {s["name"]: s for s in services_list1}
services_map2 = {s["name"]: s for s in services_list2}
# Find added and removed services
added = set(services_map2.keys()) - set(services_map1.keys())
removed = set(services_map1.keys()) - set(services_map2.keys())
differences["added_services"] = [{"name": name, **services_map2[name]} for name in added]
differences["removed_services"] = [{"name": name, **services_map1[name]} for name in removed]
# Find status changes
common = set(services_map1.keys()) & set(services_map2.keys())
for name in common:
s1, s2 = services_map1[name], services_map2[name]
if s1.get("active_state") != s2.get("active_state") or s1.get("sub_state") != s2.get("sub_state"):
differences["status_changes"].append({
"name": name,
"before": {"active_state": s1.get("active_state"), "sub_state": s1.get("sub_state")},
"after": {"active_state": s2.get("active_state"), "sub_state": s2.get("sub_state")}
})
return differences
def compare_disk_usage(self, usage1: Dict[str, Any], usage2: Dict[str, Any]) -> Dict[str, Any]:
"""Compare disk usage data between snapshots."""
differences = {
"filesystem_changes": [],
"directory_size_changes": [],
"significant_usage_changes": []
}
# Compare filesystem usage
fs1 = usage1.get("filesystem_usage", [])
fs2 = usage2.get("filesystem_usage", [])
# Create filesystem maps by mountpoint
fs_map1 = {fs["mountpoint"]: fs for fs in fs1}
fs_map2 = {fs["mountpoint"]: fs for fs in fs2}
common_fs = set(fs_map1.keys()) & set(fs_map2.keys())
for mp in common_fs:
f1, f2 = fs_map1[mp], fs_map2[mp]
if f1 != f2:
differences["filesystem_changes"].append({
"mountpoint": mp,
"before": f1,
"after": f2
})
# Check for significant usage changes
try:
use1 = int(f1.get("use_percent", "0").rstrip("%"))
use2 = int(f2.get("use_percent", "0").rstrip("%"))
if abs(use2 - use1) > 10: # 10% change threshold
differences["significant_usage_changes"].append({
"mountpoint": mp,
"change_percent": use2 - use1,
"before": f1,
"after": f2
})
except (ValueError, KeyError):
pass
return differences
def generate_summary(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Generate a summary of all differences."""
summary = {
"total_systems": len(differences),
"systems_with_changes": 0,
"total_changes": 0,
"changes_by_type": {},
"most_affected_systems": []
}
system_change_counts = {}
for data_type, systems in differences.items():
summary["changes_by_type"][data_type] = 0
for system, system_diffs in systems.items():
if system not in system_change_counts:
system_change_counts[system] = 0
# Count changes for this system and data type
change_count = self.count_changes(system_diffs)
system_change_counts[system] += change_count
summary["changes_by_type"][data_type] += change_count
summary["total_changes"] += change_count
# Count systems with changes
summary["systems_with_changes"] = len([s for s in system_change_counts.values() if s > 0])
# Find most affected systems
sorted_systems = sorted(system_change_counts.items(), key=lambda x: x[1], reverse=True)
summary["most_affected_systems"] = sorted_systems[:5]
return summary
def count_changes(self, system_diffs: Dict[str, Any]) -> int:
"""Count the number of changes in system differences."""
count = 0
for key, value in system_diffs.items():
if isinstance(value, list):
count += len(value)
elif isinstance(value, dict) and key not in ["status"]:
# Count nested changes
count += sum(1 for v in value.values() if isinstance(v, list) and v)
return count
def assess_risks(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Assess risk levels for the changes."""
risk_assessment = {
"overall_risk": "low",
"risk_factors": [],
"critical_changes": [],
"recommendations": []
}
max_risk_level = 1
# Analyze each type of change
for data_type, systems in differences.items():
for system, system_diffs in systems.items():
risk_factors = self.analyze_system_risks(system_diffs, data_type)
risk_assessment["risk_factors"].extend(risk_factors)
for factor in risk_factors:
if factor["level"] > max_risk_level:
max_risk_level = factor["level"]
if factor["level"] >= 4: # Critical
risk_assessment["critical_changes"].append({
"system": system,
"data_type": data_type,
"factor": factor
})
# Set overall risk
risk_levels = {1: "low", 2: "medium", 3: "high", 4: "critical"}
risk_assessment["overall_risk"] = risk_levels.get(max_risk_level, "unknown")
# Generate recommendations
risk_assessment["recommendations"] = self.generate_recommendations(risk_assessment)
return risk_assessment
def analyze_system_risks(self, system_diffs: Dict[str, Any], data_type: str) -> List[Dict[str, Any]]:
"""Analyze risks for a specific system's changes."""
risk_factors = []
if data_type == "mounts":
# Check for removed critical mounts
for mount in system_diffs.get("removed_mounts", []):
if mount["mountpoint"] in ["/", "/boot", "/usr", "/var"]:
risk_factors.append({
"type": "critical_mount_removed",
"description": f"Critical mount point removed: {mount['mountpoint']}",
"level": 4
})
# Check for significant usage changes
for change in system_diffs.get("usage_changes", []):
try:
before_pct = int(change["before"].get("use_percent", "0").rstrip("%"))
after_pct = int(change["after"].get("use_percent", "0").rstrip("%"))
if after_pct > 95:
risk_factors.append({
"type": "filesystem_full",
"description": f"Filesystem usage critical: {change['mountpoint']} at {after_pct}%",
"level": 3
})
except (ValueError, KeyError):
pass
elif data_type == "services":
# Check for critical service changes
critical_services = ["sshd", "systemd", "networking", "dbus"]
for service in system_diffs.get("removed_services", []):
if service["name"] in critical_services:
risk_factors.append({
"type": "critical_service_removed",
"description": f"Critical service removed: {service['name']}",
"level": 4
})
for change in system_diffs.get("status_changes", []):
if change["after"]["active_state"] == "failed":
risk_factors.append({
"type": "service_failure",
"description": f"Service failed: {change['name']}",
"level": 3
})
elif data_type == "disk_usage":
for change in system_diffs.get("significant_usage_changes", []):
if change["change_percent"] > 20:
risk_factors.append({
"type": "disk_usage_spike",
"description": f"Significant disk usage change: {change['mountpoint']} ({change['change_percent']}%)",
"level": 2
})
return risk_factors
def generate_recommendations(self, risk_assessment: Dict[str, Any]) -> List[str]:
"""Generate recommendations based on risk assessment."""
recommendations = []
if risk_assessment["overall_risk"] in ["high", "critical"]:
recommendations.append("Immediate review required - critical changes detected")
recommendations.append("Consider rolling back migration if critical services are affected")
if any(f["type"] == "critical_mount_removed" for f in risk_assessment["risk_factors"]):
recommendations.append("Verify system boot capability after mount changes")
if any(f["type"] == "critical_service_removed" for f in risk_assessment["risk_factors"]):
recommendations.append("Ensure critical services are restored before production cutover")
if any(f["type"] == "filesystem_full" for f in risk_assessment["risk_factors"]):
recommendations.append("Monitor disk space closely - cleanup may be required")
if not recommendations:
recommendations.append("Changes appear safe - proceed with standard validation procedures")
return recommendations
def validate_changes(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Validate that changes meet requirements."""
validation_results = {
"passed": True,
"checks": [],
"failed_checks": []
}
# Define validation checks
checks = [
self.check_critical_services_running,
self.check_filesystem_integrity,
self.check_no_critical_mounts_removed
]
for check_func in checks:
check_result = check_func(differences)
validation_results["checks"].append(check_result)
if not check_result["passed"]:
validation_results["passed"] = False
validation_results["failed_checks"].append(check_result)
return validation_results
def check_critical_services_running(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Check that critical services are still running."""
check = {
"name": "critical_services_running",
"description": "Verify critical services remain operational",
"passed": True,
"details": []
}
critical_services = ["sshd", "systemd"]
for data_type, systems in differences.items():
if data_type == "services":
for system, system_diffs in systems.items():
for change in system_diffs.get("status_changes", []):
if change["name"] in critical_services:
if change["after"]["active_state"] == "failed":
check["passed"] = False
check["details"].append(f"Critical service {change['name']} failed on {system}")
return check
def check_filesystem_integrity(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Check filesystem integrity after changes."""
check = {
"name": "filesystem_integrity",
"description": "Verify filesystem integrity maintained",
"passed": True,
"details": []
}
for data_type, systems in differences.items():
if data_type == "disk_usage":
for system, system_diffs in systems.items():
for change in system_diffs.get("significant_usage_changes", []):
if change["change_percent"] > 50: # Arbitrary threshold
check["passed"] = False
check["details"].append(f"Extreme usage change on {system}:{change['mountpoint']}")
return check
def check_no_critical_mounts_removed(self, differences: Dict[str, Any]) -> Dict[str, Any]:
"""Check that no critical mount points were removed."""
check = {
"name": "no_critical_mounts_removed",
"description": "Verify critical mount points remain",
"passed": True,
"details": []
}
critical_mounts = ["/", "/boot", "/usr", "/var"]
for data_type, systems in differences.items():
if data_type == "mounts":
for system, system_diffs in systems.items():
for mount in system_diffs.get("removed_mounts", []):
if mount["mountpoint"] in critical_mounts:
check["passed"] = False
check["details"].append(f"Critical mount {mount['mountpoint']} removed from {system}")
return check
def compare_snapshots(snapshot1: Dict[str, Any], snapshot2: Dict[str, Any]) -> Dict[str, Any]:
"""Main comparison function."""
comparator = SnapshotComparator()
return comparator.compare_snapshots(snapshot1, snapshot2)
+461
View File
@@ -0,0 +1,461 @@
# Observability Stack
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
## Overview
The Observability Stack provides a complete monitoring solution with:
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Log Sources │ │ Logstash │ │ Elasticsearch │
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
│ / Systems) │ │ Processing) │ │ Analytics) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │ │ Kibana │ │ Grafana │
│ Rules │ │ (Dashboards & │ │ (Metrics & │
│ │ │ Exploration) │ │ Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Quick Start
### Prerequisites
- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
### Setup
```bash
cd observability-stack
# Start the observability stack
docker-compose up -d
# Wait for services to be ready (may take 2-3 minutes)
sleep 180
# Verify services are running
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health"
```
### Access Interfaces
- **Kibana**: http://localhost:5601 (admin/elastic)
- **Grafana**: http://localhost:3000 (admin/admin)
- **Elasticsearch**: http://localhost:9200
## Project Structure
```
observability-stack/
├── docker-compose.yml # Service orchestration
├── logstash/ # Logstash configuration
│ ├── pipeline/ # Processing pipelines
│ └── config/ # Logstash settings
├── elasticsearch/ # Elasticsearch configuration
│ └── config/ # Cluster settings
├── kibana/ # Kibana configuration
│ └── config/ # Dashboard settings
├── grafana/ # Grafana configuration
│ ├── provisioning/ # Dashboards and datasources
│ └── dashboards/ # Dashboard definitions
├── logs/ # Sample log data
│ └── sample.log # Realistic application logs
├── alerting/ # Alert configuration
│ └── alert_rules.yml # Alert definitions
├── scenarios/ # Incident simulation
│ └── incident_simulation.sh # Simulation scripts
└── README.md
```
## Services Configuration
### Elasticsearch
**Configuration**: `elasticsearch/config/elasticsearch.yml`
Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data
### Logstash
**Pipelines**: `logstash/pipeline/`
- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing
**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
### Kibana
**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
### Grafana
**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
## Log Ingestion
### Sample Data
The stack includes realistic sample logs for testing:
```bash
# Ingest sample logs
curl -X POST "localhost:8080" \
-H "Content-Type: application/json" \
-d @logs/sample.log
```
### Log Formats Supported
- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules
### Data Enrichment
Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
## Alerting and Monitoring
### Alert Rules
Located in `alerting/alert_rules.yml`:
```yaml
alert_rules:
- name: "High CPU Usage"
condition: "cpu_usage > 90"
duration: "5m"
severity: "critical"
channels: ["email", "slack"]
- name: "Disk Space Low"
condition: "disk_usage > 85"
duration: "10m"
severity: "warning"
channels: ["email"]
- name: "Service Down"
condition: "service_status == 'down'"
duration: "2m"
severity: "critical"
channels: ["email", "pagerduty"]
```
### Alert Channels
- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints
## Incident Simulation
### Available Scenarios
```bash
cd scenarios
# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical
# Simulate service failure
./incident_simulation.sh --type service-down --service nginx
# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms
# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
```
### Scenario Types
- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation
## Dashboards and Visualization
### Kibana Dashboards
Pre-configured dashboards for:
1. **Log Analysis**
- Log volume over time
- Error rate trends
- Top error messages
- Geographic request distribution
2. **System Monitoring**
- CPU and memory usage
- Disk I/O statistics
- Network traffic
- System load averages
3. **Application Performance**
- Response time distributions
- Request rate metrics
- Error percentages
- User session analytics
### Grafana Dashboards
Advanced visualization panels:
- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting
## API Endpoints
### Elasticsearch APIs
```bash
# Cluster health
GET /_cluster/health
# Index statistics
GET /_cat/indices?v
# Search logs
GET /logs-*/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
```
### Kibana APIs
```bash
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard
# Export visualizations
GET /api/saved_objects/visualization/{id}
```
### Grafana APIs
```bash
# Get dashboard list
GET /api/search?query=*
# Alert status
GET /api/alerts
```
## Configuration Management
### Environment Variables
```bash
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"
# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"
# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
```
### Scaling Configuration
For production deployment:
```yaml
version: '3.8'
services:
elasticsearch:
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: '2.0'
```
## Security Considerations
### Authentication
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
### Network Security
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
### Data Protection
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
## Troubleshooting
### Common Issues
**Elasticsearch Won't Start:**
```bash
# Check memory allocation
docker-compose logs elasticsearch
# Verify Java heap settings
docker-compose exec elasticsearch ps aux
```
**Logstash Pipeline Errors:**
```bash
# Check pipeline configuration
docker-compose logs logstash
# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
```
**Kibana Connection Issues:**
```bash
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"
# Check Kibana logs
docker-compose logs kibana
```
### Performance Tuning
**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
## Development and Testing
### Adding New Dashboards
1. Create dashboard JSON in `grafana/dashboards/`
2. Update provisioning configuration
3. Restart Grafana service
### Custom Alert Rules
1. Define rules in `alerting/alert_rules.yml`
2. Update alerting configuration
3. Test rules with simulation scenarios
### Log Pipeline Development
1. Add pipeline configuration in `logstash/pipeline/`
2. Test with sample data
3. Validate parsing with Kibana
## Backup and Recovery
### Data Backup
```bash
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-H "Content-Type: application/json" \
-d '{"indices": "*"}'
```
### Configuration Backup
```bash
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
logstash/ elasticsearch/ kibana/ grafana/
```
## Contributing
1. Follow existing configuration patterns
2. Test changes with simulation scenarios
3. Update documentation for new features
4. Ensure backward compatibility
## License
Enterprise Internal Use Only
@@ -0,0 +1,326 @@
# Enterprise Observability Alert Rules
# Alert definitions for automated incident detection and notification
alert_rules:
# System Resource Alerts
- name: "High CPU Usage"
description: "CPU utilization exceeds threshold"
condition: "cpu_usage_percent > 90"
duration: "5m"
severity: "critical"
tags:
- system
- performance
channels:
- email
- slack
labels:
team: "platform"
component: "system"
- name: "High Memory Usage"
description: "Memory utilization exceeds threshold"
condition: "memory_usage_percent > 85"
duration: "3m"
severity: "warning"
tags:
- system
- memory
channels:
- email
labels:
team: "platform"
component: "system"
- name: "Disk Space Critical"
description: "Disk usage exceeds critical threshold"
condition: "disk_usage_percent > 95"
duration: "2m"
severity: "critical"
tags:
- storage
- disk
channels:
- email
- pagerduty
labels:
team: "platform"
component: "storage"
- name: "Disk Space Warning"
description: "Disk usage exceeds warning threshold"
condition: "disk_usage_percent > 85"
duration: "10m"
severity: "warning"
tags:
- storage
- disk
channels:
- email
labels:
team: "platform"
component: "storage"
# Service Availability Alerts
- name: "Service Down"
description: "Critical service is not responding"
condition: "service_status == 'down' OR http_status_code >= 500"
duration: "2m"
severity: "critical"
tags:
- service
- availability
channels:
- email
- slack
- pagerduty
labels:
team: "application"
component: "service"
- name: "Database Connection Failed"
description: "Database connection pool exhausted or unresponsive"
condition: "db_connections_active == 0 OR db_response_time > 5000"
duration: "1m"
severity: "critical"
tags:
- database
- connectivity
channels:
- email
- pagerduty
labels:
team: "database"
component: "postgresql"
- name: "Cache Unavailable"
description: "Cache service is down or unresponsive"
condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
duration: "3m"
severity: "warning"
tags:
- cache
- performance
channels:
- email
labels:
team: "infrastructure"
component: "redis"
# Application Performance Alerts
- name: "High Error Rate"
description: "Application error rate exceeds threshold"
condition: "error_rate_percent > 5"
duration: "5m"
severity: "critical"
tags:
- application
- errors
channels:
- email
- slack
labels:
team: "application"
component: "api"
- name: "Slow Response Time"
description: "API response time exceeds SLA"
condition: "response_time_p95 > 2000"
duration: "5m"
severity: "warning"
tags:
- application
- performance
channels:
- email
labels:
team: "application"
component: "api"
- name: "High Request Queue"
description: "Request queue depth is too high"
condition: "queue_depth > 100"
duration: "3m"
severity: "warning"
tags:
- application
- queue
channels:
- email
labels:
team: "application"
component: "queue"
# Infrastructure Alerts
- name: "Network Latency High"
description: "Network round-trip time exceeds threshold"
condition: "network_rtt > 100"
duration: "5m"
severity: "warning"
tags:
- network
- latency
channels:
- email
labels:
team: "network"
component: "infrastructure"
- name: "Load Balancer Unhealthy"
description: "Load balancer backend servers are unhealthy"
condition: "lb_unhealthy_backends > 0"
duration: "2m"
severity: "critical"
tags:
- loadbalancer
- availability
channels:
- email
- pagerduty
labels:
team: "infrastructure"
component: "loadbalancer"
# Security Alerts
- name: "Failed Login Attempts"
description: "Multiple failed authentication attempts detected"
condition: "failed_login_attempts > 5"
duration: "5m"
severity: "warning"
tags:
- security
- authentication
channels:
- email
- slack
labels:
team: "security"
component: "authentication"
- name: "Suspicious Network Traffic"
description: "Unusual network traffic patterns detected"
condition: "network_bytes_unusual > 1000000"
duration: "10m"
severity: "warning"
tags:
- security
- network
channels:
- email
labels:
team: "security"
component: "network"
# Log-based Alerts
- name: "Application Errors"
description: "High volume of application error logs"
condition: "log_errors_per_minute > 10"
duration: "2m"
severity: "warning"
tags:
- logs
- errors
channels:
- email
labels:
team: "application"
component: "logs"
- name: "Out of Memory Errors"
description: "Out of memory errors detected in logs"
condition: "log_oom_errors > 0"
duration: "1m"
severity: "critical"
tags:
- memory
- errors
channels:
- email
- pagerduty
labels:
team: "application"
component: "memory"
# Business Logic Alerts
- name: "Low Business Transactions"
description: "Business transaction volume below expected threshold"
condition: "business_transactions_per_hour < 100"
duration: "15m"
severity: "warning"
tags:
- business
- transactions
channels:
- email
labels:
team: "business"
component: "transactions"
- name: "Payment Failures"
description: "Payment processing failure rate is high"
condition: "payment_failure_rate > 0.05"
duration: "5m"
severity: "critical"
tags:
- payments
- business
channels:
- email
- pagerduty
labels:
team: "payments"
component: "processing"
# Alert Channels Configuration
alert_channels:
email:
type: "email"
recipients:
- "platform-team@company.com"
- "oncall@company.com"
subject_template: "[{{severity}}] {{name}} - {{description}}"
slack:
type: "slack"
webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#alerts"
username: "Observability Bot"
icon_emoji: ":warning:"
pagerduty:
type: "pagerduty"
integration_key: "your-pagerduty-integration-key"
severity_mapping:
critical: "critical"
warning: "warning"
info: "info"
# Alert Silencing Rules
silence_rules:
- name: "Maintenance Window"
condition: "maintenance_window == true"
duration: "4h"
comment: "Silenced during scheduled maintenance"
- name: "Known Issue"
condition: "known_issue_id == 'TICKET-123'"
duration: "24h"
comment: "Silenced for known issue resolution"
# Escalation Policies
escalation_policies:
- name: "Default Escalation"
steps:
- delay: "5m"
channels: ["email"]
- delay: "15m"
channels: ["slack"]
- delay: "30m"
channels: ["pagerduty"]
- name: "Critical Escalation"
steps:
- delay: "0m"
channels: ["email", "slack", "pagerduty"]
- delay: "10m"
channels: ["pagerduty"] # Escalation
+122
View File
@@ -0,0 +1,122 @@
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: observability-elasticsearch
environment:
- discovery.type=single-node
- xpack.security.enabled=true
- ELASTIC_PASSWORD=elastic
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
ports:
- "9200:9200"
- "9300:9300"
networks:
- observability
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: observability-logstash
environment:
- "LS_JAVA_OPTS=-Xms512m -Xmx512m"
volumes:
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logs:/usr/share/logstash/logs
ports:
- "5044:5044"
- "8080:8080"
networks:
- observability
depends_on:
elasticsearch:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: observability-kibana
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=elastic
- ELASTICSEARCH_PASSWORD=elastic
volumes:
- ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
ports:
- "5601:5601"
networks:
- observability
depends_on:
elasticsearch:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
interval: 30s
timeout: 10s
retries: 5
grafana:
image: grafana/grafana:10.2.0
container_name: observability-grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
ports:
- "3000:3000"
networks:
- observability
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.0
container_name: observability-filebeat
user: root
volumes:
- ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
- ./logs:/var/log/sample
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- observability
depends_on:
- logstash
restart: unless-stopped
volumes:
elasticsearch_data:
driver: local
grafana_data:
driver: local
networks:
observability:
driver: bridge
ipam:
config:
- subnet: 172.25.0.0/16
@@ -0,0 +1,2 @@
[2026-04-29 22:52:26] INFO Incident simulation script started
[2026-04-29 22:52:26] INFO Scenario: help
+98
View File
@@ -0,0 +1,98 @@
# Sample Application Logs for Observability Stack Testing
# Generated realistic log entries for demonstration and testing
[2024-01-15 08:30:15] INFO com.example.app.Application - Application started successfully on port 8080
[2024-01-15 08:30:16] INFO com.example.database.ConnectionPool - Database connection pool initialized with 10 connections
[2024-01-15 08:30:17] INFO com.example.cache.RedisCache - Redis cache connected successfully
[2024-01-15 08:30:18] INFO com.example.messaging.RabbitMQClient - Message queue connection established
[2024-01-15 08:31:22] INFO com.example.api.UserController - User login attempt: user=john.doe, ip=192.168.1.100
[2024-01-15 08:31:23] INFO com.example.auth.AuthService - Authentication successful for user john.doe
[2024-01-15 08:31:24] INFO com.example.api.UserController - API request: GET /api/users/profile, status=200, duration=45ms
[2024-01-15 08:32:01] WARN com.example.cache.RedisCache - Cache miss for key: user_profile_12345
[2024-01-15 08:32:02] INFO com.example.database.UserRepository - Database query executed: SELECT * FROM users WHERE id = ?, duration=120ms
[2024-01-15 08:32:03] INFO com.example.cache.RedisCache - Cache populated for key: user_profile_12345
[2024-01-15 08:35:12] ERROR com.example.api.OrderController - Failed to process order: order_id=ORD-2024-001, error=Payment gateway timeout
[2024-01-15 08:35:13] WARN com.example.messaging.RabbitMQClient - Message delivery failed, retrying in 5 seconds
[2024-01-15 08:35:18] INFO com.example.messaging.RabbitMQClient - Message delivered successfully after retry
[2024-01-15 08:40:05] INFO com.example.monitoring.HealthCheck - Health check passed: database=OK, cache=OK, messaging=OK
[2024-01-15 08:40:06] INFO com.example.metrics.MetricsCollector - Metrics collected: active_users=1250, requests_per_second=45.2
[2024-01-15 08:45:30] ERROR com.example.external.PaymentGateway - Payment gateway connection failed: Connection refused
[2024-01-15 08:45:31] ERROR com.example.api.PaymentController - Payment processing failed for transaction TXN-789012
[2024-01-15 08:45:32] WARN com.example.circuitbreaker.PaymentCircuitBreaker - Circuit breaker opened for payment service
[2024-01-15 08:50:15] INFO com.example.batch.BatchProcessor - Batch job started: job_id=BATCH-001, records=10000
[2024-01-15 08:50:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=2500/10000 (25%)
[2024-01-15 08:51:15] INFO com.example.batch.BatchProcessor - Batch job progress: processed=5000/10000 (50%)
[2024-01-15 08:51:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=7500/10000 (75%)
[2024-01-15 08:52:10] INFO com.example.batch.BatchProcessor - Batch job completed: job_id=BATCH-001, duration=175s, success=true
[2024-01-15 09:00:00] INFO com.example.scheduled.CleanupJob - Scheduled cleanup job started
[2024-01-15 09:00:05] INFO com.example.scheduled.CleanupJob - Cleaned up 150 expired sessions
[2024-01-15 09:00:10] INFO com.example.scheduled.CleanupJob - Removed 25 temporary files
[2024-01-15 09:00:15] INFO com.example.scheduled.CleanupJob - Cleanup job completed successfully
[2024-01-15 09:15:22] WARN com.example.database.ConnectionPool - Connection pool nearing capacity: active=8/10
[2024-01-15 09:15:23] INFO com.example.database.ConnectionPool - Connection pool expanded to 15 connections
[2024-01-15 09:30:45] ERROR com.example.api.ProductController - Database query timeout: query=SELECT * FROM products WHERE category = 'electronics'
[2024-01-15 09:30:46] WARN com.example.database.ConnectionPool - Connection pool exhausted, rejecting request
[2024-01-15 09:30:47] ERROR com.example.api.ProductController - Service temporarily unavailable, status=503
[2024-01-15 09:45:12] INFO com.example.monitoring.AlertManager - Alert triggered: High CPU usage detected (85%)
[2024-01-15 09:45:13] INFO com.example.autoscaling.ScaleManager - Auto-scaling initiated: increasing instances from 3 to 5
[2024-01-15 10:00:00] INFO com.example.backup.BackupService - Database backup started
[2024-01-15 10:05:30] INFO com.example.backup.BackupService - Database backup completed: size=2.3GB, duration=330s
[2024-01-15 10:30:15] WARN com.example.security.SecurityFilter - Suspicious activity detected: multiple failed login attempts from IP 10.0.0.50
[2024-01-15 10:30:16] INFO com.example.security.SecurityFilter - IP 10.0.0.50 blocked for 15 minutes
[2024-01-15 11:00:00] INFO com.example.reporting.ReportGenerator - Daily report generation started
[2024-01-15 11:05:00] INFO com.example.reporting.ReportGenerator - Daily report completed: transactions=15420, revenue=$125,430.50
[2024-01-15 12:00:00] ERROR com.example.external.APIGateway - External API rate limit exceeded: retrying in 60 seconds
[2024-01-15 12:01:00] INFO com.example.external.APIGateway - External API connection restored
[2024-01-15 13:15:30] CRITICAL com.example.system.SystemMonitor - Disk space critical: /var/log usage at 95%
[2024-01-15 13:15:31] INFO com.example.maintenance.LogRotation - Emergency log rotation initiated
[2024-01-15 13:15:35] INFO com.example.maintenance.LogRotation - Log rotation completed: freed 2.1GB space
[2024-01-15 14:00:00] INFO com.example.metrics.PerformanceMonitor - Performance baseline established: avg_response_time=245ms, throughput=1200 req/sec
[2024-01-15 15:30:45] WARN com.example.cache.RedisCache - Cache cluster node down: redis-node-03
[2024-01-15 15:30:46] INFO com.example.cache.RedisCache - Failover initiated to redis-node-04
[2024-01-15 16:45:12] ERROR com.example.messaging.MessageProcessor - Message processing failed: invalid message format
[2024-01-15 16:45:13] INFO com.example.messaging.DeadLetterQueue - Message moved to dead letter queue
[2024-01-15 17:00:00] INFO com.example.backup.BackupService - Full system backup started
[2024-01-15 17:30:00] INFO com.example.backup.BackupService - Full system backup completed: size=45.2GB, duration=1800s
[2024-01-15 18:00:00] INFO com.example.monitoring.HealthCheck - Evening health check: all systems operational
[2024-01-15 18:00:01] INFO com.example.metrics.MetricsCollector - End of day metrics: total_requests=125000, error_rate=0.02%, avg_response_time=198ms
# Additional realistic log patterns for testing
[2024-01-15 08:15:30] INFO nginx: 192.168.1.100 - - [15/Jan/2024:08:15:30 +0000] "GET /api/health HTTP/1.1" 200 123 "-" "curl/7.68.0"
[2024-01-15 08:15:31] INFO nginx: 192.168.1.101 - - [15/Jan/2024:08:15:31 +0000] "POST /api/login HTTP/1.1" 200 456 "-" "Mozilla/5.0 ..."
[2024-01-15 08:15:32] WARN nginx: 192.168.1.102 - - [15/Jan/2024:08:15:32 +0000] "GET /api/admin HTTP/1.1" 403 234 "-" "Mozilla/5.0 ..."
[2024-01-15 09:20:15] INFO systemd: Started Session 1234 of user john.doe
[2024-01-15 09:20:16] INFO systemd: Started User Manager for UID 1000
[2024-01-15 09:20:17] INFO systemd: Started Session 1235 of user jane.smith
[2024-01-15 10:45:30] WARN kernel: [ 1234.567890] CPU0: Core temperature above threshold, cpu clock throttled
[2024-01-15 10:45:31] INFO kernel: [ 1234.678901] CPU0: Temperature back to normal
[2024-01-15 14:30:15] ERROR postgres: FATAL: password authentication failed for user "app_user"
[2024-01-15 14:30:16] ERROR postgres: FATAL: password authentication failed for user "app_user"
[2024-01-15 14:30:17] WARN postgres: too many connections for role "app_user"
[2024-01-15 16:15:45] INFO rabbitmq: accepting AMQP connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672)
[2024-01-15 16:15:46] INFO rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): user 'app_user' authenticated
[2024-01-15 16:15:47] WARN rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): missed heartbeats from client, timeout: 60s
+318
View File
@@ -0,0 +1,318 @@
#!/bin/bash
# Enterprise Incident Simulation Script
# Simulates various failure scenarios for testing observability stack
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$(dirname "$SCRIPT_DIR")")"
LOG_FILE="$PROJECT_ROOT/observability-stack/logs/incident_simulation.log"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
local level=$1
local message=$2
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] $level $message" >> "$LOG_FILE"
echo -e "${BLUE}[$timestamp]${NC} $level $message"
}
# Function to simulate CPU spike
simulate_cpu_spike() {
local duration=${1:-60}
log "INFO" "Starting CPU spike simulation for ${duration} seconds"
# Launch CPU-intensive processes
for i in {1..4}; do
(
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# CPU-intensive calculation
result=0
for j in {1..100000}; do
result=$((result + j))
done
done
) &
PIDS[$i]=$!
done
# Wait for simulation to complete
for pid in "${PIDS[@]}"; do
wait $pid 2>/dev/null || true
done
log "INFO" "CPU spike simulation completed"
}
# Function to simulate memory leak
simulate_memory_leak() {
local duration=${1:-30}
log "INFO" "Starting memory leak simulation for ${duration} seconds"
# Create a process that gradually consumes memory
(
data=""
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# Gradually consume memory
data="${data}X"
sleep 0.1
done
) &
MEM_PID=$!
wait $MEM_PID 2>/dev/null || true
log "INFO" "Memory leak simulation completed"
}
# Function to simulate disk space exhaustion
simulate_disk_full() {
local target_dir=${1:-"/tmp"}
local duration=${2:-30}
log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
# Create large files to fill disk space
(
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# Create 100MB file
dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
sleep 2
done
) &
DISK_PID=$!
wait $DISK_PID 2>/dev/null || true
# Cleanup test files
rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
log "INFO" "Disk space exhaustion simulation completed and cleaned up"
}
# Function to simulate network issues
simulate_network_issues() {
local interface=${1:-"lo"}
local duration=${2:-20}
log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
# Add network delay and packet loss
sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
sleep $duration
# Remove network simulation
sudo tc qdisc del dev $interface root 2>/dev/null || true
log "INFO" "Network issues simulation completed"
}
# Function to simulate service crashes
simulate_service_crash() {
local service_name=${1:-"test-service"}
log "INFO" "Starting service crash simulation for ${service_name}"
# Simulate service going down
log "ERROR" "Service ${service_name} crashed unexpectedly"
sleep 5
log "INFO" "Service ${service_name} restarted automatically"
# Simulate multiple crashes
for i in {1..3}; do
sleep 2
log "ERROR" "Service ${service_name} crashed again (attempt $i)"
sleep 1
log "INFO" "Service ${service_name} recovered after crash $i"
done
}
# Function to simulate database issues
simulate_database_issues() {
local duration=${1:-25}
log "INFO" "Starting database issues simulation for ${duration} seconds"
# Simulate connection pool exhaustion
log "WARN" "Database connection pool nearing capacity"
sleep 5
log "ERROR" "Database connection pool exhausted"
sleep 5
log "ERROR" "Database query timeout occurred"
sleep 5
log "WARN" "Database connections recovering"
sleep 5
log "INFO" "Database connections restored"
log "INFO" "Database issues simulation completed"
}
# Function to simulate application errors
simulate_application_errors() {
local error_count=${1:-10}
log "INFO" "Starting application error simulation (${error_count} errors)"
for i in {1..error_count}; do
case $((RANDOM % 4)) in
0)
log "ERROR" "NullPointerException in UserService.getUser($i)"
;;
1)
log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
;;
2)
log "ERROR" "ValidationException: Invalid input data for request $i"
;;
3)
log "ERROR" "IOException: Failed to write to log file"
;;
esac
sleep $((RANDOM % 3 + 1))
done
log "INFO" "Application error simulation completed"
}
# Function to run comprehensive incident scenario
run_comprehensive_scenario() {
log "INFO" "Starting comprehensive incident scenario simulation"
# Phase 1: Initial system stress
log "INFO" "Phase 1: System stress simulation"
simulate_cpu_spike 30 &
CPU_PID=$!
simulate_memory_leak 20 &
MEM_PID=$!
sleep 10
# Phase 2: Service degradation
log "INFO" "Phase 2: Service degradation simulation"
simulate_service_crash "web-service" &
SERVICE_PID=$!
sleep 5
# Phase 3: Database issues
log "INFO" "Phase 3: Database issues simulation"
simulate_database_issues 15 &
DB_PID=$!
# Phase 4: Application errors
log "INFO" "Phase 4: Application error burst"
simulate_application_errors 15 &
APP_PID=$!
# Phase 5: Infrastructure issues
log "INFO" "Phase 5: Infrastructure issues simulation"
simulate_disk_full "/tmp" 10 &
DISK_PID=$!
# Wait for all simulations to complete
wait $CPU_PID 2>/dev/null || true
wait $MEM_PID 2>/dev/null || true
wait $SERVICE_PID 2>/dev/null || true
wait $DB_PID 2>/dev/null || true
wait $APP_PID 2>/dev/null || true
wait $DISK_PID 2>/dev/null || true
log "INFO" "Comprehensive incident scenario completed"
}
# Function to show usage
show_usage() {
echo "Enterprise Incident Simulation Script"
echo "Usage: $0 [SCENARIO] [OPTIONS]"
echo ""
echo "SCENARIOS:"
echo " cpu [DURATION] - Simulate CPU spike (default: 60s)"
echo " memory [DURATION] - Simulate memory leak (default: 30s)"
echo " disk [DIR] [DURATION] - Simulate disk space exhaustion (default: /tmp, 30s)"
echo " network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
echo " service [NAME] - Simulate service crashes (default: test-service)"
echo " database [DURATION] - Simulate database issues (default: 25s)"
echo " app-errors [COUNT] - Simulate application errors (default: 10)"
echo " comprehensive - Run full incident scenario"
echo " all - Run all individual scenarios sequentially"
echo ""
echo "EXAMPLES:"
echo " $0 cpu 120 - CPU spike for 2 minutes"
echo " $0 disk /var/log 45 - Disk full simulation in /var/log for 45 seconds"
echo " $0 comprehensive - Full incident simulation"
echo ""
}
# Main execution
main() {
local scenario=${1:-"comprehensive"}
# Create log directory if it doesn't exist
mkdir -p "$(dirname "$LOG_FILE")"
log "INFO" "Incident simulation script started"
log "INFO" "Scenario: $scenario"
case $scenario in
"cpu")
simulate_cpu_spike "${2:-60}"
;;
"memory")
simulate_memory_leak "${2:-30}"
;;
"disk")
simulate_disk_full "${2:-/tmp}" "${3:-30}"
;;
"network")
simulate_network_issues "${2:-lo}" "${3:-20}"
;;
"service")
simulate_service_crash "${2:-test-service}"
;;
"database")
simulate_database_issues "${2:-25}"
;;
"app-errors")
simulate_application_errors "${2:-10}"
;;
"comprehensive")
run_comprehensive_scenario
;;
"all")
log "INFO" "Running all scenarios sequentially"
simulate_cpu_spike 30
sleep 5
simulate_memory_leak 20
sleep 5
simulate_disk_full "/tmp" 15
sleep 5
simulate_service_crash "test-service"
sleep 5
simulate_database_issues 15
sleep 5
simulate_application_errors 8
sleep 5
simulate_network_issues "lo" 10
;;
"help"|"-h"|"--help")
show_usage
exit 0
;;
*)
echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
echo ""
show_usage
exit 1
;;
esac
log "INFO" "Incident simulation script completed successfully"
echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
}
# Run main function with all arguments
main "$@"