# Runbooks and Operational Procedures This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects. ## Table of Contents 1. [Infrastructure Simulator Operations](#infrastructure-simulator-operations) 2. [Migration Validation Procedures](#migration-validation-procedures) 3. [Observability Stack Management](#observability-stack-management) 4. [Troubleshooting Guide](#troubleshooting-guide) ## Infrastructure Simulator Operations ### Starting the Infrastructure ```bash cd enterprise-infra-simulator make up ``` **Expected Outcome:** - Docker containers for simulated Linux nodes are created - Ansible inventory is populated - Basic services are running on all nodes **Verification:** ```bash docker ps | grep infra-sim ansible -i inventory/hosts.ini all -m ping ``` ### Patching Operations ```bash cd enterprise-infra-simulator make patch ``` **Procedure:** 1. Backup current container states 2. Apply security patches via Ansible 3. Validate service availability 4. Generate patch report **Rollback:** ```bash docker-compose down docker-compose up --scale node=0 make up ``` ### Hardening Operations ```bash cd enterprise-infra-simulator ansible-playbook -i inventory/hosts.ini playbooks/harden.yml ``` **Hardening Steps:** - Disable unnecessary services - Configure firewall rules - Set secure SSH configurations - Apply CIS benchmarks ### Scaling Operations ```bash cd enterprise-infra-simulator ./scripts/simulate_scaling.sh up 3 ``` **Scaling Parameters:** - Direction: up/down - Count: number of nodes to add/remove - Type: web/app/db ### Failure Simulation ```bash cd enterprise-infra-simulator ./scripts/simulate_failure.sh --type network --duration 300 ``` **Failure Types:** - network: Network partition - disk: Disk space exhaustion - service: Service crashes - node: Complete node failure ### Decommissioning ```bash cd enterprise-infra-simulator make destroy ``` **Decommission Steps:** 1. Graceful service shutdown 2. Data backup and export 3. Configuration cleanup 4. Container removal ## Migration Validation Procedures ### Pre-Migration Snapshot ```bash cd migration-validation-framework python cli.py snapshot --env production --label pre-migration ``` **Data Collected:** - Mount points and filesystem usage - Running services and their states - Disk usage statistics - Network configurations ### Post-Migration Validation ```bash python cli.py snapshot --env production --label post-migration python cli.py compare pre-migration post-migration ``` **Validation Checks:** - Service availability verification - Filesystem integrity - Configuration consistency - Performance metrics comparison ### Report Generation ```bash python cli.py report --comparison-id --format html ``` **Report Contents:** - Executive summary - Detailed change log - Risk assessment - Recommendations ## Observability Stack Management ### Starting the Stack ```bash cd observability-stack docker-compose up -d ``` **Service Startup Order:** 1. Elasticsearch 2. Logstash 3. Kibana 4. Grafana ### Log Ingestion Testing ```bash # Send sample logs curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log ``` ### Alert Configuration ```bash # Load alert rules curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer " -d @alerting/alert_rules.json ``` ### Incident Simulation ```bash cd observability-stack ./scenarios/incident_simulation.sh --type disk-full --severity critical ``` **Incident Types:** - disk-full: Simulate disk space exhaustion - service-down: Service failure simulation - high-cpu: CPU utilization spike - network-latency: Network performance degradation ## Troubleshooting Guide ### Common Issues #### Ansible Connection Failures **Symptoms:** - `UNREACHABLE` errors in Ansible output - SSH connection timeouts **Resolution:** ```bash # Check container status docker ps | grep infra-sim # Verify SSH keys ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa # Restart containers make destroy && make up ``` #### Elasticsearch Cluster Issues **Symptoms:** - Kibana shows "No living connections" - Logstash pipeline failures **Resolution:** ```bash # Check cluster health curl -X GET "localhost:9200/_cluster/health?pretty" # Restart services docker-compose restart elasticsearch logstash kibana ``` #### Python Import Errors **Symptoms:** - ModuleNotFoundError in migration framework - Collector failures **Resolution:** ```bash # Install dependencies pip install -r requirements.txt # Check Python path python -c "import sys; print(sys.path)" ``` #### Docker Resource Constraints **Symptoms:** - Container startup failures - Out of memory errors **Resolution:** ```bash # Check Docker resources docker system df # Clean up unused resources docker system prune -a # Increase Docker memory limit # Edit /etc/docker/daemon.json { "memory": "4g", "cpu-count": 2 } ``` ### Log Locations - **Ansible:** `enterprise-infra-simulator/ansible.log` - **Docker:** `docker logs ` - **Elasticsearch:** `observability-stack/logs/elasticsearch.log` - **Migration Framework:** `migration-validation-framework/logs/validation.log` ### Performance Monitoring ```bash # Infrastructure monitoring ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20" # Elasticsearch metrics curl -X GET "localhost:9200/_cluster/stats?pretty" # Python performance python -m cProfile cli.py snapshot ``` ### Backup and Recovery #### Infrastructure Backup ```bash cd enterprise-infra-simulator docker-compose exec ansible ansible-playbook /playbooks/backup.yml ``` #### Data Backup ```bash cd observability-stack docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json ``` #### Migration Data Backup ```bash cd migration-validation-framework python cli.py backup --destination /backup/location ``` ## Emergency Procedures ### Complete System Reset ```bash # Stop all services docker-compose down -v cd enterprise-infra-simulator && make destroy # Clean up volumes docker volume prune -f # Restart from clean state cd enterprise-infra-simulator && make up cd observability-stack && docker-compose up -d ``` ### Incident Response 1. **Assess Impact:** Check monitoring dashboards 2. **Isolate Issue:** Use failure simulation scripts to reproduce 3. **Implement Fix:** Apply appropriate runbook procedure 4. **Validate Recovery:** Run validation framework 5. **Document Incident:** Update runbooks with lessons learned ## Maintenance Schedules - **Daily:** Log rotation and cleanup - **Weekly:** Security patching and updates - **Monthly:** Performance optimization and capacity planning - **Quarterly:** Architecture review and modernization