6.8 KiB
6.8 KiB
Runbooks and Operational Procedures
This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects.
Table of Contents
- Infrastructure Simulator Operations
- Migration Validation Procedures
- Observability Stack Management
- Troubleshooting Guide
Infrastructure Simulator Operations
Starting the Infrastructure
cd enterprise-infra-simulator
make up
Expected Outcome:
- Docker containers for simulated Linux nodes are created
- Ansible inventory is populated
- Basic services are running on all nodes
Verification:
docker ps | grep infra-sim
ansible -i inventory/hosts.ini all -m ping
Patching Operations
cd enterprise-infra-simulator
make patch
Procedure:
- Backup current container states
- Apply security patches via Ansible
- Validate service availability
- Generate patch report
Rollback:
docker-compose down
docker-compose up --scale node=0
make up
Hardening Operations
cd enterprise-infra-simulator
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml
Hardening Steps:
- Disable unnecessary services
- Configure firewall rules
- Set secure SSH configurations
- Apply CIS benchmarks
Scaling Operations
cd enterprise-infra-simulator
./scripts/simulate_scaling.sh up 3
Scaling Parameters:
- Direction: up/down
- Count: number of nodes to add/remove
- Type: web/app/db
Failure Simulation
cd enterprise-infra-simulator
./scripts/simulate_failure.sh --type network --duration 300
Failure Types:
- network: Network partition
- disk: Disk space exhaustion
- service: Service crashes
- node: Complete node failure
Decommissioning
cd enterprise-infra-simulator
make destroy
Decommission Steps:
- Graceful service shutdown
- Data backup and export
- Configuration cleanup
- Container removal
Migration Validation Procedures
Pre-Migration Snapshot
cd migration-validation-framework
python3 cli.py collect --output before.json --systems web01,db01
Data Collected:
- Mount points and filesystem usage
- Running services and their states
- Disk usage statistics
- Network configurations
Post-Migration Validation
python3 cli.py collect --output after.json --systems web01,db01
python3 cli.py compare before.json after.json --output diff.json
Validation Checks:
- Service availability verification
- Filesystem integrity
- Configuration consistency
- Performance metrics comparison
Report Generation
python3 cli.py report --comparison <comparison-id> --format html
Report Contents:
- Executive summary
- Detailed change log
- Risk assessment
- Recommendations
Observability Stack Management
Starting the Stack
cd observability-stack
docker-compose up -d
Service Startup Order:
- Elasticsearch
- Logstash
- Kibana
- Grafana
Log Ingestion Testing
# Send sample logs
curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log
Alert Configuration
# Load alert rules
curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer <token>" -d @alerting/alert_rules.json
Incident Simulation
cd observability-stack
./scenarios/incident_simulation.sh --type disk-full --severity critical
Incident Types:
- disk-full: Simulate disk space exhaustion
- service-down: Service failure simulation
- high-cpu: CPU utilization spike
- network-latency: Network performance degradation
Troubleshooting Guide
Common Issues
Ansible Connection Failures
Symptoms:
UNREACHABLEerrors in Ansible output- SSH connection timeouts
Resolution:
# Check container status
docker ps | grep infra-sim
# Verify SSH keys
ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa
# Restart containers
make destroy && make up
Elasticsearch Cluster Issues
Symptoms:
- Kibana shows "No living connections"
- Logstash pipeline failures
Resolution:
# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"
# Restart services
docker-compose restart elasticsearch logstash kibana
Python Import Errors
Symptoms:
- ModuleNotFoundError in migration framework
- Collector failures
Resolution:
# Install dependencies
pip install -r requirements.txt
# Check Python path
python -c "import sys; print(sys.path)"
Docker Resource Constraints
Symptoms:
- Container startup failures
- Out of memory errors
Resolution:
# Check Docker resources
docker system df
# Clean up unused resources
docker system prune -a
# Increase Docker memory limit
# Edit /etc/docker/daemon.json
{
"memory": "4g",
"cpu-count": 2
}
Log Locations
- Ansible:
enterprise-infra-simulator/ansible.log - Docker:
docker logs <container-name> - Elasticsearch:
observability-stack/logs/elasticsearch.log - Migration Framework:
migration-validation-framework/logs/validation.log
Performance Monitoring
# Infrastructure monitoring
ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20"
# Elasticsearch metrics
curl -X GET "localhost:9200/_cluster/stats?pretty"
# Python performance
python -m cProfile cli.py snapshot
Backup and Recovery
Infrastructure Backup
cd enterprise-infra-simulator
docker-compose exec ansible ansible-playbook /playbooks/backup.yml
Data Backup
cd observability-stack
docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json
Migration Data Backup
cd migration-validation-framework
tar -czf /backup/location/migration-validation-framework.tgz migration-validation-framework
Emergency Procedures
Complete System Reset
# Stop all services
docker-compose down -v
cd enterprise-infra-simulator && make destroy
# Clean up volumes
docker volume prune -f
# Restart from clean state
cd enterprise-infra-simulator && make up
cd observability-stack && docker-compose up -d
Incident Response
- Assess Impact: Check monitoring dashboards
- Isolate Issue: Use failure simulation scripts to reproduce
- Implement Fix: Apply appropriate runbook procedure
- Validate Recovery: Run validation framework
- Document Incident: Update runbooks with lessons learned
Maintenance Schedules
- Daily: Log rotation and cleanup
- Weekly: Security patching and updates
- Monthly: Performance optimization and capacity planning
- Quarterly: Architecture review and modernization