329 lines
6.7 KiB
Markdown
329 lines
6.7 KiB
Markdown
|
|
# Runbooks and Operational Procedures
|
||
|
|
|
||
|
|
This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects.
|
||
|
|
|
||
|
|
## Table of Contents
|
||
|
|
|
||
|
|
1. [Infrastructure Simulator Operations](#infrastructure-simulator-operations)
|
||
|
|
2. [Migration Validation Procedures](#migration-validation-procedures)
|
||
|
|
3. [Observability Stack Management](#observability-stack-management)
|
||
|
|
4. [Troubleshooting Guide](#troubleshooting-guide)
|
||
|
|
|
||
|
|
## Infrastructure Simulator Operations
|
||
|
|
|
||
|
|
### Starting the Infrastructure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
make up
|
||
|
|
```
|
||
|
|
|
||
|
|
**Expected Outcome:**
|
||
|
|
- Docker containers for simulated Linux nodes are created
|
||
|
|
- Ansible inventory is populated
|
||
|
|
- Basic services are running on all nodes
|
||
|
|
|
||
|
|
**Verification:**
|
||
|
|
```bash
|
||
|
|
docker ps | grep infra-sim
|
||
|
|
ansible -i inventory/hosts.ini all -m ping
|
||
|
|
```
|
||
|
|
|
||
|
|
### Patching Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
make patch
|
||
|
|
```
|
||
|
|
|
||
|
|
**Procedure:**
|
||
|
|
1. Backup current container states
|
||
|
|
2. Apply security patches via Ansible
|
||
|
|
3. Validate service availability
|
||
|
|
4. Generate patch report
|
||
|
|
|
||
|
|
**Rollback:**
|
||
|
|
```bash
|
||
|
|
docker-compose down
|
||
|
|
docker-compose up --scale node=0
|
||
|
|
make up
|
||
|
|
```
|
||
|
|
|
||
|
|
### Hardening Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
**Hardening Steps:**
|
||
|
|
- Disable unnecessary services
|
||
|
|
- Configure firewall rules
|
||
|
|
- Set secure SSH configurations
|
||
|
|
- Apply CIS benchmarks
|
||
|
|
|
||
|
|
### Scaling Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
./scripts/simulate_scaling.sh up 3
|
||
|
|
```
|
||
|
|
|
||
|
|
**Scaling Parameters:**
|
||
|
|
- Direction: up/down
|
||
|
|
- Count: number of nodes to add/remove
|
||
|
|
- Type: web/app/db
|
||
|
|
|
||
|
|
### Failure Simulation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
./scripts/simulate_failure.sh --type network --duration 300
|
||
|
|
```
|
||
|
|
|
||
|
|
**Failure Types:**
|
||
|
|
- network: Network partition
|
||
|
|
- disk: Disk space exhaustion
|
||
|
|
- service: Service crashes
|
||
|
|
- node: Complete node failure
|
||
|
|
|
||
|
|
### Decommissioning
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
make destroy
|
||
|
|
```
|
||
|
|
|
||
|
|
**Decommission Steps:**
|
||
|
|
1. Graceful service shutdown
|
||
|
|
2. Data backup and export
|
||
|
|
3. Configuration cleanup
|
||
|
|
4. Container removal
|
||
|
|
|
||
|
|
## Migration Validation Procedures
|
||
|
|
|
||
|
|
### Pre-Migration Snapshot
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd migration-validation-framework
|
||
|
|
python cli.py snapshot --env production --label pre-migration
|
||
|
|
```
|
||
|
|
|
||
|
|
**Data Collected:**
|
||
|
|
- Mount points and filesystem usage
|
||
|
|
- Running services and their states
|
||
|
|
- Disk usage statistics
|
||
|
|
- Network configurations
|
||
|
|
|
||
|
|
### Post-Migration Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python cli.py snapshot --env production --label post-migration
|
||
|
|
python cli.py compare pre-migration post-migration
|
||
|
|
```
|
||
|
|
|
||
|
|
**Validation Checks:**
|
||
|
|
- Service availability verification
|
||
|
|
- Filesystem integrity
|
||
|
|
- Configuration consistency
|
||
|
|
- Performance metrics comparison
|
||
|
|
|
||
|
|
### Report Generation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python cli.py report --comparison-id <comparison-id> --format html
|
||
|
|
```
|
||
|
|
|
||
|
|
**Report Contents:**
|
||
|
|
- Executive summary
|
||
|
|
- Detailed change log
|
||
|
|
- Risk assessment
|
||
|
|
- Recommendations
|
||
|
|
|
||
|
|
## Observability Stack Management
|
||
|
|
|
||
|
|
### Starting the Stack
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd observability-stack
|
||
|
|
docker-compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
**Service Startup Order:**
|
||
|
|
1. Elasticsearch
|
||
|
|
2. Logstash
|
||
|
|
3. Kibana
|
||
|
|
4. Grafana
|
||
|
|
|
||
|
|
### Log Ingestion Testing
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Send sample logs
|
||
|
|
curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log
|
||
|
|
```
|
||
|
|
|
||
|
|
### Alert Configuration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Load alert rules
|
||
|
|
curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer <token>" -d @alerting/alert_rules.json
|
||
|
|
```
|
||
|
|
|
||
|
|
### Incident Simulation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd observability-stack
|
||
|
|
./scenarios/incident_simulation.sh --type disk-full --severity critical
|
||
|
|
```
|
||
|
|
|
||
|
|
**Incident Types:**
|
||
|
|
- disk-full: Simulate disk space exhaustion
|
||
|
|
- service-down: Service failure simulation
|
||
|
|
- high-cpu: CPU utilization spike
|
||
|
|
- network-latency: Network performance degradation
|
||
|
|
|
||
|
|
## Troubleshooting Guide
|
||
|
|
|
||
|
|
### Common Issues
|
||
|
|
|
||
|
|
#### Ansible Connection Failures
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
- `UNREACHABLE` errors in Ansible output
|
||
|
|
- SSH connection timeouts
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
```bash
|
||
|
|
# Check container status
|
||
|
|
docker ps | grep infra-sim
|
||
|
|
|
||
|
|
# Verify SSH keys
|
||
|
|
ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa
|
||
|
|
|
||
|
|
# Restart containers
|
||
|
|
make destroy && make up
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Elasticsearch Cluster Issues
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
- Kibana shows "No living connections"
|
||
|
|
- Logstash pipeline failures
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
```bash
|
||
|
|
# Check cluster health
|
||
|
|
curl -X GET "localhost:9200/_cluster/health?pretty"
|
||
|
|
|
||
|
|
# Restart services
|
||
|
|
docker-compose restart elasticsearch logstash kibana
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Python Import Errors
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
- ModuleNotFoundError in migration framework
|
||
|
|
- Collector failures
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
```bash
|
||
|
|
# Install dependencies
|
||
|
|
pip install -r requirements.txt
|
||
|
|
|
||
|
|
# Check Python path
|
||
|
|
python -c "import sys; print(sys.path)"
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Docker Resource Constraints
|
||
|
|
|
||
|
|
**Symptoms:**
|
||
|
|
- Container startup failures
|
||
|
|
- Out of memory errors
|
||
|
|
|
||
|
|
**Resolution:**
|
||
|
|
```bash
|
||
|
|
# Check Docker resources
|
||
|
|
docker system df
|
||
|
|
|
||
|
|
# Clean up unused resources
|
||
|
|
docker system prune -a
|
||
|
|
|
||
|
|
# Increase Docker memory limit
|
||
|
|
# Edit /etc/docker/daemon.json
|
||
|
|
{
|
||
|
|
"memory": "4g",
|
||
|
|
"cpu-count": 2
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
### Log Locations
|
||
|
|
|
||
|
|
- **Ansible:** `enterprise-infra-simulator/ansible.log`
|
||
|
|
- **Docker:** `docker logs <container-name>`
|
||
|
|
- **Elasticsearch:** `observability-stack/logs/elasticsearch.log`
|
||
|
|
- **Migration Framework:** `migration-validation-framework/logs/validation.log`
|
||
|
|
|
||
|
|
### Performance Monitoring
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Infrastructure monitoring
|
||
|
|
ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20"
|
||
|
|
|
||
|
|
# Elasticsearch metrics
|
||
|
|
curl -X GET "localhost:9200/_cluster/stats?pretty"
|
||
|
|
|
||
|
|
# Python performance
|
||
|
|
python -m cProfile cli.py snapshot
|
||
|
|
```
|
||
|
|
|
||
|
|
### Backup and Recovery
|
||
|
|
|
||
|
|
#### Infrastructure Backup
|
||
|
|
```bash
|
||
|
|
cd enterprise-infra-simulator
|
||
|
|
docker-compose exec ansible ansible-playbook /playbooks/backup.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Data Backup
|
||
|
|
```bash
|
||
|
|
cd observability-stack
|
||
|
|
docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json
|
||
|
|
```
|
||
|
|
|
||
|
|
#### Migration Data Backup
|
||
|
|
```bash
|
||
|
|
cd migration-validation-framework
|
||
|
|
python cli.py backup --destination /backup/location
|
||
|
|
```
|
||
|
|
|
||
|
|
## Emergency Procedures
|
||
|
|
|
||
|
|
### Complete System Reset
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Stop all services
|
||
|
|
docker-compose down -v
|
||
|
|
cd enterprise-infra-simulator && make destroy
|
||
|
|
|
||
|
|
# Clean up volumes
|
||
|
|
docker volume prune -f
|
||
|
|
|
||
|
|
# Restart from clean state
|
||
|
|
cd enterprise-infra-simulator && make up
|
||
|
|
cd observability-stack && docker-compose up -d
|
||
|
|
```
|
||
|
|
|
||
|
|
### Incident Response
|
||
|
|
|
||
|
|
1. **Assess Impact:** Check monitoring dashboards
|
||
|
|
2. **Isolate Issue:** Use failure simulation scripts to reproduce
|
||
|
|
3. **Implement Fix:** Apply appropriate runbook procedure
|
||
|
|
4. **Validate Recovery:** Run validation framework
|
||
|
|
5. **Document Incident:** Update runbooks with lessons learned
|
||
|
|
|
||
|
|
## Maintenance Schedules
|
||
|
|
|
||
|
|
- **Daily:** Log rotation and cleanup
|
||
|
|
- **Weekly:** Security patching and updates
|
||
|
|
- **Monthly:** Performance optimization and capacity planning
|
||
|
|
- **Quarterly:** Architecture review and modernization
|