feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions
This commit is contained in:
@@ -0,0 +1,329 @@
|
||||
# Runbooks and Operational Procedures
|
||||
|
||||
This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Infrastructure Simulator Operations](#infrastructure-simulator-operations)
|
||||
2. [Migration Validation Procedures](#migration-validation-procedures)
|
||||
3. [Observability Stack Management](#observability-stack-management)
|
||||
4. [Troubleshooting Guide](#troubleshooting-guide)
|
||||
|
||||
## Infrastructure Simulator Operations
|
||||
|
||||
### Starting the Infrastructure
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
make up
|
||||
```
|
||||
|
||||
**Expected Outcome:**
|
||||
- Docker containers for simulated Linux nodes are created
|
||||
- Ansible inventory is populated
|
||||
- Basic services are running on all nodes
|
||||
|
||||
**Verification:**
|
||||
```bash
|
||||
docker ps | grep infra-sim
|
||||
ansible -i inventory/hosts.ini all -m ping
|
||||
```
|
||||
|
||||
### Patching Operations
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
make patch
|
||||
```
|
||||
|
||||
**Procedure:**
|
||||
1. Backup current container states
|
||||
2. Apply security patches via Ansible
|
||||
3. Validate service availability
|
||||
4. Generate patch report
|
||||
|
||||
**Rollback:**
|
||||
```bash
|
||||
docker-compose down
|
||||
docker-compose up --scale node=0
|
||||
make up
|
||||
```
|
||||
|
||||
### Hardening Operations
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
ansible-playbook -i inventory/hosts.ini playbooks/harden.yml
|
||||
```
|
||||
|
||||
**Hardening Steps:**
|
||||
- Disable unnecessary services
|
||||
- Configure firewall rules
|
||||
- Set secure SSH configurations
|
||||
- Apply CIS benchmarks
|
||||
|
||||
### Scaling Operations
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
./scripts/simulate_scaling.sh up 3
|
||||
```
|
||||
|
||||
**Scaling Parameters:**
|
||||
- Direction: up/down
|
||||
- Count: number of nodes to add/remove
|
||||
- Type: web/app/db
|
||||
|
||||
### Failure Simulation
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
./scripts/simulate_failure.sh --type network --duration 300
|
||||
```
|
||||
|
||||
**Failure Types:**
|
||||
- network: Network partition
|
||||
- disk: Disk space exhaustion
|
||||
- service: Service crashes
|
||||
- node: Complete node failure
|
||||
|
||||
### Decommissioning
|
||||
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
make destroy
|
||||
```
|
||||
|
||||
**Decommission Steps:**
|
||||
1. Graceful service shutdown
|
||||
2. Data backup and export
|
||||
3. Configuration cleanup
|
||||
4. Container removal
|
||||
|
||||
## Migration Validation Procedures
|
||||
|
||||
### Pre-Migration Snapshot
|
||||
|
||||
```bash
|
||||
cd migration-validation-framework
|
||||
python cli.py snapshot --env production --label pre-migration
|
||||
```
|
||||
|
||||
**Data Collected:**
|
||||
- Mount points and filesystem usage
|
||||
- Running services and their states
|
||||
- Disk usage statistics
|
||||
- Network configurations
|
||||
|
||||
### Post-Migration Validation
|
||||
|
||||
```bash
|
||||
python cli.py snapshot --env production --label post-migration
|
||||
python cli.py compare pre-migration post-migration
|
||||
```
|
||||
|
||||
**Validation Checks:**
|
||||
- Service availability verification
|
||||
- Filesystem integrity
|
||||
- Configuration consistency
|
||||
- Performance metrics comparison
|
||||
|
||||
### Report Generation
|
||||
|
||||
```bash
|
||||
python cli.py report --comparison-id <comparison-id> --format html
|
||||
```
|
||||
|
||||
**Report Contents:**
|
||||
- Executive summary
|
||||
- Detailed change log
|
||||
- Risk assessment
|
||||
- Recommendations
|
||||
|
||||
## Observability Stack Management
|
||||
|
||||
### Starting the Stack
|
||||
|
||||
```bash
|
||||
cd observability-stack
|
||||
docker-compose up -d
|
||||
```
|
||||
|
||||
**Service Startup Order:**
|
||||
1. Elasticsearch
|
||||
2. Logstash
|
||||
3. Kibana
|
||||
4. Grafana
|
||||
|
||||
### Log Ingestion Testing
|
||||
|
||||
```bash
|
||||
# Send sample logs
|
||||
curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log
|
||||
```
|
||||
|
||||
### Alert Configuration
|
||||
|
||||
```bash
|
||||
# Load alert rules
|
||||
curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer <token>" -d @alerting/alert_rules.json
|
||||
```
|
||||
|
||||
### Incident Simulation
|
||||
|
||||
```bash
|
||||
cd observability-stack
|
||||
./scenarios/incident_simulation.sh --type disk-full --severity critical
|
||||
```
|
||||
|
||||
**Incident Types:**
|
||||
- disk-full: Simulate disk space exhaustion
|
||||
- service-down: Service failure simulation
|
||||
- high-cpu: CPU utilization spike
|
||||
- network-latency: Network performance degradation
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Common Issues
|
||||
|
||||
#### Ansible Connection Failures
|
||||
|
||||
**Symptoms:**
|
||||
- `UNREACHABLE` errors in Ansible output
|
||||
- SSH connection timeouts
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check container status
|
||||
docker ps | grep infra-sim
|
||||
|
||||
# Verify SSH keys
|
||||
ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa
|
||||
|
||||
# Restart containers
|
||||
make destroy && make up
|
||||
```
|
||||
|
||||
#### Elasticsearch Cluster Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Kibana shows "No living connections"
|
||||
- Logstash pipeline failures
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check cluster health
|
||||
curl -X GET "localhost:9200/_cluster/health?pretty"
|
||||
|
||||
# Restart services
|
||||
docker-compose restart elasticsearch logstash kibana
|
||||
```
|
||||
|
||||
#### Python Import Errors
|
||||
|
||||
**Symptoms:**
|
||||
- ModuleNotFoundError in migration framework
|
||||
- Collector failures
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Install dependencies
|
||||
pip install -r requirements.txt
|
||||
|
||||
# Check Python path
|
||||
python -c "import sys; print(sys.path)"
|
||||
```
|
||||
|
||||
#### Docker Resource Constraints
|
||||
|
||||
**Symptoms:**
|
||||
- Container startup failures
|
||||
- Out of memory errors
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check Docker resources
|
||||
docker system df
|
||||
|
||||
# Clean up unused resources
|
||||
docker system prune -a
|
||||
|
||||
# Increase Docker memory limit
|
||||
# Edit /etc/docker/daemon.json
|
||||
{
|
||||
"memory": "4g",
|
||||
"cpu-count": 2
|
||||
}
|
||||
```
|
||||
|
||||
### Log Locations
|
||||
|
||||
- **Ansible:** `enterprise-infra-simulator/ansible.log`
|
||||
- **Docker:** `docker logs <container-name>`
|
||||
- **Elasticsearch:** `observability-stack/logs/elasticsearch.log`
|
||||
- **Migration Framework:** `migration-validation-framework/logs/validation.log`
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
```bash
|
||||
# Infrastructure monitoring
|
||||
ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20"
|
||||
|
||||
# Elasticsearch metrics
|
||||
curl -X GET "localhost:9200/_cluster/stats?pretty"
|
||||
|
||||
# Python performance
|
||||
python -m cProfile cli.py snapshot
|
||||
```
|
||||
|
||||
### Backup and Recovery
|
||||
|
||||
#### Infrastructure Backup
|
||||
```bash
|
||||
cd enterprise-infra-simulator
|
||||
docker-compose exec ansible ansible-playbook /playbooks/backup.yml
|
||||
```
|
||||
|
||||
#### Data Backup
|
||||
```bash
|
||||
cd observability-stack
|
||||
docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json
|
||||
```
|
||||
|
||||
#### Migration Data Backup
|
||||
```bash
|
||||
cd migration-validation-framework
|
||||
python cli.py backup --destination /backup/location
|
||||
```
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Complete System Reset
|
||||
|
||||
```bash
|
||||
# Stop all services
|
||||
docker-compose down -v
|
||||
cd enterprise-infra-simulator && make destroy
|
||||
|
||||
# Clean up volumes
|
||||
docker volume prune -f
|
||||
|
||||
# Restart from clean state
|
||||
cd enterprise-infra-simulator && make up
|
||||
cd observability-stack && docker-compose up -d
|
||||
```
|
||||
|
||||
### Incident Response
|
||||
|
||||
1. **Assess Impact:** Check monitoring dashboards
|
||||
2. **Isolate Issue:** Use failure simulation scripts to reproduce
|
||||
3. **Implement Fix:** Apply appropriate runbook procedure
|
||||
4. **Validate Recovery:** Run validation framework
|
||||
5. **Document Incident:** Update runbooks with lessons learned
|
||||
|
||||
## Maintenance Schedules
|
||||
|
||||
- **Daily:** Log rotation and cleanup
|
||||
- **Weekly:** Security patching and updates
|
||||
- **Monthly:** Performance optimization and capacity planning
|
||||
- **Quarterly:** Architecture review and modernization
|
||||
Reference in New Issue
Block a user