Files
portfolio/docs/runbooks.md
T
Mateusz Suski 8783892241
ci / validate (push) Waiting to run
Polish infrastructure portfolio projects
2026-04-29 23:30:30 +00:00

6.8 KiB

Runbooks and Operational Procedures

This document contains operational runbooks for deploying, managing, and troubleshooting the Enterprise Infrastructure Portfolio projects.

Table of Contents

  1. Infrastructure Simulator Operations
  2. Migration Validation Procedures
  3. Observability Stack Management
  4. Troubleshooting Guide

Infrastructure Simulator Operations

Starting the Infrastructure

cd enterprise-infra-simulator
make up

Expected Outcome:

  • Docker containers for simulated Linux nodes are created
  • Ansible inventory is populated
  • Basic services are running on all nodes

Verification:

docker ps | grep infra-sim
ansible -i inventory/hosts.ini all -m ping

Patching Operations

cd enterprise-infra-simulator
make patch

Procedure:

  1. Backup current container states
  2. Apply security patches via Ansible
  3. Validate service availability
  4. Generate patch report

Rollback:

docker-compose down
docker-compose up --scale node=0
make up

Hardening Operations

cd enterprise-infra-simulator
ansible-playbook -i inventory/hosts.ini playbooks/hardening.yml

Hardening Steps:

  • Disable unnecessary services
  • Configure firewall rules
  • Set secure SSH configurations
  • Apply CIS benchmarks

Scaling Operations

cd enterprise-infra-simulator
./scripts/simulate_scaling.sh up 3

Scaling Parameters:

  • Direction: up/down
  • Count: number of nodes to add/remove
  • Type: web/app/db

Failure Simulation

cd enterprise-infra-simulator
./scripts/simulate_failure.sh --type network --duration 300

Failure Types:

  • network: Network partition
  • disk: Disk space exhaustion
  • service: Service crashes
  • node: Complete node failure

Decommissioning

cd enterprise-infra-simulator
make destroy

Decommission Steps:

  1. Graceful service shutdown
  2. Data backup and export
  3. Configuration cleanup
  4. Container removal

Migration Validation Procedures

Pre-Migration Snapshot

cd migration-validation-framework
python3 cli.py collect --output before.json --systems web01,db01

Data Collected:

  • Mount points and filesystem usage
  • Running services and their states
  • Disk usage statistics
  • Network configurations

Post-Migration Validation

python3 cli.py collect --output after.json --systems web01,db01
python3 cli.py compare before.json after.json --output diff.json

Validation Checks:

  • Service availability verification
  • Filesystem integrity
  • Configuration consistency
  • Performance metrics comparison

Report Generation

python3 cli.py report --comparison <comparison-id> --format html

Report Contents:

  • Executive summary
  • Detailed change log
  • Risk assessment
  • Recommendations

Observability Stack Management

Starting the Stack

cd observability-stack
docker-compose up -d

Service Startup Order:

  1. Elasticsearch
  2. Logstash
  3. Kibana
  4. Grafana

Log Ingestion Testing

# Send sample logs
curl -X POST "localhost:8080" -H "Content-Type: application/json" -d @logs/sample.log

Alert Configuration

# Load alert rules
curl -X POST "localhost:3000/api/alerts" -H "Authorization: Bearer <token>" -d @alerting/alert_rules.json

Incident Simulation

cd observability-stack
./scenarios/incident_simulation.sh --type disk-full --severity critical

Incident Types:

  • disk-full: Simulate disk space exhaustion
  • service-down: Service failure simulation
  • high-cpu: CPU utilization spike
  • network-latency: Network performance degradation

Troubleshooting Guide

Common Issues

Ansible Connection Failures

Symptoms:

  • UNREACHABLE errors in Ansible output
  • SSH connection timeouts

Resolution:

# Check container status
docker ps | grep infra-sim

# Verify SSH keys
ansible -i inventory/hosts.ini all -m ping --private-key ~/.ssh/id_rsa

# Restart containers
make destroy && make up

Elasticsearch Cluster Issues

Symptoms:

  • Kibana shows "No living connections"
  • Logstash pipeline failures

Resolution:

# Check cluster health
curl -X GET "localhost:9200/_cluster/health?pretty"

# Restart services
docker-compose restart elasticsearch logstash kibana

Python Import Errors

Symptoms:

  • ModuleNotFoundError in migration framework
  • Collector failures

Resolution:

# Install dependencies
pip install -r requirements.txt

# Check Python path
python -c "import sys; print(sys.path)"

Docker Resource Constraints

Symptoms:

  • Container startup failures
  • Out of memory errors

Resolution:

# Check Docker resources
docker system df

# Clean up unused resources
docker system prune -a

# Increase Docker memory limit
# Edit /etc/docker/daemon.json
{
  "memory": "4g",
  "cpu-count": 2
}

Log Locations

  • Ansible: enterprise-infra-simulator/ansible.log
  • Docker: docker logs <container-name>
  • Elasticsearch: observability-stack/logs/elasticsearch.log
  • Migration Framework: migration-validation-framework/logs/validation.log

Performance Monitoring

# Infrastructure monitoring
ansible -i inventory/hosts.ini all -m shell -a "top -b -n1 | head -20"

# Elasticsearch metrics
curl -X GET "localhost:9200/_cluster/stats?pretty"

# Python performance
python -m cProfile cli.py snapshot

Backup and Recovery

Infrastructure Backup

cd enterprise-infra-simulator
docker-compose exec ansible ansible-playbook /playbooks/backup.yml

Data Backup

cd observability-stack
docker-compose exec elasticsearch curl -X PUT "localhost:9200/_snapshot/backup" -H "Content-Type: application/json" -d @backup_config.json

Migration Data Backup

cd migration-validation-framework
tar -czf /backup/location/migration-validation-framework.tgz migration-validation-framework

Emergency Procedures

Complete System Reset

# Stop all services
docker-compose down -v
cd enterprise-infra-simulator && make destroy

# Clean up volumes
docker volume prune -f

# Restart from clean state
cd enterprise-infra-simulator && make up
cd observability-stack && docker-compose up -d

Incident Response

  1. Assess Impact: Check monitoring dashboards
  2. Isolate Issue: Use failure simulation scripts to reproduce
  3. Implement Fix: Apply appropriate runbook procedure
  4. Validate Recovery: Run validation framework
  5. Document Incident: Update runbooks with lessons learned

Maintenance Schedules

  • Daily: Log rotation and cleanup
  • Weekly: Security patching and updates
  • Monthly: Performance optimization and capacity planning
  • Quarterly: Architecture review and modernization