feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack

2026-04-29 23:14:14 +00:00
parent 2313efac88
commit 7757020014
33 changed files with 6165 additions and 0 deletions
@@ -0,0 +1,461 @@
+# Observability Stack
+
+A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
+
+## Overview
+
+The Observability Stack provides a complete monitoring solution with:
+
+- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
+- **Logstash**: Data processing pipeline for log ingestion and transformation
+- **Kibana**: Visualization and exploration interface for Elasticsearch data
+- **Grafana**: Advanced metrics dashboarding and alerting platform
+- **Sample Logs**: Realistic log data for testing and demonstration
+- **Alerting**: Automated incident detection and notification rules
+- **Incident Simulation**: Scenarios for testing monitoring and response procedures
+
+## Architecture
+
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Log Sources   │    │   Logstash      │    │   Elasticsearch │
+│   (Applications │───►│   (Ingestion &  │───►│   (Storage &    │
+│    / Systems)   │    │    Processing)  │    │    Analytics)   │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+         │                       │                       │
+         ▼                       ▼                       ▼
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Alerting      │    │   Kibana        │    │   Grafana       │
+│   Rules         │    │   (Dashboards & │    │   (Metrics &    │
+│                 │    │    Exploration) │    │    Dashboards)  │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+```
+
+## Quick Start
+
+### Prerequisites
+
+- Docker and Docker Compose
+- At least 4GB RAM available
+- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
+
+### Setup
+
+```bash
+cd observability-stack
+
+# Start the observability stack
+docker-compose up -d
+
+# Wait for services to be ready (may take 2-3 minutes)
+sleep 180
+
+# Verify services are running
+curl -X GET "localhost:9200/_cluster/health?pretty"
+curl -X GET "localhost:5601/api/status"
+curl -X GET "localhost:3000/api/health"
+```
+
+### Access Interfaces
+
+- **Kibana**: http://localhost:5601 (admin/elastic)
+- **Grafana**: http://localhost:3000 (admin/admin)
+- **Elasticsearch**: http://localhost:9200
+
+## Project Structure
+
+```
+observability-stack/
+├── docker-compose.yml     # Service orchestration
+├── logstash/             # Logstash configuration
+│   ├── pipeline/         # Processing pipelines
+│   └── config/           # Logstash settings
+├── elasticsearch/        # Elasticsearch configuration
+│   └── config/           # Cluster settings
+├── kibana/              # Kibana configuration
+│   └── config/           # Dashboard settings
+├── grafana/             # Grafana configuration
+│   ├── provisioning/     # Dashboards and datasources
+│   └── dashboards/       # Dashboard definitions
+├── logs/                 # Sample log data
+│   └── sample.log        # Realistic application logs
+├── alerting/             # Alert configuration
+│   └── alert_rules.yml   # Alert definitions
+├── scenarios/            # Incident simulation
+│   └── incident_simulation.sh  # Simulation scripts
+└── README.md
+```
+
+## Services Configuration
+
+### Elasticsearch
+
+**Configuration**: `elasticsearch/config/elasticsearch.yml`
+
+Key settings:
+- Single-node cluster for development
+- Memory limits and heap sizing
+- Security enabled with basic authentication
+- CORS enabled for Kibana access
+
+**Data Indices**:
+- `logs-*`: Application and system logs
+- `metrics-*`: System and application metrics
+- `alerts-*`: Alert and incident data
+
+### Logstash
+
+**Pipelines**: `logstash/pipeline/`
+
+- **apache_logs**: Apache/Nginx access log processing
+- **system_logs**: System log parsing and enrichment
+- **application_logs**: Custom application log processing
+- **metrics_pipeline**: Metrics data processing
+
+**Input Sources**:
+- Filebeat agents
+- TCP/UDP syslog inputs
+- HTTP endpoints for metrics
+- Docker container logs
+
+### Kibana
+
+**Dashboards**:
+- Log analysis dashboard
+- System metrics overview
+- Application performance dashboard
+- Security events dashboard
+
+**Saved Objects**:
+- Index patterns for log data
+- Visualizations for common metrics
+- Search queries for troubleshooting
+
+### Grafana
+
+**Data Sources**:
+- Elasticsearch for logs and metrics
+- Prometheus (if available)
+- InfluxDB for time-series data
+
+**Dashboards**:
+- Infrastructure overview
+- Application performance
+- System resources
+- Custom business metrics
+
+## Log Ingestion
+
+### Sample Data
+
+The stack includes realistic sample logs for testing:
+
+```bash
+# Ingest sample logs
+curl -X POST "localhost:8080" \
+  -H "Content-Type: application/json" \
+  -d @logs/sample.log
+```
+
+### Log Formats Supported
+
+- **Apache/Nginx**: Combined log format
+- **Syslog**: RFC 3164/5424 compliant
+- **JSON**: Structured application logs
+- **Custom**: Configurable parsing rules
+
+### Data Enrichment
+
+Logstash pipelines add:
+- GeoIP location data
+- User agent parsing
+- Timestamp normalization
+- Host metadata enrichment
+
+## Alerting and Monitoring
+
+### Alert Rules
+
+Located in `alerting/alert_rules.yml`:
+
+```yaml
+alert_rules:
+  - name: "High CPU Usage"
+    condition: "cpu_usage > 90"
+    duration: "5m"
+    severity: "critical"
+    channels: ["email", "slack"]
+
+  - name: "Disk Space Low"
+    condition: "disk_usage > 85"
+    duration: "10m"
+    severity: "warning"
+    channels: ["email"]
+
+  - name: "Service Down"
+    condition: "service_status == 'down'"
+    duration: "2m"
+    severity: "critical"
+    channels: ["email", "pagerduty"]
+```
+
+### Alert Channels
+
+- **Email**: SMTP-based notifications
+- **Slack**: Real-time messaging
+- **PagerDuty**: Incident management integration
+- **Webhook**: Custom HTTP endpoints
+
+## Incident Simulation
+
+### Available Scenarios
+
+```bash
+cd scenarios
+
+# Simulate disk space exhaustion
+./incident_simulation.sh --type disk-full --severity critical
+
+# Simulate service failure
+./incident_simulation.sh --type service-down --service nginx
+
+# Simulate network latency
+./incident_simulation.sh --type network-latency --delay 500ms
+
+# Simulate high CPU usage
+./incident_simulation.sh --type high-cpu --cores 4
+```
+
+### Scenario Types
+
+- **disk-full**: Filesystem capacity exhaustion
+- **service-down**: Application service failures
+- **network-latency**: Network performance degradation
+- **high-cpu**: CPU utilization spikes
+- **memory-leak**: Memory consumption growth
+- **log-flood**: Excessive log generation
+
+## Dashboards and Visualization
+
+### Kibana Dashboards
+
+Pre-configured dashboards for:
+
+1. **Log Analysis**
+   - Log volume over time
+   - Error rate trends
+   - Top error messages
+   - Geographic request distribution
+
+2. **System Monitoring**
+   - CPU and memory usage
+   - Disk I/O statistics
+   - Network traffic
+   - System load averages
+
+3. **Application Performance**
+   - Response time distributions
+   - Request rate metrics
+   - Error percentages
+   - User session analytics
+
+### Grafana Dashboards
+
+Advanced visualization panels:
+
+- **Infrastructure Overview**: Multi-system resource usage
+- **Application Metrics**: Custom business KPIs
+- **Alert Status**: Active alerts and trends
+- **Capacity Planning**: Resource utilization forecasting
+
+## API Endpoints
+
+### Elasticsearch APIs
+
+```bash
+# Cluster health
+GET /_cluster/health
+
+# Index statistics
+GET /_cat/indices?v
+
+# Search logs
+GET /logs-*/_search
+{
+  "query": {
+    "match": {
+      "message": "ERROR"
+    }
+  }
+}
+```
+
+### Kibana APIs
+
+```bash
+# Get dashboard list
+GET /api/saved_objects/_find?type=dashboard
+
+# Export visualizations
+GET /api/saved_objects/visualization/{id}
+```
+
+### Grafana APIs
+
+```bash
+# Get dashboard list
+GET /api/search?query=*
+
+# Alert status
+GET /api/alerts
+```
+
+## Configuration Management
+
+### Environment Variables
+
+```bash
+# Elasticsearch
+ES_JAVA_OPTS="-Xms1g -Xmx1g"
+ELASTIC_PASSWORD="elastic"
+
+# Logstash
+LS_JAVA_OPTS="-Xms512m -Xmx512m"
+
+# Grafana
+GF_SECURITY_ADMIN_PASSWORD="admin"
+```
+
+### Scaling Configuration
+
+For production deployment:
+
+```yaml
+version: '3.8'
+services:
+  elasticsearch:
+    deploy:
+      replicas: 3
+      resources:
+        limits:
+          memory: 4G
+          cpus: '2.0'
+```
+
+## Security Considerations
+
+### Authentication
+
+- Elasticsearch basic authentication enabled
+- Grafana admin credentials configured
+- Kibana anonymous access disabled
+
+### Network Security
+
+- Services bound to localhost only
+- Internal network for service communication
+- TLS encryption for external access (production)
+
+### Data Protection
+
+- Elasticsearch encryption at rest
+- Log data retention policies
+- Backup and recovery procedures
+
+## Troubleshooting
+
+### Common Issues
+
+**Elasticsearch Won't Start:**
+```bash
+# Check memory allocation
+docker-compose logs elasticsearch
+
+# Verify Java heap settings
+docker-compose exec elasticsearch ps aux
+```
+
+**Logstash Pipeline Errors:**
+```bash
+# Check pipeline configuration
+docker-compose logs logstash
+
+# Validate pipeline syntax
+docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
+```
+
+**Kibana Connection Issues:**
+```bash
+# Verify Elasticsearch connectivity
+curl -u elastic:elastic "localhost:9200/_cluster/health"
+
+# Check Kibana logs
+docker-compose logs kibana
+```
+
+### Performance Tuning
+
+**Elasticsearch:**
+- Increase heap size for larger datasets
+- Configure shard allocation
+- Enable index optimization
+
+**Logstash:**
+- Adjust worker threads
+- Configure batch sizes
+- Enable persistent queues
+
+**Grafana:**
+- Configure query caching
+- Set dashboard refresh intervals
+- Optimize panel queries
+
+## Development and Testing
+
+### Adding New Dashboards
+
+1. Create dashboard JSON in `grafana/dashboards/`
+2. Update provisioning configuration
+3. Restart Grafana service
+
+### Custom Alert Rules
+
+1. Define rules in `alerting/alert_rules.yml`
+2. Update alerting configuration
+3. Test rules with simulation scenarios
+
+### Log Pipeline Development
+
+1. Add pipeline configuration in `logstash/pipeline/`
+2. Test with sample data
+3. Validate parsing with Kibana
+
+## Backup and Recovery
+
+### Data Backup
+
+```bash
+# Elasticsearch snapshot
+curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
+  -H "Content-Type: application/json" \
+  -d '{"indices": "*"}'
+```
+
+### Configuration Backup
+
+```bash
+# Backup all configurations
+tar -czf backup_$(date +%Y%m%d).tar.gz \
+  logstash/ elasticsearch/ kibana/ grafana/
+```
+
+## Contributing
+
+1. Follow existing configuration patterns
+2. Test changes with simulation scenarios
+3. Update documentation for new features
+4. Ensure backward compatibility
+
+## License
+
+Enterprise Internal Use Only
@@ -0,0 +1,326 @@
+# Enterprise Observability Alert Rules
+# Alert definitions for automated incident detection and notification
+
+alert_rules:
+  # System Resource Alerts
+  - name: "High CPU Usage"
+    description: "CPU utilization exceeds threshold"
+    condition: "cpu_usage_percent > 90"
+    duration: "5m"
+    severity: "critical"
+    tags:
+      - system
+      - performance
+    channels:
+      - email
+      - slack
+    labels:
+      team: "platform"
+      component: "system"
+
+  - name: "High Memory Usage"
+    description: "Memory utilization exceeds threshold"
+    condition: "memory_usage_percent > 85"
+    duration: "3m"
+    severity: "warning"
+    tags:
+      - system
+      - memory
+    channels:
+      - email
+    labels:
+      team: "platform"
+      component: "system"
+
+  - name: "Disk Space Critical"
+    description: "Disk usage exceeds critical threshold"
+    condition: "disk_usage_percent > 95"
+    duration: "2m"
+    severity: "critical"
+    tags:
+      - storage
+      - disk
+    channels:
+      - email
+      - pagerduty
+    labels:
+      team: "platform"
+      component: "storage"
+
+  - name: "Disk Space Warning"
+    description: "Disk usage exceeds warning threshold"
+    condition: "disk_usage_percent > 85"
+    duration: "10m"
+    severity: "warning"
+    tags:
+      - storage
+      - disk
+    channels:
+      - email
+    labels:
+      team: "platform"
+      component: "storage"
+
+  # Service Availability Alerts
+  - name: "Service Down"
+    description: "Critical service is not responding"
+    condition: "service_status == 'down' OR http_status_code >= 500"
+    duration: "2m"
+    severity: "critical"
+    tags:
+      - service
+      - availability
+    channels:
+      - email
+      - slack
+      - pagerduty
+    labels:
+      team: "application"
+      component: "service"
+
+  - name: "Database Connection Failed"
+    description: "Database connection pool exhausted or unresponsive"
+    condition: "db_connections_active == 0 OR db_response_time > 5000"
+    duration: "1m"
+    severity: "critical"
+    tags:
+      - database
+      - connectivity
+    channels:
+      - email
+      - pagerduty
+    labels:
+      team: "database"
+      component: "postgresql"
+
+  - name: "Cache Unavailable"
+    description: "Cache service is down or unresponsive"
+    condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
+    duration: "3m"
+    severity: "warning"
+    tags:
+      - cache
+      - performance
+    channels:
+      - email
+    labels:
+      team: "infrastructure"
+      component: "redis"
+
+  # Application Performance Alerts
+  - name: "High Error Rate"
+    description: "Application error rate exceeds threshold"
+    condition: "error_rate_percent > 5"
+    duration: "5m"
+    severity: "critical"
+    tags:
+      - application
+      - errors
+    channels:
+      - email
+      - slack
+    labels:
+      team: "application"
+      component: "api"
+
+  - name: "Slow Response Time"
+    description: "API response time exceeds SLA"
+    condition: "response_time_p95 > 2000"
+    duration: "5m"
+    severity: "warning"
+    tags:
+      - application
+      - performance
+    channels:
+      - email
+    labels:
+      team: "application"
+      component: "api"
+
+  - name: "High Request Queue"
+    description: "Request queue depth is too high"
+    condition: "queue_depth > 100"
+    duration: "3m"
+    severity: "warning"
+    tags:
+      - application
+      - queue
+    channels:
+      - email
+    labels:
+      team: "application"
+      component: "queue"
+
+  # Infrastructure Alerts
+  - name: "Network Latency High"
+    description: "Network round-trip time exceeds threshold"
+    condition: "network_rtt > 100"
+    duration: "5m"
+    severity: "warning"
+    tags:
+      - network
+      - latency
+    channels:
+      - email
+    labels:
+      team: "network"
+      component: "infrastructure"
+
+  - name: "Load Balancer Unhealthy"
+    description: "Load balancer backend servers are unhealthy"
+    condition: "lb_unhealthy_backends > 0"
+    duration: "2m"
+    severity: "critical"
+    tags:
+      - loadbalancer
+      - availability
+    channels:
+      - email
+      - pagerduty
+    labels:
+      team: "infrastructure"
+      component: "loadbalancer"
+
+  # Security Alerts
+  - name: "Failed Login Attempts"
+    description: "Multiple failed authentication attempts detected"
+    condition: "failed_login_attempts > 5"
+    duration: "5m"
+    severity: "warning"
+    tags:
+      - security
+      - authentication
+    channels:
+      - email
+      - slack
+    labels:
+      team: "security"
+      component: "authentication"
+
+  - name: "Suspicious Network Traffic"
+    description: "Unusual network traffic patterns detected"
+    condition: "network_bytes_unusual > 1000000"
+    duration: "10m"
+    severity: "warning"
+    tags:
+      - security
+      - network
+    channels:
+      - email
+    labels:
+      team: "security"
+      component: "network"
+
+  # Log-based Alerts
+  - name: "Application Errors"
+    description: "High volume of application error logs"
+    condition: "log_errors_per_minute > 10"
+    duration: "2m"
+    severity: "warning"
+    tags:
+      - logs
+      - errors
+    channels:
+      - email
+    labels:
+      team: "application"
+      component: "logs"
+
+  - name: "Out of Memory Errors"
+    description: "Out of memory errors detected in logs"
+    condition: "log_oom_errors > 0"
+    duration: "1m"
+    severity: "critical"
+    tags:
+      - memory
+      - errors
+    channels:
+      - email
+      - pagerduty
+    labels:
+      team: "application"
+      component: "memory"
+
+  # Business Logic Alerts
+  - name: "Low Business Transactions"
+    description: "Business transaction volume below expected threshold"
+    condition: "business_transactions_per_hour < 100"
+    duration: "15m"
+    severity: "warning"
+    tags:
+      - business
+      - transactions
+    channels:
+      - email
+    labels:
+      team: "business"
+      component: "transactions"
+
+  - name: "Payment Failures"
+    description: "Payment processing failure rate is high"
+    condition: "payment_failure_rate > 0.05"
+    duration: "5m"
+    severity: "critical"
+    tags:
+      - payments
+      - business
+    channels:
+      - email
+      - pagerduty
+    labels:
+      team: "payments"
+      component: "processing"
+
+# Alert Channels Configuration
+alert_channels:
+  email:
+    type: "email"
+    recipients:
+      - "platform-team@company.com"
+      - "oncall@company.com"
+    subject_template: "[{{severity}}] {{name}} - {{description}}"
+
+  slack:
+    type: "slack"
+    webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
+    channel: "#alerts"
+    username: "Observability Bot"
+    icon_emoji: ":warning:"
+
+  pagerduty:
+    type: "pagerduty"
+    integration_key: "your-pagerduty-integration-key"
+    severity_mapping:
+      critical: "critical"
+      warning: "warning"
+      info: "info"
+
+# Alert Silencing Rules
+silence_rules:
+  - name: "Maintenance Window"
+    condition: "maintenance_window == true"
+    duration: "4h"
+    comment: "Silenced during scheduled maintenance"
+
+  - name: "Known Issue"
+    condition: "known_issue_id == 'TICKET-123'"
+    duration: "24h"
+    comment: "Silenced for known issue resolution"
+
+# Escalation Policies
+escalation_policies:
+  - name: "Default Escalation"
+    steps:
+      - delay: "5m"
+        channels: ["email"]
+      - delay: "15m"
+        channels: ["slack"]
+      - delay: "30m"
+        channels: ["pagerduty"]
+
+  - name: "Critical Escalation"
+    steps:
+      - delay: "0m"
+        channels: ["email", "slack", "pagerduty"]
+      - delay: "10m"
+        channels: ["pagerduty"]  # Escalation
@@ -0,0 +1,122 @@
+version: '3.8'
+
+services:
+  elasticsearch:
+    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
+    container_name: observability-elasticsearch
+    environment:
+      - discovery.type=single-node
+      - xpack.security.enabled=true
+      - ELASTIC_PASSWORD=elastic
+      - "ES_JAVA_OPTS=-Xms1g -Xmx1g"
+    volumes:
+      - elasticsearch_data:/usr/share/elasticsearch/data
+      - ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
+    ports:
+      - "9200:9200"
+      - "9300:9300"
+    networks:
+      - observability
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+
+  logstash:
+    image: docker.elastic.co/logstash/logstash:8.11.0
+    container_name: observability-logstash
+    environment:
+      - "LS_JAVA_OPTS=-Xms512m -Xmx512m"
+    volumes:
+      - ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
+      - ./logstash/pipeline:/usr/share/logstash/pipeline
+      - ./logs:/usr/share/logstash/logs
+    ports:
+      - "5044:5044"
+      - "8080:8080"
+    networks:
+      - observability
+    depends_on:
+      elasticsearch:
+        condition: service_healthy
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  kibana:
+    image: docker.elastic.co/kibana/kibana:8.11.0
+    container_name: observability-kibana
+    environment:
+      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
+      - ELASTICSEARCH_USERNAME=elastic
+      - ELASTICSEARCH_PASSWORD=elastic
+    volumes:
+      - ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
+    ports:
+      - "5601:5601"
+    networks:
+      - observability
+    depends_on:
+      elasticsearch:
+        condition: service_healthy
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 5
+
+  grafana:
+    image: grafana/grafana:10.2.0
+    container_name: observability-grafana
+    environment:
+      - GF_SECURITY_ADMIN_USER=admin
+      - GF_SECURITY_ADMIN_PASSWORD=admin
+      - GF_USERS_ALLOW_SIGN_UP=false
+    volumes:
+      - grafana_data:/var/lib/grafana
+      - ./grafana/provisioning:/etc/grafana/provisioning
+      - ./grafana/dashboards:/var/lib/grafana/dashboards
+    ports:
+      - "3000:3000"
+    networks:
+      - observability
+    restart: unless-stopped
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  filebeat:
+    image: docker.elastic.co/beats/filebeat:8.11.0
+    container_name: observability-filebeat
+    user: root
+    volumes:
+      - ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
+      - ./logs:/var/log/sample
+      - /var/lib/docker/containers:/var/lib/docker/containers:ro
+      - /var/run/docker.sock:/var/run/docker.sock:ro
+    networks:
+      - observability
+    depends_on:
+      - logstash
+    restart: unless-stopped
+
+volumes:
+  elasticsearch_data:
+    driver: local
+  grafana_data:
+    driver: local
+
+networks:
+  observability:
+    driver: bridge
+    ipam:
+      config:
+        - subnet: 172.25.0.0/16
@@ -0,0 +1,2 @@
+[2026-04-29 22:52:26] INFO Incident simulation script started
+[2026-04-29 22:52:26] INFO Scenario: help
@@ -0,0 +1,98 @@
+# Sample Application Logs for Observability Stack Testing
+# Generated realistic log entries for demonstration and testing
+
+[2024-01-15 08:30:15] INFO  com.example.app.Application - Application started successfully on port 8080
+[2024-01-15 08:30:16] INFO  com.example.database.ConnectionPool - Database connection pool initialized with 10 connections
+[2024-01-15 08:30:17] INFO  com.example.cache.RedisCache - Redis cache connected successfully
+[2024-01-15 08:30:18] INFO  com.example.messaging.RabbitMQClient - Message queue connection established
+
+[2024-01-15 08:31:22] INFO  com.example.api.UserController - User login attempt: user=john.doe, ip=192.168.1.100
+[2024-01-15 08:31:23] INFO  com.example.auth.AuthService - Authentication successful for user john.doe
+[2024-01-15 08:31:24] INFO  com.example.api.UserController - API request: GET /api/users/profile, status=200, duration=45ms
+
+[2024-01-15 08:32:01] WARN  com.example.cache.RedisCache - Cache miss for key: user_profile_12345
+[2024-01-15 08:32:02] INFO  com.example.database.UserRepository - Database query executed: SELECT * FROM users WHERE id = ?, duration=120ms
+[2024-01-15 08:32:03] INFO  com.example.cache.RedisCache - Cache populated for key: user_profile_12345
+
+[2024-01-15 08:35:12] ERROR com.example.api.OrderController - Failed to process order: order_id=ORD-2024-001, error=Payment gateway timeout
+[2024-01-15 08:35:13] WARN  com.example.messaging.RabbitMQClient - Message delivery failed, retrying in 5 seconds
+[2024-01-15 08:35:18] INFO  com.example.messaging.RabbitMQClient - Message delivered successfully after retry
+
+[2024-01-15 08:40:05] INFO  com.example.monitoring.HealthCheck - Health check passed: database=OK, cache=OK, messaging=OK
+[2024-01-15 08:40:06] INFO  com.example.metrics.MetricsCollector - Metrics collected: active_users=1250, requests_per_second=45.2
+
+[2024-01-15 08:45:30] ERROR com.example.external.PaymentGateway - Payment gateway connection failed: Connection refused
+[2024-01-15 08:45:31] ERROR com.example.api.PaymentController - Payment processing failed for transaction TXN-789012
+[2024-01-15 08:45:32] WARN  com.example.circuitbreaker.PaymentCircuitBreaker - Circuit breaker opened for payment service
+
+[2024-01-15 08:50:15] INFO  com.example.batch.BatchProcessor - Batch job started: job_id=BATCH-001, records=10000
+[2024-01-15 08:50:45] INFO  com.example.batch.BatchProcessor - Batch job progress: processed=2500/10000 (25%)
+[2024-01-15 08:51:15] INFO  com.example.batch.BatchProcessor - Batch job progress: processed=5000/10000 (50%)
+[2024-01-15 08:51:45] INFO  com.example.batch.BatchProcessor - Batch job progress: processed=7500/10000 (75%)
+[2024-01-15 08:52:10] INFO  com.example.batch.BatchProcessor - Batch job completed: job_id=BATCH-001, duration=175s, success=true
+
+[2024-01-15 09:00:00] INFO  com.example.scheduled.CleanupJob - Scheduled cleanup job started
+[2024-01-15 09:00:05] INFO  com.example.scheduled.CleanupJob - Cleaned up 150 expired sessions
+[2024-01-15 09:00:10] INFO  com.example.scheduled.CleanupJob - Removed 25 temporary files
+[2024-01-15 09:00:15] INFO  com.example.scheduled.CleanupJob - Cleanup job completed successfully
+
+[2024-01-15 09:15:22] WARN  com.example.database.ConnectionPool - Connection pool nearing capacity: active=8/10
+[2024-01-15 09:15:23] INFO  com.example.database.ConnectionPool - Connection pool expanded to 15 connections
+
+[2024-01-15 09:30:45] ERROR com.example.api.ProductController - Database query timeout: query=SELECT * FROM products WHERE category = 'electronics'
+[2024-01-15 09:30:46] WARN  com.example.database.ConnectionPool - Connection pool exhausted, rejecting request
+[2024-01-15 09:30:47] ERROR com.example.api.ProductController - Service temporarily unavailable, status=503
+
+[2024-01-15 09:45:12] INFO  com.example.monitoring.AlertManager - Alert triggered: High CPU usage detected (85%)
+[2024-01-15 09:45:13] INFO  com.example.autoscaling.ScaleManager - Auto-scaling initiated: increasing instances from 3 to 5
+
+[2024-01-15 10:00:00] INFO  com.example.backup.BackupService - Database backup started
+[2024-01-15 10:05:30] INFO  com.example.backup.BackupService - Database backup completed: size=2.3GB, duration=330s
+
+[2024-01-15 10:30:15] WARN  com.example.security.SecurityFilter - Suspicious activity detected: multiple failed login attempts from IP 10.0.0.50
+[2024-01-15 10:30:16] INFO  com.example.security.SecurityFilter - IP 10.0.0.50 blocked for 15 minutes
+
+[2024-01-15 11:00:00] INFO  com.example.reporting.ReportGenerator - Daily report generation started
+[2024-01-15 11:05:00] INFO  com.example.reporting.ReportGenerator - Daily report completed: transactions=15420, revenue=$125,430.50
+
+[2024-01-15 12:00:00] ERROR com.example.external.APIGateway - External API rate limit exceeded: retrying in 60 seconds
+[2024-01-15 12:01:00] INFO  com.example.external.APIGateway - External API connection restored
+
+[2024-01-15 13:15:30] CRITICAL com.example.system.SystemMonitor - Disk space critical: /var/log usage at 95%
+[2024-01-15 13:15:31] INFO  com.example.maintenance.LogRotation - Emergency log rotation initiated
+[2024-01-15 13:15:35] INFO  com.example.maintenance.LogRotation - Log rotation completed: freed 2.1GB space
+
+[2024-01-15 14:00:00] INFO  com.example.metrics.PerformanceMonitor - Performance baseline established: avg_response_time=245ms, throughput=1200 req/sec
+
+[2024-01-15 15:30:45] WARN  com.example.cache.RedisCache - Cache cluster node down: redis-node-03
+[2024-01-15 15:30:46] INFO  com.example.cache.RedisCache - Failover initiated to redis-node-04
+
+[2024-01-15 16:45:12] ERROR com.example.messaging.MessageProcessor - Message processing failed: invalid message format
+[2024-01-15 16:45:13] INFO  com.example.messaging.DeadLetterQueue - Message moved to dead letter queue
+
+[2024-01-15 17:00:00] INFO  com.example.backup.BackupService - Full system backup started
+[2024-01-15 17:30:00] INFO  com.example.backup.BackupService - Full system backup completed: size=45.2GB, duration=1800s
+
+[2024-01-15 18:00:00] INFO  com.example.monitoring.HealthCheck - Evening health check: all systems operational
+[2024-01-15 18:00:01] INFO  com.example.metrics.MetricsCollector - End of day metrics: total_requests=125000, error_rate=0.02%, avg_response_time=198ms
+
+# Additional realistic log patterns for testing
+
+[2024-01-15 08:15:30] INFO  nginx: 192.168.1.100 - - [15/Jan/2024:08:15:30 +0000] "GET /api/health HTTP/1.1" 200 123 "-" "curl/7.68.0"
+[2024-01-15 08:15:31] INFO  nginx: 192.168.1.101 - - [15/Jan/2024:08:15:31 +0000] "POST /api/login HTTP/1.1" 200 456 "-" "Mozilla/5.0 ..."
+[2024-01-15 08:15:32] WARN  nginx: 192.168.1.102 - - [15/Jan/2024:08:15:32 +0000] "GET /api/admin HTTP/1.1" 403 234 "-" "Mozilla/5.0 ..."
+
+[2024-01-15 09:20:15] INFO  systemd: Started Session 1234 of user john.doe
+[2024-01-15 09:20:16] INFO  systemd: Started User Manager for UID 1000
+[2024-01-15 09:20:17] INFO  systemd: Started Session 1235 of user jane.smith
+
+[2024-01-15 10:45:30] WARN  kernel: [ 1234.567890] CPU0: Core temperature above threshold, cpu clock throttled
+[2024-01-15 10:45:31] INFO  kernel: [ 1234.678901] CPU0: Temperature back to normal
+
+[2024-01-15 14:30:15] ERROR postgres: FATAL:  password authentication failed for user "app_user"
+[2024-01-15 14:30:16] ERROR postgres: FATAL:  password authentication failed for user "app_user"
+[2024-01-15 14:30:17] WARN  postgres: too many connections for role "app_user"
+
+[2024-01-15 16:15:45] INFO  rabbitmq: accepting AMQP connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672)
+[2024-01-15 16:15:46] INFO  rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): user 'app_user' authenticated
+[2024-01-15 16:15:47] WARN  rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): missed heartbeats from client, timeout: 60s
@@ -0,0 +1,318 @@
+#!/bin/bash
+
+# Enterprise Incident Simulation Script
+# Simulates various failure scenarios for testing observability stack
+
+set -e
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(dirname "$(dirname "$SCRIPT_DIR")")"
+LOG_FILE="$PROJECT_ROOT/observability-stack/logs/incident_simulation.log"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Logging function
+log() {
+    local level=$1
+    local message=$2
+    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+    echo "[$timestamp] $level $message" >> "$LOG_FILE"
+    echo -e "${BLUE}[$timestamp]${NC} $level $message"
+}
+
+# Function to simulate CPU spike
+simulate_cpu_spike() {
+    local duration=${1:-60}
+    log "INFO" "Starting CPU spike simulation for ${duration} seconds"
+
+    # Launch CPU-intensive processes
+    for i in {1..4}; do
+        (
+            end_time=$((SECONDS + duration))
+            while [ $SECONDS -lt $end_time ]; do
+                # CPU-intensive calculation
+                result=0
+                for j in {1..100000}; do
+                    result=$((result + j))
+                done
+            done
+        ) &
+        PIDS[$i]=$!
+    done
+
+    # Wait for simulation to complete
+    for pid in "${PIDS[@]}"; do
+        wait $pid 2>/dev/null || true
+    done
+
+    log "INFO" "CPU spike simulation completed"
+}
+
+# Function to simulate memory leak
+simulate_memory_leak() {
+    local duration=${1:-30}
+    log "INFO" "Starting memory leak simulation for ${duration} seconds"
+
+    # Create a process that gradually consumes memory
+    (
+        data=""
+        end_time=$((SECONDS + duration))
+        while [ $SECONDS -lt $end_time ]; do
+            # Gradually consume memory
+            data="${data}X"
+            sleep 0.1
+        done
+    ) &
+    MEM_PID=$!
+
+    wait $MEM_PID 2>/dev/null || true
+    log "INFO" "Memory leak simulation completed"
+}
+
+# Function to simulate disk space exhaustion
+simulate_disk_full() {
+    local target_dir=${1:-"/tmp"}
+    local duration=${2:-30}
+    log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
+
+    # Create large files to fill disk space
+    (
+        end_time=$((SECONDS + duration))
+        while [ $SECONDS -lt $end_time ]; do
+            # Create 100MB file
+            dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
+            sleep 2
+        done
+    ) &
+    DISK_PID=$!
+
+    wait $DISK_PID 2>/dev/null || true
+
+    # Cleanup test files
+    rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
+    log "INFO" "Disk space exhaustion simulation completed and cleaned up"
+}
+
+# Function to simulate network issues
+simulate_network_issues() {
+    local interface=${1:-"lo"}
+    local duration=${2:-20}
+    log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
+
+    # Add network delay and packet loss
+    sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
+
+    sleep $duration
+
+    # Remove network simulation
+    sudo tc qdisc del dev $interface root 2>/dev/null || true
+    log "INFO" "Network issues simulation completed"
+}
+
+# Function to simulate service crashes
+simulate_service_crash() {
+    local service_name=${1:-"test-service"}
+    log "INFO" "Starting service crash simulation for ${service_name}"
+
+    # Simulate service going down
+    log "ERROR" "Service ${service_name} crashed unexpectedly"
+    sleep 5
+    log "INFO" "Service ${service_name} restarted automatically"
+
+    # Simulate multiple crashes
+    for i in {1..3}; do
+        sleep 2
+        log "ERROR" "Service ${service_name} crashed again (attempt $i)"
+        sleep 1
+        log "INFO" "Service ${service_name} recovered after crash $i"
+    done
+}
+
+# Function to simulate database issues
+simulate_database_issues() {
+    local duration=${1:-25}
+    log "INFO" "Starting database issues simulation for ${duration} seconds"
+
+    # Simulate connection pool exhaustion
+    log "WARN" "Database connection pool nearing capacity"
+    sleep 5
+    log "ERROR" "Database connection pool exhausted"
+    sleep 5
+    log "ERROR" "Database query timeout occurred"
+    sleep 5
+    log "WARN" "Database connections recovering"
+    sleep 5
+    log "INFO" "Database connections restored"
+
+    log "INFO" "Database issues simulation completed"
+}
+
+# Function to simulate application errors
+simulate_application_errors() {
+    local error_count=${1:-10}
+    log "INFO" "Starting application error simulation (${error_count} errors)"
+
+    for i in {1..error_count}; do
+        case $((RANDOM % 4)) in
+            0)
+                log "ERROR" "NullPointerException in UserService.getUser($i)"
+                ;;
+            1)
+                log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
+                ;;
+            2)
+                log "ERROR" "ValidationException: Invalid input data for request $i"
+                ;;
+            3)
+                log "ERROR" "IOException: Failed to write to log file"
+                ;;
+        esac
+        sleep $((RANDOM % 3 + 1))
+    done
+
+    log "INFO" "Application error simulation completed"
+}
+
+# Function to run comprehensive incident scenario
+run_comprehensive_scenario() {
+    log "INFO" "Starting comprehensive incident scenario simulation"
+
+    # Phase 1: Initial system stress
+    log "INFO" "Phase 1: System stress simulation"
+    simulate_cpu_spike 30 &
+    CPU_PID=$!
+    simulate_memory_leak 20 &
+    MEM_PID=$!
+
+    sleep 10
+
+    # Phase 2: Service degradation
+    log "INFO" "Phase 2: Service degradation simulation"
+    simulate_service_crash "web-service" &
+    SERVICE_PID=$!
+
+    sleep 5
+
+    # Phase 3: Database issues
+    log "INFO" "Phase 3: Database issues simulation"
+    simulate_database_issues 15 &
+    DB_PID=$!
+
+    # Phase 4: Application errors
+    log "INFO" "Phase 4: Application error burst"
+    simulate_application_errors 15 &
+    APP_PID=$!
+
+    # Phase 5: Infrastructure issues
+    log "INFO" "Phase 5: Infrastructure issues simulation"
+    simulate_disk_full "/tmp" 10 &
+    DISK_PID=$!
+
+    # Wait for all simulations to complete
+    wait $CPU_PID 2>/dev/null || true
+    wait $MEM_PID 2>/dev/null || true
+    wait $SERVICE_PID 2>/dev/null || true
+    wait $DB_PID 2>/dev/null || true
+    wait $APP_PID 2>/dev/null || true
+    wait $DISK_PID 2>/dev/null || true
+
+    log "INFO" "Comprehensive incident scenario completed"
+}
+
+# Function to show usage
+show_usage() {
+    echo "Enterprise Incident Simulation Script"
+    echo "Usage: $0 [SCENARIO] [OPTIONS]"
+    echo ""
+    echo "SCENARIOS:"
+    echo "  cpu [DURATION]          - Simulate CPU spike (default: 60s)"
+    echo "  memory [DURATION]       - Simulate memory leak (default: 30s)"
+    echo "  disk [DIR] [DURATION]   - Simulate disk space exhaustion (default: /tmp, 30s)"
+    echo "  network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
+    echo "  service [NAME]          - Simulate service crashes (default: test-service)"
+    echo "  database [DURATION]     - Simulate database issues (default: 25s)"
+    echo "  app-errors [COUNT]      - Simulate application errors (default: 10)"
+    echo "  comprehensive           - Run full incident scenario"
+    echo "  all                     - Run all individual scenarios sequentially"
+    echo ""
+    echo "EXAMPLES:"
+    echo "  $0 cpu 120              - CPU spike for 2 minutes"
+    echo "  $0 disk /var/log 45     - Disk full simulation in /var/log for 45 seconds"
+    echo "  $0 comprehensive        - Full incident simulation"
+    echo ""
+}
+
+# Main execution
+main() {
+    local scenario=${1:-"comprehensive"}
+
+    # Create log directory if it doesn't exist
+    mkdir -p "$(dirname "$LOG_FILE")"
+
+    log "INFO" "Incident simulation script started"
+    log "INFO" "Scenario: $scenario"
+
+    case $scenario in
+        "cpu")
+            simulate_cpu_spike "${2:-60}"
+            ;;
+        "memory")
+            simulate_memory_leak "${2:-30}"
+            ;;
+        "disk")
+            simulate_disk_full "${2:-/tmp}" "${3:-30}"
+            ;;
+        "network")
+            simulate_network_issues "${2:-lo}" "${3:-20}"
+            ;;
+        "service")
+            simulate_service_crash "${2:-test-service}"
+            ;;
+        "database")
+            simulate_database_issues "${2:-25}"
+            ;;
+        "app-errors")
+            simulate_application_errors "${2:-10}"
+            ;;
+        "comprehensive")
+            run_comprehensive_scenario
+            ;;
+        "all")
+            log "INFO" "Running all scenarios sequentially"
+            simulate_cpu_spike 30
+            sleep 5
+            simulate_memory_leak 20
+            sleep 5
+            simulate_disk_full "/tmp" 15
+            sleep 5
+            simulate_service_crash "test-service"
+            sleep 5
+            simulate_database_issues 15
+            sleep 5
+            simulate_application_errors 8
+            sleep 5
+            simulate_network_issues "lo" 10
+            ;;
+        "help"|"-h"|"--help")
+            show_usage
+            exit 0
+            ;;
+        *)
+            echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
+            echo ""
+            show_usage
+            exit 1
+            ;;
+    esac
+
+    log "INFO" "Incident simulation script completed successfully"
+    echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
+}
+
+# Run main function with all arguments
+main "$@"