feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions

This commit is contained in:
Mateusz Suski
2026-04-29 23:14:14 +00:00
parent 2313efac88
commit 7757020014
33 changed files with 6165 additions and 0 deletions
+461
View File
@@ -0,0 +1,461 @@
# Observability Stack
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
## Overview
The Observability Stack provides a complete monitoring solution with:
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Log Sources │ │ Logstash │ │ Elasticsearch │
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
│ / Systems) │ │ Processing) │ │ Analytics) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │ │ Kibana │ │ Grafana │
│ Rules │ │ (Dashboards & │ │ (Metrics & │
│ │ │ Exploration) │ │ Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Quick Start
### Prerequisites
- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
### Setup
```bash
cd observability-stack
# Start the observability stack
docker-compose up -d
# Wait for services to be ready (may take 2-3 minutes)
sleep 180
# Verify services are running
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health"
```
### Access Interfaces
- **Kibana**: http://localhost:5601 (admin/elastic)
- **Grafana**: http://localhost:3000 (admin/admin)
- **Elasticsearch**: http://localhost:9200
## Project Structure
```
observability-stack/
├── docker-compose.yml # Service orchestration
├── logstash/ # Logstash configuration
│ ├── pipeline/ # Processing pipelines
│ └── config/ # Logstash settings
├── elasticsearch/ # Elasticsearch configuration
│ └── config/ # Cluster settings
├── kibana/ # Kibana configuration
│ └── config/ # Dashboard settings
├── grafana/ # Grafana configuration
│ ├── provisioning/ # Dashboards and datasources
│ └── dashboards/ # Dashboard definitions
├── logs/ # Sample log data
│ └── sample.log # Realistic application logs
├── alerting/ # Alert configuration
│ └── alert_rules.yml # Alert definitions
├── scenarios/ # Incident simulation
│ └── incident_simulation.sh # Simulation scripts
└── README.md
```
## Services Configuration
### Elasticsearch
**Configuration**: `elasticsearch/config/elasticsearch.yml`
Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data
### Logstash
**Pipelines**: `logstash/pipeline/`
- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing
**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
### Kibana
**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
### Grafana
**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
## Log Ingestion
### Sample Data
The stack includes realistic sample logs for testing:
```bash
# Ingest sample logs
curl -X POST "localhost:8080" \
-H "Content-Type: application/json" \
-d @logs/sample.log
```
### Log Formats Supported
- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules
### Data Enrichment
Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
## Alerting and Monitoring
### Alert Rules
Located in `alerting/alert_rules.yml`:
```yaml
alert_rules:
- name: "High CPU Usage"
condition: "cpu_usage > 90"
duration: "5m"
severity: "critical"
channels: ["email", "slack"]
- name: "Disk Space Low"
condition: "disk_usage > 85"
duration: "10m"
severity: "warning"
channels: ["email"]
- name: "Service Down"
condition: "service_status == 'down'"
duration: "2m"
severity: "critical"
channels: ["email", "pagerduty"]
```
### Alert Channels
- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints
## Incident Simulation
### Available Scenarios
```bash
cd scenarios
# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical
# Simulate service failure
./incident_simulation.sh --type service-down --service nginx
# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms
# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
```
### Scenario Types
- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation
## Dashboards and Visualization
### Kibana Dashboards
Pre-configured dashboards for:
1. **Log Analysis**
- Log volume over time
- Error rate trends
- Top error messages
- Geographic request distribution
2. **System Monitoring**
- CPU and memory usage
- Disk I/O statistics
- Network traffic
- System load averages
3. **Application Performance**
- Response time distributions
- Request rate metrics
- Error percentages
- User session analytics
### Grafana Dashboards
Advanced visualization panels:
- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting
## API Endpoints
### Elasticsearch APIs
```bash
# Cluster health
GET /_cluster/health
# Index statistics
GET /_cat/indices?v
# Search logs
GET /logs-*/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
```
### Kibana APIs
```bash
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard
# Export visualizations
GET /api/saved_objects/visualization/{id}
```
### Grafana APIs
```bash
# Get dashboard list
GET /api/search?query=*
# Alert status
GET /api/alerts
```
## Configuration Management
### Environment Variables
```bash
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"
# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"
# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
```
### Scaling Configuration
For production deployment:
```yaml
version: '3.8'
services:
elasticsearch:
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: '2.0'
```
## Security Considerations
### Authentication
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
### Network Security
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
### Data Protection
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
## Troubleshooting
### Common Issues
**Elasticsearch Won't Start:**
```bash
# Check memory allocation
docker-compose logs elasticsearch
# Verify Java heap settings
docker-compose exec elasticsearch ps aux
```
**Logstash Pipeline Errors:**
```bash
# Check pipeline configuration
docker-compose logs logstash
# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
```
**Kibana Connection Issues:**
```bash
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"
# Check Kibana logs
docker-compose logs kibana
```
### Performance Tuning
**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
## Development and Testing
### Adding New Dashboards
1. Create dashboard JSON in `grafana/dashboards/`
2. Update provisioning configuration
3. Restart Grafana service
### Custom Alert Rules
1. Define rules in `alerting/alert_rules.yml`
2. Update alerting configuration
3. Test rules with simulation scenarios
### Log Pipeline Development
1. Add pipeline configuration in `logstash/pipeline/`
2. Test with sample data
3. Validate parsing with Kibana
## Backup and Recovery
### Data Backup
```bash
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-H "Content-Type: application/json" \
-d '{"indices": "*"}'
```
### Configuration Backup
```bash
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
logstash/ elasticsearch/ kibana/ grafana/
```
## Contributing
1. Follow existing configuration patterns
2. Test changes with simulation scenarios
3. Update documentation for new features
4. Ensure backward compatibility
## License
Enterprise Internal Use Only
@@ -0,0 +1,326 @@
# Enterprise Observability Alert Rules
# Alert definitions for automated incident detection and notification
alert_rules:
# System Resource Alerts
- name: "High CPU Usage"
description: "CPU utilization exceeds threshold"
condition: "cpu_usage_percent > 90"
duration: "5m"
severity: "critical"
tags:
- system
- performance
channels:
- email
- slack
labels:
team: "platform"
component: "system"
- name: "High Memory Usage"
description: "Memory utilization exceeds threshold"
condition: "memory_usage_percent > 85"
duration: "3m"
severity: "warning"
tags:
- system
- memory
channels:
- email
labels:
team: "platform"
component: "system"
- name: "Disk Space Critical"
description: "Disk usage exceeds critical threshold"
condition: "disk_usage_percent > 95"
duration: "2m"
severity: "critical"
tags:
- storage
- disk
channels:
- email
- pagerduty
labels:
team: "platform"
component: "storage"
- name: "Disk Space Warning"
description: "Disk usage exceeds warning threshold"
condition: "disk_usage_percent > 85"
duration: "10m"
severity: "warning"
tags:
- storage
- disk
channels:
- email
labels:
team: "platform"
component: "storage"
# Service Availability Alerts
- name: "Service Down"
description: "Critical service is not responding"
condition: "service_status == 'down' OR http_status_code >= 500"
duration: "2m"
severity: "critical"
tags:
- service
- availability
channels:
- email
- slack
- pagerduty
labels:
team: "application"
component: "service"
- name: "Database Connection Failed"
description: "Database connection pool exhausted or unresponsive"
condition: "db_connections_active == 0 OR db_response_time > 5000"
duration: "1m"
severity: "critical"
tags:
- database
- connectivity
channels:
- email
- pagerduty
labels:
team: "database"
component: "postgresql"
- name: "Cache Unavailable"
description: "Cache service is down or unresponsive"
condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
duration: "3m"
severity: "warning"
tags:
- cache
- performance
channels:
- email
labels:
team: "infrastructure"
component: "redis"
# Application Performance Alerts
- name: "High Error Rate"
description: "Application error rate exceeds threshold"
condition: "error_rate_percent > 5"
duration: "5m"
severity: "critical"
tags:
- application
- errors
channels:
- email
- slack
labels:
team: "application"
component: "api"
- name: "Slow Response Time"
description: "API response time exceeds SLA"
condition: "response_time_p95 > 2000"
duration: "5m"
severity: "warning"
tags:
- application
- performance
channels:
- email
labels:
team: "application"
component: "api"
- name: "High Request Queue"
description: "Request queue depth is too high"
condition: "queue_depth > 100"
duration: "3m"
severity: "warning"
tags:
- application
- queue
channels:
- email
labels:
team: "application"
component: "queue"
# Infrastructure Alerts
- name: "Network Latency High"
description: "Network round-trip time exceeds threshold"
condition: "network_rtt > 100"
duration: "5m"
severity: "warning"
tags:
- network
- latency
channels:
- email
labels:
team: "network"
component: "infrastructure"
- name: "Load Balancer Unhealthy"
description: "Load balancer backend servers are unhealthy"
condition: "lb_unhealthy_backends > 0"
duration: "2m"
severity: "critical"
tags:
- loadbalancer
- availability
channels:
- email
- pagerduty
labels:
team: "infrastructure"
component: "loadbalancer"
# Security Alerts
- name: "Failed Login Attempts"
description: "Multiple failed authentication attempts detected"
condition: "failed_login_attempts > 5"
duration: "5m"
severity: "warning"
tags:
- security
- authentication
channels:
- email
- slack
labels:
team: "security"
component: "authentication"
- name: "Suspicious Network Traffic"
description: "Unusual network traffic patterns detected"
condition: "network_bytes_unusual > 1000000"
duration: "10m"
severity: "warning"
tags:
- security
- network
channels:
- email
labels:
team: "security"
component: "network"
# Log-based Alerts
- name: "Application Errors"
description: "High volume of application error logs"
condition: "log_errors_per_minute > 10"
duration: "2m"
severity: "warning"
tags:
- logs
- errors
channels:
- email
labels:
team: "application"
component: "logs"
- name: "Out of Memory Errors"
description: "Out of memory errors detected in logs"
condition: "log_oom_errors > 0"
duration: "1m"
severity: "critical"
tags:
- memory
- errors
channels:
- email
- pagerduty
labels:
team: "application"
component: "memory"
# Business Logic Alerts
- name: "Low Business Transactions"
description: "Business transaction volume below expected threshold"
condition: "business_transactions_per_hour < 100"
duration: "15m"
severity: "warning"
tags:
- business
- transactions
channels:
- email
labels:
team: "business"
component: "transactions"
- name: "Payment Failures"
description: "Payment processing failure rate is high"
condition: "payment_failure_rate > 0.05"
duration: "5m"
severity: "critical"
tags:
- payments
- business
channels:
- email
- pagerduty
labels:
team: "payments"
component: "processing"
# Alert Channels Configuration
alert_channels:
email:
type: "email"
recipients:
- "platform-team@company.com"
- "oncall@company.com"
subject_template: "[{{severity}}] {{name}} - {{description}}"
slack:
type: "slack"
webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
channel: "#alerts"
username: "Observability Bot"
icon_emoji: ":warning:"
pagerduty:
type: "pagerduty"
integration_key: "your-pagerduty-integration-key"
severity_mapping:
critical: "critical"
warning: "warning"
info: "info"
# Alert Silencing Rules
silence_rules:
- name: "Maintenance Window"
condition: "maintenance_window == true"
duration: "4h"
comment: "Silenced during scheduled maintenance"
- name: "Known Issue"
condition: "known_issue_id == 'TICKET-123'"
duration: "24h"
comment: "Silenced for known issue resolution"
# Escalation Policies
escalation_policies:
- name: "Default Escalation"
steps:
- delay: "5m"
channels: ["email"]
- delay: "15m"
channels: ["slack"]
- delay: "30m"
channels: ["pagerduty"]
- name: "Critical Escalation"
steps:
- delay: "0m"
channels: ["email", "slack", "pagerduty"]
- delay: "10m"
channels: ["pagerduty"] # Escalation
+122
View File
@@ -0,0 +1,122 @@
version: '3.8'
services:
elasticsearch:
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
container_name: observability-elasticsearch
environment:
- discovery.type=single-node
- xpack.security.enabled=true
- ELASTIC_PASSWORD=elastic
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
volumes:
- elasticsearch_data:/usr/share/elasticsearch/data
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
ports:
- "9200:9200"
- "9300:9300"
networks:
- observability
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
interval: 30s
timeout: 10s
retries: 5
logstash:
image: docker.elastic.co/logstash/logstash:8.11.0
container_name: observability-logstash
environment:
- "LS_JAVA_OPTS=-Xms512m -Xmx512m"
volumes:
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
- ./logstash/pipeline:/usr/share/logstash/pipeline
- ./logs:/usr/share/logstash/logs
ports:
- "5044:5044"
- "8080:8080"
networks:
- observability
depends_on:
elasticsearch:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
interval: 30s
timeout: 10s
retries: 3
kibana:
image: docker.elastic.co/kibana/kibana:8.11.0
container_name: observability-kibana
environment:
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
- ELASTICSEARCH_USERNAME=elastic
- ELASTICSEARCH_PASSWORD=elastic
volumes:
- ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
ports:
- "5601:5601"
networks:
- observability
depends_on:
elasticsearch:
condition: service_healthy
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
interval: 30s
timeout: 10s
retries: 5
grafana:
image: grafana/grafana:10.2.0
container_name: observability-grafana
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
ports:
- "3000:3000"
networks:
- observability
restart: unless-stopped
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
filebeat:
image: docker.elastic.co/beats/filebeat:8.11.0
container_name: observability-filebeat
user: root
volumes:
- ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
- ./logs:/var/log/sample
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /var/run/docker.sock:/var/run/docker.sock:ro
networks:
- observability
depends_on:
- logstash
restart: unless-stopped
volumes:
elasticsearch_data:
driver: local
grafana_data:
driver: local
networks:
observability:
driver: bridge
ipam:
config:
- subnet: 172.25.0.0/16
@@ -0,0 +1,2 @@
[2026-04-29 22:52:26] INFO Incident simulation script started
[2026-04-29 22:52:26] INFO Scenario: help
+98
View File
@@ -0,0 +1,98 @@
# Sample Application Logs for Observability Stack Testing
# Generated realistic log entries for demonstration and testing
[2024-01-15 08:30:15] INFO com.example.app.Application - Application started successfully on port 8080
[2024-01-15 08:30:16] INFO com.example.database.ConnectionPool - Database connection pool initialized with 10 connections
[2024-01-15 08:30:17] INFO com.example.cache.RedisCache - Redis cache connected successfully
[2024-01-15 08:30:18] INFO com.example.messaging.RabbitMQClient - Message queue connection established
[2024-01-15 08:31:22] INFO com.example.api.UserController - User login attempt: user=john.doe, ip=192.168.1.100
[2024-01-15 08:31:23] INFO com.example.auth.AuthService - Authentication successful for user john.doe
[2024-01-15 08:31:24] INFO com.example.api.UserController - API request: GET /api/users/profile, status=200, duration=45ms
[2024-01-15 08:32:01] WARN com.example.cache.RedisCache - Cache miss for key: user_profile_12345
[2024-01-15 08:32:02] INFO com.example.database.UserRepository - Database query executed: SELECT * FROM users WHERE id = ?, duration=120ms
[2024-01-15 08:32:03] INFO com.example.cache.RedisCache - Cache populated for key: user_profile_12345
[2024-01-15 08:35:12] ERROR com.example.api.OrderController - Failed to process order: order_id=ORD-2024-001, error=Payment gateway timeout
[2024-01-15 08:35:13] WARN com.example.messaging.RabbitMQClient - Message delivery failed, retrying in 5 seconds
[2024-01-15 08:35:18] INFO com.example.messaging.RabbitMQClient - Message delivered successfully after retry
[2024-01-15 08:40:05] INFO com.example.monitoring.HealthCheck - Health check passed: database=OK, cache=OK, messaging=OK
[2024-01-15 08:40:06] INFO com.example.metrics.MetricsCollector - Metrics collected: active_users=1250, requests_per_second=45.2
[2024-01-15 08:45:30] ERROR com.example.external.PaymentGateway - Payment gateway connection failed: Connection refused
[2024-01-15 08:45:31] ERROR com.example.api.PaymentController - Payment processing failed for transaction TXN-789012
[2024-01-15 08:45:32] WARN com.example.circuitbreaker.PaymentCircuitBreaker - Circuit breaker opened for payment service
[2024-01-15 08:50:15] INFO com.example.batch.BatchProcessor - Batch job started: job_id=BATCH-001, records=10000
[2024-01-15 08:50:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=2500/10000 (25%)
[2024-01-15 08:51:15] INFO com.example.batch.BatchProcessor - Batch job progress: processed=5000/10000 (50%)
[2024-01-15 08:51:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=7500/10000 (75%)
[2024-01-15 08:52:10] INFO com.example.batch.BatchProcessor - Batch job completed: job_id=BATCH-001, duration=175s, success=true
[2024-01-15 09:00:00] INFO com.example.scheduled.CleanupJob - Scheduled cleanup job started
[2024-01-15 09:00:05] INFO com.example.scheduled.CleanupJob - Cleaned up 150 expired sessions
[2024-01-15 09:00:10] INFO com.example.scheduled.CleanupJob - Removed 25 temporary files
[2024-01-15 09:00:15] INFO com.example.scheduled.CleanupJob - Cleanup job completed successfully
[2024-01-15 09:15:22] WARN com.example.database.ConnectionPool - Connection pool nearing capacity: active=8/10
[2024-01-15 09:15:23] INFO com.example.database.ConnectionPool - Connection pool expanded to 15 connections
[2024-01-15 09:30:45] ERROR com.example.api.ProductController - Database query timeout: query=SELECT * FROM products WHERE category = 'electronics'
[2024-01-15 09:30:46] WARN com.example.database.ConnectionPool - Connection pool exhausted, rejecting request
[2024-01-15 09:30:47] ERROR com.example.api.ProductController - Service temporarily unavailable, status=503
[2024-01-15 09:45:12] INFO com.example.monitoring.AlertManager - Alert triggered: High CPU usage detected (85%)
[2024-01-15 09:45:13] INFO com.example.autoscaling.ScaleManager - Auto-scaling initiated: increasing instances from 3 to 5
[2024-01-15 10:00:00] INFO com.example.backup.BackupService - Database backup started
[2024-01-15 10:05:30] INFO com.example.backup.BackupService - Database backup completed: size=2.3GB, duration=330s
[2024-01-15 10:30:15] WARN com.example.security.SecurityFilter - Suspicious activity detected: multiple failed login attempts from IP 10.0.0.50
[2024-01-15 10:30:16] INFO com.example.security.SecurityFilter - IP 10.0.0.50 blocked for 15 minutes
[2024-01-15 11:00:00] INFO com.example.reporting.ReportGenerator - Daily report generation started
[2024-01-15 11:05:00] INFO com.example.reporting.ReportGenerator - Daily report completed: transactions=15420, revenue=$125,430.50
[2024-01-15 12:00:00] ERROR com.example.external.APIGateway - External API rate limit exceeded: retrying in 60 seconds
[2024-01-15 12:01:00] INFO com.example.external.APIGateway - External API connection restored
[2024-01-15 13:15:30] CRITICAL com.example.system.SystemMonitor - Disk space critical: /var/log usage at 95%
[2024-01-15 13:15:31] INFO com.example.maintenance.LogRotation - Emergency log rotation initiated
[2024-01-15 13:15:35] INFO com.example.maintenance.LogRotation - Log rotation completed: freed 2.1GB space
[2024-01-15 14:00:00] INFO com.example.metrics.PerformanceMonitor - Performance baseline established: avg_response_time=245ms, throughput=1200 req/sec
[2024-01-15 15:30:45] WARN com.example.cache.RedisCache - Cache cluster node down: redis-node-03
[2024-01-15 15:30:46] INFO com.example.cache.RedisCache - Failover initiated to redis-node-04
[2024-01-15 16:45:12] ERROR com.example.messaging.MessageProcessor - Message processing failed: invalid message format
[2024-01-15 16:45:13] INFO com.example.messaging.DeadLetterQueue - Message moved to dead letter queue
[2024-01-15 17:00:00] INFO com.example.backup.BackupService - Full system backup started
[2024-01-15 17:30:00] INFO com.example.backup.BackupService - Full system backup completed: size=45.2GB, duration=1800s
[2024-01-15 18:00:00] INFO com.example.monitoring.HealthCheck - Evening health check: all systems operational
[2024-01-15 18:00:01] INFO com.example.metrics.MetricsCollector - End of day metrics: total_requests=125000, error_rate=0.02%, avg_response_time=198ms
# Additional realistic log patterns for testing
[2024-01-15 08:15:30] INFO nginx: 192.168.1.100 - - [15/Jan/2024:08:15:30 +0000] "GET /api/health HTTP/1.1" 200 123 "-" "curl/7.68.0"
[2024-01-15 08:15:31] INFO nginx: 192.168.1.101 - - [15/Jan/2024:08:15:31 +0000] "POST /api/login HTTP/1.1" 200 456 "-" "Mozilla/5.0 ..."
[2024-01-15 08:15:32] WARN nginx: 192.168.1.102 - - [15/Jan/2024:08:15:32 +0000] "GET /api/admin HTTP/1.1" 403 234 "-" "Mozilla/5.0 ..."
[2024-01-15 09:20:15] INFO systemd: Started Session 1234 of user john.doe
[2024-01-15 09:20:16] INFO systemd: Started User Manager for UID 1000
[2024-01-15 09:20:17] INFO systemd: Started Session 1235 of user jane.smith
[2024-01-15 10:45:30] WARN kernel: [ 1234.567890] CPU0: Core temperature above threshold, cpu clock throttled
[2024-01-15 10:45:31] INFO kernel: [ 1234.678901] CPU0: Temperature back to normal
[2024-01-15 14:30:15] ERROR postgres: FATAL: password authentication failed for user "app_user"
[2024-01-15 14:30:16] ERROR postgres: FATAL: password authentication failed for user "app_user"
[2024-01-15 14:30:17] WARN postgres: too many connections for role "app_user"
[2024-01-15 16:15:45] INFO rabbitmq: accepting AMQP connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672)
[2024-01-15 16:15:46] INFO rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): user 'app_user' authenticated
[2024-01-15 16:15:47] WARN rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): missed heartbeats from client, timeout: 60s
+318
View File
@@ -0,0 +1,318 @@
#!/bin/bash
# Enterprise Incident Simulation Script
# Simulates various failure scenarios for testing observability stack
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(dirname "$(dirname "$SCRIPT_DIR")")"
LOG_FILE="$PROJECT_ROOT/observability-stack/logs/incident_simulation.log"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Logging function
log() {
local level=$1
local message=$2
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] $level $message" >> "$LOG_FILE"
echo -e "${BLUE}[$timestamp]${NC} $level $message"
}
# Function to simulate CPU spike
simulate_cpu_spike() {
local duration=${1:-60}
log "INFO" "Starting CPU spike simulation for ${duration} seconds"
# Launch CPU-intensive processes
for i in {1..4}; do
(
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# CPU-intensive calculation
result=0
for j in {1..100000}; do
result=$((result + j))
done
done
) &
PIDS[$i]=$!
done
# Wait for simulation to complete
for pid in "${PIDS[@]}"; do
wait $pid 2>/dev/null || true
done
log "INFO" "CPU spike simulation completed"
}
# Function to simulate memory leak
simulate_memory_leak() {
local duration=${1:-30}
log "INFO" "Starting memory leak simulation for ${duration} seconds"
# Create a process that gradually consumes memory
(
data=""
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# Gradually consume memory
data="${data}X"
sleep 0.1
done
) &
MEM_PID=$!
wait $MEM_PID 2>/dev/null || true
log "INFO" "Memory leak simulation completed"
}
# Function to simulate disk space exhaustion
simulate_disk_full() {
local target_dir=${1:-"/tmp"}
local duration=${2:-30}
log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
# Create large files to fill disk space
(
end_time=$((SECONDS + duration))
while [ $SECONDS -lt $end_time ]; do
# Create 100MB file
dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
sleep 2
done
) &
DISK_PID=$!
wait $DISK_PID 2>/dev/null || true
# Cleanup test files
rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
log "INFO" "Disk space exhaustion simulation completed and cleaned up"
}
# Function to simulate network issues
simulate_network_issues() {
local interface=${1:-"lo"}
local duration=${2:-20}
log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
# Add network delay and packet loss
sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
sleep $duration
# Remove network simulation
sudo tc qdisc del dev $interface root 2>/dev/null || true
log "INFO" "Network issues simulation completed"
}
# Function to simulate service crashes
simulate_service_crash() {
local service_name=${1:-"test-service"}
log "INFO" "Starting service crash simulation for ${service_name}"
# Simulate service going down
log "ERROR" "Service ${service_name} crashed unexpectedly"
sleep 5
log "INFO" "Service ${service_name} restarted automatically"
# Simulate multiple crashes
for i in {1..3}; do
sleep 2
log "ERROR" "Service ${service_name} crashed again (attempt $i)"
sleep 1
log "INFO" "Service ${service_name} recovered after crash $i"
done
}
# Function to simulate database issues
simulate_database_issues() {
local duration=${1:-25}
log "INFO" "Starting database issues simulation for ${duration} seconds"
# Simulate connection pool exhaustion
log "WARN" "Database connection pool nearing capacity"
sleep 5
log "ERROR" "Database connection pool exhausted"
sleep 5
log "ERROR" "Database query timeout occurred"
sleep 5
log "WARN" "Database connections recovering"
sleep 5
log "INFO" "Database connections restored"
log "INFO" "Database issues simulation completed"
}
# Function to simulate application errors
simulate_application_errors() {
local error_count=${1:-10}
log "INFO" "Starting application error simulation (${error_count} errors)"
for i in {1..error_count}; do
case $((RANDOM % 4)) in
0)
log "ERROR" "NullPointerException in UserService.getUser($i)"
;;
1)
log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
;;
2)
log "ERROR" "ValidationException: Invalid input data for request $i"
;;
3)
log "ERROR" "IOException: Failed to write to log file"
;;
esac
sleep $((RANDOM % 3 + 1))
done
log "INFO" "Application error simulation completed"
}
# Function to run comprehensive incident scenario
run_comprehensive_scenario() {
log "INFO" "Starting comprehensive incident scenario simulation"
# Phase 1: Initial system stress
log "INFO" "Phase 1: System stress simulation"
simulate_cpu_spike 30 &
CPU_PID=$!
simulate_memory_leak 20 &
MEM_PID=$!
sleep 10
# Phase 2: Service degradation
log "INFO" "Phase 2: Service degradation simulation"
simulate_service_crash "web-service" &
SERVICE_PID=$!
sleep 5
# Phase 3: Database issues
log "INFO" "Phase 3: Database issues simulation"
simulate_database_issues 15 &
DB_PID=$!
# Phase 4: Application errors
log "INFO" "Phase 4: Application error burst"
simulate_application_errors 15 &
APP_PID=$!
# Phase 5: Infrastructure issues
log "INFO" "Phase 5: Infrastructure issues simulation"
simulate_disk_full "/tmp" 10 &
DISK_PID=$!
# Wait for all simulations to complete
wait $CPU_PID 2>/dev/null || true
wait $MEM_PID 2>/dev/null || true
wait $SERVICE_PID 2>/dev/null || true
wait $DB_PID 2>/dev/null || true
wait $APP_PID 2>/dev/null || true
wait $DISK_PID 2>/dev/null || true
log "INFO" "Comprehensive incident scenario completed"
}
# Function to show usage
show_usage() {
echo "Enterprise Incident Simulation Script"
echo "Usage: $0 [SCENARIO] [OPTIONS]"
echo ""
echo "SCENARIOS:"
echo " cpu [DURATION] - Simulate CPU spike (default: 60s)"
echo " memory [DURATION] - Simulate memory leak (default: 30s)"
echo " disk [DIR] [DURATION] - Simulate disk space exhaustion (default: /tmp, 30s)"
echo " network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
echo " service [NAME] - Simulate service crashes (default: test-service)"
echo " database [DURATION] - Simulate database issues (default: 25s)"
echo " app-errors [COUNT] - Simulate application errors (default: 10)"
echo " comprehensive - Run full incident scenario"
echo " all - Run all individual scenarios sequentially"
echo ""
echo "EXAMPLES:"
echo " $0 cpu 120 - CPU spike for 2 minutes"
echo " $0 disk /var/log 45 - Disk full simulation in /var/log for 45 seconds"
echo " $0 comprehensive - Full incident simulation"
echo ""
}
# Main execution
main() {
local scenario=${1:-"comprehensive"}
# Create log directory if it doesn't exist
mkdir -p "$(dirname "$LOG_FILE")"
log "INFO" "Incident simulation script started"
log "INFO" "Scenario: $scenario"
case $scenario in
"cpu")
simulate_cpu_spike "${2:-60}"
;;
"memory")
simulate_memory_leak "${2:-30}"
;;
"disk")
simulate_disk_full "${2:-/tmp}" "${3:-30}"
;;
"network")
simulate_network_issues "${2:-lo}" "${3:-20}"
;;
"service")
simulate_service_crash "${2:-test-service}"
;;
"database")
simulate_database_issues "${2:-25}"
;;
"app-errors")
simulate_application_errors "${2:-10}"
;;
"comprehensive")
run_comprehensive_scenario
;;
"all")
log "INFO" "Running all scenarios sequentially"
simulate_cpu_spike 30
sleep 5
simulate_memory_leak 20
sleep 5
simulate_disk_full "/tmp" 15
sleep 5
simulate_service_crash "test-service"
sleep 5
simulate_database_issues 15
sleep 5
simulate_application_errors 8
sleep 5
simulate_network_issues "lo" 10
;;
"help"|"-h"|"--help")
show_usage
exit 0
;;
*)
echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
echo ""
show_usage
exit 1
;;
esac
log "INFO" "Incident simulation script completed successfully"
echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
}
# Run main function with all arguments
main "$@"