feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions
This commit is contained in:
@@ -0,0 +1,461 @@
|
||||
# Observability Stack
|
||||
|
||||
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
|
||||
|
||||
## Overview
|
||||
|
||||
The Observability Stack provides a complete monitoring solution with:
|
||||
|
||||
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
|
||||
- **Logstash**: Data processing pipeline for log ingestion and transformation
|
||||
- **Kibana**: Visualization and exploration interface for Elasticsearch data
|
||||
- **Grafana**: Advanced metrics dashboarding and alerting platform
|
||||
- **Sample Logs**: Realistic log data for testing and demonstration
|
||||
- **Alerting**: Automated incident detection and notification rules
|
||||
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Log Sources │ │ Logstash │ │ Elasticsearch │
|
||||
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
|
||||
│ / Systems) │ │ Processing) │ │ Analytics) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Alerting │ │ Kibana │ │ Grafana │
|
||||
│ Rules │ │ (Dashboards & │ │ (Metrics & │
|
||||
│ │ │ Exploration) │ │ Dashboards) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Docker and Docker Compose
|
||||
- At least 4GB RAM available
|
||||
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
|
||||
|
||||
### Setup
|
||||
|
||||
```bash
|
||||
cd observability-stack
|
||||
|
||||
# Start the observability stack
|
||||
docker-compose up -d
|
||||
|
||||
# Wait for services to be ready (may take 2-3 minutes)
|
||||
sleep 180
|
||||
|
||||
# Verify services are running
|
||||
curl -X GET "localhost:9200/_cluster/health?pretty"
|
||||
curl -X GET "localhost:5601/api/status"
|
||||
curl -X GET "localhost:3000/api/health"
|
||||
```
|
||||
|
||||
### Access Interfaces
|
||||
|
||||
- **Kibana**: http://localhost:5601 (admin/elastic)
|
||||
- **Grafana**: http://localhost:3000 (admin/admin)
|
||||
- **Elasticsearch**: http://localhost:9200
|
||||
|
||||
## Project Structure
|
||||
|
||||
```
|
||||
observability-stack/
|
||||
├── docker-compose.yml # Service orchestration
|
||||
├── logstash/ # Logstash configuration
|
||||
│ ├── pipeline/ # Processing pipelines
|
||||
│ └── config/ # Logstash settings
|
||||
├── elasticsearch/ # Elasticsearch configuration
|
||||
│ └── config/ # Cluster settings
|
||||
├── kibana/ # Kibana configuration
|
||||
│ └── config/ # Dashboard settings
|
||||
├── grafana/ # Grafana configuration
|
||||
│ ├── provisioning/ # Dashboards and datasources
|
||||
│ └── dashboards/ # Dashboard definitions
|
||||
├── logs/ # Sample log data
|
||||
│ └── sample.log # Realistic application logs
|
||||
├── alerting/ # Alert configuration
|
||||
│ └── alert_rules.yml # Alert definitions
|
||||
├── scenarios/ # Incident simulation
|
||||
│ └── incident_simulation.sh # Simulation scripts
|
||||
└── README.md
|
||||
```
|
||||
|
||||
## Services Configuration
|
||||
|
||||
### Elasticsearch
|
||||
|
||||
**Configuration**: `elasticsearch/config/elasticsearch.yml`
|
||||
|
||||
Key settings:
|
||||
- Single-node cluster for development
|
||||
- Memory limits and heap sizing
|
||||
- Security enabled with basic authentication
|
||||
- CORS enabled for Kibana access
|
||||
|
||||
**Data Indices**:
|
||||
- `logs-*`: Application and system logs
|
||||
- `metrics-*`: System and application metrics
|
||||
- `alerts-*`: Alert and incident data
|
||||
|
||||
### Logstash
|
||||
|
||||
**Pipelines**: `logstash/pipeline/`
|
||||
|
||||
- **apache_logs**: Apache/Nginx access log processing
|
||||
- **system_logs**: System log parsing and enrichment
|
||||
- **application_logs**: Custom application log processing
|
||||
- **metrics_pipeline**: Metrics data processing
|
||||
|
||||
**Input Sources**:
|
||||
- Filebeat agents
|
||||
- TCP/UDP syslog inputs
|
||||
- HTTP endpoints for metrics
|
||||
- Docker container logs
|
||||
|
||||
### Kibana
|
||||
|
||||
**Dashboards**:
|
||||
- Log analysis dashboard
|
||||
- System metrics overview
|
||||
- Application performance dashboard
|
||||
- Security events dashboard
|
||||
|
||||
**Saved Objects**:
|
||||
- Index patterns for log data
|
||||
- Visualizations for common metrics
|
||||
- Search queries for troubleshooting
|
||||
|
||||
### Grafana
|
||||
|
||||
**Data Sources**:
|
||||
- Elasticsearch for logs and metrics
|
||||
- Prometheus (if available)
|
||||
- InfluxDB for time-series data
|
||||
|
||||
**Dashboards**:
|
||||
- Infrastructure overview
|
||||
- Application performance
|
||||
- System resources
|
||||
- Custom business metrics
|
||||
|
||||
## Log Ingestion
|
||||
|
||||
### Sample Data
|
||||
|
||||
The stack includes realistic sample logs for testing:
|
||||
|
||||
```bash
|
||||
# Ingest sample logs
|
||||
curl -X POST "localhost:8080" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @logs/sample.log
|
||||
```
|
||||
|
||||
### Log Formats Supported
|
||||
|
||||
- **Apache/Nginx**: Combined log format
|
||||
- **Syslog**: RFC 3164/5424 compliant
|
||||
- **JSON**: Structured application logs
|
||||
- **Custom**: Configurable parsing rules
|
||||
|
||||
### Data Enrichment
|
||||
|
||||
Logstash pipelines add:
|
||||
- GeoIP location data
|
||||
- User agent parsing
|
||||
- Timestamp normalization
|
||||
- Host metadata enrichment
|
||||
|
||||
## Alerting and Monitoring
|
||||
|
||||
### Alert Rules
|
||||
|
||||
Located in `alerting/alert_rules.yml`:
|
||||
|
||||
```yaml
|
||||
alert_rules:
|
||||
- name: "High CPU Usage"
|
||||
condition: "cpu_usage > 90"
|
||||
duration: "5m"
|
||||
severity: "critical"
|
||||
channels: ["email", "slack"]
|
||||
|
||||
- name: "Disk Space Low"
|
||||
condition: "disk_usage > 85"
|
||||
duration: "10m"
|
||||
severity: "warning"
|
||||
channels: ["email"]
|
||||
|
||||
- name: "Service Down"
|
||||
condition: "service_status == 'down'"
|
||||
duration: "2m"
|
||||
severity: "critical"
|
||||
channels: ["email", "pagerduty"]
|
||||
```
|
||||
|
||||
### Alert Channels
|
||||
|
||||
- **Email**: SMTP-based notifications
|
||||
- **Slack**: Real-time messaging
|
||||
- **PagerDuty**: Incident management integration
|
||||
- **Webhook**: Custom HTTP endpoints
|
||||
|
||||
## Incident Simulation
|
||||
|
||||
### Available Scenarios
|
||||
|
||||
```bash
|
||||
cd scenarios
|
||||
|
||||
# Simulate disk space exhaustion
|
||||
./incident_simulation.sh --type disk-full --severity critical
|
||||
|
||||
# Simulate service failure
|
||||
./incident_simulation.sh --type service-down --service nginx
|
||||
|
||||
# Simulate network latency
|
||||
./incident_simulation.sh --type network-latency --delay 500ms
|
||||
|
||||
# Simulate high CPU usage
|
||||
./incident_simulation.sh --type high-cpu --cores 4
|
||||
```
|
||||
|
||||
### Scenario Types
|
||||
|
||||
- **disk-full**: Filesystem capacity exhaustion
|
||||
- **service-down**: Application service failures
|
||||
- **network-latency**: Network performance degradation
|
||||
- **high-cpu**: CPU utilization spikes
|
||||
- **memory-leak**: Memory consumption growth
|
||||
- **log-flood**: Excessive log generation
|
||||
|
||||
## Dashboards and Visualization
|
||||
|
||||
### Kibana Dashboards
|
||||
|
||||
Pre-configured dashboards for:
|
||||
|
||||
1. **Log Analysis**
|
||||
- Log volume over time
|
||||
- Error rate trends
|
||||
- Top error messages
|
||||
- Geographic request distribution
|
||||
|
||||
2. **System Monitoring**
|
||||
- CPU and memory usage
|
||||
- Disk I/O statistics
|
||||
- Network traffic
|
||||
- System load averages
|
||||
|
||||
3. **Application Performance**
|
||||
- Response time distributions
|
||||
- Request rate metrics
|
||||
- Error percentages
|
||||
- User session analytics
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
Advanced visualization panels:
|
||||
|
||||
- **Infrastructure Overview**: Multi-system resource usage
|
||||
- **Application Metrics**: Custom business KPIs
|
||||
- **Alert Status**: Active alerts and trends
|
||||
- **Capacity Planning**: Resource utilization forecasting
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Elasticsearch APIs
|
||||
|
||||
```bash
|
||||
# Cluster health
|
||||
GET /_cluster/health
|
||||
|
||||
# Index statistics
|
||||
GET /_cat/indices?v
|
||||
|
||||
# Search logs
|
||||
GET /logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"message": "ERROR"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kibana APIs
|
||||
|
||||
```bash
|
||||
# Get dashboard list
|
||||
GET /api/saved_objects/_find?type=dashboard
|
||||
|
||||
# Export visualizations
|
||||
GET /api/saved_objects/visualization/{id}
|
||||
```
|
||||
|
||||
### Grafana APIs
|
||||
|
||||
```bash
|
||||
# Get dashboard list
|
||||
GET /api/search?query=*
|
||||
|
||||
# Alert status
|
||||
GET /api/alerts
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Elasticsearch
|
||||
ES_JAVA_OPTS="-Xms1g -Xmx1g"
|
||||
ELASTIC_PASSWORD="elastic"
|
||||
|
||||
# Logstash
|
||||
LS_JAVA_OPTS="-Xms512m -Xmx512m"
|
||||
|
||||
# Grafana
|
||||
GF_SECURITY_ADMIN_PASSWORD="admin"
|
||||
```
|
||||
|
||||
### Scaling Configuration
|
||||
|
||||
For production deployment:
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
elasticsearch:
|
||||
deploy:
|
||||
replicas: 3
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
cpus: '2.0'
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Authentication
|
||||
|
||||
- Elasticsearch basic authentication enabled
|
||||
- Grafana admin credentials configured
|
||||
- Kibana anonymous access disabled
|
||||
|
||||
### Network Security
|
||||
|
||||
- Services bound to localhost only
|
||||
- Internal network for service communication
|
||||
- TLS encryption for external access (production)
|
||||
|
||||
### Data Protection
|
||||
|
||||
- Elasticsearch encryption at rest
|
||||
- Log data retention policies
|
||||
- Backup and recovery procedures
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Elasticsearch Won't Start:**
|
||||
```bash
|
||||
# Check memory allocation
|
||||
docker-compose logs elasticsearch
|
||||
|
||||
# Verify Java heap settings
|
||||
docker-compose exec elasticsearch ps aux
|
||||
```
|
||||
|
||||
**Logstash Pipeline Errors:**
|
||||
```bash
|
||||
# Check pipeline configuration
|
||||
docker-compose logs logstash
|
||||
|
||||
# Validate pipeline syntax
|
||||
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
|
||||
```
|
||||
|
||||
**Kibana Connection Issues:**
|
||||
```bash
|
||||
# Verify Elasticsearch connectivity
|
||||
curl -u elastic:elastic "localhost:9200/_cluster/health"
|
||||
|
||||
# Check Kibana logs
|
||||
docker-compose logs kibana
|
||||
```
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
**Elasticsearch:**
|
||||
- Increase heap size for larger datasets
|
||||
- Configure shard allocation
|
||||
- Enable index optimization
|
||||
|
||||
**Logstash:**
|
||||
- Adjust worker threads
|
||||
- Configure batch sizes
|
||||
- Enable persistent queues
|
||||
|
||||
**Grafana:**
|
||||
- Configure query caching
|
||||
- Set dashboard refresh intervals
|
||||
- Optimize panel queries
|
||||
|
||||
## Development and Testing
|
||||
|
||||
### Adding New Dashboards
|
||||
|
||||
1. Create dashboard JSON in `grafana/dashboards/`
|
||||
2. Update provisioning configuration
|
||||
3. Restart Grafana service
|
||||
|
||||
### Custom Alert Rules
|
||||
|
||||
1. Define rules in `alerting/alert_rules.yml`
|
||||
2. Update alerting configuration
|
||||
3. Test rules with simulation scenarios
|
||||
|
||||
### Log Pipeline Development
|
||||
|
||||
1. Add pipeline configuration in `logstash/pipeline/`
|
||||
2. Test with sample data
|
||||
3. Validate parsing with Kibana
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Data Backup
|
||||
|
||||
```bash
|
||||
# Elasticsearch snapshot
|
||||
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"indices": "*"}'
|
||||
```
|
||||
|
||||
### Configuration Backup
|
||||
|
||||
```bash
|
||||
# Backup all configurations
|
||||
tar -czf backup_$(date +%Y%m%d).tar.gz \
|
||||
logstash/ elasticsearch/ kibana/ grafana/
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow existing configuration patterns
|
||||
2. Test changes with simulation scenarios
|
||||
3. Update documentation for new features
|
||||
4. Ensure backward compatibility
|
||||
|
||||
## License
|
||||
|
||||
Enterprise Internal Use Only
|
||||
@@ -0,0 +1,326 @@
|
||||
# Enterprise Observability Alert Rules
|
||||
# Alert definitions for automated incident detection and notification
|
||||
|
||||
alert_rules:
|
||||
# System Resource Alerts
|
||||
- name: "High CPU Usage"
|
||||
description: "CPU utilization exceeds threshold"
|
||||
condition: "cpu_usage_percent > 90"
|
||||
duration: "5m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- system
|
||||
- performance
|
||||
channels:
|
||||
- email
|
||||
- slack
|
||||
labels:
|
||||
team: "platform"
|
||||
component: "system"
|
||||
|
||||
- name: "High Memory Usage"
|
||||
description: "Memory utilization exceeds threshold"
|
||||
condition: "memory_usage_percent > 85"
|
||||
duration: "3m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- system
|
||||
- memory
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "platform"
|
||||
component: "system"
|
||||
|
||||
- name: "Disk Space Critical"
|
||||
description: "Disk usage exceeds critical threshold"
|
||||
condition: "disk_usage_percent > 95"
|
||||
duration: "2m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- storage
|
||||
- disk
|
||||
channels:
|
||||
- email
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "platform"
|
||||
component: "storage"
|
||||
|
||||
- name: "Disk Space Warning"
|
||||
description: "Disk usage exceeds warning threshold"
|
||||
condition: "disk_usage_percent > 85"
|
||||
duration: "10m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- storage
|
||||
- disk
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "platform"
|
||||
component: "storage"
|
||||
|
||||
# Service Availability Alerts
|
||||
- name: "Service Down"
|
||||
description: "Critical service is not responding"
|
||||
condition: "service_status == 'down' OR http_status_code >= 500"
|
||||
duration: "2m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- service
|
||||
- availability
|
||||
channels:
|
||||
- email
|
||||
- slack
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "application"
|
||||
component: "service"
|
||||
|
||||
- name: "Database Connection Failed"
|
||||
description: "Database connection pool exhausted or unresponsive"
|
||||
condition: "db_connections_active == 0 OR db_response_time > 5000"
|
||||
duration: "1m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- database
|
||||
- connectivity
|
||||
channels:
|
||||
- email
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "database"
|
||||
component: "postgresql"
|
||||
|
||||
- name: "Cache Unavailable"
|
||||
description: "Cache service is down or unresponsive"
|
||||
condition: "cache_hit_ratio < 0.1 OR cache_response_time > 1000"
|
||||
duration: "3m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- cache
|
||||
- performance
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "infrastructure"
|
||||
component: "redis"
|
||||
|
||||
# Application Performance Alerts
|
||||
- name: "High Error Rate"
|
||||
description: "Application error rate exceeds threshold"
|
||||
condition: "error_rate_percent > 5"
|
||||
duration: "5m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- application
|
||||
- errors
|
||||
channels:
|
||||
- email
|
||||
- slack
|
||||
labels:
|
||||
team: "application"
|
||||
component: "api"
|
||||
|
||||
- name: "Slow Response Time"
|
||||
description: "API response time exceeds SLA"
|
||||
condition: "response_time_p95 > 2000"
|
||||
duration: "5m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- application
|
||||
- performance
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "application"
|
||||
component: "api"
|
||||
|
||||
- name: "High Request Queue"
|
||||
description: "Request queue depth is too high"
|
||||
condition: "queue_depth > 100"
|
||||
duration: "3m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- application
|
||||
- queue
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "application"
|
||||
component: "queue"
|
||||
|
||||
# Infrastructure Alerts
|
||||
- name: "Network Latency High"
|
||||
description: "Network round-trip time exceeds threshold"
|
||||
condition: "network_rtt > 100"
|
||||
duration: "5m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- network
|
||||
- latency
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "network"
|
||||
component: "infrastructure"
|
||||
|
||||
- name: "Load Balancer Unhealthy"
|
||||
description: "Load balancer backend servers are unhealthy"
|
||||
condition: "lb_unhealthy_backends > 0"
|
||||
duration: "2m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- loadbalancer
|
||||
- availability
|
||||
channels:
|
||||
- email
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "infrastructure"
|
||||
component: "loadbalancer"
|
||||
|
||||
# Security Alerts
|
||||
- name: "Failed Login Attempts"
|
||||
description: "Multiple failed authentication attempts detected"
|
||||
condition: "failed_login_attempts > 5"
|
||||
duration: "5m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- security
|
||||
- authentication
|
||||
channels:
|
||||
- email
|
||||
- slack
|
||||
labels:
|
||||
team: "security"
|
||||
component: "authentication"
|
||||
|
||||
- name: "Suspicious Network Traffic"
|
||||
description: "Unusual network traffic patterns detected"
|
||||
condition: "network_bytes_unusual > 1000000"
|
||||
duration: "10m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- security
|
||||
- network
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "security"
|
||||
component: "network"
|
||||
|
||||
# Log-based Alerts
|
||||
- name: "Application Errors"
|
||||
description: "High volume of application error logs"
|
||||
condition: "log_errors_per_minute > 10"
|
||||
duration: "2m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- logs
|
||||
- errors
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "application"
|
||||
component: "logs"
|
||||
|
||||
- name: "Out of Memory Errors"
|
||||
description: "Out of memory errors detected in logs"
|
||||
condition: "log_oom_errors > 0"
|
||||
duration: "1m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- memory
|
||||
- errors
|
||||
channels:
|
||||
- email
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "application"
|
||||
component: "memory"
|
||||
|
||||
# Business Logic Alerts
|
||||
- name: "Low Business Transactions"
|
||||
description: "Business transaction volume below expected threshold"
|
||||
condition: "business_transactions_per_hour < 100"
|
||||
duration: "15m"
|
||||
severity: "warning"
|
||||
tags:
|
||||
- business
|
||||
- transactions
|
||||
channels:
|
||||
- email
|
||||
labels:
|
||||
team: "business"
|
||||
component: "transactions"
|
||||
|
||||
- name: "Payment Failures"
|
||||
description: "Payment processing failure rate is high"
|
||||
condition: "payment_failure_rate > 0.05"
|
||||
duration: "5m"
|
||||
severity: "critical"
|
||||
tags:
|
||||
- payments
|
||||
- business
|
||||
channels:
|
||||
- email
|
||||
- pagerduty
|
||||
labels:
|
||||
team: "payments"
|
||||
component: "processing"
|
||||
|
||||
# Alert Channels Configuration
|
||||
alert_channels:
|
||||
email:
|
||||
type: "email"
|
||||
recipients:
|
||||
- "platform-team@company.com"
|
||||
- "oncall@company.com"
|
||||
subject_template: "[{{severity}}] {{name}} - {{description}}"
|
||||
|
||||
slack:
|
||||
type: "slack"
|
||||
webhook_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
|
||||
channel: "#alerts"
|
||||
username: "Observability Bot"
|
||||
icon_emoji: ":warning:"
|
||||
|
||||
pagerduty:
|
||||
type: "pagerduty"
|
||||
integration_key: "your-pagerduty-integration-key"
|
||||
severity_mapping:
|
||||
critical: "critical"
|
||||
warning: "warning"
|
||||
info: "info"
|
||||
|
||||
# Alert Silencing Rules
|
||||
silence_rules:
|
||||
- name: "Maintenance Window"
|
||||
condition: "maintenance_window == true"
|
||||
duration: "4h"
|
||||
comment: "Silenced during scheduled maintenance"
|
||||
|
||||
- name: "Known Issue"
|
||||
condition: "known_issue_id == 'TICKET-123'"
|
||||
duration: "24h"
|
||||
comment: "Silenced for known issue resolution"
|
||||
|
||||
# Escalation Policies
|
||||
escalation_policies:
|
||||
- name: "Default Escalation"
|
||||
steps:
|
||||
- delay: "5m"
|
||||
channels: ["email"]
|
||||
- delay: "15m"
|
||||
channels: ["slack"]
|
||||
- delay: "30m"
|
||||
channels: ["pagerduty"]
|
||||
|
||||
- name: "Critical Escalation"
|
||||
steps:
|
||||
- delay: "0m"
|
||||
channels: ["email", "slack", "pagerduty"]
|
||||
- delay: "10m"
|
||||
channels: ["pagerduty"] # Escalation
|
||||
@@ -0,0 +1,122 @@
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
elasticsearch:
|
||||
image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
|
||||
container_name: observability-elasticsearch
|
||||
environment:
|
||||
- discovery.type=single-node
|
||||
- xpack.security.enabled=true
|
||||
- ELASTIC_PASSWORD=elastic
|
||||
- "ES_JAVA_OPTS=-Xms1g -Xmx1g"
|
||||
volumes:
|
||||
- elasticsearch_data:/usr/share/elasticsearch/data
|
||||
- ./elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml
|
||||
ports:
|
||||
- "9200:9200"
|
||||
- "9300:9300"
|
||||
networks:
|
||||
- observability
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 5
|
||||
|
||||
logstash:
|
||||
image: docker.elastic.co/logstash/logstash:8.11.0
|
||||
container_name: observability-logstash
|
||||
environment:
|
||||
- "LS_JAVA_OPTS=-Xms512m -Xmx512m"
|
||||
volumes:
|
||||
- ./logstash/config/logstash.yml:/usr/share/logstash/config/logstash.yml
|
||||
- ./logstash/pipeline:/usr/share/logstash/pipeline
|
||||
- ./logs:/usr/share/logstash/logs
|
||||
ports:
|
||||
- "5044:5044"
|
||||
- "8080:8080"
|
||||
networks:
|
||||
- observability
|
||||
depends_on:
|
||||
elasticsearch:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "curl -f http://localhost:8080 || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
kibana:
|
||||
image: docker.elastic.co/kibana/kibana:8.11.0
|
||||
container_name: observability-kibana
|
||||
environment:
|
||||
- ELASTICSEARCH_HOSTS=http://elasticsearch:9200
|
||||
- ELASTICSEARCH_USERNAME=elastic
|
||||
- ELASTICSEARCH_PASSWORD=elastic
|
||||
volumes:
|
||||
- ./kibana/config/kibana.yml:/usr/share/kibana/config/kibana.yml
|
||||
ports:
|
||||
- "5601:5601"
|
||||
networks:
|
||||
- observability
|
||||
depends_on:
|
||||
elasticsearch:
|
||||
condition: service_healthy
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "curl -f http://localhost:5601/api/status || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 5
|
||||
|
||||
grafana:
|
||||
image: grafana/grafana:10.2.0
|
||||
container_name: observability-grafana
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_USER=admin
|
||||
- GF_SECURITY_ADMIN_PASSWORD=admin
|
||||
- GF_USERS_ALLOW_SIGN_UP=false
|
||||
volumes:
|
||||
- grafana_data:/var/lib/grafana
|
||||
- ./grafana/provisioning:/etc/grafana/provisioning
|
||||
- ./grafana/dashboards:/var/lib/grafana/dashboards
|
||||
ports:
|
||||
- "3000:3000"
|
||||
networks:
|
||||
- observability
|
||||
restart: unless-stopped
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
filebeat:
|
||||
image: docker.elastic.co/beats/filebeat:8.11.0
|
||||
container_name: observability-filebeat
|
||||
user: root
|
||||
volumes:
|
||||
- ./filebeat/config/filebeat.yml:/usr/share/filebeat/filebeat.yml
|
||||
- ./logs:/var/log/sample
|
||||
- /var/lib/docker/containers:/var/lib/docker/containers:ro
|
||||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||
networks:
|
||||
- observability
|
||||
depends_on:
|
||||
- logstash
|
||||
restart: unless-stopped
|
||||
|
||||
volumes:
|
||||
elasticsearch_data:
|
||||
driver: local
|
||||
grafana_data:
|
||||
driver: local
|
||||
|
||||
networks:
|
||||
observability:
|
||||
driver: bridge
|
||||
ipam:
|
||||
config:
|
||||
- subnet: 172.25.0.0/16
|
||||
@@ -0,0 +1,2 @@
|
||||
[2026-04-29 22:52:26] INFO Incident simulation script started
|
||||
[2026-04-29 22:52:26] INFO Scenario: help
|
||||
@@ -0,0 +1,98 @@
|
||||
# Sample Application Logs for Observability Stack Testing
|
||||
# Generated realistic log entries for demonstration and testing
|
||||
|
||||
[2024-01-15 08:30:15] INFO com.example.app.Application - Application started successfully on port 8080
|
||||
[2024-01-15 08:30:16] INFO com.example.database.ConnectionPool - Database connection pool initialized with 10 connections
|
||||
[2024-01-15 08:30:17] INFO com.example.cache.RedisCache - Redis cache connected successfully
|
||||
[2024-01-15 08:30:18] INFO com.example.messaging.RabbitMQClient - Message queue connection established
|
||||
|
||||
[2024-01-15 08:31:22] INFO com.example.api.UserController - User login attempt: user=john.doe, ip=192.168.1.100
|
||||
[2024-01-15 08:31:23] INFO com.example.auth.AuthService - Authentication successful for user john.doe
|
||||
[2024-01-15 08:31:24] INFO com.example.api.UserController - API request: GET /api/users/profile, status=200, duration=45ms
|
||||
|
||||
[2024-01-15 08:32:01] WARN com.example.cache.RedisCache - Cache miss for key: user_profile_12345
|
||||
[2024-01-15 08:32:02] INFO com.example.database.UserRepository - Database query executed: SELECT * FROM users WHERE id = ?, duration=120ms
|
||||
[2024-01-15 08:32:03] INFO com.example.cache.RedisCache - Cache populated for key: user_profile_12345
|
||||
|
||||
[2024-01-15 08:35:12] ERROR com.example.api.OrderController - Failed to process order: order_id=ORD-2024-001, error=Payment gateway timeout
|
||||
[2024-01-15 08:35:13] WARN com.example.messaging.RabbitMQClient - Message delivery failed, retrying in 5 seconds
|
||||
[2024-01-15 08:35:18] INFO com.example.messaging.RabbitMQClient - Message delivered successfully after retry
|
||||
|
||||
[2024-01-15 08:40:05] INFO com.example.monitoring.HealthCheck - Health check passed: database=OK, cache=OK, messaging=OK
|
||||
[2024-01-15 08:40:06] INFO com.example.metrics.MetricsCollector - Metrics collected: active_users=1250, requests_per_second=45.2
|
||||
|
||||
[2024-01-15 08:45:30] ERROR com.example.external.PaymentGateway - Payment gateway connection failed: Connection refused
|
||||
[2024-01-15 08:45:31] ERROR com.example.api.PaymentController - Payment processing failed for transaction TXN-789012
|
||||
[2024-01-15 08:45:32] WARN com.example.circuitbreaker.PaymentCircuitBreaker - Circuit breaker opened for payment service
|
||||
|
||||
[2024-01-15 08:50:15] INFO com.example.batch.BatchProcessor - Batch job started: job_id=BATCH-001, records=10000
|
||||
[2024-01-15 08:50:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=2500/10000 (25%)
|
||||
[2024-01-15 08:51:15] INFO com.example.batch.BatchProcessor - Batch job progress: processed=5000/10000 (50%)
|
||||
[2024-01-15 08:51:45] INFO com.example.batch.BatchProcessor - Batch job progress: processed=7500/10000 (75%)
|
||||
[2024-01-15 08:52:10] INFO com.example.batch.BatchProcessor - Batch job completed: job_id=BATCH-001, duration=175s, success=true
|
||||
|
||||
[2024-01-15 09:00:00] INFO com.example.scheduled.CleanupJob - Scheduled cleanup job started
|
||||
[2024-01-15 09:00:05] INFO com.example.scheduled.CleanupJob - Cleaned up 150 expired sessions
|
||||
[2024-01-15 09:00:10] INFO com.example.scheduled.CleanupJob - Removed 25 temporary files
|
||||
[2024-01-15 09:00:15] INFO com.example.scheduled.CleanupJob - Cleanup job completed successfully
|
||||
|
||||
[2024-01-15 09:15:22] WARN com.example.database.ConnectionPool - Connection pool nearing capacity: active=8/10
|
||||
[2024-01-15 09:15:23] INFO com.example.database.ConnectionPool - Connection pool expanded to 15 connections
|
||||
|
||||
[2024-01-15 09:30:45] ERROR com.example.api.ProductController - Database query timeout: query=SELECT * FROM products WHERE category = 'electronics'
|
||||
[2024-01-15 09:30:46] WARN com.example.database.ConnectionPool - Connection pool exhausted, rejecting request
|
||||
[2024-01-15 09:30:47] ERROR com.example.api.ProductController - Service temporarily unavailable, status=503
|
||||
|
||||
[2024-01-15 09:45:12] INFO com.example.monitoring.AlertManager - Alert triggered: High CPU usage detected (85%)
|
||||
[2024-01-15 09:45:13] INFO com.example.autoscaling.ScaleManager - Auto-scaling initiated: increasing instances from 3 to 5
|
||||
|
||||
[2024-01-15 10:00:00] INFO com.example.backup.BackupService - Database backup started
|
||||
[2024-01-15 10:05:30] INFO com.example.backup.BackupService - Database backup completed: size=2.3GB, duration=330s
|
||||
|
||||
[2024-01-15 10:30:15] WARN com.example.security.SecurityFilter - Suspicious activity detected: multiple failed login attempts from IP 10.0.0.50
|
||||
[2024-01-15 10:30:16] INFO com.example.security.SecurityFilter - IP 10.0.0.50 blocked for 15 minutes
|
||||
|
||||
[2024-01-15 11:00:00] INFO com.example.reporting.ReportGenerator - Daily report generation started
|
||||
[2024-01-15 11:05:00] INFO com.example.reporting.ReportGenerator - Daily report completed: transactions=15420, revenue=$125,430.50
|
||||
|
||||
[2024-01-15 12:00:00] ERROR com.example.external.APIGateway - External API rate limit exceeded: retrying in 60 seconds
|
||||
[2024-01-15 12:01:00] INFO com.example.external.APIGateway - External API connection restored
|
||||
|
||||
[2024-01-15 13:15:30] CRITICAL com.example.system.SystemMonitor - Disk space critical: /var/log usage at 95%
|
||||
[2024-01-15 13:15:31] INFO com.example.maintenance.LogRotation - Emergency log rotation initiated
|
||||
[2024-01-15 13:15:35] INFO com.example.maintenance.LogRotation - Log rotation completed: freed 2.1GB space
|
||||
|
||||
[2024-01-15 14:00:00] INFO com.example.metrics.PerformanceMonitor - Performance baseline established: avg_response_time=245ms, throughput=1200 req/sec
|
||||
|
||||
[2024-01-15 15:30:45] WARN com.example.cache.RedisCache - Cache cluster node down: redis-node-03
|
||||
[2024-01-15 15:30:46] INFO com.example.cache.RedisCache - Failover initiated to redis-node-04
|
||||
|
||||
[2024-01-15 16:45:12] ERROR com.example.messaging.MessageProcessor - Message processing failed: invalid message format
|
||||
[2024-01-15 16:45:13] INFO com.example.messaging.DeadLetterQueue - Message moved to dead letter queue
|
||||
|
||||
[2024-01-15 17:00:00] INFO com.example.backup.BackupService - Full system backup started
|
||||
[2024-01-15 17:30:00] INFO com.example.backup.BackupService - Full system backup completed: size=45.2GB, duration=1800s
|
||||
|
||||
[2024-01-15 18:00:00] INFO com.example.monitoring.HealthCheck - Evening health check: all systems operational
|
||||
[2024-01-15 18:00:01] INFO com.example.metrics.MetricsCollector - End of day metrics: total_requests=125000, error_rate=0.02%, avg_response_time=198ms
|
||||
|
||||
# Additional realistic log patterns for testing
|
||||
|
||||
[2024-01-15 08:15:30] INFO nginx: 192.168.1.100 - - [15/Jan/2024:08:15:30 +0000] "GET /api/health HTTP/1.1" 200 123 "-" "curl/7.68.0"
|
||||
[2024-01-15 08:15:31] INFO nginx: 192.168.1.101 - - [15/Jan/2024:08:15:31 +0000] "POST /api/login HTTP/1.1" 200 456 "-" "Mozilla/5.0 ..."
|
||||
[2024-01-15 08:15:32] WARN nginx: 192.168.1.102 - - [15/Jan/2024:08:15:32 +0000] "GET /api/admin HTTP/1.1" 403 234 "-" "Mozilla/5.0 ..."
|
||||
|
||||
[2024-01-15 09:20:15] INFO systemd: Started Session 1234 of user john.doe
|
||||
[2024-01-15 09:20:16] INFO systemd: Started User Manager for UID 1000
|
||||
[2024-01-15 09:20:17] INFO systemd: Started Session 1235 of user jane.smith
|
||||
|
||||
[2024-01-15 10:45:30] WARN kernel: [ 1234.567890] CPU0: Core temperature above threshold, cpu clock throttled
|
||||
[2024-01-15 10:45:31] INFO kernel: [ 1234.678901] CPU0: Temperature back to normal
|
||||
|
||||
[2024-01-15 14:30:15] ERROR postgres: FATAL: password authentication failed for user "app_user"
|
||||
[2024-01-15 14:30:16] ERROR postgres: FATAL: password authentication failed for user "app_user"
|
||||
[2024-01-15 14:30:17] WARN postgres: too many connections for role "app_user"
|
||||
|
||||
[2024-01-15 16:15:45] INFO rabbitmq: accepting AMQP connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672)
|
||||
[2024-01-15 16:15:46] INFO rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): user 'app_user' authenticated
|
||||
[2024-01-15 16:15:47] WARN rabbitmq: connection <0.1234.0> (192.168.1.100:5672 -> 192.168.1.50:5672): missed heartbeats from client, timeout: 60s
|
||||
+318
@@ -0,0 +1,318 @@
|
||||
#!/bin/bash
|
||||
|
||||
# Enterprise Incident Simulation Script
|
||||
# Simulates various failure scenarios for testing observability stack
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PROJECT_ROOT="$(dirname "$(dirname "$SCRIPT_DIR")")"
|
||||
LOG_FILE="$PROJECT_ROOT/observability-stack/logs/incident_simulation.log"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
local level=$1
|
||||
local message=$2
|
||||
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
echo "[$timestamp] $level $message" >> "$LOG_FILE"
|
||||
echo -e "${BLUE}[$timestamp]${NC} $level $message"
|
||||
}
|
||||
|
||||
# Function to simulate CPU spike
|
||||
simulate_cpu_spike() {
|
||||
local duration=${1:-60}
|
||||
log "INFO" "Starting CPU spike simulation for ${duration} seconds"
|
||||
|
||||
# Launch CPU-intensive processes
|
||||
for i in {1..4}; do
|
||||
(
|
||||
end_time=$((SECONDS + duration))
|
||||
while [ $SECONDS -lt $end_time ]; do
|
||||
# CPU-intensive calculation
|
||||
result=0
|
||||
for j in {1..100000}; do
|
||||
result=$((result + j))
|
||||
done
|
||||
done
|
||||
) &
|
||||
PIDS[$i]=$!
|
||||
done
|
||||
|
||||
# Wait for simulation to complete
|
||||
for pid in "${PIDS[@]}"; do
|
||||
wait $pid 2>/dev/null || true
|
||||
done
|
||||
|
||||
log "INFO" "CPU spike simulation completed"
|
||||
}
|
||||
|
||||
# Function to simulate memory leak
|
||||
simulate_memory_leak() {
|
||||
local duration=${1:-30}
|
||||
log "INFO" "Starting memory leak simulation for ${duration} seconds"
|
||||
|
||||
# Create a process that gradually consumes memory
|
||||
(
|
||||
data=""
|
||||
end_time=$((SECONDS + duration))
|
||||
while [ $SECONDS -lt $end_time ]; do
|
||||
# Gradually consume memory
|
||||
data="${data}X"
|
||||
sleep 0.1
|
||||
done
|
||||
) &
|
||||
MEM_PID=$!
|
||||
|
||||
wait $MEM_PID 2>/dev/null || true
|
||||
log "INFO" "Memory leak simulation completed"
|
||||
}
|
||||
|
||||
# Function to simulate disk space exhaustion
|
||||
simulate_disk_full() {
|
||||
local target_dir=${1:-"/tmp"}
|
||||
local duration=${2:-30}
|
||||
log "INFO" "Starting disk space exhaustion simulation in ${target_dir} for ${duration} seconds"
|
||||
|
||||
# Create large files to fill disk space
|
||||
(
|
||||
end_time=$((SECONDS + duration))
|
||||
while [ $SECONDS -lt $end_time ]; do
|
||||
# Create 100MB file
|
||||
dd if=/dev/zero of="${target_dir}/incident_test_file_$(date +%s).tmp" bs=1M count=100 2>/dev/null || true
|
||||
sleep 2
|
||||
done
|
||||
) &
|
||||
DISK_PID=$!
|
||||
|
||||
wait $DISK_PID 2>/dev/null || true
|
||||
|
||||
# Cleanup test files
|
||||
rm -f "${target_dir}"/incident_test_file_*.tmp 2>/dev/null || true
|
||||
log "INFO" "Disk space exhaustion simulation completed and cleaned up"
|
||||
}
|
||||
|
||||
# Function to simulate network issues
|
||||
simulate_network_issues() {
|
||||
local interface=${1:-"lo"}
|
||||
local duration=${2:-20}
|
||||
log "INFO" "Starting network issues simulation on ${interface} for ${duration} seconds"
|
||||
|
||||
# Add network delay and packet loss
|
||||
sudo tc qdisc add dev $interface root netem delay 100ms 50ms loss 10% 2>/dev/null || true
|
||||
|
||||
sleep $duration
|
||||
|
||||
# Remove network simulation
|
||||
sudo tc qdisc del dev $interface root 2>/dev/null || true
|
||||
log "INFO" "Network issues simulation completed"
|
||||
}
|
||||
|
||||
# Function to simulate service crashes
|
||||
simulate_service_crash() {
|
||||
local service_name=${1:-"test-service"}
|
||||
log "INFO" "Starting service crash simulation for ${service_name}"
|
||||
|
||||
# Simulate service going down
|
||||
log "ERROR" "Service ${service_name} crashed unexpectedly"
|
||||
sleep 5
|
||||
log "INFO" "Service ${service_name} restarted automatically"
|
||||
|
||||
# Simulate multiple crashes
|
||||
for i in {1..3}; do
|
||||
sleep 2
|
||||
log "ERROR" "Service ${service_name} crashed again (attempt $i)"
|
||||
sleep 1
|
||||
log "INFO" "Service ${service_name} recovered after crash $i"
|
||||
done
|
||||
}
|
||||
|
||||
# Function to simulate database issues
|
||||
simulate_database_issues() {
|
||||
local duration=${1:-25}
|
||||
log "INFO" "Starting database issues simulation for ${duration} seconds"
|
||||
|
||||
# Simulate connection pool exhaustion
|
||||
log "WARN" "Database connection pool nearing capacity"
|
||||
sleep 5
|
||||
log "ERROR" "Database connection pool exhausted"
|
||||
sleep 5
|
||||
log "ERROR" "Database query timeout occurred"
|
||||
sleep 5
|
||||
log "WARN" "Database connections recovering"
|
||||
sleep 5
|
||||
log "INFO" "Database connections restored"
|
||||
|
||||
log "INFO" "Database issues simulation completed"
|
||||
}
|
||||
|
||||
# Function to simulate application errors
|
||||
simulate_application_errors() {
|
||||
local error_count=${1:-10}
|
||||
log "INFO" "Starting application error simulation (${error_count} errors)"
|
||||
|
||||
for i in {1..error_count}; do
|
||||
case $((RANDOM % 4)) in
|
||||
0)
|
||||
log "ERROR" "NullPointerException in UserService.getUser($i)"
|
||||
;;
|
||||
1)
|
||||
log "ERROR" "TimeoutException: Database query timed out for user ID: $i"
|
||||
;;
|
||||
2)
|
||||
log "ERROR" "ValidationException: Invalid input data for request $i"
|
||||
;;
|
||||
3)
|
||||
log "ERROR" "IOException: Failed to write to log file"
|
||||
;;
|
||||
esac
|
||||
sleep $((RANDOM % 3 + 1))
|
||||
done
|
||||
|
||||
log "INFO" "Application error simulation completed"
|
||||
}
|
||||
|
||||
# Function to run comprehensive incident scenario
|
||||
run_comprehensive_scenario() {
|
||||
log "INFO" "Starting comprehensive incident scenario simulation"
|
||||
|
||||
# Phase 1: Initial system stress
|
||||
log "INFO" "Phase 1: System stress simulation"
|
||||
simulate_cpu_spike 30 &
|
||||
CPU_PID=$!
|
||||
simulate_memory_leak 20 &
|
||||
MEM_PID=$!
|
||||
|
||||
sleep 10
|
||||
|
||||
# Phase 2: Service degradation
|
||||
log "INFO" "Phase 2: Service degradation simulation"
|
||||
simulate_service_crash "web-service" &
|
||||
SERVICE_PID=$!
|
||||
|
||||
sleep 5
|
||||
|
||||
# Phase 3: Database issues
|
||||
log "INFO" "Phase 3: Database issues simulation"
|
||||
simulate_database_issues 15 &
|
||||
DB_PID=$!
|
||||
|
||||
# Phase 4: Application errors
|
||||
log "INFO" "Phase 4: Application error burst"
|
||||
simulate_application_errors 15 &
|
||||
APP_PID=$!
|
||||
|
||||
# Phase 5: Infrastructure issues
|
||||
log "INFO" "Phase 5: Infrastructure issues simulation"
|
||||
simulate_disk_full "/tmp" 10 &
|
||||
DISK_PID=$!
|
||||
|
||||
# Wait for all simulations to complete
|
||||
wait $CPU_PID 2>/dev/null || true
|
||||
wait $MEM_PID 2>/dev/null || true
|
||||
wait $SERVICE_PID 2>/dev/null || true
|
||||
wait $DB_PID 2>/dev/null || true
|
||||
wait $APP_PID 2>/dev/null || true
|
||||
wait $DISK_PID 2>/dev/null || true
|
||||
|
||||
log "INFO" "Comprehensive incident scenario completed"
|
||||
}
|
||||
|
||||
# Function to show usage
|
||||
show_usage() {
|
||||
echo "Enterprise Incident Simulation Script"
|
||||
echo "Usage: $0 [SCENARIO] [OPTIONS]"
|
||||
echo ""
|
||||
echo "SCENARIOS:"
|
||||
echo " cpu [DURATION] - Simulate CPU spike (default: 60s)"
|
||||
echo " memory [DURATION] - Simulate memory leak (default: 30s)"
|
||||
echo " disk [DIR] [DURATION] - Simulate disk space exhaustion (default: /tmp, 30s)"
|
||||
echo " network [INTERFACE] [DURATION] - Simulate network issues (default: lo, 20s)"
|
||||
echo " service [NAME] - Simulate service crashes (default: test-service)"
|
||||
echo " database [DURATION] - Simulate database issues (default: 25s)"
|
||||
echo " app-errors [COUNT] - Simulate application errors (default: 10)"
|
||||
echo " comprehensive - Run full incident scenario"
|
||||
echo " all - Run all individual scenarios sequentially"
|
||||
echo ""
|
||||
echo "EXAMPLES:"
|
||||
echo " $0 cpu 120 - CPU spike for 2 minutes"
|
||||
echo " $0 disk /var/log 45 - Disk full simulation in /var/log for 45 seconds"
|
||||
echo " $0 comprehensive - Full incident simulation"
|
||||
echo ""
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
local scenario=${1:-"comprehensive"}
|
||||
|
||||
# Create log directory if it doesn't exist
|
||||
mkdir -p "$(dirname "$LOG_FILE")"
|
||||
|
||||
log "INFO" "Incident simulation script started"
|
||||
log "INFO" "Scenario: $scenario"
|
||||
|
||||
case $scenario in
|
||||
"cpu")
|
||||
simulate_cpu_spike "${2:-60}"
|
||||
;;
|
||||
"memory")
|
||||
simulate_memory_leak "${2:-30}"
|
||||
;;
|
||||
"disk")
|
||||
simulate_disk_full "${2:-/tmp}" "${3:-30}"
|
||||
;;
|
||||
"network")
|
||||
simulate_network_issues "${2:-lo}" "${3:-20}"
|
||||
;;
|
||||
"service")
|
||||
simulate_service_crash "${2:-test-service}"
|
||||
;;
|
||||
"database")
|
||||
simulate_database_issues "${2:-25}"
|
||||
;;
|
||||
"app-errors")
|
||||
simulate_application_errors "${2:-10}"
|
||||
;;
|
||||
"comprehensive")
|
||||
run_comprehensive_scenario
|
||||
;;
|
||||
"all")
|
||||
log "INFO" "Running all scenarios sequentially"
|
||||
simulate_cpu_spike 30
|
||||
sleep 5
|
||||
simulate_memory_leak 20
|
||||
sleep 5
|
||||
simulate_disk_full "/tmp" 15
|
||||
sleep 5
|
||||
simulate_service_crash "test-service"
|
||||
sleep 5
|
||||
simulate_database_issues 15
|
||||
sleep 5
|
||||
simulate_application_errors 8
|
||||
sleep 5
|
||||
simulate_network_issues "lo" 10
|
||||
;;
|
||||
"help"|"-h"|"--help")
|
||||
show_usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
echo -e "${RED}Error: Unknown scenario '$scenario'${NC}"
|
||||
echo ""
|
||||
show_usage
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
log "INFO" "Incident simulation script completed successfully"
|
||||
echo -e "${GREEN}Simulation completed. Check logs at: $LOG_FILE${NC}"
|
||||
}
|
||||
|
||||
# Run main function with all arguments
|
||||
main "$@"
|
||||
Reference in New Issue
Block a user