# Observability Stack A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios. ## Overview The Observability Stack provides a complete monitoring solution with: - **Elasticsearch**: Distributed search and analytics engine for logs and metrics - **Logstash**: Data processing pipeline for log ingestion and transformation - **Kibana**: Visualization and exploration interface for Elasticsearch data - **Grafana**: Advanced metrics dashboarding and alerting platform - **Sample Logs**: Realistic log data for testing and demonstration - **Alerting**: Automated incident detection and notification rules - **Incident Simulation**: Scenarios for testing monitoring and response procedures ## Architecture ``` ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Log Sources │ │ Logstash │ │ Elasticsearch │ │ (Applications │───►│ (Ingestion & │───►│ (Storage & │ │ / Systems) │ │ Processing) │ │ Analytics) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Alerting │ │ Kibana │ │ Grafana │ │ Rules │ │ (Dashboards & │ │ (Metrics & │ │ │ │ Exploration) │ │ Dashboards) │ └─────────────────┘ └─────────────────┘ └─────────────────┘ ``` ## Quick Start ### Prerequisites - Docker and Docker Compose - At least 4GB RAM available - Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available ### Setup ```bash cd observability-stack # Start the observability stack docker-compose up -d # Wait for services to be ready (may take 2-3 minutes) sleep 180 # Verify services are running curl -X GET "localhost:9200/_cluster/health?pretty" curl -X GET "localhost:5601/api/status" curl -X GET "localhost:3000/api/health" ``` ### Access Interfaces - **Kibana**: http://localhost:5601 (admin/elastic) - **Grafana**: http://localhost:3000 (admin/admin) - **Elasticsearch**: http://localhost:9200 ## Project Structure ``` observability-stack/ ├── docker-compose.yml # Service orchestration ├── logstash/ # Logstash configuration │ ├── pipeline/ # Processing pipelines │ └── config/ # Logstash settings ├── elasticsearch/ # Elasticsearch configuration │ └── config/ # Cluster settings ├── kibana/ # Kibana configuration │ └── config/ # Dashboard settings ├── grafana/ # Grafana configuration │ ├── provisioning/ # Dashboards and datasources │ └── dashboards/ # Dashboard definitions ├── logs/ # Sample log data │ └── sample.log # Realistic application logs ├── alerting/ # Alert configuration │ └── alert_rules.yml # Alert definitions ├── scenarios/ # Incident simulation │ └── incident_simulation.sh # Simulation scripts └── README.md ``` ## Services Configuration ### Elasticsearch **Configuration**: `elasticsearch/config/elasticsearch.yml` Key settings: - Single-node cluster for development - Memory limits and heap sizing - Security enabled with basic authentication - CORS enabled for Kibana access **Data Indices**: - `logs-*`: Application and system logs - `metrics-*`: System and application metrics - `alerts-*`: Alert and incident data ### Logstash **Pipelines**: `logstash/pipeline/` - **apache_logs**: Apache/Nginx access log processing - **system_logs**: System log parsing and enrichment - **application_logs**: Custom application log processing - **metrics_pipeline**: Metrics data processing **Input Sources**: - Filebeat agents - TCP/UDP syslog inputs - HTTP endpoints for metrics - Docker container logs ### Kibana **Dashboards**: - Log analysis dashboard - System metrics overview - Application performance dashboard - Security events dashboard **Saved Objects**: - Index patterns for log data - Visualizations for common metrics - Search queries for troubleshooting ### Grafana **Data Sources**: - Elasticsearch for logs and metrics - Prometheus (if available) - InfluxDB for time-series data **Dashboards**: - Infrastructure overview - Application performance - System resources - Custom business metrics ## Log Ingestion ### Sample Data The stack includes realistic sample logs for testing: ```bash # Ingest sample logs curl -X POST "localhost:8080" \ -H "Content-Type: application/json" \ -d @logs/sample.log ``` ### Log Formats Supported - **Apache/Nginx**: Combined log format - **Syslog**: RFC 3164/5424 compliant - **JSON**: Structured application logs - **Custom**: Configurable parsing rules ### Data Enrichment Logstash pipelines add: - GeoIP location data - User agent parsing - Timestamp normalization - Host metadata enrichment ## Alerting and Monitoring ### Alert Rules Located in `alerting/alert_rules.yml`: ```yaml alert_rules: - name: "High CPU Usage" condition: "cpu_usage > 90" duration: "5m" severity: "critical" channels: ["email", "slack"] - name: "Disk Space Low" condition: "disk_usage > 85" duration: "10m" severity: "warning" channels: ["email"] - name: "Service Down" condition: "service_status == 'down'" duration: "2m" severity: "critical" channels: ["email", "pagerduty"] ``` ### Alert Channels - **Email**: SMTP-based notifications - **Slack**: Real-time messaging - **PagerDuty**: Incident management integration - **Webhook**: Custom HTTP endpoints ## Incident Simulation ### Available Scenarios ```bash cd scenarios # Simulate disk space exhaustion ./incident_simulation.sh --type disk-full --severity critical # Simulate service failure ./incident_simulation.sh --type service-down --service nginx # Simulate network latency ./incident_simulation.sh --type network-latency --delay 500ms # Simulate high CPU usage ./incident_simulation.sh --type high-cpu --cores 4 ``` ### Scenario Types - **disk-full**: Filesystem capacity exhaustion - **service-down**: Application service failures - **network-latency**: Network performance degradation - **high-cpu**: CPU utilization spikes - **memory-leak**: Memory consumption growth - **log-flood**: Excessive log generation ## Dashboards and Visualization ### Kibana Dashboards Pre-configured dashboards for: 1. **Log Analysis** - Log volume over time - Error rate trends - Top error messages - Geographic request distribution 2. **System Monitoring** - CPU and memory usage - Disk I/O statistics - Network traffic - System load averages 3. **Application Performance** - Response time distributions - Request rate metrics - Error percentages - User session analytics ### Grafana Dashboards Advanced visualization panels: - **Infrastructure Overview**: Multi-system resource usage - **Application Metrics**: Custom business KPIs - **Alert Status**: Active alerts and trends - **Capacity Planning**: Resource utilization forecasting ## API Endpoints ### Elasticsearch APIs ```bash # Cluster health GET /_cluster/health # Index statistics GET /_cat/indices?v # Search logs GET /logs-*/_search { "query": { "match": { "message": "ERROR" } } } ``` ### Kibana APIs ```bash # Get dashboard list GET /api/saved_objects/_find?type=dashboard # Export visualizations GET /api/saved_objects/visualization/{id} ``` ### Grafana APIs ```bash # Get dashboard list GET /api/search?query=* # Alert status GET /api/alerts ``` ## Configuration Management ### Environment Variables ```bash # Elasticsearch ES_JAVA_OPTS="-Xms1g -Xmx1g" ELASTIC_PASSWORD="elastic" # Logstash LS_JAVA_OPTS="-Xms512m -Xmx512m" # Grafana GF_SECURITY_ADMIN_PASSWORD="admin" ``` ### Scaling Configuration For production deployment: ```yaml version: '3.8' services: elasticsearch: deploy: replicas: 3 resources: limits: memory: 4G cpus: '2.0' ``` ## Security Considerations ### Authentication - Elasticsearch basic authentication enabled - Grafana admin credentials configured - Kibana anonymous access disabled ### Network Security - Services bound to localhost only - Internal network for service communication - TLS encryption for external access (production) ### Data Protection - Elasticsearch encryption at rest - Log data retention policies - Backup and recovery procedures ## Troubleshooting ### Common Issues **Elasticsearch Won't Start:** ```bash # Check memory allocation docker-compose logs elasticsearch # Verify Java heap settings docker-compose exec elasticsearch ps aux ``` **Logstash Pipeline Errors:** ```bash # Check pipeline configuration docker-compose logs logstash # Validate pipeline syntax docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/ ``` **Kibana Connection Issues:** ```bash # Verify Elasticsearch connectivity curl -u elastic:elastic "localhost:9200/_cluster/health" # Check Kibana logs docker-compose logs kibana ``` ### Performance Tuning **Elasticsearch:** - Increase heap size for larger datasets - Configure shard allocation - Enable index optimization **Logstash:** - Adjust worker threads - Configure batch sizes - Enable persistent queues **Grafana:** - Configure query caching - Set dashboard refresh intervals - Optimize panel queries ## Development and Testing ### Adding New Dashboards 1. Create dashboard JSON in `grafana/dashboards/` 2. Update provisioning configuration 3. Restart Grafana service ### Custom Alert Rules 1. Define rules in `alerting/alert_rules.yml` 2. Update alerting configuration 3. Test rules with simulation scenarios ### Log Pipeline Development 1. Add pipeline configuration in `logstash/pipeline/` 2. Test with sample data 3. Validate parsing with Kibana ## Backup and Recovery ### Data Backup ```bash # Elasticsearch snapshot curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \ -H "Content-Type: application/json" \ -d '{"indices": "*"}' ``` ### Configuration Backup ```bash # Backup all configurations tar -czf backup_$(date +%Y%m%d).tar.gz \ logstash/ elasticsearch/ kibana/ grafana/ ``` ## Contributing 1. Follow existing configuration patterns 2. Test changes with simulation scenarios 3. Update documentation for new features 4. Ensure backward compatibility ## License Enterprise Internal Use Only