Observability Stack
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
Overview
The Observability Stack provides a complete monitoring solution with:
- Elasticsearch: Distributed search and analytics engine for logs and metrics
- Logstash: Data processing pipeline for log ingestion and transformation
- Kibana: Visualization and exploration interface for Elasticsearch data
- Grafana: Advanced metrics dashboarding and alerting platform
- Sample Logs: Realistic log data for testing and demonstration
- Alerting: Automated incident detection and notification rules
- Incident Simulation: Scenarios for testing monitoring and response procedures
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Log Sources │ │ Logstash │ │ Elasticsearch │
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
│ / Systems) │ │ Processing) │ │ Analytics) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │ │ Kibana │ │ Grafana │
│ Rules │ │ (Dashboards & │ │ (Metrics & │
│ │ │ Exploration) │ │ Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Quick Start
Prerequisites
- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
Setup
cd observability-stack
# Start the observability stack
docker-compose up -d
# Wait for services to be ready (may take 2-3 minutes)
sleep 180
# Verify services are running
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health"
Access Interfaces
- Kibana: http://localhost:5601 (admin/elastic)
- Grafana: http://localhost:3000 (admin/admin)
- Elasticsearch: http://localhost:9200
Project Structure
observability-stack/
├── docker-compose.yml # Service orchestration
├── logstash/ # Logstash configuration
│ ├── pipeline/ # Processing pipelines
│ └── config/ # Logstash settings
├── elasticsearch/ # Elasticsearch configuration
│ └── config/ # Cluster settings
├── kibana/ # Kibana configuration
│ └── config/ # Dashboard settings
├── grafana/ # Grafana configuration
│ ├── provisioning/ # Dashboards and datasources
│ └── dashboards/ # Dashboard definitions
├── logs/ # Sample log data
│ └── sample.log # Realistic application logs
├── alerting/ # Alert configuration
│ └── alert_rules.yml # Alert definitions
├── scenarios/ # Incident simulation
│ └── incident_simulation.sh # Simulation scripts
└── README.md
Services Configuration
Elasticsearch
Configuration: elasticsearch/config/elasticsearch.yml
Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
Data Indices:
logs-*: Application and system logsmetrics-*: System and application metricsalerts-*: Alert and incident data
Logstash
Pipelines: logstash/pipeline/
- apache_logs: Apache/Nginx access log processing
- system_logs: System log parsing and enrichment
- application_logs: Custom application log processing
- metrics_pipeline: Metrics data processing
Input Sources:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
Kibana
Dashboards:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
Saved Objects:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
Grafana
Data Sources:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
Dashboards:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
Log Ingestion
Sample Data
The stack includes realistic sample logs for testing:
# Ingest sample logs
curl -X POST "localhost:8080" \
-H "Content-Type: application/json" \
-d @logs/sample.log
Log Formats Supported
- Apache/Nginx: Combined log format
- Syslog: RFC 3164/5424 compliant
- JSON: Structured application logs
- Custom: Configurable parsing rules
Data Enrichment
Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
Alerting and Monitoring
Alert Rules
Located in alerting/alert_rules.yml:
alert_rules:
- name: "High CPU Usage"
condition: "cpu_usage > 90"
duration: "5m"
severity: "critical"
channels: ["email", "slack"]
- name: "Disk Space Low"
condition: "disk_usage > 85"
duration: "10m"
severity: "warning"
channels: ["email"]
- name: "Service Down"
condition: "service_status == 'down'"
duration: "2m"
severity: "critical"
channels: ["email", "pagerduty"]
Alert Channels
- Email: SMTP-based notifications
- Slack: Real-time messaging
- PagerDuty: Incident management integration
- Webhook: Custom HTTP endpoints
Incident Simulation
Available Scenarios
cd scenarios
# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical
# Simulate service failure
./incident_simulation.sh --type service-down --service nginx
# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms
# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
Scenario Types
- disk-full: Filesystem capacity exhaustion
- service-down: Application service failures
- network-latency: Network performance degradation
- high-cpu: CPU utilization spikes
- memory-leak: Memory consumption growth
- log-flood: Excessive log generation
Dashboards and Visualization
Kibana Dashboards
Pre-configured dashboards for:
-
Log Analysis
- Log volume over time
- Error rate trends
- Top error messages
- Geographic request distribution
-
System Monitoring
- CPU and memory usage
- Disk I/O statistics
- Network traffic
- System load averages
-
Application Performance
- Response time distributions
- Request rate metrics
- Error percentages
- User session analytics
Grafana Dashboards
Advanced visualization panels:
- Infrastructure Overview: Multi-system resource usage
- Application Metrics: Custom business KPIs
- Alert Status: Active alerts and trends
- Capacity Planning: Resource utilization forecasting
API Endpoints
Elasticsearch APIs
# Cluster health
GET /_cluster/health
# Index statistics
GET /_cat/indices?v
# Search logs
GET /logs-*/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
Kibana APIs
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard
# Export visualizations
GET /api/saved_objects/visualization/{id}
Grafana APIs
# Get dashboard list
GET /api/search?query=*
# Alert status
GET /api/alerts
Configuration Management
Environment Variables
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"
# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"
# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
Scaling Configuration
For production deployment:
version: '3.8'
services:
elasticsearch:
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: '2.0'
Security Considerations
Authentication
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
Network Security
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
Data Protection
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
Troubleshooting
Common Issues
Elasticsearch Won't Start:
# Check memory allocation
docker-compose logs elasticsearch
# Verify Java heap settings
docker-compose exec elasticsearch ps aux
Logstash Pipeline Errors:
# Check pipeline configuration
docker-compose logs logstash
# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
Kibana Connection Issues:
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"
# Check Kibana logs
docker-compose logs kibana
Performance Tuning
Elasticsearch:
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
Logstash:
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
Grafana:
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
Development and Testing
Adding New Dashboards
- Create dashboard JSON in
grafana/dashboards/ - Update provisioning configuration
- Restart Grafana service
Custom Alert Rules
- Define rules in
alerting/alert_rules.yml - Update alerting configuration
- Test rules with simulation scenarios
Log Pipeline Development
- Add pipeline configuration in
logstash/pipeline/ - Test with sample data
- Validate parsing with Kibana
Backup and Recovery
Data Backup
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-H "Content-Type: application/json" \
-d '{"indices": "*"}'
Configuration Backup
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
logstash/ elasticsearch/ kibana/ grafana/
Contributing
- Follow existing configuration patterns
- Test changes with simulation scenarios
- Update documentation for new features
- Ensure backward compatibility
License
Enterprise Internal Use Only