This commit is contained in:
+40
-434
@@ -1,461 +1,67 @@
|
||||
# Observability Stack
|
||||
|
||||
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
|
||||
## Problem Statement
|
||||
|
||||
## Overview
|
||||
Operations teams need correlated logs, dashboards, and alert examples that make incidents observable before they become customer-facing outages. A stack that only starts containers is not enough; it also needs meaningful sample data and incident exercises.
|
||||
|
||||
The Observability Stack provides a complete monitoring solution with:
|
||||
## Solution Overview
|
||||
|
||||
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
|
||||
- **Logstash**: Data processing pipeline for log ingestion and transformation
|
||||
- **Kibana**: Visualization and exploration interface for Elasticsearch data
|
||||
- **Grafana**: Advanced metrics dashboarding and alerting platform
|
||||
- **Sample Logs**: Realistic log data for testing and demonstration
|
||||
- **Alerting**: Automated incident detection and notification rules
|
||||
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
|
||||
This project defines a local observability environment with Elasticsearch, Logstash, Kibana, Grafana, Filebeat, alert rules, sample logs, and an incident simulation script. It is built to demonstrate practical monitoring workflows rather than a production-sized cluster.
|
||||
|
||||
## Architecture
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Log Sources │ │ Logstash │ │ Elasticsearch │
|
||||
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
|
||||
│ / Systems) │ │ Processing) │ │ Analytics) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
||||
│ Alerting │ │ Kibana │ │ Grafana │
|
||||
│ Rules │ │ (Dashboards & │ │ (Metrics & │
|
||||
│ │ │ Exploration) │ │ Dashboards) │
|
||||
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
||||
Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
|
||||
|
|
||||
v
|
||||
Grafana
|
||||
|
||||
Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
|
||||
```
|
||||
|
||||
## Quick Start
|
||||
Core components:
|
||||
|
||||
### Prerequisites
|
||||
- `docker-compose.yml` defines the observability services.
|
||||
- `alerting/alert_rules.yml` records alert intent and severity.
|
||||
- `logs/` contains representative operational logs.
|
||||
- `scenarios/incident_simulation.sh` emits incident activity.
|
||||
- `examples/` contains sample alert and log outputs.
|
||||
|
||||
- Docker and Docker Compose
|
||||
- At least 4GB RAM available
|
||||
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
|
||||
|
||||
### Setup
|
||||
## How to Run
|
||||
|
||||
```bash
|
||||
cd observability-stack
|
||||
|
||||
# Start the observability stack
|
||||
docker-compose up -d
|
||||
# Validate the compose model.
|
||||
make test
|
||||
|
||||
# Wait for services to be ready (may take 2-3 minutes)
|
||||
sleep 180
|
||||
# Start the stack.
|
||||
make run
|
||||
|
||||
# Verify services are running
|
||||
curl -X GET "localhost:9200/_cluster/health?pretty"
|
||||
curl -X GET "localhost:5601/api/status"
|
||||
curl -X GET "localhost:3000/api/health"
|
||||
# Run the incident simulation.
|
||||
make demo
|
||||
|
||||
# Stop the stack.
|
||||
docker compose down
|
||||
```
|
||||
|
||||
### Access Interfaces
|
||||
When running locally:
|
||||
|
||||
- **Kibana**: http://localhost:5601 (admin/elastic)
|
||||
- **Grafana**: http://localhost:3000 (admin/admin)
|
||||
- **Elasticsearch**: http://localhost:9200
|
||||
- Kibana: `http://localhost:5601`
|
||||
- Grafana: `http://localhost:3000`
|
||||
- Elasticsearch: `http://localhost:9200`
|
||||
|
||||
## Project Structure
|
||||
## Example Output
|
||||
|
||||
```
|
||||
observability-stack/
|
||||
├── docker-compose.yml # Service orchestration
|
||||
├── logstash/ # Logstash configuration
|
||||
│ ├── pipeline/ # Processing pipelines
|
||||
│ └── config/ # Logstash settings
|
||||
├── elasticsearch/ # Elasticsearch configuration
|
||||
│ └── config/ # Cluster settings
|
||||
├── kibana/ # Kibana configuration
|
||||
│ └── config/ # Dashboard settings
|
||||
├── grafana/ # Grafana configuration
|
||||
│ ├── provisioning/ # Dashboards and datasources
|
||||
│ └── dashboards/ # Dashboard definitions
|
||||
├── logs/ # Sample log data
|
||||
│ └── sample.log # Realistic application logs
|
||||
├── alerting/ # Alert configuration
|
||||
│ └── alert_rules.yml # Alert definitions
|
||||
├── scenarios/ # Incident simulation
|
||||
│ └── incident_simulation.sh # Simulation scripts
|
||||
└── README.md
|
||||
```text
|
||||
[2026-04-29 04:18:23] WARN Database connection pool nearing capacity
|
||||
[2026-04-29 04:18:28] ERROR Database connection pool exhausted
|
||||
[2026-04-29 04:18:33] ERROR Database query timeout occurred
|
||||
[2026-04-29 04:18:44] INFO Database connections restored
|
||||
```
|
||||
|
||||
## Services Configuration
|
||||
Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
|
||||
|
||||
### Elasticsearch
|
||||
## Real-World Use Case
|
||||
|
||||
**Configuration**: `elasticsearch/config/elasticsearch.yml`
|
||||
|
||||
Key settings:
|
||||
- Single-node cluster for development
|
||||
- Memory limits and heap sizing
|
||||
- Security enabled with basic authentication
|
||||
- CORS enabled for Kibana access
|
||||
|
||||
**Data Indices**:
|
||||
- `logs-*`: Application and system logs
|
||||
- `metrics-*`: System and application metrics
|
||||
- `alerts-*`: Alert and incident data
|
||||
|
||||
### Logstash
|
||||
|
||||
**Pipelines**: `logstash/pipeline/`
|
||||
|
||||
- **apache_logs**: Apache/Nginx access log processing
|
||||
- **system_logs**: System log parsing and enrichment
|
||||
- **application_logs**: Custom application log processing
|
||||
- **metrics_pipeline**: Metrics data processing
|
||||
|
||||
**Input Sources**:
|
||||
- Filebeat agents
|
||||
- TCP/UDP syslog inputs
|
||||
- HTTP endpoints for metrics
|
||||
- Docker container logs
|
||||
|
||||
### Kibana
|
||||
|
||||
**Dashboards**:
|
||||
- Log analysis dashboard
|
||||
- System metrics overview
|
||||
- Application performance dashboard
|
||||
- Security events dashboard
|
||||
|
||||
**Saved Objects**:
|
||||
- Index patterns for log data
|
||||
- Visualizations for common metrics
|
||||
- Search queries for troubleshooting
|
||||
|
||||
### Grafana
|
||||
|
||||
**Data Sources**:
|
||||
- Elasticsearch for logs and metrics
|
||||
- Prometheus (if available)
|
||||
- InfluxDB for time-series data
|
||||
|
||||
**Dashboards**:
|
||||
- Infrastructure overview
|
||||
- Application performance
|
||||
- System resources
|
||||
- Custom business metrics
|
||||
|
||||
## Log Ingestion
|
||||
|
||||
### Sample Data
|
||||
|
||||
The stack includes realistic sample logs for testing:
|
||||
|
||||
```bash
|
||||
# Ingest sample logs
|
||||
curl -X POST "localhost:8080" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d @logs/sample.log
|
||||
```
|
||||
|
||||
### Log Formats Supported
|
||||
|
||||
- **Apache/Nginx**: Combined log format
|
||||
- **Syslog**: RFC 3164/5424 compliant
|
||||
- **JSON**: Structured application logs
|
||||
- **Custom**: Configurable parsing rules
|
||||
|
||||
### Data Enrichment
|
||||
|
||||
Logstash pipelines add:
|
||||
- GeoIP location data
|
||||
- User agent parsing
|
||||
- Timestamp normalization
|
||||
- Host metadata enrichment
|
||||
|
||||
## Alerting and Monitoring
|
||||
|
||||
### Alert Rules
|
||||
|
||||
Located in `alerting/alert_rules.yml`:
|
||||
|
||||
```yaml
|
||||
alert_rules:
|
||||
- name: "High CPU Usage"
|
||||
condition: "cpu_usage > 90"
|
||||
duration: "5m"
|
||||
severity: "critical"
|
||||
channels: ["email", "slack"]
|
||||
|
||||
- name: "Disk Space Low"
|
||||
condition: "disk_usage > 85"
|
||||
duration: "10m"
|
||||
severity: "warning"
|
||||
channels: ["email"]
|
||||
|
||||
- name: "Service Down"
|
||||
condition: "service_status == 'down'"
|
||||
duration: "2m"
|
||||
severity: "critical"
|
||||
channels: ["email", "pagerduty"]
|
||||
```
|
||||
|
||||
### Alert Channels
|
||||
|
||||
- **Email**: SMTP-based notifications
|
||||
- **Slack**: Real-time messaging
|
||||
- **PagerDuty**: Incident management integration
|
||||
- **Webhook**: Custom HTTP endpoints
|
||||
|
||||
## Incident Simulation
|
||||
|
||||
### Available Scenarios
|
||||
|
||||
```bash
|
||||
cd scenarios
|
||||
|
||||
# Simulate disk space exhaustion
|
||||
./incident_simulation.sh --type disk-full --severity critical
|
||||
|
||||
# Simulate service failure
|
||||
./incident_simulation.sh --type service-down --service nginx
|
||||
|
||||
# Simulate network latency
|
||||
./incident_simulation.sh --type network-latency --delay 500ms
|
||||
|
||||
# Simulate high CPU usage
|
||||
./incident_simulation.sh --type high-cpu --cores 4
|
||||
```
|
||||
|
||||
### Scenario Types
|
||||
|
||||
- **disk-full**: Filesystem capacity exhaustion
|
||||
- **service-down**: Application service failures
|
||||
- **network-latency**: Network performance degradation
|
||||
- **high-cpu**: CPU utilization spikes
|
||||
- **memory-leak**: Memory consumption growth
|
||||
- **log-flood**: Excessive log generation
|
||||
|
||||
## Dashboards and Visualization
|
||||
|
||||
### Kibana Dashboards
|
||||
|
||||
Pre-configured dashboards for:
|
||||
|
||||
1. **Log Analysis**
|
||||
- Log volume over time
|
||||
- Error rate trends
|
||||
- Top error messages
|
||||
- Geographic request distribution
|
||||
|
||||
2. **System Monitoring**
|
||||
- CPU and memory usage
|
||||
- Disk I/O statistics
|
||||
- Network traffic
|
||||
- System load averages
|
||||
|
||||
3. **Application Performance**
|
||||
- Response time distributions
|
||||
- Request rate metrics
|
||||
- Error percentages
|
||||
- User session analytics
|
||||
|
||||
### Grafana Dashboards
|
||||
|
||||
Advanced visualization panels:
|
||||
|
||||
- **Infrastructure Overview**: Multi-system resource usage
|
||||
- **Application Metrics**: Custom business KPIs
|
||||
- **Alert Status**: Active alerts and trends
|
||||
- **Capacity Planning**: Resource utilization forecasting
|
||||
|
||||
## API Endpoints
|
||||
|
||||
### Elasticsearch APIs
|
||||
|
||||
```bash
|
||||
# Cluster health
|
||||
GET /_cluster/health
|
||||
|
||||
# Index statistics
|
||||
GET /_cat/indices?v
|
||||
|
||||
# Search logs
|
||||
GET /logs-*/_search
|
||||
{
|
||||
"query": {
|
||||
"match": {
|
||||
"message": "ERROR"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Kibana APIs
|
||||
|
||||
```bash
|
||||
# Get dashboard list
|
||||
GET /api/saved_objects/_find?type=dashboard
|
||||
|
||||
# Export visualizations
|
||||
GET /api/saved_objects/visualization/{id}
|
||||
```
|
||||
|
||||
### Grafana APIs
|
||||
|
||||
```bash
|
||||
# Get dashboard list
|
||||
GET /api/search?query=*
|
||||
|
||||
# Alert status
|
||||
GET /api/alerts
|
||||
```
|
||||
|
||||
## Configuration Management
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Elasticsearch
|
||||
ES_JAVA_OPTS="-Xms1g -Xmx1g"
|
||||
ELASTIC_PASSWORD="elastic"
|
||||
|
||||
# Logstash
|
||||
LS_JAVA_OPTS="-Xms512m -Xmx512m"
|
||||
|
||||
# Grafana
|
||||
GF_SECURITY_ADMIN_PASSWORD="admin"
|
||||
```
|
||||
|
||||
### Scaling Configuration
|
||||
|
||||
For production deployment:
|
||||
|
||||
```yaml
|
||||
version: '3.8'
|
||||
services:
|
||||
elasticsearch:
|
||||
deploy:
|
||||
replicas: 3
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
cpus: '2.0'
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Authentication
|
||||
|
||||
- Elasticsearch basic authentication enabled
|
||||
- Grafana admin credentials configured
|
||||
- Kibana anonymous access disabled
|
||||
|
||||
### Network Security
|
||||
|
||||
- Services bound to localhost only
|
||||
- Internal network for service communication
|
||||
- TLS encryption for external access (production)
|
||||
|
||||
### Data Protection
|
||||
|
||||
- Elasticsearch encryption at rest
|
||||
- Log data retention policies
|
||||
- Backup and recovery procedures
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Elasticsearch Won't Start:**
|
||||
```bash
|
||||
# Check memory allocation
|
||||
docker-compose logs elasticsearch
|
||||
|
||||
# Verify Java heap settings
|
||||
docker-compose exec elasticsearch ps aux
|
||||
```
|
||||
|
||||
**Logstash Pipeline Errors:**
|
||||
```bash
|
||||
# Check pipeline configuration
|
||||
docker-compose logs logstash
|
||||
|
||||
# Validate pipeline syntax
|
||||
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
|
||||
```
|
||||
|
||||
**Kibana Connection Issues:**
|
||||
```bash
|
||||
# Verify Elasticsearch connectivity
|
||||
curl -u elastic:elastic "localhost:9200/_cluster/health"
|
||||
|
||||
# Check Kibana logs
|
||||
docker-compose logs kibana
|
||||
```
|
||||
|
||||
### Performance Tuning
|
||||
|
||||
**Elasticsearch:**
|
||||
- Increase heap size for larger datasets
|
||||
- Configure shard allocation
|
||||
- Enable index optimization
|
||||
|
||||
**Logstash:**
|
||||
- Adjust worker threads
|
||||
- Configure batch sizes
|
||||
- Enable persistent queues
|
||||
|
||||
**Grafana:**
|
||||
- Configure query caching
|
||||
- Set dashboard refresh intervals
|
||||
- Optimize panel queries
|
||||
|
||||
## Development and Testing
|
||||
|
||||
### Adding New Dashboards
|
||||
|
||||
1. Create dashboard JSON in `grafana/dashboards/`
|
||||
2. Update provisioning configuration
|
||||
3. Restart Grafana service
|
||||
|
||||
### Custom Alert Rules
|
||||
|
||||
1. Define rules in `alerting/alert_rules.yml`
|
||||
2. Update alerting configuration
|
||||
3. Test rules with simulation scenarios
|
||||
|
||||
### Log Pipeline Development
|
||||
|
||||
1. Add pipeline configuration in `logstash/pipeline/`
|
||||
2. Test with sample data
|
||||
3. Validate parsing with Kibana
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Data Backup
|
||||
|
||||
```bash
|
||||
# Elasticsearch snapshot
|
||||
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"indices": "*"}'
|
||||
```
|
||||
|
||||
### Configuration Backup
|
||||
|
||||
```bash
|
||||
# Backup all configurations
|
||||
tar -czf backup_$(date +%Y%m%d).tar.gz \
|
||||
logstash/ elasticsearch/ kibana/ grafana/
|
||||
```
|
||||
|
||||
## Contributing
|
||||
|
||||
1. Follow existing configuration patterns
|
||||
2. Test changes with simulation scenarios
|
||||
3. Update documentation for new features
|
||||
4. Ensure backward compatibility
|
||||
|
||||
## License
|
||||
|
||||
Enterprise Internal Use Only
|
||||
A platform team can use this project to explain how logs move through an ingestion pipeline, how alert rules map to operational symptoms, and how incident exercises create evidence for on-call readiness reviews.
|
||||
|
||||
Reference in New Issue
Block a user