7757020014
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions
461 lines
11 KiB
Markdown
461 lines
11 KiB
Markdown
# Observability Stack
|
|
|
|
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
|
|
|
|
## Overview
|
|
|
|
The Observability Stack provides a complete monitoring solution with:
|
|
|
|
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
|
|
- **Logstash**: Data processing pipeline for log ingestion and transformation
|
|
- **Kibana**: Visualization and exploration interface for Elasticsearch data
|
|
- **Grafana**: Advanced metrics dashboarding and alerting platform
|
|
- **Sample Logs**: Realistic log data for testing and demonstration
|
|
- **Alerting**: Automated incident detection and notification rules
|
|
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Log Sources │ │ Logstash │ │ Elasticsearch │
|
|
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
|
|
│ / Systems) │ │ Processing) │ │ Analytics) │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
|
|
│ Alerting │ │ Kibana │ │ Grafana │
|
|
│ Rules │ │ (Dashboards & │ │ (Metrics & │
|
|
│ │ │ Exploration) │ │ Dashboards) │
|
|
└─────────────────┘ └─────────────────┘ └─────────────────┘
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
### Prerequisites
|
|
|
|
- Docker and Docker Compose
|
|
- At least 4GB RAM available
|
|
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
|
|
|
|
### Setup
|
|
|
|
```bash
|
|
cd observability-stack
|
|
|
|
# Start the observability stack
|
|
docker-compose up -d
|
|
|
|
# Wait for services to be ready (may take 2-3 minutes)
|
|
sleep 180
|
|
|
|
# Verify services are running
|
|
curl -X GET "localhost:9200/_cluster/health?pretty"
|
|
curl -X GET "localhost:5601/api/status"
|
|
curl -X GET "localhost:3000/api/health"
|
|
```
|
|
|
|
### Access Interfaces
|
|
|
|
- **Kibana**: http://localhost:5601 (admin/elastic)
|
|
- **Grafana**: http://localhost:3000 (admin/admin)
|
|
- **Elasticsearch**: http://localhost:9200
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
observability-stack/
|
|
├── docker-compose.yml # Service orchestration
|
|
├── logstash/ # Logstash configuration
|
|
│ ├── pipeline/ # Processing pipelines
|
|
│ └── config/ # Logstash settings
|
|
├── elasticsearch/ # Elasticsearch configuration
|
|
│ └── config/ # Cluster settings
|
|
├── kibana/ # Kibana configuration
|
|
│ └── config/ # Dashboard settings
|
|
├── grafana/ # Grafana configuration
|
|
│ ├── provisioning/ # Dashboards and datasources
|
|
│ └── dashboards/ # Dashboard definitions
|
|
├── logs/ # Sample log data
|
|
│ └── sample.log # Realistic application logs
|
|
├── alerting/ # Alert configuration
|
|
│ └── alert_rules.yml # Alert definitions
|
|
├── scenarios/ # Incident simulation
|
|
│ └── incident_simulation.sh # Simulation scripts
|
|
└── README.md
|
|
```
|
|
|
|
## Services Configuration
|
|
|
|
### Elasticsearch
|
|
|
|
**Configuration**: `elasticsearch/config/elasticsearch.yml`
|
|
|
|
Key settings:
|
|
- Single-node cluster for development
|
|
- Memory limits and heap sizing
|
|
- Security enabled with basic authentication
|
|
- CORS enabled for Kibana access
|
|
|
|
**Data Indices**:
|
|
- `logs-*`: Application and system logs
|
|
- `metrics-*`: System and application metrics
|
|
- `alerts-*`: Alert and incident data
|
|
|
|
### Logstash
|
|
|
|
**Pipelines**: `logstash/pipeline/`
|
|
|
|
- **apache_logs**: Apache/Nginx access log processing
|
|
- **system_logs**: System log parsing and enrichment
|
|
- **application_logs**: Custom application log processing
|
|
- **metrics_pipeline**: Metrics data processing
|
|
|
|
**Input Sources**:
|
|
- Filebeat agents
|
|
- TCP/UDP syslog inputs
|
|
- HTTP endpoints for metrics
|
|
- Docker container logs
|
|
|
|
### Kibana
|
|
|
|
**Dashboards**:
|
|
- Log analysis dashboard
|
|
- System metrics overview
|
|
- Application performance dashboard
|
|
- Security events dashboard
|
|
|
|
**Saved Objects**:
|
|
- Index patterns for log data
|
|
- Visualizations for common metrics
|
|
- Search queries for troubleshooting
|
|
|
|
### Grafana
|
|
|
|
**Data Sources**:
|
|
- Elasticsearch for logs and metrics
|
|
- Prometheus (if available)
|
|
- InfluxDB for time-series data
|
|
|
|
**Dashboards**:
|
|
- Infrastructure overview
|
|
- Application performance
|
|
- System resources
|
|
- Custom business metrics
|
|
|
|
## Log Ingestion
|
|
|
|
### Sample Data
|
|
|
|
The stack includes realistic sample logs for testing:
|
|
|
|
```bash
|
|
# Ingest sample logs
|
|
curl -X POST "localhost:8080" \
|
|
-H "Content-Type: application/json" \
|
|
-d @logs/sample.log
|
|
```
|
|
|
|
### Log Formats Supported
|
|
|
|
- **Apache/Nginx**: Combined log format
|
|
- **Syslog**: RFC 3164/5424 compliant
|
|
- **JSON**: Structured application logs
|
|
- **Custom**: Configurable parsing rules
|
|
|
|
### Data Enrichment
|
|
|
|
Logstash pipelines add:
|
|
- GeoIP location data
|
|
- User agent parsing
|
|
- Timestamp normalization
|
|
- Host metadata enrichment
|
|
|
|
## Alerting and Monitoring
|
|
|
|
### Alert Rules
|
|
|
|
Located in `alerting/alert_rules.yml`:
|
|
|
|
```yaml
|
|
alert_rules:
|
|
- name: "High CPU Usage"
|
|
condition: "cpu_usage > 90"
|
|
duration: "5m"
|
|
severity: "critical"
|
|
channels: ["email", "slack"]
|
|
|
|
- name: "Disk Space Low"
|
|
condition: "disk_usage > 85"
|
|
duration: "10m"
|
|
severity: "warning"
|
|
channels: ["email"]
|
|
|
|
- name: "Service Down"
|
|
condition: "service_status == 'down'"
|
|
duration: "2m"
|
|
severity: "critical"
|
|
channels: ["email", "pagerduty"]
|
|
```
|
|
|
|
### Alert Channels
|
|
|
|
- **Email**: SMTP-based notifications
|
|
- **Slack**: Real-time messaging
|
|
- **PagerDuty**: Incident management integration
|
|
- **Webhook**: Custom HTTP endpoints
|
|
|
|
## Incident Simulation
|
|
|
|
### Available Scenarios
|
|
|
|
```bash
|
|
cd scenarios
|
|
|
|
# Simulate disk space exhaustion
|
|
./incident_simulation.sh --type disk-full --severity critical
|
|
|
|
# Simulate service failure
|
|
./incident_simulation.sh --type service-down --service nginx
|
|
|
|
# Simulate network latency
|
|
./incident_simulation.sh --type network-latency --delay 500ms
|
|
|
|
# Simulate high CPU usage
|
|
./incident_simulation.sh --type high-cpu --cores 4
|
|
```
|
|
|
|
### Scenario Types
|
|
|
|
- **disk-full**: Filesystem capacity exhaustion
|
|
- **service-down**: Application service failures
|
|
- **network-latency**: Network performance degradation
|
|
- **high-cpu**: CPU utilization spikes
|
|
- **memory-leak**: Memory consumption growth
|
|
- **log-flood**: Excessive log generation
|
|
|
|
## Dashboards and Visualization
|
|
|
|
### Kibana Dashboards
|
|
|
|
Pre-configured dashboards for:
|
|
|
|
1. **Log Analysis**
|
|
- Log volume over time
|
|
- Error rate trends
|
|
- Top error messages
|
|
- Geographic request distribution
|
|
|
|
2. **System Monitoring**
|
|
- CPU and memory usage
|
|
- Disk I/O statistics
|
|
- Network traffic
|
|
- System load averages
|
|
|
|
3. **Application Performance**
|
|
- Response time distributions
|
|
- Request rate metrics
|
|
- Error percentages
|
|
- User session analytics
|
|
|
|
### Grafana Dashboards
|
|
|
|
Advanced visualization panels:
|
|
|
|
- **Infrastructure Overview**: Multi-system resource usage
|
|
- **Application Metrics**: Custom business KPIs
|
|
- **Alert Status**: Active alerts and trends
|
|
- **Capacity Planning**: Resource utilization forecasting
|
|
|
|
## API Endpoints
|
|
|
|
### Elasticsearch APIs
|
|
|
|
```bash
|
|
# Cluster health
|
|
GET /_cluster/health
|
|
|
|
# Index statistics
|
|
GET /_cat/indices?v
|
|
|
|
# Search logs
|
|
GET /logs-*/_search
|
|
{
|
|
"query": {
|
|
"match": {
|
|
"message": "ERROR"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Kibana APIs
|
|
|
|
```bash
|
|
# Get dashboard list
|
|
GET /api/saved_objects/_find?type=dashboard
|
|
|
|
# Export visualizations
|
|
GET /api/saved_objects/visualization/{id}
|
|
```
|
|
|
|
### Grafana APIs
|
|
|
|
```bash
|
|
# Get dashboard list
|
|
GET /api/search?query=*
|
|
|
|
# Alert status
|
|
GET /api/alerts
|
|
```
|
|
|
|
## Configuration Management
|
|
|
|
### Environment Variables
|
|
|
|
```bash
|
|
# Elasticsearch
|
|
ES_JAVA_OPTS="-Xms1g -Xmx1g"
|
|
ELASTIC_PASSWORD="elastic"
|
|
|
|
# Logstash
|
|
LS_JAVA_OPTS="-Xms512m -Xmx512m"
|
|
|
|
# Grafana
|
|
GF_SECURITY_ADMIN_PASSWORD="admin"
|
|
```
|
|
|
|
### Scaling Configuration
|
|
|
|
For production deployment:
|
|
|
|
```yaml
|
|
version: '3.8'
|
|
services:
|
|
elasticsearch:
|
|
deploy:
|
|
replicas: 3
|
|
resources:
|
|
limits:
|
|
memory: 4G
|
|
cpus: '2.0'
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Authentication
|
|
|
|
- Elasticsearch basic authentication enabled
|
|
- Grafana admin credentials configured
|
|
- Kibana anonymous access disabled
|
|
|
|
### Network Security
|
|
|
|
- Services bound to localhost only
|
|
- Internal network for service communication
|
|
- TLS encryption for external access (production)
|
|
|
|
### Data Protection
|
|
|
|
- Elasticsearch encryption at rest
|
|
- Log data retention policies
|
|
- Backup and recovery procedures
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
**Elasticsearch Won't Start:**
|
|
```bash
|
|
# Check memory allocation
|
|
docker-compose logs elasticsearch
|
|
|
|
# Verify Java heap settings
|
|
docker-compose exec elasticsearch ps aux
|
|
```
|
|
|
|
**Logstash Pipeline Errors:**
|
|
```bash
|
|
# Check pipeline configuration
|
|
docker-compose logs logstash
|
|
|
|
# Validate pipeline syntax
|
|
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
|
|
```
|
|
|
|
**Kibana Connection Issues:**
|
|
```bash
|
|
# Verify Elasticsearch connectivity
|
|
curl -u elastic:elastic "localhost:9200/_cluster/health"
|
|
|
|
# Check Kibana logs
|
|
docker-compose logs kibana
|
|
```
|
|
|
|
### Performance Tuning
|
|
|
|
**Elasticsearch:**
|
|
- Increase heap size for larger datasets
|
|
- Configure shard allocation
|
|
- Enable index optimization
|
|
|
|
**Logstash:**
|
|
- Adjust worker threads
|
|
- Configure batch sizes
|
|
- Enable persistent queues
|
|
|
|
**Grafana:**
|
|
- Configure query caching
|
|
- Set dashboard refresh intervals
|
|
- Optimize panel queries
|
|
|
|
## Development and Testing
|
|
|
|
### Adding New Dashboards
|
|
|
|
1. Create dashboard JSON in `grafana/dashboards/`
|
|
2. Update provisioning configuration
|
|
3. Restart Grafana service
|
|
|
|
### Custom Alert Rules
|
|
|
|
1. Define rules in `alerting/alert_rules.yml`
|
|
2. Update alerting configuration
|
|
3. Test rules with simulation scenarios
|
|
|
|
### Log Pipeline Development
|
|
|
|
1. Add pipeline configuration in `logstash/pipeline/`
|
|
2. Test with sample data
|
|
3. Validate parsing with Kibana
|
|
|
|
## Backup and Recovery
|
|
|
|
### Data Backup
|
|
|
|
```bash
|
|
# Elasticsearch snapshot
|
|
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"indices": "*"}'
|
|
```
|
|
|
|
### Configuration Backup
|
|
|
|
```bash
|
|
# Backup all configurations
|
|
tar -czf backup_$(date +%Y%m%d).tar.gz \
|
|
logstash/ elasticsearch/ kibana/ grafana/
|
|
```
|
|
|
|
## Contributing
|
|
|
|
1. Follow existing configuration patterns
|
|
2. Test changes with simulation scenarios
|
|
3. Update documentation for new features
|
|
4. Ensure backward compatibility
|
|
|
|
## License
|
|
|
|
Enterprise Internal Use Only |