feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack
CI Pipeline / lint-ansible (push) Waiting to run
CI Pipeline / test-python (push) Waiting to run
CI Pipeline / validate-docker (push) Waiting to run
CI Pipeline / security-scan (push) Waiting to run
CI Pipeline / documentation (push) Waiting to run
CI Pipeline / integration-test (push) Blocked by required conditions

This commit is contained in:
Mateusz Suski
2026-04-29 23:14:14 +00:00
parent 2313efac88
commit 7757020014
33 changed files with 6165 additions and 0 deletions
+461
View File
@@ -0,0 +1,461 @@
# Observability Stack
A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
## Overview
The Observability Stack provides a complete monitoring solution with:
- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
## Architecture
```
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Log Sources │ │ Logstash │ │ Elasticsearch │
│ (Applications │───►│ (Ingestion & │───►│ (Storage & │
│ / Systems) │ │ Processing) │ │ Analytics) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Alerting │ │ Kibana │ │ Grafana │
│ Rules │ │ (Dashboards & │ │ (Metrics & │
│ │ │ Exploration) │ │ Dashboards) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```
## Quick Start
### Prerequisites
- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
### Setup
```bash
cd observability-stack
# Start the observability stack
docker-compose up -d
# Wait for services to be ready (may take 2-3 minutes)
sleep 180
# Verify services are running
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health"
```
### Access Interfaces
- **Kibana**: http://localhost:5601 (admin/elastic)
- **Grafana**: http://localhost:3000 (admin/admin)
- **Elasticsearch**: http://localhost:9200
## Project Structure
```
observability-stack/
├── docker-compose.yml # Service orchestration
├── logstash/ # Logstash configuration
│ ├── pipeline/ # Processing pipelines
│ └── config/ # Logstash settings
├── elasticsearch/ # Elasticsearch configuration
│ └── config/ # Cluster settings
├── kibana/ # Kibana configuration
│ └── config/ # Dashboard settings
├── grafana/ # Grafana configuration
│ ├── provisioning/ # Dashboards and datasources
│ └── dashboards/ # Dashboard definitions
├── logs/ # Sample log data
│ └── sample.log # Realistic application logs
├── alerting/ # Alert configuration
│ └── alert_rules.yml # Alert definitions
├── scenarios/ # Incident simulation
│ └── incident_simulation.sh # Simulation scripts
└── README.md
```
## Services Configuration
### Elasticsearch
**Configuration**: `elasticsearch/config/elasticsearch.yml`
Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data
### Logstash
**Pipelines**: `logstash/pipeline/`
- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing
**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
### Kibana
**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
### Grafana
**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
## Log Ingestion
### Sample Data
The stack includes realistic sample logs for testing:
```bash
# Ingest sample logs
curl -X POST "localhost:8080" \
-H "Content-Type: application/json" \
-d @logs/sample.log
```
### Log Formats Supported
- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules
### Data Enrichment
Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
## Alerting and Monitoring
### Alert Rules
Located in `alerting/alert_rules.yml`:
```yaml
alert_rules:
- name: "High CPU Usage"
condition: "cpu_usage > 90"
duration: "5m"
severity: "critical"
channels: ["email", "slack"]
- name: "Disk Space Low"
condition: "disk_usage > 85"
duration: "10m"
severity: "warning"
channels: ["email"]
- name: "Service Down"
condition: "service_status == 'down'"
duration: "2m"
severity: "critical"
channels: ["email", "pagerduty"]
```
### Alert Channels
- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints
## Incident Simulation
### Available Scenarios
```bash
cd scenarios
# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical
# Simulate service failure
./incident_simulation.sh --type service-down --service nginx
# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms
# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
```
### Scenario Types
- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation
## Dashboards and Visualization
### Kibana Dashboards
Pre-configured dashboards for:
1. **Log Analysis**
- Log volume over time
- Error rate trends
- Top error messages
- Geographic request distribution
2. **System Monitoring**
- CPU and memory usage
- Disk I/O statistics
- Network traffic
- System load averages
3. **Application Performance**
- Response time distributions
- Request rate metrics
- Error percentages
- User session analytics
### Grafana Dashboards
Advanced visualization panels:
- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting
## API Endpoints
### Elasticsearch APIs
```bash
# Cluster health
GET /_cluster/health
# Index statistics
GET /_cat/indices?v
# Search logs
GET /logs-*/_search
{
"query": {
"match": {
"message": "ERROR"
}
}
}
```
### Kibana APIs
```bash
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard
# Export visualizations
GET /api/saved_objects/visualization/{id}
```
### Grafana APIs
```bash
# Get dashboard list
GET /api/search?query=*
# Alert status
GET /api/alerts
```
## Configuration Management
### Environment Variables
```bash
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"
# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"
# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
```
### Scaling Configuration
For production deployment:
```yaml
version: '3.8'
services:
elasticsearch:
deploy:
replicas: 3
resources:
limits:
memory: 4G
cpus: '2.0'
```
## Security Considerations
### Authentication
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
### Network Security
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
### Data Protection
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
## Troubleshooting
### Common Issues
**Elasticsearch Won't Start:**
```bash
# Check memory allocation
docker-compose logs elasticsearch
# Verify Java heap settings
docker-compose exec elasticsearch ps aux
```
**Logstash Pipeline Errors:**
```bash
# Check pipeline configuration
docker-compose logs logstash
# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
```
**Kibana Connection Issues:**
```bash
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"
# Check Kibana logs
docker-compose logs kibana
```
### Performance Tuning
**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
## Development and Testing
### Adding New Dashboards
1. Create dashboard JSON in `grafana/dashboards/`
2. Update provisioning configuration
3. Restart Grafana service
### Custom Alert Rules
1. Define rules in `alerting/alert_rules.yml`
2. Update alerting configuration
3. Test rules with simulation scenarios
### Log Pipeline Development
1. Add pipeline configuration in `logstash/pipeline/`
2. Test with sample data
3. Validate parsing with Kibana
## Backup and Recovery
### Data Backup
```bash
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-H "Content-Type: application/json" \
-d '{"indices": "*"}'
```
### Configuration Backup
```bash
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
logstash/ elasticsearch/ kibana/ grafana/
```
## Contributing
1. Follow existing configuration patterns
2. Test changes with simulation scenarios
3. Update documentation for new features
4. Ensure backward compatibility
## License
Enterprise Internal Use Only