feat: Add comprehensive enterprise Linux infrastructure portfolio with Ansible, Python, and ELK stack

2026-04-29 23:14:14 +00:00
parent 2313efac88
commit 7757020014
33 changed files with 6165 additions and 0 deletions
@@ -0,0 +1,461 @@
+# Observability Stack
+
+A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
+
+## Overview
+
+The Observability Stack provides a complete monitoring solution with:
+
+- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
+- **Logstash**: Data processing pipeline for log ingestion and transformation
+- **Kibana**: Visualization and exploration interface for Elasticsearch data
+- **Grafana**: Advanced metrics dashboarding and alerting platform
+- **Sample Logs**: Realistic log data for testing and demonstration
+- **Alerting**: Automated incident detection and notification rules
+- **Incident Simulation**: Scenarios for testing monitoring and response procedures
+
+## Architecture
+
+```
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Log Sources   │    │   Logstash      │    │   Elasticsearch │
+│   (Applications │───►│   (Ingestion &  │───►│   (Storage &    │
+│    / Systems)   │    │    Processing)  │    │    Analytics)   │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+         │                       │                       │
+         ▼                       ▼                       ▼
+┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
+│   Alerting      │    │   Kibana        │    │   Grafana       │
+│   Rules         │    │   (Dashboards & │    │   (Metrics &    │
+│                 │    │    Exploration) │    │    Dashboards)  │
+└─────────────────┘    └─────────────────┘    └─────────────────┘
+```
+
+## Quick Start
+
+### Prerequisites
+
+- Docker and Docker Compose
+- At least 4GB RAM available
+- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
+
+### Setup
+
+```bash
+cd observability-stack
+
+# Start the observability stack
+docker-compose up -d
+
+# Wait for services to be ready (may take 2-3 minutes)
+sleep 180
+
+# Verify services are running
+curl -X GET "localhost:9200/_cluster/health?pretty"
+curl -X GET "localhost:5601/api/status"
+curl -X GET "localhost:3000/api/health"
+```
+
+### Access Interfaces
+
+- **Kibana**: http://localhost:5601 (admin/elastic)
+- **Grafana**: http://localhost:3000 (admin/admin)
+- **Elasticsearch**: http://localhost:9200
+
+## Project Structure
+
+```
+observability-stack/
+├── docker-compose.yml     # Service orchestration
+├── logstash/             # Logstash configuration
+│   ├── pipeline/         # Processing pipelines
+│   └── config/           # Logstash settings
+├── elasticsearch/        # Elasticsearch configuration
+│   └── config/           # Cluster settings
+├── kibana/              # Kibana configuration
+│   └── config/           # Dashboard settings
+├── grafana/             # Grafana configuration
+│   ├── provisioning/     # Dashboards and datasources
+│   └── dashboards/       # Dashboard definitions
+├── logs/                 # Sample log data
+│   └── sample.log        # Realistic application logs
+├── alerting/             # Alert configuration
+│   └── alert_rules.yml   # Alert definitions
+├── scenarios/            # Incident simulation
+│   └── incident_simulation.sh  # Simulation scripts
+└── README.md
+```
+
+## Services Configuration
+
+### Elasticsearch
+
+**Configuration**: `elasticsearch/config/elasticsearch.yml`
+
+Key settings:
+- Single-node cluster for development
+- Memory limits and heap sizing
+- Security enabled with basic authentication
+- CORS enabled for Kibana access
+
+**Data Indices**:
+- `logs-*`: Application and system logs
+- `metrics-*`: System and application metrics
+- `alerts-*`: Alert and incident data
+
+### Logstash
+
+**Pipelines**: `logstash/pipeline/`
+
+- **apache_logs**: Apache/Nginx access log processing
+- **system_logs**: System log parsing and enrichment
+- **application_logs**: Custom application log processing
+- **metrics_pipeline**: Metrics data processing
+
+**Input Sources**:
+- Filebeat agents
+- TCP/UDP syslog inputs
+- HTTP endpoints for metrics
+- Docker container logs
+
+### Kibana
+
+**Dashboards**:
+- Log analysis dashboard
+- System metrics overview
+- Application performance dashboard
+- Security events dashboard
+
+**Saved Objects**:
+- Index patterns for log data
+- Visualizations for common metrics
+- Search queries for troubleshooting
+
+### Grafana
+
+**Data Sources**:
+- Elasticsearch for logs and metrics
+- Prometheus (if available)
+- InfluxDB for time-series data
+
+**Dashboards**:
+- Infrastructure overview
+- Application performance
+- System resources
+- Custom business metrics
+
+## Log Ingestion
+
+### Sample Data
+
+The stack includes realistic sample logs for testing:
+
+```bash
+# Ingest sample logs
+curl -X POST "localhost:8080" \
+  -H "Content-Type: application/json" \
+  -d @logs/sample.log
+```
+
+### Log Formats Supported
+
+- **Apache/Nginx**: Combined log format
+- **Syslog**: RFC 3164/5424 compliant
+- **JSON**: Structured application logs
+- **Custom**: Configurable parsing rules
+
+### Data Enrichment
+
+Logstash pipelines add:
+- GeoIP location data
+- User agent parsing
+- Timestamp normalization
+- Host metadata enrichment
+
+## Alerting and Monitoring
+
+### Alert Rules
+
+Located in `alerting/alert_rules.yml`:
+
+```yaml
+alert_rules:
+  - name: "High CPU Usage"
+    condition: "cpu_usage > 90"
+    duration: "5m"
+    severity: "critical"
+    channels: ["email", "slack"]
+
+  - name: "Disk Space Low"
+    condition: "disk_usage > 85"
+    duration: "10m"
+    severity: "warning"
+    channels: ["email"]
+
+  - name: "Service Down"
+    condition: "service_status == 'down'"
+    duration: "2m"
+    severity: "critical"
+    channels: ["email", "pagerduty"]
+```
+
+### Alert Channels
+
+- **Email**: SMTP-based notifications
+- **Slack**: Real-time messaging
+- **PagerDuty**: Incident management integration
+- **Webhook**: Custom HTTP endpoints
+
+## Incident Simulation
+
+### Available Scenarios
+
+```bash
+cd scenarios
+
+# Simulate disk space exhaustion
+./incident_simulation.sh --type disk-full --severity critical
+
+# Simulate service failure
+./incident_simulation.sh --type service-down --service nginx
+
+# Simulate network latency
+./incident_simulation.sh --type network-latency --delay 500ms
+
+# Simulate high CPU usage
+./incident_simulation.sh --type high-cpu --cores 4
+```
+
+### Scenario Types
+
+- **disk-full**: Filesystem capacity exhaustion
+- **service-down**: Application service failures
+- **network-latency**: Network performance degradation
+- **high-cpu**: CPU utilization spikes
+- **memory-leak**: Memory consumption growth
+- **log-flood**: Excessive log generation
+
+## Dashboards and Visualization
+
+### Kibana Dashboards
+
+Pre-configured dashboards for:
+
+1. **Log Analysis**
+   - Log volume over time
+   - Error rate trends
+   - Top error messages
+   - Geographic request distribution
+
+2. **System Monitoring**
+   - CPU and memory usage
+   - Disk I/O statistics
+   - Network traffic
+   - System load averages
+
+3. **Application Performance**
+   - Response time distributions
+   - Request rate metrics
+   - Error percentages
+   - User session analytics
+
+### Grafana Dashboards
+
+Advanced visualization panels:
+
+- **Infrastructure Overview**: Multi-system resource usage
+- **Application Metrics**: Custom business KPIs
+- **Alert Status**: Active alerts and trends
+- **Capacity Planning**: Resource utilization forecasting
+
+## API Endpoints
+
+### Elasticsearch APIs
+
+```bash
+# Cluster health
+GET /_cluster/health
+
+# Index statistics
+GET /_cat/indices?v
+
+# Search logs
+GET /logs-*/_search
+{
+  "query": {
+    "match": {
+      "message": "ERROR"
+    }
+  }
+}
+```
+
+### Kibana APIs
+
+```bash
+# Get dashboard list
+GET /api/saved_objects/_find?type=dashboard
+
+# Export visualizations
+GET /api/saved_objects/visualization/{id}
+```
+
+### Grafana APIs
+
+```bash
+# Get dashboard list
+GET /api/search?query=*
+
+# Alert status
+GET /api/alerts
+```
+
+## Configuration Management
+
+### Environment Variables
+
+```bash
+# Elasticsearch
+ES_JAVA_OPTS="-Xms1g -Xmx1g"
+ELASTIC_PASSWORD="elastic"
+
+# Logstash
+LS_JAVA_OPTS="-Xms512m -Xmx512m"
+
+# Grafana
+GF_SECURITY_ADMIN_PASSWORD="admin"
+```
+
+### Scaling Configuration
+
+For production deployment:
+
+```yaml
+version: '3.8'
+services:
+  elasticsearch:
+    deploy:
+      replicas: 3
+      resources:
+        limits:
+          memory: 4G
+          cpus: '2.0'
+```
+
+## Security Considerations
+
+### Authentication
+
+- Elasticsearch basic authentication enabled
+- Grafana admin credentials configured
+- Kibana anonymous access disabled
+
+### Network Security
+
+- Services bound to localhost only
+- Internal network for service communication
+- TLS encryption for external access (production)
+
+### Data Protection
+
+- Elasticsearch encryption at rest
+- Log data retention policies
+- Backup and recovery procedures
+
+## Troubleshooting
+
+### Common Issues
+
+**Elasticsearch Won't Start:**
+```bash
+# Check memory allocation
+docker-compose logs elasticsearch
+
+# Verify Java heap settings
+docker-compose exec elasticsearch ps aux
+```
+
+**Logstash Pipeline Errors:**
+```bash
+# Check pipeline configuration
+docker-compose logs logstash
+
+# Validate pipeline syntax
+docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
+```
+
+**Kibana Connection Issues:**
+```bash
+# Verify Elasticsearch connectivity
+curl -u elastic:elastic "localhost:9200/_cluster/health"
+
+# Check Kibana logs
+docker-compose logs kibana
+```
+
+### Performance Tuning
+
+**Elasticsearch:**
+- Increase heap size for larger datasets
+- Configure shard allocation
+- Enable index optimization
+
+**Logstash:**
+- Adjust worker threads
+- Configure batch sizes
+- Enable persistent queues
+
+**Grafana:**
+- Configure query caching
+- Set dashboard refresh intervals
+- Optimize panel queries
+
+## Development and Testing
+
+### Adding New Dashboards
+
+1. Create dashboard JSON in `grafana/dashboards/`
+2. Update provisioning configuration
+3. Restart Grafana service
+
+### Custom Alert Rules
+
+1. Define rules in `alerting/alert_rules.yml`
+2. Update alerting configuration
+3. Test rules with simulation scenarios
+
+### Log Pipeline Development
+
+1. Add pipeline configuration in `logstash/pipeline/`
+2. Test with sample data
+3. Validate parsing with Kibana
+
+## Backup and Recovery
+
+### Data Backup
+
+```bash
+# Elasticsearch snapshot
+curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
+  -H "Content-Type: application/json" \
+  -d '{"indices": "*"}'
+```
+
+### Configuration Backup
+
+```bash
+# Backup all configurations
+tar -czf backup_$(date +%Y%m%d).tar.gz \
+  logstash/ elasticsearch/ kibana/ grafana/
+```
+
+## Contributing
+
+1. Follow existing configuration patterns
+2. Test changes with simulation scenarios
+3. Update documentation for new features
+4. Ensure backward compatibility
+
+## License
+
+Enterprise Internal Use Only