Polish infrastructure portfolio projects

2026-04-29 23:30:30 +00:00
parent b0537b4bff
commit 8783892241
34 changed files with 762 additions and 1226 deletions
@@ -0,0 +1,10 @@
+.PHONY: run test demo
+
+run:
+	docker compose up -d
+
+test:
+	docker compose config
+
+demo:
+	./scenarios/incident_simulation.sh comprehensive
@@ -1,461 +1,67 @@
 # Observability Stack

-A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.
+## Problem Statement

-## Overview
+Operations teams need correlated logs, dashboards, and alert examples that make incidents observable before they become customer-facing outages. A stack that only starts containers is not enough; it also needs meaningful sample data and incident exercises.

-The Observability Stack provides a complete monitoring solution with:
+## Solution Overview

- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures
+This project defines a local observability environment with Elasticsearch, Logstash, Kibana, Grafana, Filebeat, alert rules, sample logs, and an incident simulation script. It is built to demonstrate practical monitoring workflows rather than a production-sized cluster.

-## Architecture
+## Architecture Overview

 ```
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Log Sources   │    │   Logstash      │    │   Elasticsearch │
-│   (Applications │───►│   (Ingestion &  │───►│   (Storage &    │
-│    / Systems)   │    │    Processing)  │    │    Analytics)   │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
-         │                       │                       │
-         ▼                       ▼                       ▼
-┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
-│   Alerting      │    │   Kibana        │    │   Grafana       │
-│   Rules         │    │   (Dashboards & │    │   (Metrics &    │
-│                 │    │    Exploration) │    │    Dashboards)  │
-└─────────────────┘    └─────────────────┘    └─────────────────┘
+Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
+                                                       |
+                                                       v
+                                                    Grafana
+
+Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
 ```

-## Quick Start
+Core components:

-### Prerequisites
+- `docker-compose.yml` defines the observability services.
+- `alerting/alert_rules.yml` records alert intent and severity.
+- `logs/` contains representative operational logs.
+- `scenarios/incident_simulation.sh` emits incident activity.
+- `examples/` contains sample alert and log outputs.

- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available
-
-### Setup
+## How to Run

 ```bash
 cd observability-stack

-# Start the observability stack
-docker-compose up -d
+# Validate the compose model.
+make test

-# Wait for services to be ready (may take 2-3 minutes)
-sleep 180
+# Start the stack.
+make run

-# Verify services are running
-curl -X GET "localhost:9200/_cluster/health?pretty"
-curl -X GET "localhost:5601/api/status"
-curl -X GET "localhost:3000/api/health"
+# Run the incident simulation.
+make demo
+
+# Stop the stack.
+docker compose down
 ```

-### Access Interfaces
+When running locally:

- **Kibana**: http://localhost:5601 (admin/elastic)
- **Grafana**: http://localhost:3000 (admin/admin)
- **Elasticsearch**: http://localhost:9200
+- Kibana: `http://localhost:5601`
+- Grafana: `http://localhost:3000`
+- Elasticsearch: `http://localhost:9200`

-## Project Structure
+## Example Output

-```
-observability-stack/
-├── docker-compose.yml     # Service orchestration
-├── logstash/             # Logstash configuration
-│   ├── pipeline/         # Processing pipelines
-│   └── config/           # Logstash settings
-├── elasticsearch/        # Elasticsearch configuration
-│   └── config/           # Cluster settings
-├── kibana/              # Kibana configuration
-│   └── config/           # Dashboard settings
-├── grafana/             # Grafana configuration
-│   ├── provisioning/     # Dashboards and datasources
-│   └── dashboards/       # Dashboard definitions
-├── logs/                 # Sample log data
-│   └── sample.log        # Realistic application logs
-├── alerting/             # Alert configuration
-│   └── alert_rules.yml   # Alert definitions
-├── scenarios/            # Incident simulation
-│   └── incident_simulation.sh  # Simulation scripts
-└── README.md
+```text
+[2026-04-29 04:18:23] WARN Database connection pool nearing capacity
+[2026-04-29 04:18:28] ERROR Database connection pool exhausted
+[2026-04-29 04:18:33] ERROR Database query timeout occurred
+[2026-04-29 04:18:44] INFO Database connections restored
 ```

-## Services Configuration
+Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).

-### Elasticsearch
+## Real-World Use Case

-**Configuration**: `elasticsearch/config/elasticsearch.yml`
-
-Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access
-
-**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data
-
-### Logstash
-
-**Pipelines**: `logstash/pipeline/`
-
- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing
-
-**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs
-
-### Kibana
-
-**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard
-
-**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting
-
-### Grafana
-
-**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data
-
-**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics
-
-## Log Ingestion
-
-### Sample Data
-
-The stack includes realistic sample logs for testing:
-
-```bash
-# Ingest sample logs
-curl -X POST "localhost:8080" \
-  -H "Content-Type: application/json" \
-  -d @logs/sample.log
-```
-
-### Log Formats Supported
-
- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules
-
-### Data Enrichment
-
-Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment
-
-## Alerting and Monitoring
-
-### Alert Rules
-
-Located in `alerting/alert_rules.yml`:
-
-```yaml
-alert_rules:
-  - name: "High CPU Usage"
-    condition: "cpu_usage > 90"
-    duration: "5m"
-    severity: "critical"
-    channels: ["email", "slack"]
-
-  - name: "Disk Space Low"
-    condition: "disk_usage > 85"
-    duration: "10m"
-    severity: "warning"
-    channels: ["email"]
-
-  - name: "Service Down"
-    condition: "service_status == 'down'"
-    duration: "2m"
-    severity: "critical"
-    channels: ["email", "pagerduty"]
-```
-
-### Alert Channels
-
- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints
-
-## Incident Simulation
-
-### Available Scenarios
-
-```bash
-cd scenarios
-
-# Simulate disk space exhaustion
-./incident_simulation.sh --type disk-full --severity critical
-
-# Simulate service failure
-./incident_simulation.sh --type service-down --service nginx
-
-# Simulate network latency
-./incident_simulation.sh --type network-latency --delay 500ms
-
-# Simulate high CPU usage
-./incident_simulation.sh --type high-cpu --cores 4
-```
-
-### Scenario Types
-
- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation
-
-## Dashboards and Visualization
-
-### Kibana Dashboards
-
-Pre-configured dashboards for:
-
-1. **Log Analysis**
-   - Log volume over time
-   - Error rate trends
-   - Top error messages
-   - Geographic request distribution
-
-2. **System Monitoring**
-   - CPU and memory usage
-   - Disk I/O statistics
-   - Network traffic
-   - System load averages
-
-3. **Application Performance**
-   - Response time distributions
-   - Request rate metrics
-   - Error percentages
-   - User session analytics
-
-### Grafana Dashboards
-
-Advanced visualization panels:
-
- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting
-
-## API Endpoints
-
-### Elasticsearch APIs
-
-```bash
-# Cluster health
-GET /_cluster/health
-
-# Index statistics
-GET /_cat/indices?v
-
-# Search logs
-GET /logs-*/_search
-{
-  "query": {
-    "match": {
-      "message": "ERROR"
-    }
-  }
-}
-```
-
-### Kibana APIs
-
-```bash
-# Get dashboard list
-GET /api/saved_objects/_find?type=dashboard
-
-# Export visualizations
-GET /api/saved_objects/visualization/{id}
-```
-
-### Grafana APIs
-
-```bash
-# Get dashboard list
-GET /api/search?query=*
-
-# Alert status
-GET /api/alerts
-```
-
-## Configuration Management
-
-### Environment Variables
-
-```bash
-# Elasticsearch
-ES_JAVA_OPTS="-Xms1g -Xmx1g"
-ELASTIC_PASSWORD="elastic"
-
-# Logstash
-LS_JAVA_OPTS="-Xms512m -Xmx512m"
-
-# Grafana
-GF_SECURITY_ADMIN_PASSWORD="admin"
-```
-
-### Scaling Configuration
-
-For production deployment:
-
-```yaml
-version: '3.8'
-services:
-  elasticsearch:
-    deploy:
-      replicas: 3
-      resources:
-        limits:
-          memory: 4G
-          cpus: '2.0'
-```
-
-## Security Considerations
-
-### Authentication
-
- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled
-
-### Network Security
-
- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)
-
-### Data Protection
-
- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures
-
-## Troubleshooting
-
-### Common Issues
-
-**Elasticsearch Won't Start:**
-```bash
-# Check memory allocation
-docker-compose logs elasticsearch
-
-# Verify Java heap settings
-docker-compose exec elasticsearch ps aux
-```
-
-**Logstash Pipeline Errors:**
-```bash
-# Check pipeline configuration
-docker-compose logs logstash
-
-# Validate pipeline syntax
-docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
-```
-
-**Kibana Connection Issues:**
-```bash
-# Verify Elasticsearch connectivity
-curl -u elastic:elastic "localhost:9200/_cluster/health"
-
-# Check Kibana logs
-docker-compose logs kibana
-```
-
-### Performance Tuning
-
-**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization
-
-**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues
-
-**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries
-
-## Development and Testing
-
-### Adding New Dashboards
-
-1. Create dashboard JSON in `grafana/dashboards/`
-2. Update provisioning configuration
-3. Restart Grafana service
-
-### Custom Alert Rules
-
-1. Define rules in `alerting/alert_rules.yml`
-2. Update alerting configuration
-3. Test rules with simulation scenarios
-
-### Log Pipeline Development
-
-1. Add pipeline configuration in `logstash/pipeline/`
-2. Test with sample data
-3. Validate parsing with Kibana
-
-## Backup and Recovery
-
-### Data Backup
-
-```bash
-# Elasticsearch snapshot
-curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
-  -H "Content-Type: application/json" \
-  -d '{"indices": "*"}'
-```
-
-### Configuration Backup
-
-```bash
-# Backup all configurations
-tar -czf backup_$(date +%Y%m%d).tar.gz \
-  logstash/ elasticsearch/ kibana/ grafana/
-```
-
-## Contributing
-
-1. Follow existing configuration patterns
-2. Test changes with simulation scenarios
-3. Update documentation for new features
-4. Ensure backward compatibility
-
-## License
-
-Enterprise Internal Use Only
+A platform team can use this project to explain how logs move through an ingestion pipeline, how alert rules map to operational symptoms, and how incident exercises create evidence for on-call readiness reviews.
@@ -1,5 +1,3 @@
-version: '3.8'
-
 services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.11.0
@@ -119,4 +117,4 @@ networks:
    driver: bridge
    ipam:
      config:
-        - subnet: 172.25.0.0/16
+        - subnet: 172.25.0.0/16
@@ -0,0 +1,30 @@
+# Observability Stack Architecture
+
+## Components
+
+- Filebeat: tails sample and container logs.
+- Logstash: receives and processes log events.
+- Elasticsearch: stores searchable observability data.
+- Kibana: supports log exploration and dashboards.
+- Grafana: provides operational dashboards.
+- Alert rules: document symptoms, thresholds, and severity.
+- Incident simulation: generates controlled failure signals.
+
+## Data Flow
+
+```
+Log source -> Filebeat -> Logstash -> Elasticsearch -> Kibana
+                                            |
+                                            v
+                                         Grafana
+```
+
+Incident exercises follow this flow:
+
+```
+Operator -> incident_simulation.sh -> logs/incident_simulation.log -> Filebeat -> Logstash -> alerts/dashboards
+```
+
+## Notes
+
+This is a local demonstration stack, not a production Elasticsearch deployment. A production version would add dedicated nodes, TLS, secret management, retention policies, index lifecycle management, and external alert delivery.
@@ -0,0 +1,4 @@
+2026-04-29T04:19:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=100 threshold=95 status=firing
+2026-04-29T04:19:30Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=7.8 threshold=5.0 status=firing
+2026-04-29T04:22:00Z alert=database_connection_pool_exhausted severity=critical service=checkout-api host=app-web-02 value=71 threshold=95 status=resolved
+2026-04-29T04:23:15Z alert=api_error_rate_high severity=warning service=checkout-api host=app-web-02 value=1.2 threshold=5.0 status=resolved
@@ -0,0 +1,5 @@
+2026-04-29T04:18:21Z INFO service=checkout-api host=app-web-02 request_id=8f4b2 path=/checkout status=200 latency_ms=142
+2026-04-29T04:18:28Z WARN service=checkout-api host=app-web-02 event=db_pool_pressure active=92 max=100
+2026-04-29T04:18:33Z ERROR service=checkout-api host=app-web-02 event=db_timeout query=CreateOrder timeout_ms=5000 customer_tier=enterprise
+2026-04-29T04:18:39Z ERROR service=checkout-api host=app-web-02 event=payment_retry_exhausted order_id=ord-104288 provider=stripe
+2026-04-29T04:18:44Z INFO service=checkout-api host=app-web-02 event=recovery db_pool_active=48
@@ -0,0 +1,21 @@
+# Scenario: Incident Simulation
+
+## Description
+
+Generate a controlled application and infrastructure incident so the logging pipeline, alert rules, and dashboards can be reviewed with realistic event timing.
+
+## Commands
+
+```bash
+cd observability-stack
+docker compose config
+./scenarios/incident_simulation.sh comprehensive
+tail -n 40 logs/incident_simulation.log
+```
+
+## Expected Result
+
+- The compose file validates successfully.
+- The simulation writes a sequence of CPU, memory, service, database, and application error events.
+- Alert examples indicate firing and resolved states.
+- Operators can trace incident progression through logs and dashboard queries.