# Observability Stack

A comprehensive monitoring and logging stack for enterprise infrastructure observability using the ELK (Elasticsearch, Logstash, Kibana) stack and Grafana. Includes sample data ingestion, alerting rules, and incident simulation scenarios.

## Overview

The Observability Stack provides a complete monitoring solution with:

- **Elasticsearch**: Distributed search and analytics engine for logs and metrics
- **Logstash**: Data processing pipeline for log ingestion and transformation
- **Kibana**: Visualization and exploration interface for Elasticsearch data
- **Grafana**: Advanced metrics dashboarding and alerting platform
- **Sample Logs**: Realistic log data for testing and demonstration
- **Alerting**: Automated incident detection and notification rules
- **Incident Simulation**: Scenarios for testing monitoring and response procedures

## Architecture

```
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Log Sources   │    │   Logstash      │    │   Elasticsearch │
│   (Applications │───►│   (Ingestion &  │───►│   (Storage &    │
│    / Systems)   │    │    Processing)  │    │    Analytics)   │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         ▼                       ▼                       ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Alerting      │    │   Kibana        │    │   Grafana       │
│   Rules         │    │   (Dashboards & │    │   (Metrics &    │
│                 │    │    Exploration) │    │    Dashboards)  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
```

## Quick Start

### Prerequisites

- Docker and Docker Compose
- At least 4GB RAM available
- Ports 5601 (Kibana), 9200 (Elasticsearch), 3000 (Grafana) available

### Setup

```bash
cd observability-stack

# Start the observability stack
docker-compose up -d

# Wait for services to be ready (may take 2-3 minutes)
sleep 180

# Verify services are running
curl -X GET "localhost:9200/_cluster/health?pretty"
curl -X GET "localhost:5601/api/status"
curl -X GET "localhost:3000/api/health"
```

### Access Interfaces

- **Kibana**: http://localhost:5601 (admin/elastic)
- **Grafana**: http://localhost:3000 (admin/admin)
- **Elasticsearch**: http://localhost:9200

## Project Structure

```
observability-stack/
├── docker-compose.yml     # Service orchestration
├── logstash/             # Logstash configuration
│   ├── pipeline/         # Processing pipelines
│   └── config/           # Logstash settings
├── elasticsearch/        # Elasticsearch configuration
│   └── config/           # Cluster settings
├── kibana/              # Kibana configuration
│   └── config/           # Dashboard settings
├── grafana/             # Grafana configuration
│   ├── provisioning/     # Dashboards and datasources
│   └── dashboards/       # Dashboard definitions
├── logs/                 # Sample log data
│   └── sample.log        # Realistic application logs
├── alerting/             # Alert configuration
│   └── alert_rules.yml   # Alert definitions
├── scenarios/            # Incident simulation
│   └── incident_simulation.sh  # Simulation scripts
└── README.md
```

## Services Configuration

### Elasticsearch

**Configuration**: `elasticsearch/config/elasticsearch.yml`

Key settings:
- Single-node cluster for development
- Memory limits and heap sizing
- Security enabled with basic authentication
- CORS enabled for Kibana access

**Data Indices**:
- `logs-*`: Application and system logs
- `metrics-*`: System and application metrics
- `alerts-*`: Alert and incident data

### Logstash

**Pipelines**: `logstash/pipeline/`

- **apache_logs**: Apache/Nginx access log processing
- **system_logs**: System log parsing and enrichment
- **application_logs**: Custom application log processing
- **metrics_pipeline**: Metrics data processing

**Input Sources**:
- Filebeat agents
- TCP/UDP syslog inputs
- HTTP endpoints for metrics
- Docker container logs

### Kibana

**Dashboards**:
- Log analysis dashboard
- System metrics overview
- Application performance dashboard
- Security events dashboard

**Saved Objects**:
- Index patterns for log data
- Visualizations for common metrics
- Search queries for troubleshooting

### Grafana

**Data Sources**:
- Elasticsearch for logs and metrics
- Prometheus (if available)
- InfluxDB for time-series data

**Dashboards**:
- Infrastructure overview
- Application performance
- System resources
- Custom business metrics

## Log Ingestion

### Sample Data

The stack includes realistic sample logs for testing:

```bash
# Ingest sample logs
curl -X POST "localhost:8080" \
  -H "Content-Type: application/json" \
  -d @logs/sample.log
```

### Log Formats Supported

- **Apache/Nginx**: Combined log format
- **Syslog**: RFC 3164/5424 compliant
- **JSON**: Structured application logs
- **Custom**: Configurable parsing rules

### Data Enrichment

Logstash pipelines add:
- GeoIP location data
- User agent parsing
- Timestamp normalization
- Host metadata enrichment

## Alerting and Monitoring

### Alert Rules

Located in `alerting/alert_rules.yml`:

```yaml
alert_rules:
  - name: "High CPU Usage"
    condition: "cpu_usage > 90"
    duration: "5m"
    severity: "critical"
    channels: ["email", "slack"]

  - name: "Disk Space Low"
    condition: "disk_usage > 85"
    duration: "10m"
    severity: "warning"
    channels: ["email"]

  - name: "Service Down"
    condition: "service_status == 'down'"
    duration: "2m"
    severity: "critical"
    channels: ["email", "pagerduty"]
```

### Alert Channels

- **Email**: SMTP-based notifications
- **Slack**: Real-time messaging
- **PagerDuty**: Incident management integration
- **Webhook**: Custom HTTP endpoints

## Incident Simulation

### Available Scenarios

```bash
cd scenarios

# Simulate disk space exhaustion
./incident_simulation.sh --type disk-full --severity critical

# Simulate service failure
./incident_simulation.sh --type service-down --service nginx

# Simulate network latency
./incident_simulation.sh --type network-latency --delay 500ms

# Simulate high CPU usage
./incident_simulation.sh --type high-cpu --cores 4
```

### Scenario Types

- **disk-full**: Filesystem capacity exhaustion
- **service-down**: Application service failures
- **network-latency**: Network performance degradation
- **high-cpu**: CPU utilization spikes
- **memory-leak**: Memory consumption growth
- **log-flood**: Excessive log generation

## Dashboards and Visualization

### Kibana Dashboards

Pre-configured dashboards for:

1. **Log Analysis**
   - Log volume over time
   - Error rate trends
   - Top error messages
   - Geographic request distribution

2. **System Monitoring**
   - CPU and memory usage
   - Disk I/O statistics
   - Network traffic
   - System load averages

3. **Application Performance**
   - Response time distributions
   - Request rate metrics
   - Error percentages
   - User session analytics

### Grafana Dashboards

Advanced visualization panels:

- **Infrastructure Overview**: Multi-system resource usage
- **Application Metrics**: Custom business KPIs
- **Alert Status**: Active alerts and trends
- **Capacity Planning**: Resource utilization forecasting

## API Endpoints

### Elasticsearch APIs

```bash
# Cluster health
GET /_cluster/health

# Index statistics
GET /_cat/indices?v

# Search logs
GET /logs-*/_search
{
  "query": {
    "match": {
      "message": "ERROR"
    }
  }
}
```

### Kibana APIs

```bash
# Get dashboard list
GET /api/saved_objects/_find?type=dashboard

# Export visualizations
GET /api/saved_objects/visualization/{id}
```

### Grafana APIs

```bash
# Get dashboard list
GET /api/search?query=*

# Alert status
GET /api/alerts
```

## Configuration Management

### Environment Variables

```bash
# Elasticsearch
ES_JAVA_OPTS="-Xms1g -Xmx1g"
ELASTIC_PASSWORD="elastic"

# Logstash
LS_JAVA_OPTS="-Xms512m -Xmx512m"

# Grafana
GF_SECURITY_ADMIN_PASSWORD="admin"
```

### Scaling Configuration

For production deployment:

```yaml
version: '3.8'
services:
  elasticsearch:
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 4G
          cpus: '2.0'
```

## Security Considerations

### Authentication

- Elasticsearch basic authentication enabled
- Grafana admin credentials configured
- Kibana anonymous access disabled

### Network Security

- Services bound to localhost only
- Internal network for service communication
- TLS encryption for external access (production)

### Data Protection

- Elasticsearch encryption at rest
- Log data retention policies
- Backup and recovery procedures

## Troubleshooting

### Common Issues

**Elasticsearch Won't Start:**
```bash
# Check memory allocation
docker-compose logs elasticsearch

# Verify Java heap settings
docker-compose exec elasticsearch ps aux
```

**Logstash Pipeline Errors:**
```bash
# Check pipeline configuration
docker-compose logs logstash

# Validate pipeline syntax
docker-compose exec logstash logstash -t -f /usr/share/logstash/pipeline/
```

**Kibana Connection Issues:**
```bash
# Verify Elasticsearch connectivity
curl -u elastic:elastic "localhost:9200/_cluster/health"

# Check Kibana logs
docker-compose logs kibana
```

### Performance Tuning

**Elasticsearch:**
- Increase heap size for larger datasets
- Configure shard allocation
- Enable index optimization

**Logstash:**
- Adjust worker threads
- Configure batch sizes
- Enable persistent queues

**Grafana:**
- Configure query caching
- Set dashboard refresh intervals
- Optimize panel queries

## Development and Testing

### Adding New Dashboards

1. Create dashboard JSON in `grafana/dashboards/`
2. Update provisioning configuration
3. Restart Grafana service

### Custom Alert Rules

1. Define rules in `alerting/alert_rules.yml`
2. Update alerting configuration
3. Test rules with simulation scenarios

### Log Pipeline Development

1. Add pipeline configuration in `logstash/pipeline/`
2. Test with sample data
3. Validate parsing with Kibana

## Backup and Recovery

### Data Backup

```bash
# Elasticsearch snapshot
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_$(date +%Y%m%d_%H%M%S)" \
  -H "Content-Type: application/json" \
  -d '{"indices": "*"}'
```

### Configuration Backup

```bash
# Backup all configurations
tar -czf backup_$(date +%Y%m%d).tar.gz \
  logstash/ elasticsearch/ kibana/ grafana/
```

## Contributing

1. Follow existing configuration patterns
2. Test changes with simulation scenarios
3. Update documentation for new features
4. Ensure backward compatibility

## License

Enterprise Internal Use Only