ci: configure and stabilize CI/CD pipeline
- fix runner configuration issues - correct workflow labels and execution environment - resolve dependency issues in pipeline (python deps) - improve reliability of automation runs
This commit is contained in:
@@ -0,0 +1,67 @@
|
||||
# Observability Stack
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Operations teams need correlated logs, dashboards, and alert examples that make incidents observable before they become customer-facing outages. A stack that only starts containers is not enough; it also needs meaningful sample data and incident exercises.
|
||||
|
||||
## Solution Overview
|
||||
|
||||
This project defines a local observability environment with Elasticsearch, Logstash, Kibana, Grafana, Filebeat, alert rules, sample logs, and an incident simulation script. It is built to demonstrate practical monitoring workflows rather than a production-sized cluster.
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
|
||||
|
|
||||
v
|
||||
Grafana
|
||||
|
||||
Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
|
||||
```
|
||||
|
||||
Core components:
|
||||
|
||||
- `docker-compose.yml` defines the observability services.
|
||||
- `alerting/alert_rules.yml` records alert intent and severity.
|
||||
- `logs/` contains representative operational logs.
|
||||
- `scenarios/incident_simulation.sh` emits incident activity.
|
||||
- `examples/` contains sample alert and log outputs.
|
||||
|
||||
## How to Run
|
||||
|
||||
```bash
|
||||
cd observability-stack
|
||||
|
||||
# Validate the compose model.
|
||||
make test
|
||||
|
||||
# Start the stack.
|
||||
make run
|
||||
|
||||
# Run the incident simulation.
|
||||
make demo
|
||||
|
||||
# Stop the stack.
|
||||
docker compose down
|
||||
```
|
||||
|
||||
When running locally:
|
||||
|
||||
- Kibana: `http://localhost:5601`
|
||||
- Grafana: `http://localhost:3000`
|
||||
- Elasticsearch: `http://localhost:9200`
|
||||
|
||||
## Example Output
|
||||
|
||||
```text
|
||||
[2026-04-29 04:18:23] WARN Database connection pool nearing capacity
|
||||
[2026-04-29 04:18:28] ERROR Database connection pool exhausted
|
||||
[2026-04-29 04:18:33] ERROR Database query timeout occurred
|
||||
[2026-04-29 04:18:44] INFO Database connections restored
|
||||
```
|
||||
|
||||
Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
|
||||
|
||||
## Real-World Use Case
|
||||
|
||||
A platform team can use this project to explain how logs move through an ingestion pipeline, how alert rules map to operational symptoms, and how incident exercises create evidence for on-call readiness reviews.
|
||||
Reference in New Issue
Block a user