ci: configure and stabilize CI/CD pipeline

- fix runner configuration issues - correct workflow labels and execution environment - resolve dependency issues in pipeline (python deps) - improve reliability of automation runs
2026-04-29 23:14:14 +00:00
parent 2313efac88
commit fcf305bd70
45 changed files with 6016 additions and 0 deletions
@@ -0,0 +1,67 @@
+# Observability Stack
+
+## Problem Statement
+
+Operations teams need correlated logs, dashboards, and alert examples that make incidents observable before they become customer-facing outages. A stack that only starts containers is not enough; it also needs meaningful sample data and incident exercises.
+
+## Solution Overview
+
+This project defines a local observability environment with Elasticsearch, Logstash, Kibana, Grafana, Filebeat, alert rules, sample logs, and an incident simulation script. It is built to demonstrate practical monitoring workflows rather than a production-sized cluster.
+
+## Architecture Overview
+
+```
+Application/System Logs -> Filebeat -> Logstash -> Elasticsearch -> Kibana
+                                                       |
+                                                       v
+                                                    Grafana
+
+Incident Scenario -> Sample Logs -> Alert Rules -> Operator Review
+```
+
+Core components:
+
+- `docker-compose.yml` defines the observability services.
+- `alerting/alert_rules.yml` records alert intent and severity.
+- `logs/` contains representative operational logs.
+- `scenarios/incident_simulation.sh` emits incident activity.
+- `examples/` contains sample alert and log outputs.
+
+## How to Run
+
+```bash
+cd observability-stack
+
+# Validate the compose model.
+make test
+
+# Start the stack.
+make run
+
+# Run the incident simulation.
+make demo
+
+# Stop the stack.
+docker compose down
+```
+
+When running locally:
+
+- Kibana: `http://localhost:5601`
+- Grafana: `http://localhost:3000`
+- Elasticsearch: `http://localhost:9200`
+
+## Example Output
+
+```text
+[2026-04-29 04:18:23] WARN Database connection pool nearing capacity
+[2026-04-29 04:18:28] ERROR Database connection pool exhausted
+[2026-04-29 04:18:33] ERROR Database query timeout occurred
+[2026-04-29 04:18:44] INFO Database connections restored
+```
+
+Additional examples are available in [examples/alert-output.txt](examples/alert-output.txt) and [examples/sample-log.txt](examples/sample-log.txt).
+
+## Real-World Use Case
+
+A platform team can use this project to explain how logs move through an ingestion pipeline, how alert rules map to operational symptoms, and how incident exercises create evidence for on-call readiness reviews.