Rework portfolio around Linux operations, Zabbix monitoring, migration validation, and ELK/Grafana log observability. Add AAP-style LVM resize workflow, Zabbix server/proxy/agent automation assets, Linux/AIX monitoring templates, and updated validation CI.
This commit is contained in:
+30
@@ -0,0 +1,30 @@
|
||||
# Incident Response Runbook
|
||||
|
||||
## Filesystem Alert
|
||||
|
||||
1. Confirm current usage and growth trend.
|
||||
2. Check whether the host is Linux or AIX and use the correct runbook.
|
||||
3. Validate application ownership of the filesystem.
|
||||
4. Clean known temporary paths or request LVM expansion when approved.
|
||||
5. Attach before/after evidence to the incident ticket.
|
||||
|
||||
## Agent Unreachable
|
||||
|
||||
1. Confirm whether data loss affects one host, one proxy, or one network segment.
|
||||
2. Check proxy queue and last seen timestamp.
|
||||
3. Validate agent service state and firewall path.
|
||||
4. For active checks, confirm `ServerActive` and hostname match.
|
||||
|
||||
## Proxy Backlog
|
||||
|
||||
1. Check server reachability from proxy.
|
||||
2. Check proxy DB filesystem usage.
|
||||
3. Confirm whether config sync recently changed.
|
||||
4. Reduce noise by temporarily disabling non-critical discovery rules if required.
|
||||
|
||||
## Unsupported Items
|
||||
|
||||
1. Identify affected template and item key.
|
||||
2. Check whether item is Linux-specific or AIX-specific.
|
||||
3. Validate agent version and custom user parameters.
|
||||
4. Roll back template change if canary host group is affected.
|
||||
@@ -0,0 +1,29 @@
|
||||
# Zabbix Maintenance Runbook
|
||||
|
||||
## Server Checks
|
||||
|
||||
- Confirm Zabbix server process and web frontend availability.
|
||||
- Check database health, free space, and slow queries.
|
||||
- Review cache usage, poller utilization, and housekeeper activity.
|
||||
- Confirm recent values are arriving for representative Linux and AIX hosts.
|
||||
|
||||
## Proxy Checks
|
||||
|
||||
- Confirm proxy last seen timestamp.
|
||||
- Check proxy queue and delayed values.
|
||||
- Validate proxy database size and filesystem usage.
|
||||
- Confirm active/passive connectivity based on proxy mode.
|
||||
|
||||
## Template Maintenance
|
||||
|
||||
- Import templates in a controlled window.
|
||||
- Watch unsupported items after import.
|
||||
- Validate a small canary host group before wider rollout.
|
||||
- Document changed triggers and thresholds.
|
||||
|
||||
## Common Failure Modes
|
||||
|
||||
- Agent unreachable: check DNS, firewall, agent service, proxy route.
|
||||
- Unsupported item: check key spelling, OS capability, agent version, user parameter.
|
||||
- Proxy backlog: check WAN, DB size, proxy process, server availability.
|
||||
- Alert noise: review trigger thresholds and dependency design.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Zabbix Proxy Design
|
||||
|
||||
## Purpose
|
||||
|
||||
Zabbix proxies reduce dependency on direct connectivity between the central server and monitored hosts. They are useful for client networks, segmented environments, remote sites, and maintenance windows.
|
||||
|
||||
## Active Proxy
|
||||
|
||||
- Proxy connects to the Zabbix server.
|
||||
- Good for restricted networks where inbound access to the proxy is not allowed.
|
||||
- Hosts can use active agent checks against the proxy.
|
||||
- Main operational checks: proxy last seen, delayed values, local DB size, config sync.
|
||||
|
||||
## Passive Proxy
|
||||
|
||||
- Zabbix server connects to the proxy.
|
||||
- Useful when central server can reach the proxy network.
|
||||
- Requires firewall rules from server to proxy.
|
||||
- Main operational checks: proxy listener, network latency, poller load.
|
||||
|
||||
## Operational Signals
|
||||
|
||||
- Proxy queue growth.
|
||||
- Unsupported items after template changes.
|
||||
- Agent unreachable or active checks delayed.
|
||||
- Proxy DB growth during WAN outage.
|
||||
- Config sync failures after maintenance.
|
||||
Reference in New Issue
Block a user