Overview
src/monitoring/metrics/ collects Prometheus metrics but no alerting rules are defined. SLA breaches (P99 latency > 1s, error rate > 1%) are only detectable by manually watching dashboards.
Specifications
Features:
- Define Prometheus alerting rules for: request error rate, P99 latency, queue depth, and DLQ depth.
Tasks:
- Create
charts/teachlink-backend/templates/prometheus-rules.yaml with PrometheusRule CR.
- Define alerts:
HighErrorRate (>1% 5xx for 5m), HighP99Latency (>1s for 10m), QueueDepthHigh (>1000 jobs for 10m).
- Configure
alertmanager webhook or Slack route in chart values.
- Add alert documentation in
docs/RUNBOOKS.md.
Impacted Files:
charts/teachlink-backend/
docs/RUNBOOKS.md
Acceptance Criteria
HighErrorRate alert fires in a test with injected errors.
- Alerts include runbook links.
- Helm chart renders without errors after adding the template.
Overview
src/monitoring/metrics/collects Prometheus metrics but no alerting rules are defined. SLA breaches (P99 latency > 1s, error rate > 1%) are only detectable by manually watching dashboards.Specifications
Features:
Tasks:
charts/teachlink-backend/templates/prometheus-rules.yamlwithPrometheusRuleCR.HighErrorRate(>1% 5xx for 5m),HighP99Latency(>1s for 10m),QueueDepthHigh(>1000 jobs for 10m).alertmanagerwebhook or Slack route in chart values.docs/RUNBOOKS.md.Impacted Files:
charts/teachlink-backend/docs/RUNBOOKS.mdAcceptance Criteria
HighErrorRatealert fires in a test with injected errors.