Skip to content

Add Prometheus alerting rules for API latency and error-rate SLA thresholds #846

Description

@RUKAYAT-CODER

Overview

src/monitoring/metrics/ collects Prometheus metrics but no alerting rules are defined. SLA breaches (P99 latency > 1s, error rate > 1%) are only detectable by manually watching dashboards.

Specifications

Features:

  • Define Prometheus alerting rules for: request error rate, P99 latency, queue depth, and DLQ depth.

Tasks:

  • Create charts/teachlink-backend/templates/prometheus-rules.yaml with PrometheusRule CR.
  • Define alerts: HighErrorRate (>1% 5xx for 5m), HighP99Latency (>1s for 10m), QueueDepthHigh (>1000 jobs for 10m).
  • Configure alertmanager webhook or Slack route in chart values.
  • Add alert documentation in docs/RUNBOOKS.md.

Impacted Files:

  • charts/teachlink-backend/
  • docs/RUNBOOKS.md

Acceptance Criteria

  • HighErrorRate alert fires in a test with injected errors.
  • Alerts include runbook links.
  • Helm chart renders without errors after adding the template.

Metadata

Metadata

Assignees

Labels

Stellar WaveIssues in the Stellar wave programenhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions