EV-6666: Surface Alertmanager alerts on the manager Alerts page#4879
EV-6666: Surface Alertmanager alerts on the manager Alerts page#4879rene-dekker wants to merge 5 commits into
Conversation
Add a Linseed network policy ingress rule permitting traffic from the Alertmanager pods in the tigera-prometheus namespace, so Alertmanager can push Prometheus alerts to Linseed as events. The Alertmanager egress policy already allows all TCP egress, so only the Linseed ingress side was missing. Exports monitor.AlertmanagerSourceEntityRule as the single source of truth for the Alertmanager pod selector. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a ClusterRole granting create on events (linseed.tigera.io), bound to the prometheus service account that Alertmanager runs as. Linseed authorizes writes via SubjectAccessReview, so this lets Alertmanager push Prometheus alerts to Linseed as events using its existing service account token. The role/binding are rendered only when Alertmanager is enabled and removed otherwise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the placeholder Alertmanager webhook receiver with one that posts to Linseed's /api/v1/events/alertmanager endpoint, so Prometheus alerts surface on the Alerts UI page. Linseed requires mTLS plus a bearer token, so the Alertmanager spec now mounts the prometheus client TLS key pair and the trusted CA bundle, and the webhook http_config references them along with the service account token. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a UIAlertsIntegration (Enabled|Disabled) field to the Monitor spec that controls whether Prometheus/Alertmanager alerts are forwarded to Linseed and surfaced on the manager Alerts page (defaults to Enabled). When disabled, the operator renders an Alertmanager config that routes to a null receiver instead of the Linseed webhook. The config secret is regenerated to the selected variant when the operator owns it, so the toggle takes effect at runtime. A hash of the Alertmanager config is added as a pod annotation so that config changes roll the Alertmanager pod and reload the new config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
099c002 to
044b64c
Compare
Replace the raw alertmanager.yaml config secret with an AlertmanagerConfig custom resource referenced by Alertmanager.spec.alertmanagerConfiguration: - If the user supplies an AlertmanagerConfig named calico-node-alertmanager in the tigera-operator namespace, the operator renders a copy of it in tigera-prometheus. Otherwise it renders a default: the Linseed webhook receiver when the UI alerts integration is enabled, or a null receiver when disabled. - The webhook authenticates to Linseed with the Linseed-issued bearer token secret for the prometheus service account (prometheus-tigera-linseed-token) and the client cert / CA bundle, all referenced from the CR; the prometheus-operator mounts them into the Alertmanager pod, so the explicit Secrets/ConfigMaps mounts are removed. - The pod is annotated with a hash of the AlertmanagerConfig spec, client cert and CA bundle so any config change rolls the pod. - The legacy alertmanager-calico-node-alertmanager config secret is now deleted. This also fixes the upgrade gap where a pre-existing (stock) config secret was left untouched because it matched neither operator default, so the integration never wired up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 044b64c)
electricjesus
left a comment
There was a problem hiding this comment.
Drive-by review, courtesy of a quest from Tigera Town 🤣. Mostly looks good. One thing I think blocks merge, plus a few mediums, left inline.
Cross-PR ordering: these two have to ship together, and this PR can't stand alone. The operator's own RBAC for alertmanagerconfigs lives in the calico-private charts in tigera/calico-private#12184, so if this vendors into the operator ahead of that, the monitor controller can't create the AlertmanagerConfig and goes degraded. Same on the receiving end: without #12184 the /api/v1/events/alertmanager endpoint 404s and Linseed rejects the prometheus_alert type. Worth pinning both to the same release and noting the dependency on each PR while they're still draft.
One nice-to-have I noticed but won't block on: the config-hash annotation that rolls the pod doesn't include the token secret data, so the pod won't roll when Kubernetes first populates the token. It relies on the config-reloader watching the mounted secret. Probably fine, worth a sanity check.
| // monitor.AlertmanagerConfigName), the operator renders a copy of it in tigera-prometheus. | ||
| // Otherwise it renders the operator's default config (the Linseed webhook receiver when the UI | ||
| // alerts integration is enabled, or a null receiver when disabled). | ||
| func (r *ReconcileMonitor) readAlertmanagerConfig(ctx context.Context, uiAlertsEnabled bool) (*monitoringv1alpha1.AlertmanagerConfig, error) { |
There was a problem hiding this comment.
This drops existing customer Alertmanager config on upgrade, and I don't think we can ship it that way.
Today the customization path is the raw alertmanager-calico-node-alertmanager secret. It's documented, and the old readAlertmanagerConfigSecret carried all that owner-ref logic precisely to leave a user-modified secret alone. This PR deletes that secret and only reads config from an AlertmanagerConfig CR the customer has never created. So on upgrade, anyone who set their own receivers (PagerDuty, Slack, email) loses them and falls back to the default Linseed webhook. Their external paging stops and the alerts quietly reroute to the manager UI instead.
Options, in order of how much I'd trust them:
- Migrate: if the legacy secret exists and differs from the old default, parse it and seed the
AlertmanagerConfigbefore deleting the secret. - Failing that, detect a non-default legacy secret and
SetDegradedwith a clear message instead of silently replacing it, so the upgrade isn't invisible.
Either way the release note has to call this out as a breaking change. Right now it only describes the new feature.
|
|
||
| // The Linseed bearer-token secret is only needed when Alertmanager is running and forwarding | ||
| // alerts to Linseed (the UI alerts integration is enabled); otherwise remove it. | ||
| if mc.alertmanagerReplicas() > 0 && mc.cfg.Monitor.UIAlertsEnabled() { |
There was a problem hiding this comment.
Two things about the disable toggle when a user brings their own AlertmanagerConfig.
The toggle only swaps the default. If a user has their own AlertmanagerConfig in the operator namespace, uiAlertsIntegration: Disabled does nothing, since we copy their spec verbatim. The field doc says it "controls whether alerts are forwarded to Linseed," which won't hold for those users. Worth documenting the precedence, or deciding whether disable should win regardless.
Separately, the Linseed token secret and the tigera-alertmanager-linseed ClusterRole/Binding get created whenever Alertmanager runs with the integration enabled, even if the user's own config never talks to Linseed. That leaves a token secret and an event-create grant nothing uses. Not harmful, but it's a dangling credential. Could gate those on the default-config path rather than on UIAlertsEnabled alone.
| } | ||
|
|
||
| // +kubebuilder:validation:Enum=Enabled;Disabled | ||
| type UIAlertsIntegrationStatusType string |
There was a problem hiding this comment.
UIAlertsIntegrationStatusType reads like a status field, but this is a spec enum. It's public API and awkward to rename after release, so I'd fix it now. UIAlertsIntegrationType or UIAlertsIntegrationMode matches what it actually is.
Wires Prometheus/Alertmanager alerts through to the manager Alerts page, with a toggle to enable/disable the integration.
/api/v1/events/alertmanagerMonitor.spec.uiAlertsIntegration(Enabled|Disabled, default Enabled). When disabled, the rendered Alertmanager config routes to a null receiver. The operator regenerates the config secret when it owns it, so toggling takes effect at runtime.Companion PRs: calico-private (Linseed ingest + dedup), ui-modules (Alerts page toggle +
prometheus_alertrendering).🤖 Generated with Claude Code