Skip to content

feat(monitor): scrape WAF proxy decision metrics (would_block + blocked)#5

Draft
electricjesus wants to merge 7 commits into
seth/applicationlayer-render-v3from
seth/waf-proxy-metrics-scrape
Draft

feat(monitor): scrape WAF proxy decision metrics (would_block + blocked)#5
electricjesus wants to merge 7 commits into
seth/applicationlayer-render-v3from
seth/waf-proxy-metrics-scrape

Conversation

@electricjesus

Copy link
Copy Markdown
Owner

Description

New feature (EV-6650). Adds Prometheus scraping of the WAF decision metrics that
the Coraza WASM filter exports on each Envoy Gateway proxy.

A PodMonitor selects the Envoy Gateway proxy pods across all Gateway namespaces
and scrapes their /stats/prometheus endpoint (port 19001). The proxy-wasm
counters encode their attribution in the stat name, so metric relabelings
normalise them into two queryable series:

  • tigera_waf_decisions_total{decision="would_block"|"blocked",policy,namespace,gateway,rule_id,phase}
  • tigera_waf_transactions_total{gateway}

It also renders the NetworkPolicy required for the scrape to succeed: a
GlobalNetworkPolicy permitting Prometheus to reach the proxy metrics port
(Pass-terminated so the proxy data plane is unaffected) plus the matching
Prometheus egress rule.

The metric data is produced by the coraza-wasm / WAF reconciler work in
calico-private (EV-6650). Envoy Gateway exposes the proxy Prometheus endpoint by
default, so no data-plane change is required here.

Testing:

  • Unit tests cover the PodMonitor relabelings and the GlobalNetworkPolicy.
  • Verified on a live cluster: Prometheus scrapes the proxies and the normalised
    series populate with policy/namespace/gateway/rule_id/phase labels for
    both would-block (DetectionOnly) and block decisions.

Release Note

Added Prometheus metrics for Ingress Gateway WAF would-block and block decisions (tigera_waf_decisions_total).

Add spec.extensions.waf.state (+ IsWAFGatewayExtensionEnabled helper) to the
GatewayAPI CR to gate the WAF v3 (Gateway API add-on) surface, default-off.
Regenerate deepcopy + CRD manifest.

Refs EV-6657
Render the WAF v3 (Coraza WASM) surface on calico-kube-controllers, gated on the
GatewayAPI WAF extension:

- WASM_IMAGE/WASM_PULL_SECRET/WASM_CA_CERT env, ENABLED_CONTROLLERS, reconciler
  RBAC (wafpolicies/plugins, EnvoyExtensionPolicy, events, secret replication),
  coraza-wasm component (config/enterprise_versions.yml + gen-versions template +
  generated enterprise.go) + GatewayAddonsFeature constant.
- In-process WAF SecLang validating admission webhook: a Service fronting the
  kube-controllers Pod + ValidatingWebhookConfiguration (wafplugins/wafpolicies,
  /validate-waf, FailurePolicy=Fail, caBundle=operator CA); the serving-cert
  mount + WAF_WEBHOOK_CERT_DIR env + container port 9443; and namespaces
  patch/update RBAC for the waf-id-range annotation.

Refs EV-6657
…n controller

Gate on GatewayAPI.spec.extensions.waf.state, issue the webhook serving cert for
the tigera-waf-webhook Service DNS (materialized into calico-system via the
existing CertificateManagement render), thread it into the kube-controllers
config, and render the webhook Service + ValidatingWebhookConfiguration.

Refs EV-6657
Wire the EnvoyProxy render so the data-plane Envoy proxy captures the Coraza
WAF filter's audit decision log (EV-6650 WAF observability):

- Tune EnvoyProxy.Spec.Logging.Level to {default: warn, wasm: info} so the
  wasm component's "AuditLog:" lines (emitted via proxywasm.LogInfo) surface
  in Envoy's application log while the rest stays quiet. Envoy Gateway passes
  arbitrary component keys through to --component-log-level, and Envoy
  recognises "wasm".
- Append --log-path /access_logs/envoy.log via EnvoyProxy.Spec.ExtraArgs to
  redirect Envoy's application log to a file on the existing access-logs
  emptyDir (already mounted in both the envoy container, which writes it, and
  the l7-log-collector, which reads it). ExtraArgs is used rather than a
  container-args Patch, which would replace Envoy Gateway's generated args.
  The file is directly under /access_logs (not a subdirectory) because Envoy
  does not create --log-path parent directories.
- Set WAF_AUDIT_LOG_PATH=/access_logs/envoy.log on the l7-log-collector init
  container so it can tail the file and forward WAF decision records via
  PolicySync.ReportWAF.

Refs EV-6650
The calico-system.envoy-gateway ingress allow put both 0.0.0.0/0 and ::/0 in
a single rule's Source.Nets, which Calico rejects ("rule contains both IPv4
and IPv6 CIDRs") — the whole NetworkPolicy fails to apply and the gatewayapi
reconcile aborts before rendering the rest. Split the allow-from-anywhere into
two rules, one per address family (dual-stack and IPv6-only both need ::/0).
…gateway data plane

The gateway data-plane WAF (design-25) emits Coraza audit events that the
l7-collector forwards to Felix via ReportWAF. For those events to reach
Elasticsearch they need the same Felix -> waf.log -> fluentd -> linseed
pipeline the legacy ApplicationLayer WAF uses, but two of its enablement
knobs were never wired for the gateway path:

- FelixConfiguration.WAFEventLogsFileEnabled gates Felix's ReportWAF handler
  and the waf.log file reporter; without it ReportWAF returns
  "WAFEvents disabled". The ApplicationLayer controller already owns this
  field, so OR in the GatewayAPI WAF extension state (and add a GatewayAPI
  watch so toggling it re-reconciles). Also set it in the TPROXYMode
  upgrade-workaround branch, since it is an independent field.
- fluentd-node's in_tail_waf_logs source is gated by the WAF_LOG_FILE env,
  which the operator never set. Set it alongside FLOW_LOG_FILE / DNS_LOG_FILE;
  the path is always present and the file only exists when a WAF producer is
  enabled.

Refs EV-6650
Add a PodMonitor that scrapes the Coraza WASM WAF counters off each
Gateway's Envoy proxy pods and normalizes them into a queryable series:

  tigera_waf_decisions_total{decision="would_block"|"blocked",
                             policy,namespace,gateway,rule_id,phase}
  tigera_waf_transactions_total{gateway}

proxy-wasm counters have no native label dimensions, so the wasm bakes
attribution into the stat name; metricRelabelings lift policy/namespace
(order-agnostic) and, for real blocks, rule_id/phase, then collapse the
per-policy/rule name variants into one series. gateway/gateway_namespace
come from the proxy pod's EG labels via target relabelings.

Also render the NetworkPolicy needed for the scrape to work: a
GlobalNetworkPolicy allowing Prometheus -> EG proxy :19001 (the proxies
run in arbitrary Gateway namespaces; the rule is Pass-terminated so the
proxy data plane is untouched) plus the matching Prometheus egress rule.

Keeps only the WAF filter counters to bound ingest. License-gated like
the other enterprise monitors. EG exposes /stats/prometheus (:19001) by
default; counter names verified live on EG v1.7.2 / Envoy v1.37.

Refs EV-6650
@electricjesus electricjesus force-pushed the seth/applicationlayer-render-v3 branch from 1690e71 to 2f38001 Compare June 4, 2026 21:22
@electricjesus electricjesus force-pushed the seth/applicationlayer-render-v3 branch from 5eaab67 to 2af9beb Compare June 12, 2026 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant