feat(monitor): scrape WAF proxy decision metrics (would_block + blocked)#5
Draft
electricjesus wants to merge 7 commits into
Draft
feat(monitor): scrape WAF proxy decision metrics (would_block + blocked)#5electricjesus wants to merge 7 commits into
electricjesus wants to merge 7 commits into
Conversation
Add spec.extensions.waf.state (+ IsWAFGatewayExtensionEnabled helper) to the GatewayAPI CR to gate the WAF v3 (Gateway API add-on) surface, default-off. Regenerate deepcopy + CRD manifest. Refs EV-6657
Render the WAF v3 (Coraza WASM) surface on calico-kube-controllers, gated on the GatewayAPI WAF extension: - WASM_IMAGE/WASM_PULL_SECRET/WASM_CA_CERT env, ENABLED_CONTROLLERS, reconciler RBAC (wafpolicies/plugins, EnvoyExtensionPolicy, events, secret replication), coraza-wasm component (config/enterprise_versions.yml + gen-versions template + generated enterprise.go) + GatewayAddonsFeature constant. - In-process WAF SecLang validating admission webhook: a Service fronting the kube-controllers Pod + ValidatingWebhookConfiguration (wafplugins/wafpolicies, /validate-waf, FailurePolicy=Fail, caBundle=operator CA); the serving-cert mount + WAF_WEBHOOK_CERT_DIR env + container port 9443; and namespaces patch/update RBAC for the waf-id-range annotation. Refs EV-6657
…n controller Gate on GatewayAPI.spec.extensions.waf.state, issue the webhook serving cert for the tigera-waf-webhook Service DNS (materialized into calico-system via the existing CertificateManagement render), thread it into the kube-controllers config, and render the webhook Service + ValidatingWebhookConfiguration. Refs EV-6657
Wire the EnvoyProxy render so the data-plane Envoy proxy captures the Coraza
WAF filter's audit decision log (EV-6650 WAF observability):
- Tune EnvoyProxy.Spec.Logging.Level to {default: warn, wasm: info} so the
wasm component's "AuditLog:" lines (emitted via proxywasm.LogInfo) surface
in Envoy's application log while the rest stays quiet. Envoy Gateway passes
arbitrary component keys through to --component-log-level, and Envoy
recognises "wasm".
- Append --log-path /access_logs/envoy.log via EnvoyProxy.Spec.ExtraArgs to
redirect Envoy's application log to a file on the existing access-logs
emptyDir (already mounted in both the envoy container, which writes it, and
the l7-log-collector, which reads it). ExtraArgs is used rather than a
container-args Patch, which would replace Envoy Gateway's generated args.
The file is directly under /access_logs (not a subdirectory) because Envoy
does not create --log-path parent directories.
- Set WAF_AUDIT_LOG_PATH=/access_logs/envoy.log on the l7-log-collector init
container so it can tail the file and forward WAF decision records via
PolicySync.ReportWAF.
Refs EV-6650
The calico-system.envoy-gateway ingress allow put both 0.0.0.0/0 and ::/0 in
a single rule's Source.Nets, which Calico rejects ("rule contains both IPv4
and IPv6 CIDRs") — the whole NetworkPolicy fails to apply and the gatewayapi
reconcile aborts before rendering the rest. Split the allow-from-anywhere into
two rules, one per address family (dual-stack and IPv6-only both need ::/0).
…gateway data plane The gateway data-plane WAF (design-25) emits Coraza audit events that the l7-collector forwards to Felix via ReportWAF. For those events to reach Elasticsearch they need the same Felix -> waf.log -> fluentd -> linseed pipeline the legacy ApplicationLayer WAF uses, but two of its enablement knobs were never wired for the gateway path: - FelixConfiguration.WAFEventLogsFileEnabled gates Felix's ReportWAF handler and the waf.log file reporter; without it ReportWAF returns "WAFEvents disabled". The ApplicationLayer controller already owns this field, so OR in the GatewayAPI WAF extension state (and add a GatewayAPI watch so toggling it re-reconciles). Also set it in the TPROXYMode upgrade-workaround branch, since it is an independent field. - fluentd-node's in_tail_waf_logs source is gated by the WAF_LOG_FILE env, which the operator never set. Set it alongside FLOW_LOG_FILE / DNS_LOG_FILE; the path is always present and the file only exists when a WAF producer is enabled. Refs EV-6650
Add a PodMonitor that scrapes the Coraza WASM WAF counters off each
Gateway's Envoy proxy pods and normalizes them into a queryable series:
tigera_waf_decisions_total{decision="would_block"|"blocked",
policy,namespace,gateway,rule_id,phase}
tigera_waf_transactions_total{gateway}
proxy-wasm counters have no native label dimensions, so the wasm bakes
attribution into the stat name; metricRelabelings lift policy/namespace
(order-agnostic) and, for real blocks, rule_id/phase, then collapse the
per-policy/rule name variants into one series. gateway/gateway_namespace
come from the proxy pod's EG labels via target relabelings.
Also render the NetworkPolicy needed for the scrape to work: a
GlobalNetworkPolicy allowing Prometheus -> EG proxy :19001 (the proxies
run in arbitrary Gateway namespaces; the rule is Pass-terminated so the
proxy data plane is untouched) plus the matching Prometheus egress rule.
Keeps only the WAF filter counters to bound ingest. License-gated like
the other enterprise monitors. EG exposes /stats/prometheus (:19001) by
default; counter names verified live on EG v1.7.2 / Envoy v1.37.
Refs EV-6650
1690e71 to
2f38001
Compare
5eaab67 to
2af9beb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
New feature (EV-6650). Adds Prometheus scraping of the WAF decision metrics that
the Coraza WASM filter exports on each Envoy Gateway proxy.
A
PodMonitorselects the Envoy Gateway proxy pods across all Gateway namespacesand scrapes their
/stats/prometheusendpoint (port 19001). The proxy-wasmcounters encode their attribution in the stat name, so metric relabelings
normalise them into two queryable series:
tigera_waf_decisions_total{decision="would_block"|"blocked",policy,namespace,gateway,rule_id,phase}tigera_waf_transactions_total{gateway}It also renders the
NetworkPolicyrequired for the scrape to succeed: aGlobalNetworkPolicypermitting Prometheus to reach the proxy metrics port(Pass-terminated so the proxy data plane is unaffected) plus the matching
Prometheus egress rule.
The metric data is produced by the coraza-wasm / WAF reconciler work in
calico-private (EV-6650). Envoy Gateway exposes the proxy Prometheus endpoint by
default, so no data-plane change is required here.
monitor(pkg/render/monitor).seth/applicationlayer-render-v3, feat(applicationlayer): WAF v3 (Coraza WASM) render — kube-controllers plumbing + admission webhook tigera/operator#4821).Testing:
series populate with
policy/namespace/gateway/rule_id/phaselabels forboth would-block (DetectionOnly) and block decisions.
Release Note