📝 CodeRabbit Chat: Fix ClickHouse trigger in adaptive export service#67
📝 CodeRabbit Chat: Fix ClickHouse trigger in adaptive export service#67coderabbitai[bot] wants to merge 37 commits into
Conversation
New env-gated background loop that runs the same PxL shape AE's anomaly-gated path uses, but with an empty Target (no ns/pod predicate) and over a configurable rolling window. Writes via the existing sink so the byte-shape of forensic_db rows is comparable between the PASSTHROUGH=1 phase (EVERYTHING) and the PASSTHROUGH=0 phase (AE-FILTER). One-shot A/B that yields the per-table capture fraction of the adaptive write path. - internal/passthrough/passthrough.go — Loop + Config; defaults to 30s window / 30s refresh / clickhouse.PixieTables() table list. - internal/passthrough/passthrough_test.go — 6 tests; the load-bearing one is TestLoop_EmitsEmptyTargetPxL (asserts neither df.namespace nor df.pod predicates appear in the emitted PxL). - cmd/main.go: ADAPTIVE_PASSTHROUGH + _WINDOW_SEC + _REFRESH_SEC env knobs. Adapter is constructed unconditionally when passthrough is on (joins the existing PushPixie / streaming construction path so the same pxapi grpc stream is reused). Loop is registered with the shutdown WaitGroup so SIGTERM waits for the in-flight tick. - cmd/BUILD.bazel: drop @px// load (other AE BUILD.bazel files use //bazel — sticking out as the only one with @px is a leftover from a prior gazelle run; align). Add passthrough dep.
Stand-alone workflow that builds entlein/dx (private Active-Diagnosis Framework) into ghcr.io/k8sstormcenter/dx-daemon. Separates the dx image publish from the bazel-based vizier_release pipeline; the dx repo ships its own Dockerfile.dxd (Go cross-compile + distroless final stage) so it doesn't need to live as a submodule inside src/vizier/services/dx. Triggers: - tag push 'release/dx/v*' on this repo cuts a release build, image tag derived from the tag suffix (release/dx/v0.1.0 -> image tag 0.1.0). - workflow_dispatch lets us build any dx ref on demand with a custom tag (default: short sha of the resolved dx commit). Pulls dx via DX_ENTLEIN_PAT (already configured on the repo). Multi-arch build (linux/amd64, linux/arm64); Dockerfile.dxd cross-compiles in the native BUILDPLATFORM stage and the final stage is COPY-only, so target emulation isn't required.
The dx image build pipeline lives in entlein/dx itself (PR #53, branch feat/bazel-release): bazel-based with @px external pin to pixie's ae-prod tip, pushes to docker.io/entlein/dx-daemon on a release/dx/v* tag in the dx repo. The pixie-side buildx workflow this reverts duplicated that intent in the wrong repo + the wrong build system (docker buildx instead of bazel + pl_go_image macros) + the wrong registry (ghcr.io/k8sstormcenter instead of docker.io/entlein).
adaptive_export: ADAPTIVE_PASSTHROUGH firehose loop (A/B capture-fraction measurement)
Consolidates the recurring content_type silent-drop incident class into one default-suite test gate (6 tests, ~15ms): I1 TestContract_ContentTypeIsInt64InSchema I2 TestContract_FastEncodeContentTypeAsInt I3 TestContract_SilentDropDetected I3.b TestContract_SilentDropNotTriggeredOnSuccess I3.c TestContract_SilentDropToleratesMissingSummaryHeader I4 TestContract_HTTPEventsRoundTrip Top-of-file docstring chronicles the incident timeline so future operators can grep their way to the contract.
…affordances Fixes the silent-halt bug: the trigger gated on a RAW event_time high-water-mark, so a single anomaly in a larger unit (ms/ns) drove the watermark past all real seconds rows and AE stopped processing forever (data still on Pixie). Normalize event_time to canonical nanoseconds in the poll SELECT filter+order and in the in-memory/persisted cursor, boundary-dedup, and maxSeen (normalizeEventTimeNanos + chNormEventTimeNanos). Validated at the data layer: vs a poisoned watermark the raw filter returns 0 rows, the normalized filter recovers all 60. Also adds ADAPTIVE_PUSH_REFRESH_SEC (negative = single-shot pull) for the reproducible load-test harness, an in-package trigger unit test, and an e2e hermetic load test (mock PixieQuerier, exact rows+bytes).
…ness
Consolidate the adaptive_export load-test harness under src/e2e_test/ (matching
vzconn_loadtest / px_cluster conventions). Control-plane experiments (E1-E4, E6,
E8 sustained) are proven exactly-reproducible on a live rig; the data-plane
experiments (E5, E8 data-mode) are authored and pending live validation on a
vizier-registered rig (status documented in README + FINDINGS_AND_BACKLOG).
- harness/: shell + python (inject, exp_control, exp_e8, stats, ...) + lib helpers
- fixtures/EXPERIMENTS.md: curated kubescape_logs data-set catalog + expected outputs
- k8s/: isolated sinks + per-rep generator pod (no probes)
- tools/loadgen/: cleanloadgen + httpsink (docker-built test tool; .bazelignore'd
pending a bazel target — lib/pq is already vendored in the module)
- FINDINGS_AND_BACKLOG.md: F8 watermark-poison bug + the fix + AE backlog
The AE Go unit/e2e tests live with the service (internal/{trigger,e2e}).
…iagram; gen sustained-DNS mode C15 = AE must keep re-pulling+writing an active pod until t_end or DX stop (the contract DX steers on; last week's 'wrote then stopped' is its violation). Add DX-steering sequence diagram. Generator gains SUSTAIN_SEC (distinct-DNS trickle, a Pixie-traced protocol) + configurable SETTLE_PRE_MS warm-up for fresh-pod capture; harness wires GEN_SUSTAIN_SEC/GEN_SETTLE_MS.
…lisation The 700821d trigger unit-normalisation wrapped event_time in a multiIf(...) inside both the WHERE filter and the ORDER BY. Three existing tests in watermark_test.go + one in clickhouse_test.go pinned the raw 'event_time >= N' substring and broke at HEAD. Update each test's expected substring to match the new normalized form (') >= <ns-scaled N>' — the closing paren of multiIf, then the value in canonical nanoseconds). Per-test ns-scaling: watermark_test.go:94 1744000000000000000 already ns -> unchanged watermark_test.go:125 InitialWatermark=42 < 1e10 sec -> * 1e9 watermark_test.go:156 InitialWatermark=7 < 1e10 sec -> * 1e9 watermark_test.go:297 event_time='5000' < 1e10 sec -> * 1e9 clickhouse_test.go:82 ref.T=1744..303e9 ns already ns -> unchanged go test ./src/vizier/services/adaptive_export/... all green.
…future-stamp watermark poison)
Records one forensic_db.ae_reconcile row per data-plane pull (read_count vs wrote_count, window, ns/pod) across ALL three write paths — controller fan-out (filter), passthrough firehose, and streaming scanner — so a reconcile run localizes loss to query (R5: read<PEM) vs sink (R6: wrote<read) and quantifies re-pull dup (C8). Counts alone (write >= read) were proven insufficient. - new internal/reconcile leaf package (Row, Recorder, Nop) — no import cycle - sink.Record: CH-backed recorder (INSERT INTO forensic_db.ae_reconcile) - ae_reconcile table: schema.sql + KnownTables + OperatorOwnedTables (synced); not a pixie table (absent from PixieTables, so VerifyPixieSchema ignores it) - wired: passthrough.tick, controller.pushPixieRows (deferred, all return paths), streaming.scanner.Run; gated by ADAPTIVE_RECONCILE=true, else Nop - unit test proves read/wrote capture incl. the sink-drop read>wrote shape - fixed apply_test trailing-tables guard for the new operator table - harness: exp_row_reconcile.sh (row-level PEM<->CH), ae_vs_all.sh, exp_datavolume_extreme.sh
px -o json empty result previously printed one blank line → counted as a phantom LOSS=1. Guard: empty set → 0-byte keys file; drop all-empty-field keys. Confirmed against the controlled log4j run (backend http 14/14 exact, conn 66>=12, no loss).
…log4shell Reliable BY CONSTRUCTION against bob#140 (stateful/unreliable exploit on re-fire): fresh-JVM backend (delete pod) + attacker-before-backend + the WORKING resolvable FQDN attacker.<ns>.svc.cluster.local:1389 (NOT the bare attacker-ns.svc which NXDOMAINs and gets dropped), then VERIFY the actual backend->:1389 LDAP egress in forensic_db.conn_stats and RETRY until confirmed (the validity gate). Never assumes the exploit fired. Node-side.
…tion) Reword from offensive 'exploit' to detection-signal-generation language: this validates the kubescape->DX->AE detection chain. No logic change.
… http2 - pxl.CompilePassthrough/Render: precompile per-table PxL once (fixed window => constant relative start_time), only the two time_ bounds are stamped per tick. Rendered output is byte-identical to QueryFor with an empty Target (TestCompilePassthrough_MatchesQueryFor), so this is a structural change, not a capture change. upid->pod/ns stays in PxL. - passthrough: tickConcurrent fans every table out at once (was a serial loop); shared pull() helper. Sink/recorder are pool/HTTP-backed and already called concurrently elsewhere. - drop http2_messages.beta from the firehose set (not materialised on every cluster => ""Table not found"" every tick); shared PixieTables/DDL lists untouched. - toggle ADAPTIVE_PASSTHROUGH_COMPILED (default on; =false reverts to the legacy serial QueryFor path).
…e.go Fixes "missing strict dependencies: import of .../internal/reconcile" that broke the AE image build for passthrough, sink, streaming, controller, cmd (pre-existing since the ADAPTIVE_RECONCILE commit added the package + imports without bazel deps; never CI-built). Also wires the new pxl/compile.go srcs + passthrough/pxl test srcs (compiled_test.go, reconcile_test.go, compile_test.go). - new internal/reconcile/BUILD.bazel (go_library, stdlib-only) - +//internal/reconcile dep: passthrough, sink, streaming, controller, cmd - pxl go_library +compile.go; pxl_test +compile_test.go; passthrough_test +compiled_test.go,reconcile_test.go (+reconcile dep)
F1 RCA: Pixie caps every px.display at max_output_rows_per_table (default 10000, query_flags.go) — the planner add_limit_to_batch_result_sink_rule silently truncates wide firehose windows / busy pods at the READ (write path is clean: ae_reconcile shows read==wrote). Fix uses Pixie own native knob — prepend `#px:set max_output_rows_per_table=1000000` to every generated PxL (QueryFor + CompilePassthrough) so all pull paths are uncapped. Validated on rig: a 14208-row window returned 10000 (capped) vs 14298 (with flag). No pagination loop, no extra round-trips. See memory project-ae-passthrough-10k-cap.
Measures the AdaptiveExport value prop: datavolume REDUCTION of DX-steered AE (rev-3 streaming, AE writes only DX-steered activeSet pods over the control surface) vs saving ALL data (passthrough firehose). Two arms, same fixed load, forensic_db active-part deltas (rows+bytes) per table; reduction = 1 - DX/ALL. Uses the canonical resolvable JNDI FQDN (attacker.attacker-ns.svc.cluster.local) so the chain fires + DX can classify (a malformed host → NXDOMAIN → no steer). Successor to ae_vs_all.sh, whose AE arm used the rev-2 controller gate + stale JNDI.
Measures all AE non-functional requirements under steady load on the rig: throughput (rows+bytes/sec), capture completeness (AE read vs broker count = F1 cap proof), write fidelity (read==wrote + write-error count), end-to-end freshness latency (now - max(time_) in CH), resource footprint (AE pod cpu/mem idle vs loaded), per-cycle cadence. Emits a structured report; companion to exp_dx_steering_reduction.sh. Real-data only.
…ring + live-pod guard) Run-1 reported false 100% reduction because stale adaptive_attribution windows rehydrated DEAD pods (deleted loaders) into the activeSet → AE streamed dead pods → 0 rows. Clear adaptive_attribution before the DX arm so the activeSet only gets freshly-steered LIVE pods; add a guard that prints the steered pods + marshalsec fire count + live log4j-poc pods so a dead-arm result is caught, not reported as a reduction.
lag query used now()-DateTime64 (type error -> na); use dateDiff(second,...). Capture-completeness vs broker was window-misaligned (623% artifact) -> drop it; report tot_read vs tot_wrote (read==wrote) + errs instead. The 10k-cap/completeness proof is the dedicated F1 test (max_read>10000 vs broker for the SAME window).
…ary) Run-2 byte-delta reduction came out negative because system.parts byte delta is compaction-noisy (merges land mid-window). Report rows reduction as primary (actual captured-row count, noise-free); keep bytes as secondary context.
Eviction-RCA finding (PR #63 NFR campaign): AE had NO memory limit (only cpu 300m) and was CPU-pinned at 300m under concurrent passthrough. AE measured tiny (16-38Mi steady), but the raised 1M-row passthrough cap can spike, so cap at 1Gi so AE can never memory-pressure a node; raise cpu limit 300m->1 core (was throttling). NOTE node evictions were NOT AE/OOM — node-01 went NotReady (network/heartbeat); the memory consumer is PEM (1365Mi, OOMs at the 2Gi default).
Root cause of the recurring "AE unauthenticated / writes 0 / crashloop" reverts: kustomization.yaml bundled adaptive_export_secrets.yaml (placeholder pixie-api-key) with the role+deployment, so EVERY infra re-apply (make log4j) clobbered the real key that ae-auth had written. Separation of concerns: remove the secret from the kustomization — infra (role+deployment) stays re-appliable; the secret holds real creds and is owned solely by `make ae-auth`, created once, never touched by infra re-applies. Secret manifest kept as a hand-applied seed-only template (documented).
The DX-steered arm was failing because the harness only fired stage-1 (JNDI/LDAP). That generates ldap-egress but NO kubescape R0001 → backend never flagged → DX no case → indeterminate → AE steers wrong/no pods. R0001 comes from stage-2 (post- exploitation exec). fire() now does stage-1 (JNDI) + stage-2 (whoami/shadow/token/ getent in the backend) → kubescape R0001+R0006 → DX rules backend MALIGNANT → backend enters AE activeSet → reduction is measurable. Verified live: DX evidence unexpected-spawn+sensitive-file-read → verdict ruled_in generic=MALIGNANT.
…dd DX-steering diagnostics
Standing terminology rule: allowlist/blocklist, never whitelist/blacklist.
Pure rename (no behavior change) of the rev-3 streaming filter:
FilterModeWhitelist → FilterModeAllowlist
MaxWhitelistSize → MaxAllowlistSize
ADAPTIVE_STREAM_MAX_WHITELIST → ADAPTIVE_STREAM_MAX_ALLOWLIST (env)
mode=whitelist log string → mode=allowlist
plus all comments/identifiers/tests in streaming, activeset, cmd/main.
DX-steering diagnostics (the reason DX-arm-writes-0 has been hard to RCA —
we could not tell "empty ActiveSet" from "broker returned 0 rows"):
- scanner: log the empty-allowlist short-circuit (was silent) so an
empty ActiveSet is visible in logs, distinct from "query completed rows=0".
- FilterUpdater: emitted-filter log Debug→Info so the steered pod count
per ActiveSet change is visible without debug logging.
NOTE: ADAPTIVE_STREAM_MAX_WHITELIST env renamed → tooling that sets the old
name must switch to ADAPTIVE_STREAM_MAX_ALLOWLIST.
The Pixie dx_evidence_graph UI reads dx_attack_graph via px.DataFrame clickhouse_dsn, whose query template hardcodes event_time + hostname and ORDER BY event_time. A table without those columns fails 'Unknown identifier event_time'; a table created by hand (local, not via the operator) isn't globally registered. Fix: make AE own it like the other forensic tables. - schema.sql: dx_attack_graph DDL with event_time(UInt64 nanos) + hostname, edge columns, fromUnixTimestamp64Nano partition/TTL (nanos-correct). - KnownTables + OperatorOwnedTables: register it so Apply creates it at boot. - apply_test: assert last-applied DDL == last OperatorOwnedTables entry (robust to appended operator tables) instead of hardcoding trigger_watermark. go test ./.../clickhouse green.
Pixie's clickhouse_dsn type mapper reads UInt8 as BOOLEAN and does not handle UInt16/UInt32/Float32 -> px fails with 'Column[N] given incorrect type' rendering the dx_evidence_graph. weight/max_severity/num_findings -> Int64, confidence -> Float64 (map cleanly to INT64/FLOAT64). Verified live: px run returns all 6 edges with every column. event_time stays UInt64 (matches kubescape_logs, which px reads).
…canner buildPxL The DX/streaming arm silently capped each per-table pull at Pixie's default 10000-row limit while the passthrough/ALL arm (pxl.CompilePassthrough / QueryFor) already raises it to 1,000,000 via the broker's #px:set query flag. Validated live on 6a33dac0: a single streaming http_events pull returned exactly rows=10000 (the cap). Left unfixed this UNDER-counts the DX arm and OVERSTATES the DX-vs-ALL volume reduction. Prepend the same #px:set directive to the streaming scanner's PxL so both arms are uncapped and comparable.
Adds the rule-ins-only view (condition != '') to the canonical schema.sql, registers it in KnownTables + OperatorOwnedTables (after dx_attack_graph), and teaches DDL() to extract CREATE VIEW headers. AE now creates it on boot so the dx_evidence_graph UI's default malicious-only read is standard, not a per-rig manual step. Tests updated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dx POSTs a JSON array of edges to /dx/attack_graph; AE writes them to forensic_db.dx_attack_graph via JSONEachRow (Applier.WriteAttackGraph). Wired in main.go when CONTROL_ADDR is set; 501 if the sink is unset. This is the AE half of the dx->AE->CH attack-graph write path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ntract adaptive_export/sink: content_type silent-drop contract suite
AE datavolume: DX-steered vs save-ALL reduction measurement
…licious-view ae(clickhouse): dx_attack_graph_malicious view at boot
|
Important Review skippedThis PR was authored by the user configured for CodeRabbit reviews. CodeRabbit does not review PRs authored by this user. It's recommended to use a dedicated user account to post CodeRabbit review feedback. ⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Plus Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Comment |
3b5f551 to
e0a19dd
Compare
…#67) ae-prod's boundary handling accumulates the seen-fingerprint set on a no-progress tick, but if >PollLimit rows share one normalized event_time the SQL (>= watermark ORDER BY time LIMIT N, no secondary key) returns the same N boundary rows every poll → rows beyond N at that timestamp are never emitted (infinite boundary). Detect all-skipped-at-capacity and advance the watermark by 1ns to make forward progress (fingerprint dedup already tolerates the 1ns overlap). Cherry-picked the trigger fix + its test from the stale CodeRabbit-chat PR #67 (687851d); dropped that PR's unrelated gen-pod.tmpl.yaml churn. #67 itself is NOT mergeable (77 commits behind ae-prod, re-adds deleted terraform).
|
Closing — not mergeable and superseded. This is a CodeRabbit-chat snapshot 77 commits behind ae-prod, Its one genuine change — the ClickHouse trigger watermark boundary-escape ( |
… → 73.7% cov) The trigger's incremental kubescape-events pump has 3 moving parts that must agree: watermark advancement, boundary fingerprint dedup, and PollLimit-saturated draining (PR #67 fix). Existing tests cover each in isolation. This new file pins them TOGETHER against the simplest possible reference ("drain everything in event_time order, dedupe by fingerprint, advance the cursor to max(event_time)"). If the iterative trigger and the reference disagree on the set of emitted rows for ANY poll sequence, one of the three parts is wrong. Four oracle tests: TestOracle_TriggerEmitsNaiveSet_StaggeredCorpus — 50 rows scattered across distinct event_times, PollLimit=10 forces ≥5 polls; trigger must emit exactly the 50 rows the naive reference computes. TestOracle_PollLimitSaturation_AtCapacity — regression guard for PR #67 (dfdc465): when EXACTLY PollLimit rows share a boundary event_time, every one of them must emit, and the cursor must clear the boundary for the next-event_time row that follows. TestOracle_PollLimitOverflow_DocumentsLossBound — pins PR #67's intentional trade-off: when >PollLimit rows share a boundary event_time, the first PollLimit emit, then the 1ns escape advances the cursor past the surplus. Lock-in test: if a future fix recovers the surplus (good!) this test fails loudly so both can update in lock-step instead of regressing silently. TestOracle_BoundaryDedup_NoDuplicates — when CH returns the same boundary row in two consecutive polls (real production case: a new row lands at the same event_time after a previous poll's cursor advanced past it), the seenAtBoundary map must filter the duplicate. Coverage: trigger 71.6% → 73.7% (statement-level). The oracle's value isn't in statement count — it's in property-level coverage on invariants the per-feature tests can't enforce together. Build.bazel: adds oracle_test.go to the existing trigger_test target.
Code changes was requested by @entlein.
The following files were modified:
src/e2e_test/adaptive_export_loadtest/k8s/gen-pod.tmpl.yamlsrc/vizier/services/adaptive_export/internal/trigger/clickhouse.gosrc/vizier/services/adaptive_export/internal/trigger/clickhouse_internal_test.go