Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
2db5a9e
adaptive_export: production AE — streaming export + write-integrity
Jun 8, 2026
45130be
adaptive_export: ADAPTIVE_PASSTHROUGH firehose loop
entlein Jun 8, 2026
1d36935
ci: dx-image workflow — build + publish dx-daemon to ghcr
entlein Jun 9, 2026
a274637
Revert ci: dx-image workflow — wrong repo
entlein Jun 9, 2026
ef7525c
Merge pull request #55 from k8sstormcenter/entlein/adaptive-passthrough
entlein Jun 9, 2026
bb0bd55
adaptive_export/sink: content_type silent-drop contract suite
entlein Jun 9, 2026
700821d
adaptive_export: unit-normalize trigger watermark cursor + load-test …
Jun 16, 2026
ae7b86f
e2e_test/adaptive_export_loadtest: AE fixture-isolation load-test har…
Jun 16, 2026
b6c7333
e2e_test/adaptive_export_loadtest: document AE implied contracts (C1-…
Jun 16, 2026
7e86711
adaptive_export_loadtest: C15 write-duration contract + DX-steering d…
Jun 16, 2026
167d506
adaptive_export/trigger: update test SQL substrings for multiIf norma…
entlein Jun 16, 2026
9e4e353
adaptive_export_loadtest: exp_control uses real now_s event_time (no …
Jun 16, 2026
c9f19b6
adaptive_export: ADAPTIVE_RECONCILE per-pull write-fidelity instrument
Jun 17, 2026
b582e0f
harness: exp_pipeline_reconcile — skip empty-key rows (0 rows != LOSS 1)
Jun 17, 2026
1f6cc8e
harness: log4shell_fire.sh — reliably fire + restart the log4j-chain …
Jun 17, 2026
a3283bb
harness: log4shell_fire.sh — detection-signal framing (Cyber Verifica…
Jun 17, 2026
4a445c5
adaptive_export(passthrough): precompiled + concurrent firehose, drop…
Jun 17, 2026
1fccd01
adaptive_export: bazel BUILD deps for internal/reconcile + pxl compil…
Jun 17, 2026
0af95f0
adaptive_export(pxl): raise Pixie 10k result cap via #px:set query flag
Jun 17, 2026
8dc0de9
adaptive_export_loadtest: DX-steered-vs-ALL datavolume reduction harness
Jun 17, 2026
229604c
adaptive_export_loadtest: deep AE NFR benchmark harness
Jun 18, 2026
0ee771c
adaptive_export_loadtest: fix DX-reduction dead-arm (clear stale stee…
Jun 18, 2026
0efffe8
nfr harness: fix lag (dateDiff) + drop racy broker-pct completeness
Jun 18, 2026
2fc46dc
dx-reduction harness: report ROWS reduction (primary) + bytes (second…
Jun 18, 2026
a0cca5a
ae deployment: add memory limit (1Gi) + raise cpu limit to 1 core
Jun 18, 2026
c77f427
ae bootstrap: separate the secret from the re-applied infra bundle
Jun 18, 2026
6fe94ed
dx-reduction harness: fire BOTH attack stages so DX steers the backend
Jun 18, 2026
668f823
adaptive_export: rename whitelist→allowlist across streaming path + a…
Jun 18, 2026
f886503
ae(clickhouse): create forensic_db.dx_attack_graph at boot
Jun 18, 2026
3e2b93a
ae(clickhouse): dx_attack_graph numeric cols Int64/Float64 (px-readable)
Jun 18, 2026
e3f7f81
adaptive_export(streaming): add #px:set max_output_rows cap flag to s…
Jun 18, 2026
1513c88
ae(clickhouse): create dx_attack_graph_malicious view at boot
Jun 19, 2026
b2df97e
ae(control): /dx/attack_graph ingest endpoint -> ClickHouse
Jun 19, 2026
c2f31a2
Merge pull request #57 from k8sstormcenter/entlein/ae-content-type-co…
entlein Jun 19, 2026
98ed54f
Merge pull request #63 from k8sstormcenter/ae-datavolume-dx-steering
entlein Jun 19, 2026
274f6f4
Merge pull request #65 from k8sstormcenter/feat/ae-dx-attack-graph-ma…
entlein Jun 19, 2026
687851d
📝 CodeRabbit Chat: Fix ClickHouse trigger in adaptive export service
coderabbitai[bot] Jun 20, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions .bazelignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,7 @@ third_party/threadstacks
tools/chef/nodes
# To keep third party dependencies separate, privy is intentional setup as a separate bazel workspace
src/datagen/pii/privy

# adaptive_export_loadtest generator is a docker-built test tool (see its README);
# build-agent to replace with a bazel target. Until then, keep it out of gazelle.
src/e2e_test/adaptive_export_loadtest/tools/loadgen
4 changes: 2 additions & 2 deletions .github/workflows/vizier_release.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ jobs:
image-base-name: "dev_image_with_extras"
build-release:
name: Build Release
runs-on: oracle-16cpu-64gb-x86-64
runs-on: oracle-vm-16cpu-64gb-x86-64
needs: get-dev-image
permissions:
contents: read
Expand Down Expand Up @@ -140,7 +140,7 @@ jobs:
git commit -s -m "Release Helm chart Vizier ${VERSION}"
git push origin "gh-pages"
update-gh-artifacts-manifest:
runs-on: oracle-8cpu-32gb-x86-64
runs-on: oracle-vm-16cpu-64gb-x86-64
needs: [get-dev-image, create-github-release]
container:
image: ${{ needs.get-dev-image.outputs.image-with-tag }}
Expand Down
11 changes: 11 additions & 0 deletions k8s/vizier/bootstrap/adaptive_export_deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,17 @@ spec:
containers:
- name: adaptive-export
image: vizier-adaptive_export_image:latest
# Bounded so AE can never memory-pressure a node (measured: AE uses
# only ~16-38Mi steady; passthrough with the raised 1M-row cap can
# spike, so 1Gi caps the worst case). CPU was pinned at the old 300m
# limit under concurrent passthrough → raised to 1 core.
resources:
requests:
cpu: 200m
memory: 128Mi
limits:
cpu: "1"
memory: 1Gi
env:
- name: PL_NAMESPACE
valueFrom:
Expand Down
7 changes: 7 additions & 0 deletions k8s/vizier/bootstrap/adaptive_export_secrets.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
# SEED-ONLY template — NOT in kustomization.yaml (separation of concerns).
# Real credentials are written by `make ae-auth` (pixie-api-key from keys.env,
# clickhouse-dsn = the fixed forensic-CH constant). Do NOT add this back to the
# bundle: a re-apply would clobber the real pixie-api-key with the placeholder
# (the recurring "AE unauthenticated / writes 0" bug). Apply this by hand ONLY
# to seed a brand-new cluster so the AE pod's secretKeyRef resolves before
# ae-auth runs.
---
apiVersion: v1
kind: Secret
Expand Down
7 changes: 6 additions & 1 deletion k8s/vizier/bootstrap/kustomization.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,5 +16,10 @@ resources:
- cert_provisioner_job.yaml
- vizier_crd_role.yaml
- adaptive_export_role.yaml
- adaptive_export_secrets.yaml
# adaptive_export_secrets.yaml is intentionally NOT bundled here: it holds real
# credentials (pixie-api-key, clickhouse-dsn) owned by `make ae-auth`. Bundling
# it meant every infra re-apply clobbered the real key with the placeholder.
# Separation of concerns: infra (role+deployment) re-appliable; secret is
# created ONCE by ae-auth and never touched by this kustomization. ponytail:
# apply adaptive_export_secrets.yaml manually only to seed a fresh cluster.
- adaptive_export_deployment.yaml
4 changes: 2 additions & 2 deletions skaffold/skaffold_vizier.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ build:
bazel:
target: //src/vizier/services/cloud_connector:cloud_connector_server_image.tar
args:
- --config=x86_64_sysroot
- --compilation_mode=opt
- --config=x86_64_sysroot
- --compilation_mode=opt
- image: vizier-cert_provisioner_image
context: .
bazel:
Expand Down
14 changes: 14 additions & 0 deletions src/api/go/pxapi/opts.go
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,17 @@ func WithDirectCredsInsecure() ClientOption {
c.insecureDirect = true
}
}

// WithDirectTLSSkipVerify is the secure-by-default option for direct (standalone /
// node-local PEM) connections: the transport IS TLS-encrypted, but the server cert
// is not chain/hostname-verified. Use this instead of WithDirectCredsInsecure when
// the direct endpoint serves TLS with a self-signed / service cert whose SAN does
// not match the node IP (e.g. vizier-pem's direct-query port served with
// service-tls-certs, dialed at HOST_IP). Unlike WithDisableTLSVerification it does
// NOT require a "cluster.local" address, so it works for the node-IP direct dial.
// Bearer creds (the minted JWT) therefore ride an encrypted channel, never plaintext.
func WithDirectTLSSkipVerify() ClientOption {
return func(c *Client) {
c.disableTLSVerification = true
}
}
98 changes: 98 additions & 0 deletions src/e2e_test/adaptive_export_loadtest/CONTRACTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# Adaptive Export (AE) — implied contracts

What AE *currently assumes but does not enforce*. Each ⚠️ is an **implied** contract
(a silent assumption); 🔴 marks ones we've observed violated, with the fix. Grounded
in `src/vizier/services/adaptive_export/` (trigger, controller, sink, config) + the
`forensic_db` DDL.

## End-to-end data flow + where each contract sits

```mermaid
flowchart TD
subgraph PROD["Producer (per node)"]
VEC["Vector kubescape_enrich sink<br/>(or load-test fixtures)"]
end
subgraph CH1["ClickHouse — input"]
KL["forensic_db.kubescape_logs<br/>MergeTree ORDER BY (event_time, hostname)<br/>TTL toDateTime(event_time)+30d"]
end
subgraph AE["adaptive_export (per node DaemonSet)"]
TRG["TRIGGER: poll 250ms<br/>WHERE hostname=NODE AND event_time>=watermark<br/>ORDER BY event_time LIMIT N"]
CTL["CONTROLLER: hash + active-set<br/>window [event_time-Before, now)"]
PXL["DATA-PLANE: PxL per (ns,pod)×table<br/>refresh every 30s while window open"]
end
subgraph VZ["Pixie"]
QB["vizier-query-broker → PEMs"]
end
subgraph CH2["ClickHouse — output (forensic_db)"]
ATTR["adaptive_attribution<br/>ReplacingMergeTree(t_end)<br/>ORDER BY (hostname, anomaly_hash)"]
WM["trigger_watermark<br/>ReplacingMergeTree(updated_at)"]
PROT["http/dns/pgsql/conn_stats/...<br/>plain MergeTree (NO dedup)"]
end
VEC -->|"C1 ⚠️ event_time UNIT = seconds<br/>C2 ⚠️ hostname = k8s node name"| KL
KL -->|"C3 🔴 event_time monotone ≥ watermark<br/>C4 ⚠️ boundary dedup by content fp"| TRG
TRG --> CTL
CTL -->|"C5 ⚠️ anomaly_hash = f(pid,comm,pod,ns) only"| ATTR
TRG -->|"C6 ⚠️ watermark persist throttled ~5s"| WM
CTL --> PXL
PXL -->|"C7 needs registered vizier"| QB
QB -->|"C8 🔴 plain MergeTree + 30s re-pull → dup"| PROT
PXL -->|"C9 ⚠️ write only if rows>0"| PROT
ATTR -. "C10 ⚠️ join: events.pod = ns/pod ↔ attribution.pod = bare" .- PROT
```

## Boot / dependency contract

```mermaid
flowchart LR
ENV["ENV (all non-empty or FATAL):<br/>PIXIE_CLUSTER_ID · CLUSTER_NAME<br/>PIXIE_API_KEY · CLICKHOUSE_DSN"] --> BOOT
CM["cm/pl-cloud-config<br/>PL_CLOUD_ADDR=…:443"] -->|"C11 🔴 missing :443 → crashloop"| BOOT
BOOT["AE boot"] --> DDL["C12 self-applies forensic_db DDL<br/>(ADAPTIVE_SKIP_APPLY=false)"]
BOOT --> CTRLPLANE["control plane: CH only"]
BOOT --> DATAPLANE["data plane: needs query-broker<br/>(C7) + ADAPTIVE_PUSH_PIXIE_ROWS"]
```

## Contract register

| # | Contract (implied) | Enforced? | Status / fix |
|---|---|---|---|
| C1 | `kubescape_logs.event_time` is unix **seconds** (one unit end-to-end) | ❌ trigger auto-detects s/ms/ns; DDL `toDateTime()` assumes seconds | 🔴 **F8 root** — see C3; AE-2 standardize+normalize |
| C2 | `hostname` = the k8s **node** name (AE polls `WHERE hostname=node`) | ❌ convention only | ⚠️ fixtures must use a real node, else no AE ever reads them |
| C3 | every new anomaly's `event_time` ≥ current watermark (monotone) | ❌ strict HWM filter | 🔴 **F8** — a larger-unit / out-of-order / future row poisons the HWM → all later rows silently dropped. **Fix (PR #53):** normalize cursor to nanos (`chNormEventTimeNanos`); AE-9: ingest-order cursor / bounded-lookback+dedup + below-watermark metric |
| C4 | rows sharing `event_time` at the boundary are deduped by content fingerprint |`seenAtBoundary` | ok |
| C5 | `anomaly_hash = SHA256(pid,comm,pod,ns)[:16]` — identity is the **workload**, independent of event_time/RuleID || ok (N events for one target → 1 attribution row) |
| C6 | `trigger_watermark` persisted value tracks the live cursor | ❌ throttled ~5s | ⚠️ external readers/restart see up to 5s stale; AE-7 flush-on-shutdown |
| C7 | data-plane requires a **registered** vizier query-broker || ⚠️ control plane works without it; data plane silently does nothing |
| C8 | re-pulling a window is idempotent | ❌ protocol tables plain MergeTree (no dedup) + 30s re-pull | 🔴 duplicate inflation. **Fix:** single-shot (`ADAPTIVE_PUSH_REFRESH_SEC=-1`, or `AFTER<refresh`); AE-6 ReplacingMergeTree protocol tables |
| C9 | a protocol table row is written only if Pixie returned ≥1 row |`WritePixieRows len==0 → nil` | ok (empty workload → 0 rows, by design) |
| C10 | join key: `events.pod` = `"ns/pod"` (upid_to_pod_name) vs `adaptive_attribution.pod` = **bare** pod | ❌ asymmetric | ⚠️ consumers must `concat(namespace,'/',pod)` to join (burned the volume tool) |
| C11 | `PL_CLOUD_ADDR` carries `:443` || 🔴 missing → AE crashloops / 0 writes (per-PG fix) |
| C12 | AE owns + self-applies the `forensic_db` DDL | ✅ when `ADAPTIVE_SKIP_APPLY=false` | ok; but DDL TTL/PARTITION assume seconds (C1) |
| C13 | `adaptive_attribution` / protocol writes are durable | ❌ best-effort: logged, non-fatal, **not retried** | 🔴 silent loss under CH hiccup; AE-4 retry+count |
| C14 | **DX⊇AE invariant**: AE write-set ⊇ DX read-set (AE persists everything dx queries) | ❌ by convention | ⚠️ validated per-table in the load-test, not enforced in code |
| C15 | **Write-duration (the one DX steers on):** once an anomaly opens a pod's window, AE **keeps re-pulling + writing that pod's forensic data continuously** until `t_end` expires OR DX explicitly stops it. `t_end = now + After`, extended by each new anomaly for the hash. | ❌ partial | 🔴 **last week's "wrote then stopped" bug.** Premature stop modes under investigation (E8-data RCA): (a) F8 — extension anomalies dropped → `t_end` not extended → expires early; (b) EmptyResultSkip negative cache skips a (pod,table) mid-window after N empty pulls; (c) prune/in-flight race; (d) my `PUSH_REFRESH=-1` single-shot is a TEST affordance that *violates* this contract (writes once) — production must re-pull. |

## DX steering contract (what DX can rely on / control)

```mermaid
sequenceDiagram
participant DX
participant AE
participant Pixie
participant CH as forensic_db
Note over AE: anomaly (or DX referral) opens window [t_start, t_end=now+After]
loop every PushRefreshInterval until t_end OR DX stop (C15)
AE->>Pixie: PxL per table for (ns,pod), slice since last_upper
Pixie-->>AE: rows
AE->>CH: write rows (write ⊇ DX read, C14)
end
DX->>AE: StartExport / StopExport / extend t_end (control surface, CONTROL_ADDR)
Note over AE: stop ONLY on t_end or DX stop — never silently early (C15)
```

- **DX controls:** (1) open/extend a window (each referral/anomaly extends `t_end`), (2) explicit **StopExport** via the control surface (`CONTROL_ADDR`, design rev-3 — confirm wired), (3) the active set (which pods AE over-captures).
- **DX relies on:** C5 (stable hash identity), C14 (write ⊇ read), **C15 (no premature stop)**, C9 (0 rows only when the workload is genuinely silent), C10 (the `ns/pod` ↔ bare join). For DX to steer dependably, C3/C8/C13/C15 must move from 🔴 to ✅.

## Legend
✅ enforced in code · ⚠️ implied (assumed, not checked) · 🔴 observed violated (fix noted).
Full repro + backlog: `FINDINGS_AND_BACKLOG.md`. The fixes for C3/C1 are on PR #53 (`ae-prod`).
Loading
Loading