Skip to content

docs: design for cacheable initial load (jupyter + server)#877

Open
paddymul wants to merge 23 commits into
mainfrom
feat/initial-load-cache
Open

docs: design for cacheable initial load (jupyter + server)#877
paddymul wants to merge 23 commits into
mainfrom
feat/initial-load-cache

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Design-only PR (no code yet) — an RFC for review before implementation. Full doc: docs/initial-load-cache-design.md.

Problem

Buckaroo's first render re-runs the whole pipeline (sample → analyze → style → serialize the first window) on every mount / session open, even when neither the data nor the config changed. Hosts that can detect when the data changed (xorq build hashes, file content hashes) should be able to replay a cached first render without constructing the DataFrame or executing the expression.

Proposal — two functions

  • get_initial_cache_data(df, ...) -> (config_id, bundle) — runs the pipeline once, snapshots the first render (initial_state + first window + summary stats) into a JSON-serializable bundle.
  • populate_from_cache_data(bundle, ...) -> CachedInitial — serves the opening requests (initial_state + first infinite_request) entirely from the bundle. Touches no data.

Works for both jupyter (anywidget) and server buckaroo; replay is backend-agnostic (pandas / polars / xorq).

Key enabling fact

get_dfviewer_config(sd, df) reads only the summary dict plus column/index structure — never a row value (styling_core.py:422-473, customizations/styling.py:70-142). So df_display_args regenerates from merged_sd + a zero-row schema DataFrame, which is what lets styling and component config stay configurable at replay without ever touching the frame.

Boundaries

  • config_id keys only the data-touching computation (analysis classes, sampling, init_sd, skip cols). Display knobs (overrides, component_config, pinned_rows, theme) are replay-time and stay out of the key, so re-theming never invalidates the cache.
  • The caller owns data identity and reset; buckaroo returns the config half. (data_id, config_id) is the real cache key — that's the "infinite variation."
  • The cache serves only the opening window. Sort / search / scroll-past-window / cleaning ops fall back to the source by design, which keeps each bundle small.

Scope

Additive — no behavior change to existing paths. One shared-code refactor (build_df_display_args lifted out of _handle_widget_change so the live path and the cache path use one assembly). Full scope / files / TDD build order in the doc.

Looking for review on: the config_id in/out boundary, and produce-time backend parity (pandas/polars/xorq bundle shapes) before I start implementation.

🤖 Generated with Claude Code

Adds a design doc for get_initial_cache_data / populate_from_cache_data: snapshot the first render (initial_state + first window + summary stats) and replay it without constructing the DataFrame or executing the expression. Styling and component config stay configurable at replay via merged_sd + a zero-row schema df. Config_id keys only the data-touching computation; the caller owns data identity / reset.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 31, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26726112305

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.10.dev26726112305

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.10.dev26726112305" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Make the component_config / column_config_overrides test concrete: byte-equal df_display_args vs a live ServerDataflow with the same knobs, frame raises if touched. Proves both are honored via the regeneration path, not just the prerendered fast path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…end)

Design review outcomes: (1) backend-provides / widget-validates handshake — widget computes its own config_id + live schema, match => hydrate without touching df/expr, mismatch => warn + recompute, never blind-trust; (2) stats stored as full merged_sd minus value_counts, lossless type-tagged parquet (no pickle), round-trip tested across pandas/polars/xorq; (3) server stats delivery unchanged (follow-ups #880 trim wire payload, #881 transport abstraction); (4) config_id keys data-touching computation only, display knobs replay-time. xorq desktop is the driving consumer.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Locks all grill decisions: xorq-desktop driver; backend-provides/widget-validates handshake keyed on get_expr_hash (server-managed store outside build dirs, no sidecar); three-layer cache above xorq snapshot cache; window+stats bundle (value_counts dropped, lossless type-tagged parquet codec, cross-backend tested); /load_expr default-on with POST opt-out; lazy-on-miss + prewarm + LRU; send-then-warm via add_callback (no async); correlation-id + /cache endpoint observability; Jupyter mechanism-only. Adds measured cost model (load ~17ms flat vs N+1 stat executions) as motivation. Follow-ups #880/#881.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lift the df_display_args assembly out of both CustomizableDataflow._handle_widget_change and BuckarooInfiniteWidget._handle_widget_change into a module-level build_df_display_args in styling_core. Behavior-preserving (full unit suite green bar 3 pre-existing JS-build/uvx env failures); it's the shared assembler the initial-load cache replay path will reuse, regenerating df_display_args from merged_sd + a zero-row schema frame. dataflow_test imports merge_column_config from styling_core (its canonical home) instead of the incidental dataflow re-export the refactor dropped.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
config_fingerprint doesn't exist yet — collection fails with ModuleNotFoundError. Pins the contract: deterministic, hex, cross-process stable (subprocess test), sensitive to analysis-klass membership / version / init_sd / skip_stat_columns, and order-insensitive on skip columns. Fix follows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…l-load cache

New buckaroo.cache package + config_fingerprint: a blake2b digest over the data-touching config (analysis-klass module.qualname + optional per-class cache_version, sampling params, init_sd, skip_stat_columns, INITIAL_CACHE_VERSION). Cross-process stable (no id()), so a bundle built in one process validates the handshake in another. Display knobs are intentionally excluded. 7 tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
serialize_sd/deserialize_sd don't exist yet — collection fails. Pins the contract: lossless round-trip of every backend value type (pd.Timestamp/Timedelta, stdlib datetime/date/time/timedelta, Decimal, bytes, numpy scalars, nan, histogram lists), value_counts dropped, MultiIndex orig_col_name tuple preserved, plus a real pandas-pipeline sd round-trip. Fix follows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Encode merged_sd to parquet without pickle via a type-tagged JSON envelope: JSON-native values pass through; pd.Timestamp/Timedelta, stdlib datetime/date/time/timedelta, Decimal, bytes, numpy scalars, NaN and MultiIndex tuples are tagged and reconstructed on decode. value_counts is dropped (not recomputed-from, not read by the frontend; see #880). numpy scalars decode to native Python (value-lossless). 11 cache tests green. Also folds in the paddy_format reformat of the test file that the failing-test commit predated.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Covers the backend-agnostic core (pandas only): get_initial_cache_data /
build_bundle_from_dataflow produce a bundle equal to a live dataflow's first
render; cache_mismatch_reason validates config_id + schema + version;
apply_initial_cache hydrates a target from the bundle alone and regenerates
df_display_args from a zero-row frame under replay-time overrides. Red until
buckaroo/cache/initial_cache.py lands.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
buckaroo/cache/initial_cache.py — the backend-agnostic core of the initial-load
cache:

- get_initial_cache_data(df) / build_bundle_from_dataflow: run the pipeline once
  and snapshot an InitialCacheData bundle (prerendered df_display_args, df_meta,
  type-tagged merged_sd parquet, first-window parquet, config_id, column_schema).
- cache_mismatch_reason: the validate-don't-trust handshake — recomputes the
  config_id from the live config and checks version + config_id + schema,
  returning None (safe) or a reason.
- apply_initial_cache: hydrate a target (widget / dataflow / session) from the
  bundle alone — no DataFrame, no execution. Regenerates df_display_args from a
  zero-row frame under replay-time overrides (styling is data-free).

Pandas dispatch only for now (via ServerDataflow); polars/xorq builders land
with the server integration. build_bundle_from_dataflow is what the server will
call against an already-built dataflow.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
InitialCacheStore — in-memory LRU over a persistent on-disk dir keyed by
data_id: put/get, LRU eviction (memory-only and disk-backed), disk persistence
across store instances, prewarm, report (for /cache), write_bundle/read_bundle
round-trip. Plus the serve_window_request predicate truth table. Red until
buckaroo/cache/store.py and serve_window_request land.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…redicate

buckaroo/cache/store.py — InitialCacheStore, an in-memory LRU over a persistent
on-disk directory keyed by data_id:
- put writes to memory + disk (manifest.json + sd.parquet + first_window.parquet,
  tmp-then-rename); get returns an in-memory hit or lazily faults from disk.
- LRU eviction drops only from memory; disk is the durable layer (evicted
  entries fault back in).
- prewarm loads every persisted bundle eagerly; report feeds /cache.

buckaroo/cache/initial_cache.py — serve_window_request: the pure predicate for
the WS fast path. True only for the cached head slice (start==0, end<=window),
unsorted and unfiltered; everything else falls through to the live source.

Both pure Python — the /load_expr wiring + /cache endpoint that consume them
land with the server integration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
xorq-gated server integration (4b-i): /load_expr builds + stores an
InitialCacheData bundle keyed by the expr hash and reports a cache block
({status, data_id, request_id}); initial_cache:false skips the store; the
/cache endpoint reports stored bundles. Red until the store is wired into
make_app, the xorq bundle builder lands, and the /cache route is added.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- xorq_loading.expr_data_id (get_expr_hash) + build_xorq_bundle: the xorq
  counterpart of build_bundle_from_dataflow — first window via window_to_parquet,
  schema via expr.schema() (processed_df is an ibis expr, not a pandas frame).
- app.make_app: construct an InitialCacheStore (memory-only by default;
  initial_cache_dir for a persistent, prewarmed one) and register /cache.
- LoadExprHandler: default-on store of the bundle keyed by expr hash; echo a
  cache block {status, data_id, request_id}; initial_cache:false skips the
  store; request_id stamped on the log line. Best-effort — a cache failure
  never fails the load.
- CacheHandler (GET /cache): store introspection.

The hit fast path (serve-from-cache + serve_window) lands next.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A repeat /load_expr of the same expr (same content-based data_id) must report
cache.status=='hit' and still render (WS initial_state carries the bundle's
df_display_args + df_meta). A repeat with a different data-touching config
(init_sd) must report status=='mismatch' and recompute. Red until the hit
fast-path + handshake land in LoadExprHandler.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a repeat /load_expr of the same expr the handshake (config_id recomputed from
the live klasses + request init_sd/skip, never read from the bundle) decides:
- hit  → build the dataflow with ALL stat columns skipped (the N+1 pipeline
         collapses to a single count(), ~100x cheaper) for scroll/sort/search,
         and serve the first paint + stats from the cached bundle via
         apply_initial_cache (byte-identical render).
- miss/mismatch → build normally, (re)store the bundle.
A mismatch warns + recomputes rather than mis-serving.

The bundle's first window is parked on the session; the WS handler's
serve_window fast path ships it for the head slice (unsorted, unfiltered)
without touching the expr — sorts/searches/deeper slices fall through to the
warmed dataflow. /load clears the window so a pandas load can't serve a stale
xorq slice.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nded

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BuckarooWidgetBase should accept initial_cache=<bundle>: a matching bundle
hydrates the widget's display traits (proven via a sentinel tagged onto the
bundle's df_meta), a mismatch warns + keeps the computed values. Red until the
kwarg + handshake land in the widget __init__ chain.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BuckarooWidgetBase (and the Infinite/DFViewer subclasses) accept initial_cache=
<bundle>. After the dataflow builds, _maybe_apply_initial_cache runs the same
validate-don't-trust handshake the server uses (config_id from the widget's live
klasses/sampling + schema) and, on a match, replays the bundle onto the widget's
display traits via apply_initial_cache; a mismatch warns and keeps the computed
values.

Mechanism only — no Jupyter store/driver/prewarm (per scope). The widget already
built its dataflow, so this is for parity with the server path and future
Jupyter cache use, not a build skip.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@paddymul paddymul marked this pull request as ready for review May 31, 2026 21:44
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6106a650c0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/server/websocket_handler.py Outdated
Comment on lines +209 to +210
if window_parquet:
self.write_message(window_parquet, binary=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Slice cached window to requested range

When a cache-hit session serves an unsorted head request whose end is smaller than DEFAULT_WINDOW (for example AG Grid's initial 0..40 block on a dataset with more than 40 rows), this sends the full cached 1000-row parquet frame while the response key still says end: 40. The client records rows against [start,end]; oversized payloads are rejected by SmartRowCache.addRows because resp.data.length does not match the requested segment, so the initial grid request can fail on any cached dataset larger than the requested block. The fast path needs to either only handle end == window/total or slice the cached parquet to pa.start..pa.end before writing it.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d45dd9c. Confirmed: _serve_from_cache_if_window shipped the whole cached head window regardless of the requested [start, end], so once total_rows > window the client (SmartRowCache.addRows) dropped the oversized payload — row count != requested segment, and start + data.length != sentLength — and the initial getRows threw Missing rows. (Datasets ≤ window happened to survive via the tail-recursion branch, which is why the existing 10-row test passed.)

The fast path now slices the cached parquet to [start, end] before sending, via a new serialization_utils.slice_window_parquet (pq.read_tabletable.slicepq.write_table(compression='none')). That is byte-format-identical to the live window_to_parquet path the JS hyparquet reader already consumes, so only the row count changes; the response key and full length are untouched. end past the window clamps to the rows available.

Coverage: unit tests for the slice helper, plus a WS test that asks a 10-row hit for [0,4] and asserts a 4-row binary frame.

Comment thread buckaroo/server/handlers.py Outdated
Comment on lines +522 to +523
if hit:
apply_initial_cache(session, bundle)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Replay cache with display overrides

On a cache hit this replays bundle.df_display_args without the current request's column_config_overrides or extra_grid_config, even though /load_expr accepts those knobs and they are intentionally excluded from the cache fingerprint. If a first load stores the baseline bundle and a later load of the same expr supplies column overrides, apply_initial_cache(session, bundle) overwrites the freshly built xorq_dataflow.df_display_args that did include the overrides, so the user sees stale/default column styling until the cache is bypassed. Pass the current df_display_klasses and display override arguments to apply_initial_cache as the widget path does.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d45dd9c. Confirmed: the hit branch called apply_initial_cache(session, bundle) bare, so it replayed the bundle's baseline df_display_args and overwrote the freshly-built xorq_dataflow.df_display_args that did include this load's column_config_overrides / extra_grid_config (both excluded from the fingerprint by design). Result: a hit on the same expr with new display knobs rendered stale styling.

It now passes df_display_klasses plus the override args off the dataflow to apply_initial_cache, mirroring the widget path. Since the bundle carries the full merged_sd, the replay regenerates df_display_args from the zero-row frame with the current knobs applied (styling is data-free). component_config stays on the existing post-apply merge loop, consistent with the miss path.

Coverage: a WS test loads the same expr a second time (hit) with extra_grid_config and asserts it lands in df_viewer_config (was {}, now {"rowHeight": 99}).

Two Codex findings on #877:

* P1 (websocket_handler): the cache fast path ships the whole cached head
  window regardless of the requested [start,end], so the client's
  SmartRowCache.addRows rejects the payload once total_rows > window. New
  test asks a 10-row hit for [0,4] and asserts the binary frame holds 4 rows
  (currently 10).
* P2 (handlers): a cache hit replays the bundle's bare df_display_args,
  dropping the current request's display knobs (excluded from the cache
  fingerprint by design). New test loads with extra_grid_config on a hit and
  asserts it survives into df_viewer_config (currently {}).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses two Codex findings on #877.

P1 — websocket_handler: the initial-load cache fast path shipped the whole
cached head window (up to DEFAULT_WINDOW rows) regardless of the requested
[start, end]. AG Grid's first block is cacheBlockSize rows (visible + 50),
smaller than the window, and SmartRowCache.addRows rejects a payload whose
row count != the requested segment once total_rows > window — so the initial
paint failed on any cached dataset larger than one block. Add
serialization_utils.slice_window_parquet (pyarrow read → slice → write with
compression='none', byte-format-identical to the live window_to_parquet path)
and slice to [start, end] before sending; the response key + full length are
unchanged.

P2 — handlers: a cache hit replayed the bundle's baseline df_display_args via
a bare apply_initial_cache(session, bundle), dropping the current request's
column_config_overrides / extra_grid_config. Those knobs are excluded from the
cache fingerprint by design, so a hit on the same expr with different display
config showed stale styling. Pass the dataflow's df_display_klasses + override
args (mirroring the widget path) so the replay regenerates from the zero-row
frame with them; component_config stays on the existing merge loop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@paddymul
Copy link
Copy Markdown
Collaborator Author

Addressed both Codex findings.

P1 — slice cached window to requested range (websocket_handler.py). Real bug. The initial-load cache fast path shipped the whole cached head window (≤ DEFAULT_WINDOW rows) regardless of the requested [start, end]. AG Grid's first block is cacheBlockSize (≈ visible + 50), smaller than the window, so once total_rows > window the client (SmartRowCache.addRows) dropped the oversized payload and the initial paint threw Missing rows. Fixed by slicing the cached parquet to [start, end] via a new serialization_utils.slice_window_parquet (pyarrow read → slice → write_table(compression='none'), byte-format-identical to the live window_to_parquet path). Response key + full length unchanged. — d45dd9c

P2 — replay cache with display overrides (handlers.py). Real bug. A cache hit replayed the bundle's baseline df_display_args via a bare apply_initial_cache(session, bundle), dropping this load's column_config_overrides / extra_grid_config (excluded from the fingerprint by design). Fixed by passing df_display_klasses + the override args off the dataflow (mirroring the widget path) so the replay regenerates from the zero-row frame with the current knobs; component_config stays on the existing merge loop. — d45dd9c

Both were committed test-first: failing tests in 4271282 (seen red on CI for Python 3.11/3.12/3.13), fix in d45dd9c (same jobs now green). New coverage: WS integration tests for the sub-window slice and the override replay, plus unit tests for slice_window_parquet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant