Skip to content

Add aggregate flag to /sql/measures/v3/ for raw-row output#2222

Open
jsogarro wants to merge 1 commit into
DataJunction:mainfrom
jsogarro:feature/jogarro/measures-v3-preaggregate-false
Open

Add aggregate flag to /sql/measures/v3/ for raw-row output#2222
jsogarro wants to merge 1 commit into
DataJunction:mainfrom
jsogarro:feature/jogarro/measures-v3-preaggregate-false

Conversation

@jsogarro
Copy link
Copy Markdown

@jsogarro jsogarro commented Jun 4, 2026

PR: Add aggregate flag to /sql/measures/v3/ for raw-row output

Target: DataJunction/dj, base branch main
Branch: feature/jogarro/measures-v3-preaggregate-false
Type: Draft


Summary

Adds an aggregate: bool = True query parameter to /sql/measures/v3/. The default is fully backward compatible. When aggregate=false, the endpoint:

  • Emits a flat SELECT with no GROUP BY
  • Projects raw column references using v2-style amenable aliases (<parent_node>_DOT_<column>) instead of synthetic <col>_sum_<hash> names
  • Bypasses pre-aggregation table matching (raw rows cannot be served from a pre-aggregated table)
  • Still populates metric_formulas[].combiner (pointing at the raw alias) so the caller can aggregate the raw output downstream

This is the v3 equivalent of v2's preaggregate=False mode. It exists for callers that apply their own aggregation downstream over the raw row-level output, where v3's server-side aggregation would double-count.

Why

A downstream consumer of /sql/measures/v2 (with preaggregate=false) hit a 7-8x metric over-count when migrating to /sql/measures/v3. The root cause: the consumer's downstream analysis layer applies its own aggregation over raw indicator values (e.g., SUM(is_retained_end) / COUNT(accounts) where is_retained_end ∈ {0, 1}), and v3's mandatory server-side aggregation stacked under that outer aggregation produced inflated sums. The metric SUM(is_retained_end) displayed as 0.9048 under v2 + preaggregate=false and 6.4107 under v3 — same query, same data, ~640% inflation purely from the doubled aggregation.

The consumer cannot reshape v3's output back to the v2 shape because v3's synthetic column naming and server-side GROUP BY destroy the per-row signal before it ever leaves DJ. They need v3 to support a mode that does not pre-aggregate.

Changes

Behavior:

  • New query parameter aggregate: bool = Query(True) on /sql/measures/v3/. Default preserves all existing behavior byte-for-byte.
  • New aggregate: bool = True field on BuildContext. Threaded through setup_build_context and build_measures_sql.
  • When ctx.aggregate=False:
    • Pre-aggregation matcher at measures.py:1846 is gated and never fires (raw rows cannot be served from a pre-agg).
    • Each component in build_grain_group_sql projects make_column_ref(component.expression) instead of build_component_expression(component) (no SUM/COUNT wrapping).
    • Component alias uses amenable_name(parent_node.name + "." + component.expression), producing v2-style names like v3_DOT_page_views_enriched_DOT_is_product_view.
    • skip_aggregation=True is passed to build_select_ast, which already has a working no-GROUP BY path for non-decomposable metrics.
    • metric_formulas[].combiner is populated with <aggregation>(<v2-style alias>) (e.g., SUM(v3_DOT_page_views_enriched_DOT_is_product_view)).

Out of scope (deferred to follow-up PRs):

  • /sql/measures/v3/combined — semantics of aggregate=false on the combined endpoint are not obvious (combined output requires alignment that aggregation provides).
  • /sql/preaggregations route — unchanged.
  • v2 endpoint — unchanged.
  • Python client (datajunction-clients/python/) — unchanged.

Design decisions

Decision Rationale
Parameter named aggregate, not preaggregate or use_preagg_tables preaggregate would collide with v2's identically-named param (different semantics on each endpoint). use_preagg_tables already exists on /combined with a narrower meaning (table sourcing only, not GROUP BY emission). aggregate names the actual behavior: when False, no server-side aggregation. Default-True preserves backward compat.
New BuildContext.aggregate field instead of reusing use_materialized use_materialized is dual-purpose: it controls both pre-agg table matching (preagg_matcher.py:72-73) AND parent-transform materialization (materialization.py:112-182, cte.py:2012,2053, measures.py:1463). Flipping use_materialized=False to bypass the matcher would also force every parent transform to be recomputed via CTE instead of read from its materialized table — large perf regression. New field isolates the concern.
Single early-exit projection branch covers all aggregability levels Simpler diff than threading the flag through each existing aggregability-specific branch (FULL/LIMITED/NONE/merged). Reuses the same make_column_ref(component.expression) pattern the existing NONE-aggregability path uses at measures.py:1902-1911. Trade-off: aggregability-specific paths are short-circuited entirely in raw mode, which is the correct semantics for the target use case (callers want raw rows regardless of aggregability level).
Column alias scheme amenable_name(parent_node.name + SEPARATOR + component.expression) Reuses the v2 alias convention defined in naming.py:38-50. Output is <parent_node>_DOT_<column> (e.g., v3_DOT_page_views_enriched_DOT_is_product_view), matching the v2 column naming that the v2-pinned consumer already consumes. Minimizes downstream migration work.
Matcher bypass via 1-line guard at the single call site The pre-aggregation matcher is called from exactly one place (measures.py:1846). A method on BuildContext would add indirection without reducing duplication. The inline guard with a comment explaining why use_materialized is intentionally left alone is clearer.

Test plan

5 new test cases in tests/construction/build_v3/measures_sql_test.py:

Test Asserts
TestMeasuresSQLAggregateFalse::test_binary_indicator_no_group_by SUM(is_product_view) on a 0/1 indicator with aggregate=False: SQL projects raw t1.is_product_view with v2-style alias, no GROUP BY, combiner is SUM(<raw_alias>). Regression test for the production over-count.
TestMeasuresSQLAggregateFalse::test_additive_numeric_metric SUM(line_total) on aggregate=False: same flat-SELECT / raw-column shape.
TestMeasuresSQLAggregateFalse::test_multi_metric_preserves_each_combiner Multi-metric request (v3.total_revenue, v3.max_unit_price, v3.min_unit_price): each metric's combiner is preserved and references its raw v2-style column alias.
TestMeasuresSQLAggregateFalse::test_default_aggregate_true_unchanged Backward-compat guard: default (no param) and explicit aggregate=True produce identical responses with GROUP BY present and synthetic hash aliases.
TestMeasuresSQLAggregateFalseEndpoint::test_route_aggregate_false_full_response_shape Route-level integration: top-level shape preserved (grain_groups, metric_formulas, dialect, requested_dimensions); per-component fields reflect raw-row semantics.

Verified test results:

tests/construction/build_v3/measures_sql_test.py ...... 126 passed   (5 new + 121 existing)
tests/construction/build_v3/                    ...... 544 passed   (entire build_v3 suite)
tests/api/cubes_test.py (v3 subset)             ...... 2 passed
pre-commit run (ruff lint + format, mypy, AST, ...) .. all hooks passed

No existing test was modified; no regression observed across the v3 construction or v3 cube tests.

Backward compatibility

  • The new query parameter defaults to True.
  • Threading through setup_build_context / build_measures_sql uses keyword arguments with True defaults, so any existing caller of these helpers compiles unchanged.
  • The matcher gate at measures.py:1846 adds one and ctx.aggregate term to the existing condition; when aggregate=True (default), the condition is identical to before.
  • The new projection branch only executes when not ctx.aggregate — when aggregate=True, control falls through to the existing aggregability-specific branches unchanged.

Rollback

Revert the merge commit. No data migration, no config change, no feature flag — the new parameter is the only behavior toggle and defaults to current behavior.

Notes for reviewers

  • The semantic_entity field on component columns still references the synthetic <col>_sum_<hash> name when aggregate=False (e.g., v3.product_view_count:is_product_view_sum_eb3a4b41). This is intentional — semantic_entity is for DJ's internal identity matching and is independent of the projected column name. Happy to revisit if reviewers prefer it to track the v2-style alias.
  • Empty grain and components arrays are not produced under aggregate=False; the response keeps grain populated with the requested dimensions and components populated with the (renamed) raw-column entries. The plan we wrote internally initially considered emptying both fields, but the test output showed that preserving them gives the caller more useful metadata without breaking any existing consumer assumption.
  • The new parameter is intentionally scoped to /sql/measures/v3/. /sql/measures/v3/combined is not in scope; semantics of "raw rows across multiple aligned grain groups" are not obvious and deserve a separate design discussion.

Manual Test Plan

Prerequisites

  • Clone DataJunction/dj (or the relevant fork) and check out feature/jogarro/measures-v3-preaggregate-false.
  • Install: cd datajunction-server && uv sync --group test --extra transpilation.
  • Docker is running (required by the test containers fixture).

Test Steps

Step 1: Run the new test class

cd datajunction-server
uv run pytest \
  "tests/construction/build_v3/measures_sql_test.py::TestMeasuresSQLAggregateFalse" \
  "tests/construction/build_v3/measures_sql_test.py::TestMeasuresSQLAggregateFalseEndpoint" \
  -vv

Expected: 5 passed, 0 failed.

Step 2: Run the full measures_sql test file (regression check)

uv run pytest tests/construction/build_v3/measures_sql_test.py -q

Expected: 126 passed (121 existing + 5 new), 0 failed.

Step 3: Run the entire build_v3 test directory

uv run pytest tests/construction/build_v3/ -q

Expected: 544 passed, 1 xfailed (pre-existing).

Step 4: Pre-commit / lint

cd ..
uv run pre-commit run --files \
  datajunction-server/datajunction_server/api/sql.py \
  datajunction-server/datajunction_server/construction/build_v3/builder.py \
  datajunction-server/datajunction_server/construction/build_v3/measures.py \
  datajunction-server/datajunction_server/construction/build_v3/types.py \
  datajunction-server/tests/construction/build_v3/measures_sql_test.py

Expected: All hooks pass.

Verification Queries

After standing up the patched server locally, call:

# Default (aggregate=True) - existing behavior
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date" | jq '.grain_groups[0].sql'
# Expected: includes "GROUP BY", projects "SUM(t1.is_product_view) is_product_view_sum_<hash>"

# Raw mode (aggregate=False) - new behavior
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date&aggregate=false" | jq '.grain_groups[0].sql'
# Expected: NO "GROUP BY", projects "t1.is_product_view v3_DOT_page_views_enriched_DOT_is_product_view"

# Combiner is still populated under aggregate=False
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date&aggregate=false" | jq '.metric_formulas[0].combiner'
# Expected: "SUM(v3_DOT_page_views_enriched_DOT_is_product_view)"

Edge Cases

  • aggregate=false with multiple metrics from the same parent fact (verified by test_multi_metric_preserves_each_combiner).
  • aggregate=false combined with use_materialized=true — matcher bypass takes precedence (verified: matcher gate is ctx.aggregate and ctx.use_materialized and ctx.available_preaggs).
  • aggregate=true (default) is identical to pre-PR behavior (verified by test_default_aggregate_true_unchanged).
  • aggregate=false with no dimensions (global raw projection) — verified by test_multi_metric_preserves_each_combiner (calls with dimensions=[]).

Cleanup

No external resources allocated. Test container is torn down by the pytest fixture.

Adds aggregate: bool = True query parameter to /sql/measures/v3/. When
False, the endpoint emits a flat SELECT with no GROUP BY, projects raw
column references using v2-style amenable aliases
(<parent_node>_DOT_<column>), and bypasses pre-aggregation table
matching. metric_formulas[].combiner is still populated so callers can
re-aggregate the raw output downstream. Default True preserves all
existing behavior.

This is the v3 equivalent of v2's preaggregate=False mode. It exists
for callers that apply their own aggregation downstream over the raw
row-level output, where v3's mandatory server-side aggregation would
double-count.

Implementation:
- New BuildContext.aggregate field, threaded through setup_build_context
  and build_measures_sql.
- Matcher gate at measures.py:1846 adds 'ctx.aggregate and' to the
  existing condition. use_materialized is intentionally NOT reused
  because it is dual-purpose: it also controls parent-transform
  materialization via should_use_materialized_table.
- Single early-exit projection branch in build_grain_group_sql handles
  all aggregability levels uniformly when ctx.aggregate is False.
- skip_aggregation=True is OR'd with the existing condition in
  build_select_ast, reusing the working no-GROUP-BY path.

Tests: 5 new tests covering binary indicator, additive numeric,
multi-metric combiner preservation, default backward compat, and
route-level integration. All 126 measures_sql_test.py tests pass
(121 existing + 5 new), all 544 build_v3 tests pass, pre-commit
clean.
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 4, 2026

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 72fc77b
🔍 Latest deploy log https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/6a219d81a2a84700088898c2

@shangyian
Copy link
Copy Markdown
Collaborator

@jsogarro thanks for adding this -- it might be good to merge this anyway as a fallback, even if it's not used immediately. I'll go through it more closely later today

@jsogarro jsogarro marked this pull request as ready for review June 4, 2026 19:03
@jsogarro
Copy link
Copy Markdown
Author

jsogarro commented Jun 4, 2026

@jsogarro thanks for adding this -- it might be good to merge this anyway as a fallback, even if it's not used immediately. I'll go through it more closely later today

Sounds good. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants