Add aggregate flag to /sql/measures/v3/ for raw-row output by jsogarro · Pull Request #2222 · DataJunction/dj

jsogarro · 2026-06-04T15:45:02Z

PR: Add `aggregate` flag to `/sql/measures/v3/` for raw-row output

Target: DataJunction/dj, base branch main
Branch: feature/jogarro/measures-v3-preaggregate-false
Type: Draft

Summary

Adds an aggregate: bool = True query parameter to /sql/measures/v3/. The default is fully backward compatible. When aggregate=false, the endpoint:

Emits a flat SELECT with no GROUP BY
Projects raw column references using v2-style amenable aliases (<parent_node>_DOT_<column>) instead of synthetic <col>_sum_<hash> names
Bypasses pre-aggregation table matching (raw rows cannot be served from a pre-aggregated table)
Still populates metric_formulas[].combiner (pointing at the raw alias) so the caller can aggregate the raw output downstream

This is the v3 equivalent of v2's preaggregate=False mode. It exists for callers that apply their own aggregation downstream over the raw row-level output, where v3's server-side aggregation would double-count.

Why

A downstream consumer of /sql/measures/v2 (with preaggregate=false) hit a 7-8x metric over-count when migrating to /sql/measures/v3. The root cause: the consumer's downstream analysis layer applies its own aggregation over raw indicator values (e.g., SUM(is_retained_end) / COUNT(accounts) where is_retained_end ∈ {0, 1}), and v3's mandatory server-side aggregation stacked under that outer aggregation produced inflated sums. The metric SUM(is_retained_end) displayed as 0.9048 under v2 + preaggregate=false and 6.4107 under v3 — same query, same data, ~640% inflation purely from the doubled aggregation.

The consumer cannot reshape v3's output back to the v2 shape because v3's synthetic column naming and server-side GROUP BY destroy the per-row signal before it ever leaves DJ. They need v3 to support a mode that does not pre-aggregate.

Changes

Behavior:

New query parameter aggregate: bool = Query(True) on /sql/measures/v3/. Default preserves all existing behavior byte-for-byte.
New aggregate: bool = True field on BuildContext. Threaded through setup_build_context and build_measures_sql.
When ctx.aggregate=False:
- Pre-aggregation matcher at measures.py:1846 is gated and never fires (raw rows cannot be served from a pre-agg).
- Each component in build_grain_group_sql projects make_column_ref(component.expression) instead of build_component_expression(component) (no SUM/COUNT wrapping).
- Component alias uses amenable_name(parent_node.name + "." + component.expression), producing v2-style names like v3_DOT_page_views_enriched_DOT_is_product_view.
- skip_aggregation=True is passed to build_select_ast, which already has a working no-GROUP BY path for non-decomposable metrics.
- metric_formulas[].combiner is populated with <aggregation>(<v2-style alias>) (e.g., SUM(v3_DOT_page_views_enriched_DOT_is_product_view)).

Out of scope (deferred to follow-up PRs):

/sql/measures/v3/combined — semantics of aggregate=false on the combined endpoint are not obvious (combined output requires alignment that aggregation provides).
/sql/preaggregations route — unchanged.
v2 endpoint — unchanged.
Python client (datajunction-clients/python/) — unchanged.

Design decisions

Decision	Rationale
Parameter named `aggregate`, not `preaggregate` or `use_preagg_tables`	`preaggregate` would collide with v2's identically-named param (different semantics on each endpoint). `use_preagg_tables` already exists on `/combined` with a narrower meaning (table sourcing only, not GROUP BY emission). `aggregate` names the actual behavior: when False, no server-side aggregation. Default-True preserves backward compat.
New `BuildContext.aggregate` field instead of reusing `use_materialized`	`use_materialized` is dual-purpose: it controls both pre-agg table matching (`preagg_matcher.py:72-73`) AND parent-transform materialization (`materialization.py:112-182`, `cte.py:2012,2053`, `measures.py:1463`). Flipping `use_materialized=False` to bypass the matcher would also force every parent transform to be recomputed via CTE instead of read from its materialized table — large perf regression. New field isolates the concern.
Single early-exit projection branch covers all aggregability levels	Simpler diff than threading the flag through each existing aggregability-specific branch (FULL/LIMITED/NONE/merged). Reuses the same `make_column_ref(component.expression)` pattern the existing NONE-aggregability path uses at `measures.py:1902-1911`. Trade-off: aggregability-specific paths are short-circuited entirely in raw mode, which is the correct semantics for the target use case (callers want raw rows regardless of aggregability level).
Column alias scheme `amenable_name(parent_node.name + SEPARATOR + component.expression)`	Reuses the v2 alias convention defined in `naming.py:38-50`. Output is `<parent_node>_DOT_<column>` (e.g., `v3_DOT_page_views_enriched_DOT_is_product_view`), matching the v2 column naming that the v2-pinned consumer already consumes. Minimizes downstream migration work.
Matcher bypass via 1-line guard at the single call site	The pre-aggregation matcher is called from exactly one place (`measures.py:1846`). A method on `BuildContext` would add indirection without reducing duplication. The inline guard with a comment explaining why `use_materialized` is intentionally left alone is clearer.

Test plan

5 new test cases in tests/construction/build_v3/measures_sql_test.py:

Test	Asserts
`TestMeasuresSQLAggregateFalse::test_binary_indicator_no_group_by`	`SUM(is_product_view)` on a 0/1 indicator with `aggregate=False`: SQL projects raw `t1.is_product_view` with v2-style alias, no `GROUP BY`, `combiner` is `SUM(<raw_alias>)`. Regression test for the production over-count.
`TestMeasuresSQLAggregateFalse::test_additive_numeric_metric`	`SUM(line_total)` on `aggregate=False`: same flat-SELECT / raw-column shape.
`TestMeasuresSQLAggregateFalse::test_multi_metric_preserves_each_combiner`	Multi-metric request (`v3.total_revenue`, `v3.max_unit_price`, `v3.min_unit_price`): each metric's combiner is preserved and references its raw v2-style column alias.
`TestMeasuresSQLAggregateFalse::test_default_aggregate_true_unchanged`	Backward-compat guard: default (no param) and explicit `aggregate=True` produce identical responses with `GROUP BY` present and synthetic hash aliases.
`TestMeasuresSQLAggregateFalseEndpoint::test_route_aggregate_false_full_response_shape`	Route-level integration: top-level shape preserved (`grain_groups`, `metric_formulas`, `dialect`, `requested_dimensions`); per-component fields reflect raw-row semantics.

Verified test results:

tests/construction/build_v3/measures_sql_test.py ...... 126 passed   (5 new + 121 existing)
tests/construction/build_v3/                    ...... 544 passed   (entire build_v3 suite)
tests/api/cubes_test.py (v3 subset)             ...... 2 passed
pre-commit run (ruff lint + format, mypy, AST, ...) .. all hooks passed

No existing test was modified; no regression observed across the v3 construction or v3 cube tests.

Backward compatibility

The new query parameter defaults to True.
Threading through setup_build_context / build_measures_sql uses keyword arguments with True defaults, so any existing caller of these helpers compiles unchanged.
The matcher gate at measures.py:1846 adds one and ctx.aggregate term to the existing condition; when aggregate=True (default), the condition is identical to before.
The new projection branch only executes when not ctx.aggregate — when aggregate=True, control falls through to the existing aggregability-specific branches unchanged.

Rollback

Revert the merge commit. No data migration, no config change, no feature flag — the new parameter is the only behavior toggle and defaults to current behavior.

Notes for reviewers

The semantic_entity field on component columns still references the synthetic <col>_sum_<hash> name when aggregate=False (e.g., v3.product_view_count:is_product_view_sum_eb3a4b41). This is intentional — semantic_entity is for DJ's internal identity matching and is independent of the projected column name. Happy to revisit if reviewers prefer it to track the v2-style alias.
Empty grain and components arrays are not produced under aggregate=False; the response keeps grain populated with the requested dimensions and components populated with the (renamed) raw-column entries. The plan we wrote internally initially considered emptying both fields, but the test output showed that preserving them gives the caller more useful metadata without breaking any existing consumer assumption.
The new parameter is intentionally scoped to /sql/measures/v3/. /sql/measures/v3/combined is not in scope; semantics of "raw rows across multiple aligned grain groups" are not obvious and deserve a separate design discussion.

Manual Test Plan

Prerequisites

Clone DataJunction/dj (or the relevant fork) and check out feature/jogarro/measures-v3-preaggregate-false.
Install: cd datajunction-server && uv sync --group test --extra transpilation.
Docker is running (required by the test containers fixture).

Test Steps

Step 1: Run the new test class

cd datajunction-server
uv run pytest \
  "tests/construction/build_v3/measures_sql_test.py::TestMeasuresSQLAggregateFalse" \
  "tests/construction/build_v3/measures_sql_test.py::TestMeasuresSQLAggregateFalseEndpoint" \
  -vv

Expected: 5 passed, 0 failed.

Step 2: Run the full measures_sql test file (regression check)

uv run pytest tests/construction/build_v3/measures_sql_test.py -q

Expected: 126 passed (121 existing + 5 new), 0 failed.

Step 3: Run the entire build_v3 test directory

uv run pytest tests/construction/build_v3/ -q

Expected: 544 passed, 1 xfailed (pre-existing).

Step 4: Pre-commit / lint

cd ..
uv run pre-commit run --files \
  datajunction-server/datajunction_server/api/sql.py \
  datajunction-server/datajunction_server/construction/build_v3/builder.py \
  datajunction-server/datajunction_server/construction/build_v3/measures.py \
  datajunction-server/datajunction_server/construction/build_v3/types.py \
  datajunction-server/tests/construction/build_v3/measures_sql_test.py

Expected: All hooks pass.

Verification Queries

After standing up the patched server locally, call:

# Default (aggregate=True) - existing behavior
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date" | jq '.grain_groups[0].sql'
# Expected: includes "GROUP BY", projects "SUM(t1.is_product_view) is_product_view_sum_<hash>"

# Raw mode (aggregate=False) - new behavior
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date&aggregate=false" | jq '.grain_groups[0].sql'
# Expected: NO "GROUP BY", projects "t1.is_product_view v3_DOT_page_views_enriched_DOT_is_product_view"

# Combiner is still populated under aggregate=False
curl -s "http://localhost:8000/sql/measures/v3/?metrics=v3.product_view_count&dimensions=v3.page_views_enriched.page_date&aggregate=false" | jq '.metric_formulas[0].combiner'
# Expected: "SUM(v3_DOT_page_views_enriched_DOT_is_product_view)"

Edge Cases

aggregate=false with multiple metrics from the same parent fact (verified by test_multi_metric_preserves_each_combiner).
aggregate=false combined with use_materialized=true — matcher bypass takes precedence (verified: matcher gate is ctx.aggregate and ctx.use_materialized and ctx.available_preaggs).
aggregate=true (default) is identical to pre-PR behavior (verified by test_default_aggregate_true_unchanged).
aggregate=false with no dimensions (global raw projection) — verified by test_multi_metric_preserves_each_combiner (calls with dimensions=[]).

Cleanup

No external resources allocated. Test container is torn down by the pytest fixture.

Adds aggregate: bool = True query parameter to /sql/measures/v3/. When False, the endpoint emits a flat SELECT with no GROUP BY, projects raw column references using v2-style amenable aliases (<parent_node>_DOT_<column>), and bypasses pre-aggregation table matching. metric_formulas[].combiner is still populated so callers can re-aggregate the raw output downstream. Default True preserves all existing behavior. This is the v3 equivalent of v2's preaggregate=False mode. It exists for callers that apply their own aggregation downstream over the raw row-level output, where v3's mandatory server-side aggregation would double-count. Implementation: - New BuildContext.aggregate field, threaded through setup_build_context and build_measures_sql. - Matcher gate at measures.py:1846 adds 'ctx.aggregate and' to the existing condition. use_materialized is intentionally NOT reused because it is dual-purpose: it also controls parent-transform materialization via should_use_materialized_table. - Single early-exit projection branch in build_grain_group_sql handles all aggregability levels uniformly when ctx.aggregate is False. - skip_aggregation=True is OR'd with the existing condition in build_select_ast, reusing the working no-GROUP-BY path. Tests: 5 new tests covering binary indicator, additive numeric, multi-metric combiner preservation, default backward compat, and route-level integration. All 126 measures_sql_test.py tests pass (121 existing + 5 new), all 544 build_v3 tests pass, pre-commit clean.

netlify · 2026-06-04T15:45:09Z

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Name	Link
🔨 Latest commit	`72fc77b`
🔍 Latest deploy log	https://app.netlify.com/projects/thriving-cassata-78ae72/deploys/6a219d81a2a84700088898c2

shangyian · 2026-06-04T19:03:10Z

@jsogarro thanks for adding this -- it might be good to merge this anyway as a fallback, even if it's not used immediately. I'll go through it more closely later today

jsogarro · 2026-06-04T19:03:55Z

@jsogarro thanks for adding this -- it might be good to merge this anyway as a fallback, even if it's not used immediately. I'll go through it more closely later today

Sounds good. Thank you!

jsogarro marked this pull request as ready for review June 4, 2026 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add aggregate flag to /sql/measures/v3/ for raw-row output#2222

Add aggregate flag to /sql/measures/v3/ for raw-row output#2222
jsogarro wants to merge 1 commit into
DataJunction:mainfrom
jsogarro:feature/jogarro/measures-v3-preaggregate-false

jsogarro commented Jun 4, 2026

Uh oh!

netlify Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

shangyian commented Jun 4, 2026

Uh oh!

jsogarro commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jsogarro commented Jun 4, 2026

PR: Add aggregate flag to /sql/measures/v3/ for raw-row output

Summary

Why

Changes

Design decisions

Test plan

Backward compatibility

Rollback

Notes for reviewers

Manual Test Plan

Prerequisites

Test Steps

Verification Queries

Edge Cases

Cleanup

Uh oh!

netlify Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Uh oh!

shangyian commented Jun 4, 2026

Uh oh!

jsogarro commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR: Add `aggregate` flag to `/sql/measures/v3/` for raw-row output

netlify Bot commented Jun 4, 2026 •

edited

Loading