feat(reader): surface per-zone stats from the zone-map table (ADR 0013 §6)#181
Merged
Conversation
62a74d4 to
778f863
Compare
…3 §6) Add ScanIterator.columnZoneStats(col): one ArrayStats per zone with min/max/sum/null-count, the read-side feed for aggregate push-down. Sum is decoded from the column's vortex.stats zone-map table rather than per-flat node stats — matching Rust, whose flat writer retains only pre-computed stats (flat/writer.rs) and emits SUM only in the zoned table (zoned/writer.rs). Falls back to per-chunk node stats (sum null) when a column has no zone map. - ArrayStats gains a sum component; fromFbs decodes it (forward-compat). - ZonedStatsSchema moves inspector -> reader so the read path can reconstruct the stats-table dtype; cli/inspector imports updated. - VortexWriter is unchanged functionally (comment only); sum continues to live in the zone-map table. Calcite VortexAggregates.SUM/AVG now fold the per-zone sums instead of a full scan: metadata-only when every zone carries a sum, falling back to a streaming scan only when a column has no zone map. Verified both interop directions, incl. a new test folding per-zone sums from a Rust-written file to the exact column total. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
778f863 to
3df5b7d
Compare
… ref columnZoneStats javadoc overstated alignment: zone-map rows align with chunkRowCounts() only on the fallback path and for files this writer produces (one zone per chunk). A foreign writer may use a fixed zone length independent of chunk boundaries, so the zone count need not match. Reword to scope the guarantee. Also fix VortexWriter's stale [io.github.dfa1.vortex.inspect] reference to ZonedStatsSchema — the class moved to the reader package in this branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds the read-side surface for ADR 0013 §6 aggregate push-down: a way to read per-zone statistics without decoding data segments.
ScanIterator.columnZoneStats(col)— returns oneArrayStatsper zone (min/max/sum/null-count), positionally aligned withchunkRowCounts(). The whole-zone tier a reduce kernel folds; boundary zones fall back to streaming decode (future work).ArrayStatsgains asumcomponent;fromFbsdecodes it (forward-compat).ZonedStatsSchemamovesinspector→readerso the read path can reconstruct the stats-table dtype.cli/inspectorimports updated; test moved with it.Where sum comes from
Sum is decoded from the column's
vortex.statszone-map table, not per-flat node stats — matching Rust, whose flat writer retains only pre-computed stats (flat/writer.rs) and emitsSUMonly in the zoned table (zoned/writer.rs). When a column has no zone map,columnZoneStatsfalls back to per-chunk node stats (sumnull).VortexWriteris unchanged functionally (comment only) — sum already lived in the zone-map table.Tests
ColumnZoneStatsTest(writer): per-zone min/max/sum/null-count, whole-zone SUM fold, float→Double sums, missing-column → empty-per-zone.RustWritesJavaReadsIntegrationTest: a Rust-written 200k-row file — every zone's sum non-null, folds to the exact column total. Proves the reader decodes Rust's zone-map sum.Scope
ADR 0013 stays Proposed. Out of scope (follow-ups):
Mask/Predicate/kernel vocab, the two-tier reduce, and rewiring calciteVortexAggregates.SUMoff its full-scan (now trivial — sum is available per-zone).🤖 Generated with Claude Code