Perf: Window topn optimisation by SubhamSinghal · Pull Request #21479 · apache/datafusion

SubhamSinghal · 2026-04-08T17:50:08Z

Which issue does this PR close?

Related to Optimize "per partition" top-k : ROW_NUMBER < 5 / TopK #6899.

Rationale for this change

Queries like SELECT *, ROW_NUMBER() OVER (PARTITION BY pk ORDER BY val) as rn FROM t WHERE rn <= K are extremely common in analytics ("top N per group"). The current plan sorts the entire dataset O(N log N), computes ROW_NUMBER for all rows, then filters. With 10M rows, 1K partitions, and K=3, we sort all 10M rows but only keep 3K.

This PR introduces a PartitionedTopKExec operator that replaces the SortExec, maintaining a per-partition TopK heap (reusing DataFusion's existing TopK implementation). Cost drops to O(N log K) time and O(K × P × row_size) memory.

What changes are included in this PR?

New physical operator: PartitionedTopKExec (physical-plan/src/sorts/partitioned_topk.rs)

Reads unsorted input, groups rows by partition key using RowConverter, feeds sub-batches to a per-partition TopK heap
Emits only the top-K rows per partition in sorted (partition_keys, order_keys) order
Reuses the existing TopK implementation for heap management, sort key comparison, eviction, and batch compaction

New optimizer rule: WindowTopN (physical-optimizer/src/window_topn.rs)

Detects the pattern:

FilterExec(rn <= K)
  [optional ProjectionExec]
    BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
      SortExec(partition_keys, order_keys)

And replaces it with:

[optional ProjectionExec]
  BoundedWindowAggExec(ROW_NUMBER PARTITION BY ... ORDER BY ...)
    PartitionedTopKExec(fetch=K)

Both FilterExec and SortExec are removed.

Supported predicates: rn <= K, rn < K, K >= rn, K > rn.

The rule only fires for ROW_NUMBER with a PARTITION BY clause. Global top-K (no PARTITION BY) is already handled by
SortExec with fetch.

Config flag: datafusion.optimizer.enable_window_topn (default: true)

Benchmark results (H2O groupby Q8, 10M rows, top-2 per partition):

cargo run --release --example h2o_window_topn_bench

Scenario	Enabled (ms)	Disabled (ms)	Speedup
100 partitions (100K rows/part)	43	174	4.0x
1K partitions (10K rows/part)	71	146	2.1x
10K partitions (1K rows/part)	619	128	0.2x (regression)
100K partitions (100 rows/part)	4368	135	0.03x (regression)

The 100K-partition regression is expected: per-partition TopK overhead (RowConverter, MemoryReservation per instance)
dominates when partitions are very numerous with few rows each. For the common case (moderate partition cardinality), the
optimization provides 2-3x speedup.

Are these changes tested?

Yes:

7 unit tests (core/tests/physical_optimizer/window_topn.rs): basic ROW_NUMBER, rn < K, flipped predicates, non-window column filter, config disabled, no partition by, projection between filter and window
5 SLT tests (sqllogictest/test_files/window_topn.slt): correctness verification, EXPLAIN plan validation, rn < K, no-partition-by case, config disabled fallback

Are there any user-facing changes?

No breaking API changes. The optimization is disabled by default and transparent to users. It can be enabled via:

SET datafusion.optimizer.enable_window_topn = true;

2010YOUY01

Thank you — this PR looks really nice.

I took a quick look and left a few suggestions. I’ll review the optimizer rewrite and execution side more carefully later.

datafusion/common/src/config.rs

2010YOUY01 · 2026-04-09T04:23:25Z

datafusion/core/examples/h2o_window_topn_bench.rs

+// specific language governing permissions and limitations
+// under the License.
+
+// Standalone H2O groupby Q8 benchmark: PartitionedTopKExec enabled vs disabled


We could keep this benchmark in this PR, but it would be great to clean it up later.
To make benchmark maintenance easier, we could directly add queries representing this workload to h2o window benchmark, so that similar benchmarks won't get scattered to multiple places.

datafusion/benchmarks/bench.sh

Line 123 in e1ad871

h2o_small_window: Extended h2oai benchmark with small dataset (1e7 rows) for window, default file format is csv

Though the issue is now the h2o benchmark counts the dataset loading time, so we can't isolate the target executor's processing time, so we could add an option to eliminate the data loading time later 🤔

Though the issue is now the h2o benchmark counts the dataset loading time, so we can't isolate the target executor's processing time, so we could add an option to eliminate the data loading time later

Shall I keep benchmark query in h2o benchmark in this PR or shall we do it once we eliminate data loading time?

datafusion/physical-optimizer/src/window_topn.rs

2010YOUY01 · 2026-04-09T04:34:08Z

datafusion/physical-optimizer/src/window_topn.rs

+        // Step 1: Match FilterExec at the top
+        let filter = plan.downcast_ref::<FilterExec>()?;
+
+        // Don't handle filters with projections


I'm curious why skipping this

The filter's column indices would point to the projected schema, not the window exec's output schema, so our index-based matching for the ROW_NUMBER column would be wrong without resolving the projection mapping. Skipping this case for simplicity right now.

Yes, it's a good idea to keep things simpler at start.

Could you file a PR for this follow-up work? I'm happy to do it also.

datafusion/physical-plan/src/sorts/partitioned_topk.rs

2010YOUY01 · 2026-04-09T04:39:24Z

datafusion/physical-plan/src/sorts/partitioned_topk.rs

+        )?))
+    }
+
+    fn apply_expressions(


Not related to this PR, but I’m curious why this is a required ExecutionPlan API and when it is used, given that different operators can hold expressions for very different purposes 🤔

2010YOUY01 · 2026-04-09T04:48:11Z

datafusion/sqllogictest/test_files/window_topn.slt

+# Tests for Window TopN optimization: PartitionedTopKExec
+
+statement ok
+CREATE TABLE window_topn_t (id INT, pk INT, val INT) AS VALUES


I suggest moving the main test coverage here, instead of keeping it in unit tests across different layers such as optimizer tests. Once we have solid coverage here, it is less likely to get lost during local refactors.

We can also extend the coverage with more edge cases, for example:

predicates such as rn < 2, 2 > rn, etc.

mixing other window expressions with row_number()

empty or overlapping partition / order keys, such as ... OVER (ORDER BY id) or ... OVER (PARTITION BY id ORDER BY id, customer)

different sort options such as ASC, DESC, and NULLS FIRST

the QUALIFY clause https://datafusion.apache.org/user-guide/sql/select.html#qualify-clause

and more

added tests for these cases

Dandandan · 2026-04-10T07:01:40Z

datafusion.optimizer.enable_window_topn

If it has regressions as large as 0.03x it should off by default (and we should look if we can automatically enable it via a heuristic / stats based on partition cardinality / rows)

2010YOUY01

I have reviewed it carefully, and it looks good to me.

I think it’s ready to go once the output batch coalescing is addressed (see comment). The other suggestions are preferably to be handled in follow-up PRs to keep this PR simple and focused.

2010YOUY01 · 2026-04-12T03:03:17Z

datafusion/physical-optimizer/src/optimizer.rs

            Arc::new(OptimizeAggregateOrder::new()),
+            // WindowTopN: replaces Filter(rn<=K) → Window(ROW_NUMBER) → Sort
+            // with Window(ROW_NUMBER) → PartitionedTopKExec(fetch=K).
+            // Must run after EnforceSorting (which inserts SortExec) and before


An alternative is to move this rule before EnforceSorting, I think this can make the implementation simpler.

This is optional, possibly try as follow-up for simplification, since I might have missed something though.

See the detailed rationale:

Background on window planning

The initial physical plan of window function doesn't include SortExec and RepartitionExec, it simply declare the required sort/partitioning inside window operator, and rely on later EnforceSorting and EnforceDistribution physical optimizer rule to insert the SortExec. You can use the below script to check in datafusion-cli

CREATE TABLE t ( id INT, ts INT ); INSERT INTO t VALUES (1, 10), (1, 20), (1, 30), (2, 15), (2, 25); EXPLAIN VERBOSE SELECT id, ts, ROW_NUMBER() OVER ( PARTITION BY id ORDER BY ts ) AS rn FROM t QUALIFY rn < 3;

Idea

Move the rewrite rule before EnforceSorting, so we don't have to match SortExec, the plan pattern matching is likely to be simpler.
The physical plan sanitizer still checks the ordering invariants to ensure things are still correct.

Will try it out in a separate PR.

2010YOUY01 · 2026-04-12T03:04:49Z

datafusion/physical-optimizer/src/window_topn.rs

+        // Step 1: Match FilterExec at the top
+        let filter = plan.downcast_ref::<FilterExec>()?;
+
+        // Don't handle filters with projections


Yes, it's a good idea to keep things simpler at start.

Could you file a PR for this follow-up work? I'm happy to do it also.

datafusion/physical-plan/src/sorts/partitioned_topk.rs

datafusion/common/src/config.rs

datafusion/physical-plan/src/sorts/partitioned_topk.rs

2010YOUY01 · 2026-04-12T03:31:01Z

datafusion/physical-plan/src/sorts/partitioned_topk.rs

+        }};
+    }
+
+    // ---------- Accumulation phase ----------


Optimization to try as follow-up:
To make it faster, we might want to add a fast path for single partition keys like PARTITION BY a, since we don't have to do row conversion here.

Co-authored-by: Yongting You <2010youy01@gmail.com>

Subham Singhal added 2 commits April 8, 2026 22:42

Benchmark window topn optimisation

38fa07a

Lint fix

52147dd

github-actions bot added optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Apr 8, 2026

2010YOUY01 reviewed Apr 9, 2026

View reviewed changes

2010YOUY01 changed the title ~~Benchmark: Window topn optimisation~~ Perf: Window topn optimisation Apr 9, 2026

Resolve comment

48fd178

github-actions bot added the documentation Improvements or additions to documentation label Apr 9, 2026

Subham Singhal added 2 commits April 9, 2026 19:59

Adds UT

5c2c0fb

Fix build failure

ca5a1ae

2010YOUY01 reviewed Apr 12, 2026

View reviewed changes

Subham Singhal and others added 5 commits April 12, 2026 11:21

Adds BatchCoaleser

ad73410

Apply suggestions from code review

ec15954

Co-authored-by: Yongting You <2010youy01@gmail.com>

Fix linting

c03de69

Fix build failure

26076d9

Merge branch 'main' into window-topn-partitioned-topk-exec

87c9e84

Conversation

SubhamSinghal commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Apr 10, 2026

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Background on window planning

Idea

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SubhamSinghal commented Apr 8, 2026 •

edited

Loading

2010YOUY01 Apr 12, 2026 •

edited

Loading