feat: replace custom shuffle block format with Arrow IPC streams by andygrove · Pull Request #3911 · apache/datafusion-comet

andygrove · 2026-04-08T19:55:54Z

Which issue does this PR close?

Partial fix for #3882

Rationale for this change

The native columnar shuffle currently uses a custom per-block format (length-prefixed compressed IPC messages) that requires manual framing, per-block schema headers, and a custom Java reader (CometShuffleBlockIterator). Replacing this with standard Arrow IPC streams eliminates custom serialization code, enables built-in IPC body compression (zstd/lz4), and allows the shuffle reader to use Arrow's StreamReader directly.

What changes are included in this PR?

Write path:

Replace ShuffleBlockWriter with Arrow IPC StreamWriter in all partitioners (single, multi, empty-schema)
Each partition's data in the shuffle file is now a complete IPC stream (schema + batches + EOS marker)
Small batches are coalesced via BufBatchWriter before serialization
Enable ipc_compression feature in the Arrow dependency for built-in zstd/lz4 body compression

Read path (native):

Add JniInputStream: a Rust Read impl that pulls bytes from a JVM InputStream via JNI with 64KB read-ahead buffering
Add ShuffleStreamReader: manages reading potentially concatenated IPC streams (from spills) using Arrow's StreamReader
Update ShuffleScanExec to lazily create a ShuffleStreamReader instead of calling per-block decode methods

Read path (JVM):

Replace CometShuffleBlockIterator (custom Java reader) with a simple InputStream passed to native via JNI
Simplify NativeBatchDecoderIterator to open a stream, read batches, and close — no more per-block ByteBuffer management
Add openShuffleStream, nextShuffleStreamBatch, shuffleStreamNumFields, closeShuffleStream JNI methods

Cleanup:

Remove ShuffleBlockWriter, CometShuffleBlockIterator, and shuffle block iterator JNI bridge code
Remove legacy configuration options (COMET_COLUMNAR_SHUFFLE_ASYNC_ENABLED, COMET_SHUFFLE_PREFER_DICTIONARY_RATIO)

How are these changes tested?

Existing shuffle Rust unit tests updated and passing (19 tests)
Existing JVM shuffle integration tests provide end-to-end coverage
Clippy clean with -D warnings

Addresses apache#3882 — shuffle format overhead with default batch size.

Replace the custom shuffle block format (per-batch IPC streams with custom length-prefix headers and external compression wrappers) with standard Arrow IPC streams using built-in body compression. Key changes: - Replace ShuffleBlockWriter with CompressionCodec::ipc_write_options() that creates IpcWriteOptions with LZ4_FRAME or ZSTD body compression - Rewrite BufBatchWriter to use a persistent StreamWriter that writes the schema once, then appends N record batch messages - Rewrite PartitionWriter (spill) to create StreamWriter over spill files - Rewrite PartitionOutputStream (immediate mode) to use persistent StreamWriter<Vec<u8>> with lazy creation and drain/finish lifecycle - Simplify SinglePartitionShufflePartitioner by removing manual batch coalescing (handled by BufBatchWriter's BatchCoalescer) - Update sort-based shuffle in spark_unsafe/row.rs to use StreamWriter - Remove snappy from shuffle compression options (keep Snappy variant in CompressionCodec enum for Parquet writer compatibility) - Update all tests to use Arrow StreamReader for roundtrip verification - Update shuffle_bench binary and criterion benchmarks The old ipc.rs read path is preserved for Task 6. The core crate will have expected compile errors in shuffle_scan.rs tests and jni_api.rs due to removed ShuffleBlockWriter export.

Add JniInputStream (implements std::io::Read by pulling bytes from a JVM InputStream via JNI with 64KB read-ahead buffer) and ShuffleStreamReader (wraps Arrow StreamReader<JniInputStream> for lifecycle management). Replace decodeShuffleBlock JNI function with four new streaming functions: openShuffleStream, nextShuffleStreamBatch, shuffleStreamNumFields, and closeShuffleStream. The old read_ipc_compressed is retained for the legacy ShuffleScanExec code path.

Replace decodeShuffleBlock JNI declaration with four new streaming methods: openShuffleStream, nextShuffleStreamBatch, shuffleStreamNumFields, and closeShuffleStream. Rewrite NativeBatchDecoderIterator to use a native handle pattern instead of manual header parsing and ByteBuffer management.

…C format

… streams Replace the old CometShuffleBlockIterator-based read path in ShuffleScanExec with ShuffleStreamReader, which reads standard Arrow IPC streams directly from JVM InputStreams via JniInputStream. This eliminates the custom per-batch block format (8-byte length + 8-byte field count + 4-byte codec + compressed IPC) and the per-batch JNI calls (hasNext/getBuffer) in favor of streaming reads. Changes: - CometShuffledRowRDD: return raw InputStream instead of CometShuffleBlockIterator - CometExecIterator: accept Map[Int, InputStream] instead of Map[Int, CometShuffleBlockIterator] - ShuffleScanExec (Rust): lazily create ShuffleStreamReader from InputStream GlobalRef, read batches via reader.next_batch() instead of JNI block-by-block dance - Add Send+Sync impls for SharedJniStream/StreamReadAdapter to satisfy ExecutionPlan bounds

…ings - Hold a single StreamWriter across all batches in process_sorted_row_partition instead of creating a fresh writer per batch - Remove read_ipc_compressed and snap/lz4_flex/zstd dependencies from shuffle crate - Remove dead CometShuffleBlockIterator.java and its JNI bridge - Rename shuffle_block_writer.rs to codec.rs to reflect its contents - Remove unused _write_time parameter from BufBatchWriter write/flush - Make CompressionCodec::Snappy return an error in ipc_write_options - Remove Snappy from shuffle writer codec mappings in planner and JNI

Update shuffle IPC code to work with jni 0.22 API changes: - GlobalRef → Global<JObject<'static>> / Global<JPrimitiveArray<'static, i8>> - JNIEnv → Env, EnvUnowned - JavaVM::attach_current_thread now takes closure - JVMClasses::get_env() → JVMClasses::with_env() Also update EmptySchemaShufflePartitioner to use Arrow IPC StreamWriter instead of removed ShuffleBlockWriter.

Empty shuffle partitions must have zero bytes in the data file so that Spark's MapOutputTracker reports zero-size blocks. Writing schema+EOS for empty partitions changed the block sizes, which affected coalesce partition grouping in DefaultPartitionCoalescer. Also add miri ignore attribute to shuffle_partitioner_memory test since the spill path now uses IPC StreamWriter which calls into zstd FFI.

andygrove · 2026-04-09T12:55:15Z

I ran TPC-H @ 1TB and did not see any significant change in performance

andygrove · 2026-04-09T12:59:10Z

@Kontinuation fyi

mbutrovich · 2026-04-09T14:30:16Z

I ran TPC-H @ 1TB and did not see any significant change in performance

What about amount of shuffle data written?

…r to 8KB Arrow's StreamWriter issues multiple small writes per batch (continuation marker, flatbuf metadata, padding) before the body data. Wrapping the output File in BufWriter coalesces these small writes. The body flush after each batch means the buffer only needs to hold metadata, so 1MB was wasteful — 8KB is sufficient.

andygrove · 2026-04-09T18:02:21Z

I am seeing a performance regression with the standalone benchmark binary, so moving this to draft until I understand why.

wForget · 2026-04-10T02:51:33Z

Ipc StreamWriter also seems to compress data on a per-RecordBatch. If we want to improve the compression rate, we may need to perform compression after StreamWriter finish.

wForget · 2026-04-10T02:57:24Z

Ipc StreamWriter also seems to compress by RecordBatch. If we want to improve the compression rate, we may need to perform compression after StreamWriter finish.

We can also wrap the output writer with the compression stream_writer.

andygrove added 18 commits April 8, 2026 13:41

docs: design spec for one IPC stream per partition shuffle format

cb7cb51

Addresses apache#3882 — shuffle format overhead with default batch size.

docs: add validation skip requirement to shuffle stream reader spec

8b05f48

docs: implementation plan for IPC stream per partition shuffle format

018f97d

feat: enable Arrow IPC compression feature for shuffle format

bc5c1b3

fix: resolve clippy warnings and update shuffle_scan tests for new IP…

a48791f

…C format

fix: apply spotless formatting

24056ad

fix: handle empty streams and concatenated IPC streams in shuffle reader

f186d7e

chore: remove unrelated files accidentally committed

924f53f

format

7e3a24e

fix: add write_time metric to SinglePartitionShufflePartitioner

8dbbe59

chore: apply rustfmt formatting

7b45660

andygrove marked this pull request as ready for review April 9, 2026 12:55

andygrove requested review from comphead, mbutrovich, parthchandra and wForget April 9, 2026 12:56

andygrove marked this pull request as draft April 9, 2026 18:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: replace custom shuffle block format with Arrow IPC streams#3911

feat: replace custom shuffle block format with Arrow IPC streams#3911
andygrove wants to merge 19 commits intoapache:mainfrom
andygrove:ipc-stream-shuffle

andygrove commented Apr 8, 2026 •

edited

Loading

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

mbutrovich commented Apr 9, 2026

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

wForget commented Apr 10, 2026 •

edited

Loading

Uh oh!

wForget commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andygrove commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

mbutrovich commented Apr 9, 2026

Uh oh!

andygrove commented Apr 9, 2026

Uh oh!

wForget commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wForget commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Apr 8, 2026 •

edited

Loading

wForget commented Apr 10, 2026 •

edited

Loading