feat(write): add write pipeline with DataFusion INSERT INTO/OVERWRITE support#234
Open
JingsongLi wants to merge 1 commit intoapache:mainfrom
Open
feat(write): add write pipeline with DataFusion INSERT INTO/OVERWRITE support#234JingsongLi wants to merge 1 commit intoapache:mainfrom
JingsongLi wants to merge 1 commit intoapache:mainfrom
Conversation
… support Add TableWrite for writing Arrow RecordBatches to Paimon append-only tables. Each (partition, bucket) pair gets its own DataFileWriter with direct writes (matching delta-rs DeltaWriter pattern). File rolling uses tokio::spawn for background close, and prepare_commit uses try_join_all for parallel finalization across partition writers. Key components: - TableWrite: routes batches by partition/bucket, holds DataFileWriters - DataFileWriter: manages parquet file lifecycle with rolling support - WriteBuilder: creates TableWrite and TableCommit instances - PaimonDataSink: DataFusion DataSink integration for INSERT/OVERWRITE - FormatFileWriter: extended with flush() and in_progress_size() Configurable options via CoreOptions: - file.compression (default: zstd) - target-file-size (default: 256MB) - write.parquet-buffer-size (default: 256MB) Includes E2E integration tests for unpartitioned, partitioned, fixed-bucket, multi-commit, column projection, and bucket filtering.
| let row = BinaryRow::from_serialized_bytes(&msg.partition)?; | ||
| let mut spec = HashMap::new(); | ||
| for (i, key) in partition_keys.iter().enumerate() { | ||
| if let Some(datum) = extract_datum(&row, i, &data_types[i])? { |
Contributor
There was a problem hiding this comment.
This will drop NULL partition keys from the overwrite predicate. I reproduced a case where overwriting the NULL partition also deletes other partitions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Subtask of #232
Add TableWrite for writing Arrow RecordBatches to Paimon append-only tables. Each (partition, bucket) pair gets its own DataFileWriter with direct writes (matching delta-rs DeltaWriter pattern). File rolling uses tokio::spawn for background close, and prepare_commit uses try_join_all for parallel finalization across partition writers.
Key components:
Configurable options via CoreOptions:
Includes E2E integration tests for unpartitioned, partitioned, fixed-bucket, multi-commit, column projection, and bucket filtering.
Brief change log
Tests
API and Format
Documentation