Conversation
gbrgr
approved these changes
Mar 20, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
With this table:
we uncovered a bug:
"failed to process iceberg_next_batch with error: Error reading batch: Unexpected => Stream error: Unexpected => failed to read record batch, source: Unexpected => unexpected target column type FixedSizeBinary(16)"- this is an error for C_UUID columnRoot Cause
Binary (not FixedSizeBinary(16))
Primitive(Fixed(16))
mismatch)
The fix is at step 3: add (Some(PrimitiveType::Binary), Some(PrimitiveType::Fixed(_))) to type_promotion_is_valid,
since Binary in Parquet/Arrow is a valid representation of Fixed(N) in Iceberg. This would let the column be read from the file
normally, and then a cast or passthrough at the transformer level would handle it.
Why it works for C_UUID only
When you project only [12] and field 12 fails the type check, indices is empty → ProjectionMask::all() → all columns are read,
including C_UUID. Then the transformer sees C_UUID in the source because it was read (despite not being in the projection mask
properly). It gets column index 0 or whatever, and since FixedSizeBinary(16) comes through from the actual Parquet read, the
equals_datatype check in the transformer passes against the target type.
So: the projected scan "works" by accident — the empty-indices fallback reads everything, and the column happens to be there.
That's the full picture:
transformer fills it with nulls → crashes on FixedSizeBinary not supported in create_primitive_array_repeated
C_UUID is present → works by accident
What happens if the Binary data exceeds the Fixed(N) length?
With the current fix, type_promotion_is_valid just says "yes, this column is readable" — it lets the
Parquet reader include the column in the projection mask. The actual data flows through as-is
(Binary type from Arrow). The RecordBatchTransformer then checks
source_field.data_type().equals_datatype(target_type):
These don't match, so it goes to ColumnSource::Promote, which calls arrow_cast::cast(Binary →
FixedSizeBinary(16)). Arrow's cast implementation will fail at runtime if any value's length != 16.
So you'd get an error like "cannot cast Binary to FixedSizeBinary(16): value at index N has length
M" — not silent corruption, but a hard error.
This is actually reasonable behavior: if a Parquet file claims to have Fixed(16) data but contains
values of a different length, that's a corrupt file, and erroring is correct.