Skip to content

binary type promotion#69

Merged
vustef merged 3 commits intomainfrom
vs-type-promotion-binary
Mar 20, 2026
Merged

binary type promotion#69
vustef merged 3 commits intomainfrom
vs-type-promotion-binary

Conversation

@vustef
Copy link
Copy Markdown
Collaborator

@vustef vustef commented Mar 20, 2026

With this table:

CREATE OR REPLACE ICEBERG TABLE vustef_db_12092025.public.ibt_all_types (
			id INT,
			c_bigint BIGINT,
			c_float FLOAT,
			c_double DOUBLE,
			c_decimal DECIMAL(10,2),
			c_boolean BOOLEAN
			,c_date DATE
			,c_time TIME
			,c_timestamp TIMESTAMP_NTZ(6)
			,c_timestamptz TIMESTAMP_LTZ(6)
			,c_string STRING
			,c_uuid BINARY(16)
		)
			CATALOG = 'SNOWFLAKE'
			EXTERNAL_VOLUME = 'snowflake_managed';

INSERT INTO vustef_db_12092025.public.ibt_all_types
  (ID, C_BIGINT, C_FLOAT, C_DOUBLE, C_DECIMAL, C_BOOLEAN, C_DATE, C_TIME, C_TIMESTAMP, C_TIMESTAMPTZ, C_STRING, C_UUID)
SELECT column1, column2, column3, column4, column5, column6, column7, column8, column9, column10, column11, HEX_DECODE_BINARY(REPLACE(UUID_STRING(), '-', ''))
FROM VALUES
(11, 1000000010, 10.1::FLOAT, 10.0010::FLOAT, 99999.99::NUMBER(10,2), FALSE, '2025-10-25'::DATE, '19:30:00'::TIME, '2025-10-25 19:30:00'::TIMESTAMP_NTZ, '2025-10-25 19:30:00 +01:00'::TIMESTAMP_LTZ, 'juliet');

we uncovered a bug:

  • when we query whole table, we get an error "failed to process iceberg_next_batch with error: Error reading batch: Unexpected => Stream error: Unexpected => failed to read record batch, source: Unexpected => unexpected target column type FixedSizeBinary(16)" - this is an error for C_UUID column
  • when we only query C_UUID, it works
  • when we query any other column in addition to C_UUID, we get the error again.

Root Cause

  1. Snowflake writes FIXED_LEN_BYTE_ARRAY(16) in Parquet for the fixed[16] column, but the Arrow Parquet reader decodes it as
    Binary (not FixedSizeBinary(16))
  2. arrow_schema_to_schema converts Binary to Primitive(Binary), while the Iceberg table metadata says the column is
    Primitive(Fixed(16))
  3. type_promotion_is_valid rejects (Binary, Fixed(16)) — there's no arm allowing this "promotion" (or rather, this type mapping
    mismatch)
  4. The column is excluded from the Parquet projection mask, so it's never read from the file
  5. RecordBatchTransformer sees the column as "missing" and tries to fill it via ColumnSource::Add with None (null default)
  6. create_primitive_array_repeated has no FixedSizeBinary arm, producing the error

The fix is at step 3: add (Some(PrimitiveType::Binary), Some(PrimitiveType::Fixed(_))) to type_promotion_is_valid,
since Binary in Parquet/Arrow is a valid representation of Fixed(N) in Iceberg. This would let the column be read from the file
normally, and then a cast or passthrough at the transformer level would handle it.

Why it works for C_UUID only

When you project only [12] and field 12 fails the type check, indices is empty → ProjectionMask::all() → all columns are read,
including C_UUID. Then the transformer sees C_UUID in the source because it was read (despite not being in the projection mask
properly). It gets column index 0 or whatever, and since FixedSizeBinary(16) comes through from the actual Parquet read, the
equals_datatype check in the transformer passes against the target type.

So: the projected scan "works" by accident — the empty-indices fallback reads everything, and the column happens to be there.

That's the full picture:

  • Full scan: 11 of 12 columns pass the type check → indices is non-empty → ProjectionMask::leaves(indices) excludes C_UUID →
    transformer fills it with nulls → crashes on FixedSizeBinary not supported in create_primitive_array_repeated
  • Projected scan of only C_UUID: 0 of 1 columns pass → indices is empty → ProjectionMask::all() fallback → all columns read →
    C_UUID is present → works by accident

What happens if the Binary data exceeds the Fixed(N) length?

With the current fix, type_promotion_is_valid just says "yes, this column is readable" — it lets the
Parquet reader include the column in the projection mask. The actual data flows through as-is
(Binary type from Arrow). The RecordBatchTransformer then checks
source_field.data_type().equals_datatype(target_type):

  • Source: Binary
  • Target: FixedSizeBinary(16)

These don't match, so it goes to ColumnSource::Promote, which calls arrow_cast::cast(Binary →
FixedSizeBinary(16)). Arrow's cast implementation will fail at runtime if any value's length != 16.
So you'd get an error like "cannot cast Binary to FixedSizeBinary(16): value at index N has length
M" — not silent corruption, but a hard error.

This is actually reasonable behavior: if a Parquet file claims to have Fixed(16) data but contains
values of a different length, that's a corrupt file, and erroring is correct.

@vustef vustef requested review from gbrgr and mjschleich March 20, 2026 14:47
@vustef vustef enabled auto-merge March 20, 2026 14:49
@vustef vustef disabled auto-merge March 20, 2026 14:54
@vustef vustef enabled auto-merge March 20, 2026 15:01
@vustef vustef merged commit ff28348 into main Mar 20, 2026
19 checks passed
@vustef vustef deleted the vs-type-promotion-binary branch March 20, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants