Arrow: Fix NullabilityHolder.reset() to clear the isNull array#16264
Open
jbbqqf wants to merge 1 commit intoapache:mainfrom
Open
Arrow: Fix NullabilityHolder.reset() to clear the isNull array#16264jbbqqf wants to merge 1 commit intoapache:mainfrom
jbbqqf wants to merge 1 commit intoapache:mainfrom
Conversation
reset() previously only zeroed numNulls, leaving the isNull marker array unchanged. Callers that reused a NullabilityHolder across batches could observe stale null markers via isNullAt() — the holder reported hasNulls() == false yet isNullAt(i) returned 1 for indices that had been set null in a previous batch. Today's hot-path callers always overwrite all positions on the next batch via setNulls/setNotNulls, so the bug is latent rather than observable in production. Fixing reset() to also clear isNull restores the documented "clean initial state" contract and removes a trap for future callers. Adds TestNullabilityHolder covering the single-set, bulk-set, and idempotency paths. All three tests fail on origin/main and pass with this change. Resolves apache#15808 Co-Authored-By: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
NullabilityHolder.reset()previously zeroednumNullsbut left theisNullmarker array unchanged. Callers that reuse a holder across batches could seehasNulls() == falseyet still getisNullAt(i) == 1for indices set null in a previous batch.This PR fixes
reset()to also clear theisNullarray, and adds focused unit tests covering the single-set, bulk-set, and idempotency paths.Resolves #15808
Context
Today's vectorized read hot-path callers (e.g. the arrow vectorized readers) always overwrite all positions on the next batch via
setNulls(start, num)/setNotNulls(start, num), so the latent stale state is never observed in production. The bug is therefore a contract trap rather than an active correctness bug — but the holder's contract documentsreset()as restoring a clean initial state, and the original triage in #15808 already proposed the one-line fix.This PR is a revival of #15809 (closed by stale-bot at 30 days of no review). The diff is unchanged in spirit; this version adds AI disclosure (which a reviewer asked for on the previous attempt) and a slightly broader test class to make the regression-test evidence explicit.
Changes
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/NullabilityHolder.java—reset()now also callsArrays.fill(isNull, (byte) 0). A short comment explains the why (so a reviewer reading the diff cold doesn't have to re-derive it from the issue).arrow/src/test/java/org/apache/iceberg/arrow/vectorized/TestNullabilityHolder.java— new test class with three tests:resetClearsIsNullMarkers,resetClearsBulkSetNulls,resetIsIdempotent. All three fail onorigin/mainand pass with the fix.The other two arrays in
NullabilityHolder(nullsandnonNulls) are immutablearraycopysource buffers and intentionally not cleared — that's documented in the inline comment.Reproduce BEFORE/AFTER yourself (copy-paste)
The only thing that changes between BEFORE and AFTER is the checked-out ref of the production class; the test file is the same in both runs.
What I ran locally
./gradlew :iceberg-arrow:test --tests "org.apache.iceberg.arrow.vectorized.TestNullabilityHolder"onorigin/mainwith the new test file imported → 3 tests, 3 failures (regression visible)../gradlew :iceberg-arrow:spotlessCheck→ PASS.:iceberg-arrow:testbuild wrapping the change → BUILD SUCCESSFUL (existing tests untouched).Edge cases tested
setNull(idx)thenreset()setNull(3),reset()isNullAt(3) == 0, all other indices == 0,numNulls() == 0resetClearsIsNullMarkerssetNulls(start, num)thenreset()setNulls(2, 4),reset()hasNulls() == falseresetClearsBulkSetNullsreset()twicesetNull(0),reset(),reset()numNulls() == 0resetIsIdempotentRisk / blast radius
Internal:
NullabilityHolderis package-private underorg.apache.iceberg.arrow.vectorizedand only used by the vectorized arrow read path. The fix adds abyte[]zero-fill inreset()(size = batch capacity, typically 4–8 KiB on hot paths). On the dominant hot path the holder is reused across batches and the next bulksetNulls/setNotNullswould have overwritten the array anyway, so the only observable effect is a single extraArrays.fillperreset()— a few microseconds per batch on typical JVM heuristics.Additive change; no API surface modified.
Release note
PR drafted with assistance from Claude Code. The change was reviewed manually against
apache/iceberg's source onmain. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.