Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265
Open
jbbqqf wants to merge 2 commits intoapache:mainfrom
Open
Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265jbbqqf wants to merge 2 commits intoapache:mainfrom
jbbqqf wants to merge 2 commits intoapache:mainfrom
Conversation
BaseFile.splitOffsets() previously wrapped the underlying long[] in a fresh ArrayUtil.toUnmodifiableLongList on every invocation. Manifest rewriting, format conversion, and toString() debugging can call splitOffsets() many times for the same file, leaking a List<Long> wrapper each time. Cache the unmodifiable view in a transient field that is populated lazily on the first call and invalidated wherever the underlying long[] is reassigned (the long-form constructor, the copy constructor, and the indexed setter). Marking it transient keeps it out of the wire format and Java serialization. Adds TestBaseFileSplitOffsets covering the four behaviors that matter: repeated calls return the same instance, corrupted offsets still yield null stably, the copy constructor produces an independent cached view, and the returned list remains unmodifiable. Updates TestManifestReader#FILE_COMPARISON_CONFIG to ignore the new transient cache field — recursive equality on data files should not depend on whether splitOffsets() / toString() happened to be called on one side of the comparison. Resolves apache#15622 Co-Authored-By: Claude Code <noreply@anthropic.com>
Two CI failures on PR apache#16265: 1. build-checks: TestBaseFileSplitOffsets#splitOffsetsListIsUnmodifiable uses assertThatThrownBy(...).isInstanceOf(...) without a .hasMessage* chain, which the AssertThatThrownByWithMessageCheck custom checkstyle rule requires. Add .hasMessage(null) — Collections.unmodifiableList throws UOE with no message, and .hasMessage(null) is the established idiom in this codebase (see TestManifestReaderStats). Also switch to the static import of assertThatThrownBy as required by the import rule on lines 91-94 of checkstyle.xml. 2. kafka-connect-tests: TestEventSerialization#testDataWrittenSerialization uses .usingRecursiveComparison() which introspects every field — including the new transient splitOffsetsList cache. The expected side has it populated (encoder calls splitOffsets() during Avro encode), the deserialized side has it null (no one called splitOffsets() on the freshly-decoded file), so the comparison fails on the cache field even though the underlying long[] is identical. Add splitOffsetsList to the existing field-ignore list, matching the pattern already used for avroSchema, icebergSchema, schema, fromProjectionPos. Co-Authored-By: Claude Code <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
BaseFile.splitOffsets()previously wrapped the underlyinglong[]in a freshArrayUtil.toUnmodifiableLongListon every invocation. Manifest rewriting, format conversion, andtoString()debugging can callsplitOffsets()many times for the same file, leaking aList<Long>wrapper each call.This PR caches the unmodifiable view in a transient field, populated lazily on the first call and invalidated wherever the underlying
long[]is reassigned.Resolves #15622
Context
The original ask in #15622 — "BaseFile.splitOffsets() allocates a new List on every call" — was previously addressed in #15625 / #15624, both of which were stale-closed at 30 days with no review (no technical objections). This PR is a fresh, minimal revival of that approach with explicit invalidation at every write site, regression tests covering the cache contract, and AI disclosure.
The trade-off is small: one extra reference field per
BaseFile(a few bytes), zero serialization impact (transient), and a one-line guard insplitOffsets()itself. Hot paths that callsplitOffsets()repeatedly stop paying the per-call allocation cost.Changes
core/src/main/java/org/apache/iceberg/BaseFile.javaList<Long> splitOffsetsListcache field.splitOffsets()populates the cache on first call and reuses it thereafter; corrupted offsets still returnnullstably.long[] splitOffsetsfield (the long-form constructor, the copy constructor, and the indexed settercase 14) now also null out the cache so the nextsplitOffsets()call rebuilds it from the new array.core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java(new) — four tests covering: repeated-call instance reuse; null-when-corrupted contract is stable across calls; copy-constructor produces an independent cached view; the returned list remains unmodifiable.core/src/test/java/org/apache/iceberg/TestManifestReader.java— extendsFILE_COMPARISON_CONFIGto ignore the new transient cache field. Recursive equality on data files should not depend on whethersplitOffsets()/toString()happened to be called on one side of the comparison before the assertion. (This is the same pattern the config already uses formanifestLocation,fileOrdinal,fromProjectionPos, etc.)Reproduce BEFORE/AFTER yourself (copy-paste)
The only thing that changes between BEFORE and AFTER is the checked-out ref of
BaseFile.java; the test file is the same in both runs.What I ran locally
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets"onorigin/mainwith the new test imported → 2/4 fail (regression visible)../gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets" --tests "org.apache.iceberg.TestContentFileParser" --tests "org.apache.iceberg.TestManifestReader" --tests "org.apache.iceberg.TestTrackedFileStruct"→ all green../gradlew :iceberg-core:spotlessCheck→ PASS.Edge cases tested
splitOffsets()calls on a single file[0, 1024, 2048]List<Long>instance, with the same contentssplitOffsetsReturnsSameInstanceOnRepeatedCallsfileSize=100, offsets[0, 50, 200]null(caching must not turn null into a stale empty list)splitOffsetsNullWhenCorruptedThenReturnsNullStablysplitOffsets()calledcopyConstructorProducesIndependentCachedViewsplitOffsets().set(0, 99L)UnsupportedOperationExceptionsplitOffsetsListIsUnmodifiableBaseFile(existing test)toString()populated the cache on one side (cache field excluded from comparison)TestManifestReader#testReaderWithFilterWithoutSelectRisk / blast radius
BaseFileis the shared superclass forDataFile/DeleteFile. The cache is internal, transient, and only read insidesplitOffsets().transientmeans the field is excluded from Avro reflection and Java serialization. Existingequals/hashCode(delegated toSupportsIndexProjection/internalGet) are also unaffected because they go through the indexed schema, not arbitrary fields.BaseFile(≈ 4 bytes on compressed-oop heaps), plus the cachedList<Long>itself when populated (which is the same wrapper that today is allocated and discarded on every call — net memory pressure decreases for the common case).BaseFileinstances are not documented as thread-safe across mutation, and the cached field is a benign-data-race scenario (idempotent population of an immutable view); no extra synchronization is introduced.Additive change. No public API surface modified.
Release note
PR drafted with assistance from Claude Code. The change was reviewed manually against
apache/iceberg's source onmain. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.