Skip to content

Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265

Open
jbbqqf wants to merge 2 commits intoapache:mainfrom
jbbqqf:feat/15622-cache-split-offsets-list
Open

Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265
jbbqqf wants to merge 2 commits intoapache:mainfrom
jbbqqf:feat/15622-cache-split-offsets-list

Conversation

@jbbqqf
Copy link
Copy Markdown

@jbbqqf jbbqqf commented May 9, 2026

Summary

BaseFile.splitOffsets() previously wrapped the underlying long[] in a fresh ArrayUtil.toUnmodifiableLongList on every invocation. Manifest rewriting, format conversion, and toString() debugging can call splitOffsets() many times for the same file, leaking a List<Long> wrapper each call.

This PR caches the unmodifiable view in a transient field, populated lazily on the first call and invalidated wherever the underlying long[] is reassigned.

Resolves #15622

Context

The original ask in #15622"BaseFile.splitOffsets() allocates a new List on every call" — was previously addressed in #15625 / #15624, both of which were stale-closed at 30 days with no review (no technical objections). This PR is a fresh, minimal revival of that approach with explicit invalidation at every write site, regression tests covering the cache contract, and AI disclosure.

The trade-off is small: one extra reference field per BaseFile (a few bytes), zero serialization impact (transient), and a one-line guard in splitOffsets() itself. Hot paths that call splitOffsets() repeatedly stop paying the per-call allocation cost.

Changes

  • core/src/main/java/org/apache/iceberg/BaseFile.java
    • Add a transient List<Long> splitOffsetsList cache field.
    • splitOffsets() populates the cache on first call and reuses it thereafter; corrupted offsets still return null stably.
    • All three sites that reassign the long[] splitOffsets field (the long-form constructor, the copy constructor, and the indexed setter case 14) now also null out the cache so the next splitOffsets() call rebuilds it from the new array.
    • Inline comment on the cache field explains the why and points to the issue, so a reviewer reading the diff cold doesn't have to re-derive it.
  • core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java (new) — four tests covering: repeated-call instance reuse; null-when-corrupted contract is stable across calls; copy-constructor produces an independent cached view; the returned list remains unmodifiable.
  • core/src/test/java/org/apache/iceberg/TestManifestReader.java — extends FILE_COMPARISON_CONFIG to ignore the new transient cache field. Recursive equality on data files should not depend on whether splitOffsets() / toString() happened to be called on one side of the comparison before the assertion. (This is the same pattern the config already uses for manifestLocation, fileOrdinal, fromProjectionPos, etc.)

Reproduce BEFORE/AFTER yourself (copy-paste)

# --- one-time setup ---
git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro

# --- BEFORE (origin/main) ---
git checkout origin/main
# Pull in only the new test class to make the regression visible against
# the unmodified BaseFile in main:
git fetch https://github.com/jbbqqf/iceberg.git feat/15622-cache-split-offsets-list
git checkout FETCH_HEAD -- core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 2 failures —
#   splitOffsetsReturnsSameInstanceOnRepeatedCalls (allocates fresh List per call)
#   copyConstructorProducesIndependentCachedView   (same)
# The other two pass on main because they don't depend on caching.

# --- AFTER (this PR) ---
git checkout FETCH_HEAD
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 4 passes — splitOffsets() reuses a cached
# unmodifiable view, invalidated at every write site.

The only thing that changes between BEFORE and AFTER is the checked-out ref of BaseFile.java; the test file is the same in both runs.

What I ran locally

  • ./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets" on origin/main with the new test imported → 2/4 fail (regression visible).
  • Same on this branch → 4/4 pass.
  • ./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets" --tests "org.apache.iceberg.TestContentFileParser" --tests "org.apache.iceberg.TestManifestReader" --tests "org.apache.iceberg.TestTrackedFileStruct"all green.
  • ./gradlew :iceberg-core:spotlessCheckPASS.

Edge cases tested

# Scenario Input Expected Verified by
1 Repeated splitOffsets() calls on a single file well-formed offsets [0, 1024, 2048] both calls return the same List<Long> instance, with the same contents splitOffsetsReturnsSameInstanceOnRepeatedCalls
2 Corrupted offsets (last offset >= file size) fileSize=100, offsets [0, 50, 200] both calls return null (caching must not turn null into a stale empty list) splitOffsetsNullWhenCorruptedThenReturnsNullStably
3 Copy constructor invalidates the cache copy a file that already has splitOffsets() called copy returns its own list with equal contents but different identity, and copy's own cache is stable across repeated calls copyConstructorProducesIndependentCachedView
4 The returned list remains unmodifiable splitOffsets().set(0, 99L) throws UnsupportedOperationException splitOffsetsListIsUnmodifiable
5 Recursive comparison on BaseFile (existing test) manifest read+filter, equality vs constructed file passes regardless of whether toString() populated the cache on one side (cache field excluded from comparison) TestManifestReader#testReaderWithFilterWithoutSelect

Risk / blast radius

  • BaseFile is the shared superclass for DataFile/DeleteFile. The cache is internal, transient, and only read inside splitOffsets().
  • Wire/serialization: unaffected — transient means the field is excluded from Avro reflection and Java serialization. Existing equals/hashCode (delegated to SupportsIndexProjection / internalGet) are also unaffected because they go through the indexed schema, not arbitrary fields.
  • Memory: one extra reference field per BaseFile (≈ 4 bytes on compressed-oop heaps), plus the cached List<Long> itself when populated (which is the same wrapper that today is allocated and discarded on every call — net memory pressure decreases for the common case).
  • Concurrency: BaseFile instances are not documented as thread-safe across mutation, and the cached field is a benign-data-race scenario (idempotent population of an immutable view); no extra synchronization is introduced.

Additive change. No public API surface modified.

Release note

BaseFile.splitOffsets() now caches its unmodifiable List<Long> view, eliminating per-call allocations during manifest rewrites and format conversions.

PR drafted with assistance from Claude Code. The change was reviewed manually against apache/iceberg's source on main. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.

BaseFile.splitOffsets() previously wrapped the underlying long[] in a
fresh ArrayUtil.toUnmodifiableLongList on every invocation. Manifest
rewriting, format conversion, and toString() debugging can call
splitOffsets() many times for the same file, leaking a List<Long>
wrapper each time.

Cache the unmodifiable view in a transient field that is populated
lazily on the first call and invalidated wherever the underlying
long[] is reassigned (the long-form constructor, the copy
constructor, and the indexed setter). Marking it transient keeps it
out of the wire format and Java serialization.

Adds TestBaseFileSplitOffsets covering the four behaviors that matter:
repeated calls return the same instance, corrupted offsets still
yield null stably, the copy constructor produces an independent
cached view, and the returned list remains unmodifiable.

Updates TestManifestReader#FILE_COMPARISON_CONFIG to ignore the new
transient cache field — recursive equality on data files should not
depend on whether splitOffsets() / toString() happened to be called
on one side of the comparison.

Resolves apache#15622

Co-Authored-By: Claude Code <noreply@anthropic.com>
@github-actions github-actions Bot added the core label May 9, 2026
Two CI failures on PR apache#16265:

1. build-checks: TestBaseFileSplitOffsets#splitOffsetsListIsUnmodifiable
   uses assertThatThrownBy(...).isInstanceOf(...) without a .hasMessage*
   chain, which the AssertThatThrownByWithMessageCheck custom checkstyle
   rule requires. Add .hasMessage(null) — Collections.unmodifiableList
   throws UOE with no message, and .hasMessage(null) is the established
   idiom in this codebase (see TestManifestReaderStats). Also switch to
   the static import of assertThatThrownBy as required by the import
   rule on lines 91-94 of checkstyle.xml.

2. kafka-connect-tests: TestEventSerialization#testDataWrittenSerialization
   uses .usingRecursiveComparison() which introspects every field —
   including the new transient splitOffsetsList cache. The expected
   side has it populated (encoder calls splitOffsets() during Avro
   encode), the deserialized side has it null (no one called
   splitOffsets() on the freshly-decoded file), so the comparison
   fails on the cache field even though the underlying long[] is
   identical. Add splitOffsetsList to the existing field-ignore list,
   matching the pattern already used for avroSchema, icebergSchema,
   schema, fromProjectionPos.

Co-Authored-By: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

BaseFile.splitOffsets() allocates a new List on every call

1 participant