Core: Cache splitOffsets list in BaseFile to avoid repeated allocations by jbbqqf · Pull Request #16265 · apache/iceberg

jbbqqf · 2026-05-09T17:02:11Z

Summary

BaseFile.splitOffsets() previously wrapped the underlying long[] in a fresh ArrayUtil.toUnmodifiableLongList on every invocation. Manifest rewriting, format conversion, and toString() debugging can call splitOffsets() many times for the same file, leaking a List<Long> wrapper each call.

This PR caches the unmodifiable view in a transient field, populated lazily on the first call and invalidated wherever the underlying long[] is reassigned.

Resolves #15622

Context

The original ask in #15622 — "BaseFile.splitOffsets() allocates a new List on every call" — was previously addressed in #15625 / #15624, both of which were stale-closed at 30 days with no review (no technical objections). This PR is a fresh, minimal revival of that approach with explicit invalidation at every write site, regression tests covering the cache contract, and AI disclosure.

The trade-off is small: one extra reference field per BaseFile (a few bytes), zero serialization impact (transient), and a one-line guard in splitOffsets() itself. Hot paths that call splitOffsets() repeatedly stop paying the per-call allocation cost.

Changes

core/src/main/java/org/apache/iceberg/BaseFile.java
- Add a transient List<Long> splitOffsetsList cache field.
- splitOffsets() populates the cache on first call and reuses it thereafter; corrupted offsets still return null stably.
- All three sites that reassign the long[] splitOffsets field (the long-form constructor, the copy constructor, and the indexed setter case 14) now also null out the cache so the next splitOffsets() call rebuilds it from the new array.
- Inline comment on the cache field explains the why and points to the issue, so a reviewer reading the diff cold doesn't have to re-derive it.
core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java (new) — four tests covering: repeated-call instance reuse; null-when-corrupted contract is stable across calls; copy-constructor produces an independent cached view; the returned list remains unmodifiable.
core/src/test/java/org/apache/iceberg/TestManifestReader.java — extends FILE_COMPARISON_CONFIG to ignore the new transient cache field. Recursive equality on data files should not depend on whether splitOffsets() / toString() happened to be called on one side of the comparison before the assertion. (This is the same pattern the config already uses for manifestLocation, fileOrdinal, fromProjectionPos, etc.)

Reproduce BEFORE/AFTER yourself (copy-paste)

# --- one-time setup ---
git clone https://github.com/apache/iceberg.git /tmp/repro && cd /tmp/repro

# --- BEFORE (origin/main) ---
git checkout origin/main
# Pull in only the new test class to make the regression visible against
# the unmodified BaseFile in main:
git fetch https://github.com/jbbqqf/iceberg.git feat/15622-cache-split-offsets-list
git checkout FETCH_HEAD -- core/src/test/java/org/apache/iceberg/TestBaseFileSplitOffsets.java
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 2 failures —
#   splitOffsetsReturnsSameInstanceOnRepeatedCalls (allocates fresh List per call)
#   copyConstructorProducesIndependentCachedView   (same)
# The other two pass on main because they don't depend on caching.

# --- AFTER (this PR) ---
git checkout FETCH_HEAD
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets"
# Expected: 4 tests, 4 passes — splitOffsets() reuses a cached
# unmodifiable view, invalidated at every write site.

The only thing that changes between BEFORE and AFTER is the checked-out ref of BaseFile.java; the test file is the same in both runs.

What I ran locally

./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets" on origin/main with the new test imported → 2/4 fail (regression visible).
Same on this branch → 4/4 pass.
./gradlew :iceberg-core:test --tests "org.apache.iceberg.TestBaseFileSplitOffsets" --tests "org.apache.iceberg.TestContentFileParser" --tests "org.apache.iceberg.TestManifestReader" --tests "org.apache.iceberg.TestTrackedFileStruct" → all green.
./gradlew :iceberg-core:spotlessCheck → PASS.

Edge cases tested

#	Scenario	Input	Expected	Verified by
1	Repeated `splitOffsets()` calls on a single file	well-formed offsets `[0, 1024, 2048]`	both calls return the same `List<Long>` instance, with the same contents	`splitOffsetsReturnsSameInstanceOnRepeatedCalls`
2	Corrupted offsets (last offset >= file size)	`fileSize=100`, offsets `[0, 50, 200]`	both calls return `null` (caching must not turn null into a stale empty list)	`splitOffsetsNullWhenCorruptedThenReturnsNullStably`
3	Copy constructor invalidates the cache	copy a file that already has `splitOffsets()` called	copy returns its own list with equal contents but different identity, and copy's own cache is stable across repeated calls	`copyConstructorProducesIndependentCachedView`
4	The returned list remains unmodifiable	`splitOffsets().set(0, 99L)`	throws `UnsupportedOperationException`	`splitOffsetsListIsUnmodifiable`
5	Recursive comparison on `BaseFile` (existing test)	manifest read+filter, equality vs constructed file	passes regardless of whether `toString()` populated the cache on one side (cache field excluded from comparison)	`TestManifestReader#testReaderWithFilterWithoutSelect`

Risk / blast radius

BaseFile is the shared superclass for DataFile/DeleteFile. The cache is internal, transient, and only read inside splitOffsets().
Wire/serialization: unaffected — transient means the field is excluded from Avro reflection and Java serialization. Existing equals/hashCode (delegated to SupportsIndexProjection / internalGet) are also unaffected because they go through the indexed schema, not arbitrary fields.
Memory: one extra reference field per BaseFile (≈ 4 bytes on compressed-oop heaps), plus the cached List<Long> itself when populated (which is the same wrapper that today is allocated and discarded on every call — net memory pressure decreases for the common case).
Concurrency: BaseFile instances are not documented as thread-safe across mutation, and the cached field is a benign-data-race scenario (idempotent population of an immutable view); no extra synchronization is introduced.

Additive change. No public API surface modified.

Release note

BaseFile.splitOffsets() now caches its unmodifiable List<Long> view, eliminating per-call allocations during manifest rewrites and format conversions.

PR drafted with assistance from Claude Code. The change was reviewed manually against apache/iceberg's source on main. The reproducer block above was used during development; it is the same one a reviewer can paste verbatim.

BaseFile.splitOffsets() previously wrapped the underlying long[] in a fresh ArrayUtil.toUnmodifiableLongList on every invocation. Manifest rewriting, format conversion, and toString() debugging can call splitOffsets() many times for the same file, leaking a List<Long> wrapper each time. Cache the unmodifiable view in a transient field that is populated lazily on the first call and invalidated wherever the underlying long[] is reassigned (the long-form constructor, the copy constructor, and the indexed setter). Marking it transient keeps it out of the wire format and Java serialization. Adds TestBaseFileSplitOffsets covering the four behaviors that matter: repeated calls return the same instance, corrupted offsets still yield null stably, the copy constructor produces an independent cached view, and the returned list remains unmodifiable. Updates TestManifestReader#FILE_COMPARISON_CONFIG to ignore the new transient cache field — recursive equality on data files should not depend on whether splitOffsets() / toString() happened to be called on one side of the comparison. Resolves apache#15622 Co-Authored-By: Claude Code <noreply@anthropic.com>

Two CI failures on PR apache#16265: 1. build-checks: TestBaseFileSplitOffsets#splitOffsetsListIsUnmodifiable uses assertThatThrownBy(...).isInstanceOf(...) without a .hasMessage* chain, which the AssertThatThrownByWithMessageCheck custom checkstyle rule requires. Add .hasMessage(null) — Collections.unmodifiableList throws UOE with no message, and .hasMessage(null) is the established idiom in this codebase (see TestManifestReaderStats). Also switch to the static import of assertThatThrownBy as required by the import rule on lines 91-94 of checkstyle.xml. 2. kafka-connect-tests: TestEventSerialization#testDataWrittenSerialization uses .usingRecursiveComparison() which introspects every field — including the new transient splitOffsetsList cache. The expected side has it populated (encoder calls splitOffsets() during Avro encode), the deserialized side has it null (no one called splitOffsets() on the freshly-decoded file), so the comparison fails on the cache field even though the underlying long[] is identical. Add splitOffsetsList to the existing field-ignore list, matching the pattern already used for avroSchema, icebergSchema, schema, fromProjectionPos. Co-Authored-By: Claude Code <noreply@anthropic.com>

github-actions Bot added the core label May 9, 2026

github-actions Bot added the KAFKACONNECT label May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265

Core: Cache splitOffsets list in BaseFile to avoid repeated allocations#16265
jbbqqf wants to merge 2 commits intoapache:mainfrom
jbbqqf:feat/15622-cache-split-offsets-list

jbbqqf commented May 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jbbqqf commented May 9, 2026

Summary

Context

Changes

Reproduce BEFORE/AFTER yourself (copy-paste)

What I ran locally

Edge cases tested

Risk / blast radius

Release note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant