Skip to content

[AMORO-4006][ams] Fix base table snapshot expiration failure for keyed Mixed-Iceberg#4210

Open
fightBoxing wants to merge 2 commits intoapache:masterfrom
fightBoxing:fix/issue-4006-base-table-expire-snapshots-metadata-refresh
Open

[AMORO-4006][ams] Fix base table snapshot expiration failure for keyed Mixed-Iceberg#4210
fightBoxing wants to merge 2 commits intoapache:masterfrom
fightBoxing:fix/issue-4006-base-table-expire-snapshots-metadata-refresh

Conversation

@fightBoxing
Copy link
Copy Markdown

Why are the changes needed?

Closes #4006.

For keyed Mixed-Iceberg tables managed by AMS's internal REST catalog, the base store and the change store share the same server-side TableMetadata record (and therefore the same meta_version). In MixedTableMaintainer.expireSnapshots() the change maintainer runs before the base maintainer:

if (changeMaintainer != null) {
  changeMaintainer.expireSnapshots();   // bumps the shared meta_version
}
baseMaintainer.expireSnapshots();       // still holds a stale Table reference

After the change maintainer commits, the shared meta_version has been bumped on the server side, but the baseMaintainer.table reference was captured at construction time and its cached TableMetadata still points to the old version. The subsequent base commit then trips the optimistic-lock check and fails with CommitFailedException, so base-store snapshots are never actually expired — only the change store is cleaned up, causing steady metadata growth over time as reported in the issue.

Brief change log

Fix — MixedTableMaintainer.java

  • After changeMaintainer.expireSnapshots() (and its @VisibleForTesting overload) finishes, refresh the base maintainer's cached Iceberg Table before running the base expiration, so the next commit starts from the latest meta_version.
  • The refresh is wrapped in a try/catch that logs a warning on failure — a transient refresh error must not mask the original error signal coming from baseMaintainer.expireSnapshots().

Regression tests — TestSnapshotExpire.java

  1. testRefreshBaseTableBeforeBaseExpiration — wraps baseMaintainer.table with a Mockito spy (injected via reflection) and verifies that refresh() is invoked during expireSnapshots(). The pre-fix code path does not call refresh() inside the expire-snapshots flow, so this test fails without the fix and passes with it (verified locally by reverting MixedTableMaintainer.java to upstream/master).
  2. testExpireSnapshotsShrinksBothStores — end-to-end smoke test that drives expireSnapshots() on a keyed table with multiple expirable snapshots on both stores and asserts that both snapshot lists shrink. Acts as a happy-path safety net.

Note: The default BasicCatalogTestHelper backend used by this test class keeps base and change stores in independent Iceberg tables (no shared meta_version), so it cannot physically reproduce the server-side conflict. The first test therefore asserts the fix at the behavioural level — that refresh() is called on the base maintainer's cached table exactly where the fix introduces it.

How was this patch tested?

Locally with JDK 11 + Maven against the reactor:

mvn -pl amoro-ams -am surefire:test \
    -Dtest='TestSnapshotExpire#testRefreshBaseTableBeforeBaseExpiration+testExpireSnapshotsShrinksBothStores'
Scenario testRefreshBaseTableBeforeBaseExpiration testExpireSnapshotsShrinksBothStores
With the fix applied ✅ PASS (4/4 keyed parameterised instances) ✅ PASS
Fix reverted to upstream/master ❌ FAIL — refresh() never called ✅ PASS (expected — HadoopCatalog does not share meta_version)

Final result with the fix in place:

Tests run: 8, Failures: 0, Errors: 0, Skipped: 4, Time elapsed: 8.937 s
[INFO] BUILD SUCCESS

(8 = 4 parameterised configs × 2 methods; 4 skipped are non-keyed configs gated by Assume.assumeTrue(isKeyedTable()).)

Also verified:

  • mvn spotless:apply — no format violations.
  • mvn -pl amoro-ams -am test-compileBUILD SUCCESS across the whole reactor.

Documentation

Does this PR introduce any user-facing change?

  • Yes.
  • No. Internal bug fix; no API, config, or docs change.

rockyyin added 2 commits May 8, 2026 16:22
…d Mixed-Iceberg

For Mixed-Iceberg tables managed by AMS internal REST catalog, the base
store and the change store share the same server-side TableMetadata
record and therefore the same meta_version. In MixedTableMaintainer the
changeMaintainer.expireSnapshots() is invoked before baseMaintainer, and
after it commits the shared meta_version has been bumped. The
baseMaintainer however still holds the Iceberg Table instance captured
at construction time, whose cached TableMetadata points to the old
meta_version. When baseMaintainer then tries to commit its own snapshot
expiration it trips the optimistic-lock check and fails with
CommitFailedException, leaving the base store snapshots never expired.

Refresh the base table metadata right after the change store commits so
that the base store's subsequent expireSnapshots() commit starts from
the latest version. A try/catch guards against transient refresh
failures so they don't mask the original error signal.

Closes apache#4006
…expiration refresh

Adds two regression tests in TestSnapshotExpire covering issue apache#4006:

1. testRefreshBaseTableBeforeBaseExpiration — wraps the base maintainer's
   cached Iceberg Table with a Mockito spy via reflection and verifies
   that refresh() is invoked during MixedTableMaintainer.expireSnapshots().
   Prior to the fix, no refresh() call happens on the base maintainer's
   cached table during the expire-snapshots flow, so this test fails
   without the fix and passes with it.

2. testExpireSnapshotsShrinksBothStores — end-to-end smoke test that
   exercises expireSnapshots() on a keyed Mixed-Iceberg table with
   expirable snapshots on both stores and asserts that both the base
   store and the change store snapshot lists shrink. Serves as a safety
   net for the happy path.

The BasicCatalogTestHelper backend used by this test class cannot
faithfully reproduce the shared-meta_version scenario (base and change
stores live in independent Iceberg tables), so the first test asserts
the fix behaviourally (refresh() is called), which is the direct code
path introduced by the fix.

Verified locally: with the fix applied both tests pass; after reverting
MixedTableMaintainer to upstream/master the first test fails as expected.
@github-actions github-actions Bot added the module:ams-server Ams server module label May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module:ams-server Ams server module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: keyed mixed iceberg format table always failed to expire snapshots of it's basetable

1 participant