Skip to content

fix(incremental-scan): treat EXISTING manifest entries as appends when from=None#74

Merged
gbrgr merged 5 commits intomainfrom
gb/inc-scan-fix
Apr 21, 2026
Merged

fix(incremental-scan): treat EXISTING manifest entries as appends when from=None#74
gbrgr merged 5 commits intomainfrom
gb/inc-scan-fix

Conversation

@gbrgr
Copy link
Copy Markdown
Collaborator

@gbrgr gbrgr commented Apr 21, 2026

Summary

  • When from_snapshot_id is None (scan from the beginning of time), the incremental scan was silently dropping EXISTING manifest entries, causing data from expired snapshots to go missing
  • Root cause: three compounding filters excluded data files whose original snapshot was expired from the catalog — the added_snapshot_id manifest filter, the filter_fn entry filter, and the else { Ok(()) } status check
  • Fix: when from=None, switch to full-scan semantics — use only to_snapshot's manifest list (no added_snapshot_id filter), clear filter_fn, and treat EXISTING entries as appends alongside Added entries

Test plan

  • New test test_incremental_scan_from_none_includes_existing_entries_from_expired_snapshots — creates a 2-snapshot table, rebuilds the Table with only snapshot 2 in the metadata (simulating snapshot 1 being expired), and asserts that from=None returns all live rows including those from the expired snapshot
  • All 28 existing incremental scan tests pass without regression

🤖 Generated with Claude Code

gbrgr and others added 5 commits April 21, 2026 18:03
…n from=None

When `from_snapshot_id` is None (scan from the beginning of time), the incremental
scan now uses full-scan semantics instead of silently dropping EXISTING manifest
entries. Previously, data files from expired snapshots that appear as EXISTING
entries in later manifests were missed, causing incomplete results.

Changes:
- `IncrementalPlanContext` gains a `from_snapshot_id: Option<i64>` field to track
  whether the user provided an explicit starting snapshot.
- `build_manifest_file_contexts`: when `from=None`, reads ALL manifests from
  `to_snapshot`'s manifest list without filtering by `added_snapshot_id`, and
  clears `filter_fn` so entries from expired snapshots pass through.
- `plan_files`: when `from=None`, treats EXISTING data entries as appends (alongside
  Added entries) and skips Deleted entries; also skips `from_snapshot_collection`
  since equality deletes are already handled per-file via `AppendedFileScanTask`.
- Adds regression test that simulates an expired snapshot: snapshot 1 is removed
  from the catalog metadata while its data files (referenced as EXISTING entries in
  snapshot 2's manifest) must still appear in the append stream.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Update version comments to match the exact tags that the pinned commit
SHAs correspond to, fixing zizmor v1.23.1 ref-version-mismatch findings.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Extract `scan_and_verify` free function from `verify_incremental_scan`
so tests can pass `Option<i64>` snapshot IDs and an arbitrary `Table`.

Add `IncrementalTestFixture::with_expired_snapshots(n)` that rebuilds
table metadata with the `n` oldest snapshots removed, encapsulating the
inline JSON construction that was in the test body.

The regression test drops from ~80 lines to ~20.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gbrgr gbrgr merged commit 4182137 into main Apr 21, 2026
21 checks passed
@gbrgr gbrgr deleted the gb/inc-scan-fix branch April 21, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant