Core: Fix row ID assignment for EXISTING entry during a manifest merge#16263
Core: Fix row ID assignment for EXISTING entry during a manifest merge#16263amogh-jahagirdar wants to merge 1 commit intoapache:mainfrom
Conversation
| } | ||
| }; | ||
| } else { | ||
| // data file's first_row_id is null when the manifest's first_row_id is null |
There was a problem hiding this comment.
I'm still checking if changing idAssigner like this for all cases is really the right way to fix this or if during manifest merging we should actually pass in our own id assignment transformer that preserves the existing case.
Here we're changing the manifest reader because
// data file's first_row_id is null when the manifest's first_row_id is null
isn't true in the context of reading filtered manifests that have EXISTING entries which need to have first row IDs preserved but the filtered manifest does not have a first row ID.
kevinjqliu
left a comment
There was a problem hiding this comment.
LGTM
i ran the test on main and it failed. one nit on using rewriteFiles as it is deprecated
| table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit(); | ||
| // FILE_A→0, FILE_B→125; nextRowId=225 | ||
| table.updateProperties().set(TableProperties.MANIFEST_MIN_MERGE_COUNT, "1").commit(); | ||
| table.newRewrite().rewriteFiles(Set.of(FILE_A), Set.of(FILE_C)).commit(); |
There was a problem hiding this comment.
| table.newRewrite().rewriteFiles(Set.of(FILE_A), Set.of(FILE_C)).commit(); | |
| table.newRewrite().deleteFile(FILE_A).addFile(FILE_C).commit(); |
nit: rewriteFiles is deprecated
|
|
||
| return entry; | ||
| }; | ||
| // preserve any existing file-level first_row_id |
There was a problem hiding this comment.
Maybe add a more explanatory comment here?
| // preserve any existing file-level first_row_id | |
| // We must preserve existing row id assignments. Existing entries | |
| // carry their own file-level first_row_id from the source manifest, | |
| // even though intermediate manifests may have a null first_row_id. |
We observed some cases where when EXISTING entries are carried over in an operation as part of a manifest merge, the first row ID assignment is not preserved and instead gets a new first row ID based on the inheritance rules which is unexpected (and is a spec violation).