ATLAS-5274: [Impala Hook] Self-referencing INSERT OVERWRITE produces …#605
Open
achandel-01 wants to merge 1 commit intoapache:masterfrom
Open
ATLAS-5274: [Impala Hook] Self-referencing INSERT OVERWRITE produces …#605achandel-01 wants to merge 1 commit intoapache:masterfrom
achandel-01 wants to merge 1 commit intoapache:masterfrom
Conversation
…impala_process with empty outputs[], breaking lineage
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
…impala_process with empty outputs[], breaking lineage
What changes were proposed in this pull request?
earlier for impala insert overwrite (DML QUERY) shows the wrong lineage graph :

now after these changes lineage will look like this :

explanation :
getEntities() walks inputNodes and outputNodes and fills the process’s inputs and outputs lists, using a single processedNames set to avoid duplicates. When the same table is both read and written (e.g. self-lineage / INSERT OVERWRITE into the table you read from), its qualified name appears in both node lists. The input loop runs first and registers that name in processedNames. The output loop then treats the name as “already processed” and skips adding that table to outputs, even though it is a real write target. The fix is to dedupe per side (separate sets for inputs vs outputs) so the same qualified name can appear in both inputs and outputs when the lineage says so.
How was this patch tested?
unit testing , mvn build