Skip to content

Add script to extract provenance, dataset, and source metadata for Croissant#2062

Open
d-a-k-s-h-7 wants to merge 2 commits into
datacommonsorg:masterfrom
d-a-k-s-h-7:add-provenance-extraction
Open

Add script to extract provenance, dataset, and source metadata for Croissant#2062
d-a-k-s-h-7 wants to merge 2 commits into
datacommonsorg:masterfrom
d-a-k-s-h-7:add-provenance-extraction

Conversation

@d-a-k-s-h-7
Copy link
Copy Markdown

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a utility script extract_source_dataset_provenance.py along with its unit tests to extract and assemble the Data Commons Provenance hierarchy into a structured JSON file. The review feedback suggests several improvements to enhance robustness, including adding defensive guard checks for None values in helper functions, conditionally constructing dataset information only when retrieved successfully, and explicitly specifying UTF-8 encoding when writing the output file.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +45 to +47
def get_node_property(node_data, prop_name, default=None):
"""Helper to extract a property value from the node data dictionary."""
arcs = node_data.get("arcs", {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To prevent potential AttributeError exceptions, add a defensive guard check to ensure node_data is not None before attempting to access its properties. In Python, dict.get(key, default) can return None if the key exists in the dictionary with a value of None.

def get_node_property(node_data, prop_name, default=None):
    """Helper to extract a property value from the node data dictionary."""
    if not node_data:
        return default
    arcs = node_data.get("arcs", {})
References
  1. Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.

Comment on lines +54 to +56
def get_node_dcid(node_data, prop_name):
"""Helper to extract a DCID from a property."""
arcs = node_data.get("arcs", {})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Add a defensive guard check to ensure node_data is not None before attempting to access its properties to prevent potential AttributeError exceptions.

def get_node_dcid(node_data, prop_name):
    """Helper to extract a DCID from a property."""
    if not node_data:
        return None
    arcs = node_data.get("arcs", {})
References
  1. Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.

Comment on lines +187 to +199
if ds_dcid:
ds_data = dataset_data_map.get(ds_dcid, {})
src_dcid = ds_data.get("source_dcid")
source_info = None

if src_dcid:
source_info = source_data_map.get(src_dcid)

dataset_info = {
"name": ds_data.get("name"),
"url": ds_data.get("url"),
"source": source_info
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If ds_dcid is not found in dataset_data_map (for example, if the API call failed or returned no data for that dataset), ds_data will be an empty dictionary {}. This results in a dummy dataset_info dictionary populated with None values. It is cleaner and more robust to only construct dataset_info if ds_data is successfully retrieved.

Suggested change
if ds_dcid:
ds_data = dataset_data_map.get(ds_dcid, {})
src_dcid = ds_data.get("source_dcid")
source_info = None
if src_dcid:
source_info = source_data_map.get(src_dcid)
dataset_info = {
"name": ds_data.get("name"),
"url": ds_data.get("url"),
"source": source_info
}
if ds_dcid:
ds_data = dataset_data_map.get(ds_dcid)
if ds_data:
src_dcid = ds_data.get("source_dcid")
source_info = None
if src_dcid:
source_info = source_data_map.get(src_dcid)
dataset_info = {
"name": ds_data.get("name"),
"url": ds_data.get("url"),
"source": source_info
}

prov["dataset"] = dataset_info
final_output.append(prov)

with open(output_file, "w") as f:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

When writing to a text file, it is a best practice to explicitly specify encoding="utf-8" to ensure consistent behavior across different operating systems and environments, preventing potential UnicodeEncodeError exceptions.

Suggested change
with open(output_file, "w") as f:
with open(output_file, "w", encoding="utf-8") as f:

@d-a-k-s-h-7 d-a-k-s-h-7 force-pushed the add-provenance-extraction branch from c1f4f1d to 89386b7 Compare June 5, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant