Add script to extract provenance, dataset, and source metadata for Croissant#2062
Add script to extract provenance, dataset, and source metadata for Croissant#2062d-a-k-s-h-7 wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a utility script extract_source_dataset_provenance.py along with its unit tests to extract and assemble the Data Commons Provenance hierarchy into a structured JSON file. The review feedback suggests several improvements to enhance robustness, including adding defensive guard checks for None values in helper functions, conditionally constructing dataset information only when retrieved successfully, and explicitly specifying UTF-8 encoding when writing the output file.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def get_node_property(node_data, prop_name, default=None): | ||
| """Helper to extract a property value from the node data dictionary.""" | ||
| arcs = node_data.get("arcs", {}) |
There was a problem hiding this comment.
To prevent potential AttributeError exceptions, add a defensive guard check to ensure node_data is not None before attempting to access its properties. In Python, dict.get(key, default) can return None if the key exists in the dictionary with a value of None.
def get_node_property(node_data, prop_name, default=None):
"""Helper to extract a property value from the node data dictionary."""
if not node_data:
return default
arcs = node_data.get("arcs", {})References
- Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.
| def get_node_dcid(node_data, prop_name): | ||
| """Helper to extract a DCID from a property.""" | ||
| arcs = node_data.get("arcs", {}) |
There was a problem hiding this comment.
Add a defensive guard check to ensure node_data is not None before attempting to access its properties to prevent potential AttributeError exceptions.
def get_node_dcid(node_data, prop_name):
"""Helper to extract a DCID from a property."""
if not node_data:
return None
arcs = node_data.get("arcs", {})References
- Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.
| if ds_dcid: | ||
| ds_data = dataset_data_map.get(ds_dcid, {}) | ||
| src_dcid = ds_data.get("source_dcid") | ||
| source_info = None | ||
|
|
||
| if src_dcid: | ||
| source_info = source_data_map.get(src_dcid) | ||
|
|
||
| dataset_info = { | ||
| "name": ds_data.get("name"), | ||
| "url": ds_data.get("url"), | ||
| "source": source_info | ||
| } |
There was a problem hiding this comment.
If ds_dcid is not found in dataset_data_map (for example, if the API call failed or returned no data for that dataset), ds_data will be an empty dictionary {}. This results in a dummy dataset_info dictionary populated with None values. It is cleaner and more robust to only construct dataset_info if ds_data is successfully retrieved.
| if ds_dcid: | |
| ds_data = dataset_data_map.get(ds_dcid, {}) | |
| src_dcid = ds_data.get("source_dcid") | |
| source_info = None | |
| if src_dcid: | |
| source_info = source_data_map.get(src_dcid) | |
| dataset_info = { | |
| "name": ds_data.get("name"), | |
| "url": ds_data.get("url"), | |
| "source": source_info | |
| } | |
| if ds_dcid: | |
| ds_data = dataset_data_map.get(ds_dcid) | |
| if ds_data: | |
| src_dcid = ds_data.get("source_dcid") | |
| source_info = None | |
| if src_dcid: | |
| source_info = source_data_map.get(src_dcid) | |
| dataset_info = { | |
| "name": ds_data.get("name"), | |
| "url": ds_data.get("url"), | |
| "source": source_info | |
| } |
| prov["dataset"] = dataset_info | ||
| final_output.append(prov) | ||
|
|
||
| with open(output_file, "w") as f: |
There was a problem hiding this comment.
When writing to a text file, it is a best practice to explicitly specify encoding="utf-8" to ensure consistent behavior across different operating systems and environments, preventing potential UnicodeEncodeError exceptions.
| with open(output_file, "w") as f: | |
| with open(output_file, "w", encoding="utf-8") as f: |
c1f4f1d to
89386b7
Compare
No description provided.