Add script to extract provenance, dataset, and source metadata for Croissant by d-a-k-s-h-7 · Pull Request #2062 · datacommonsorg/data

d-a-k-s-h-7 · 2026-06-04T21:47:02Z

No description provided.

…oissant

gemini-code-assist

Code Review

This pull request introduces a utility script extract_source_dataset_provenance.py along with its unit tests to extract and assemble the Data Commons Provenance hierarchy into a structured JSON file. The review feedback suggests several improvements to enhance robustness, including adding defensive guard checks for None values in helper functions, conditionally constructing dataset information only when retrieved successfully, and explicitly specifying UTF-8 encoding when writing the output file.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T21:50:02Z

+def get_node_property(node_data, prop_name, default=None):
+    """Helper to extract a property value from the node data dictionary."""
+    arcs = node_data.get("arcs", {})


To prevent potential AttributeError exceptions, add a defensive guard check to ensure node_data is not None before attempting to access its properties. In Python, dict.get(key, default) can return None if the key exists in the dictionary with a value of None.

def get_node_property(node_data, prop_name, default=None): """Helper to extract a property value from the node data dictionary.""" if not node_data: return default arcs = node_data.get("arcs", {})

References

Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.

gemini-code-assist · 2026-06-04T21:50:02Z

+def get_node_dcid(node_data, prop_name):
+    """Helper to extract a DCID from a property."""
+    arcs = node_data.get("arcs", {})


Add a defensive guard check to ensure node_data is not None before attempting to access its properties to prevent potential AttributeError exceptions.

def get_node_dcid(node_data, prop_name): """Helper to extract a DCID from a property.""" if not node_data: return None arcs = node_data.get("arcs", {})

References

Prefer explicit guard checks over broad exception control flow (try...except) for handling nested data structures, as they are easier to reason about.

gemini-code-assist · 2026-06-04T21:50:02Z

+        if ds_dcid:
+            ds_data = dataset_data_map.get(ds_dcid, {})
+            src_dcid = ds_data.get("source_dcid")
+            source_info = None
+
+            if src_dcid:
+                source_info = source_data_map.get(src_dcid)
+
+            dataset_info = {
+                "name": ds_data.get("name"),
+                "url": ds_data.get("url"),
+                "source": source_info
+            }


If ds_dcid is not found in dataset_data_map (for example, if the API call failed or returned no data for that dataset), ds_data will be an empty dictionary {}. This results in a dummy dataset_info dictionary populated with None values. It is cleaner and more robust to only construct dataset_info if ds_data is successfully retrieved.

Suggested change

if ds_dcid:

ds_data = dataset_data_map.get(ds_dcid, {})

src_dcid = ds_data.get("source_dcid")

source_info = None

if src_dcid:

source_info = source_data_map.get(src_dcid)

dataset_info = {

"name": ds_data.get("name"),

"url": ds_data.get("url"),

"source": source_info

}

if ds_dcid:

ds_data = dataset_data_map.get(ds_dcid)

if ds_data:

src_dcid = ds_data.get("source_dcid")

source_info = None

if src_dcid:

source_info = source_data_map.get(src_dcid)

dataset_info = {

"name": ds_data.get("name"),

"url": ds_data.get("url"),

"source": source_info

}

gemini-code-assist · 2026-06-04T21:50:02Z

+        prov["dataset"] = dataset_info
+        final_output.append(prov)
+
+    with open(output_file, "w") as f:


When writing to a text file, it is a best practice to explicitly specify encoding="utf-8" to ensure consistent behavior across different operating systems and environments, preventing potential UnicodeEncodeError exceptions.

Suggested change

with open(output_file, "w") as f:

with open(output_file, "w", encoding="utf-8") as f:

Add script to extract provenance, dataset, and source metadata for Cr…

d86d4db

…oissant

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

Apply Gemini code review suggestions

89386b7

d-a-k-s-h-7 force-pushed the add-provenance-extraction branch from c1f4f1d to 89386b7 Compare June 5, 2026 11:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add script to extract provenance, dataset, and source metadata for Croissant#2062

Add script to extract provenance, dataset, and source metadata for Croissant#2062
d-a-k-s-h-7 wants to merge 2 commits into
datacommonsorg:masterfrom
d-a-k-s-h-7:add-provenance-extraction

d-a-k-s-h-7 commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	with open(output_file, "w") as f:
	with open(output_file, "w", encoding="utf-8") as f:

Conversation

d-a-k-s-h-7 commented Jun 4, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant