Skip to content

Validation_goldens modfication#2043

Open
niveditasing wants to merge 8 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix
Open

Validation_goldens modfication#2043
niveditasing wants to merge 8 commits into
datacommonsorg:masterfrom
niveditasing:validation_golden_fix

Conversation

@niveditasing
Copy link
Copy Markdown
Contributor

@niveditasing niveditasing commented May 28, 2026

Summary
This PR addresses data filtering bugs, malformed CSV output formatting, path resolution errors, and parsing crashes within the validation pipeline.

Analysis and Resolution

  1. Data Filtering: Namespace Mismatches
    Issue: The "must-include" filter file (top_100k_places.csv) loaded IDs without namespaces (e.g., "Earth"), whereas the input dataset (WorldBank.csv) utilized fully qualified names (e.g., "dcid:Earth"). This exact-match failure resulted in all nodes being filtered out (0 golden records generated).

Fix: Integrated strip_namespace() during the comparison step to normalize IDs, allowing the nodes to pass the filter correctly.

  1. Output Formatting: Redundant Serialization in Single-Property CSVs
    Issue: The utility function file_write_csv_dict serialized single-property data structures into a cluttered format containing stringified dictionaries (e.g., "{'Column name ': 'Column value'}") instead of a standard single-column layout.

Fix: Replaced the utility call with manual CSV writing to explicitly define the header (Output Column) and write clean, single-column values.

  1. Path Resolution: Silent Mismatches and Execution Context Failures
    Issue: Relative paths in validation_config.json resolved against the terminal's Current Working Directory (CWD). When executed from the repository root or via automated batch jobs (which run from a nested temporary folder in output/), the runner failed to locate the golden_data directory. This caused the system to silently load 0 expected records and report a false-positive PASSED status.

Fix (runner.py): Stored self.validation_config_path and self.stats_summary in the constructor. Implemented a dynamic path-climbing resolver in run_validations that traverses parent directories relative to the config path, summary path, or CWD to reliably locate the golden_data folder across both direct and batch execution contexts.

  1. CSV Parsing: Whitespace Insensitivity and Malformed Value Crashes
    Issue: The golden CSV file contained leading spaces in headers (e.g., ' StatVar'), causing mismatches against input files. Additionally, unquoted commas within list values (e.g., [InternationalDollar, USDollar]) caused the CSV reader to generate an extra column with a key of None. Sorting keys containing None triggered a Python TypeError crash.

Fix (validator_goldens.py): Introduced a dictionary comprehension sanitizer within load_nodes_from_file:

Python
cleaned_node = {k.strip(): v for k, v in node.items() if k is not None and isinstance(k, str) and k.strip() != ''}
This strips whitespace from headers and drops None keys to prevent parsing crashes.

Verification and Testing Performed:

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates validator_goldens.py to support namespace-stripped matching when generating goldens and replaces the helper function for writing CSVs with custom csv.DictWriter logic. A review comment points out that extracting CSV column headers from only the first node in golden_nodes assumes all nodes share the same keys, which can lead to errors or missing columns if keys vary. It suggests collecting the union of all keys across all nodes instead.

Comment thread tools/import_validation/validator_goldens.py
@balit-raibot
Copy link
Copy Markdown
Contributor

@gemini-code-assist review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates tools/import_validation/validator_goldens.py to import the csv module, strip namespaces when matching nodes against must_include_values, and replace the helper file_write_csv_dict with a custom csv.DictWriter implementation. The feedback suggests adding newline='' when opening the CSV file to prevent platform-specific carriage return issues and simplifying the logic for extracting and sorting unique keys.

Comment thread tools/import_validation/validator_goldens.py Outdated
@niveditasing
Copy link
Copy Markdown
Contributor Author

@gemini-code-assist review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces path resolution logic for summary report rules in the validation runner, cleans up CSV node keys during loading, and updates golden file generation to use Python's standard csv.DictWriter. Feedback focuses on adhering to PEP 8 line length limits, replacing debug print statements with proper logging.debug calls, and opening CSV files with newline='' to ensure cross-platform compatibility on Windows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread tools/import_validation/runner.py
Comment thread tools/import_validation/validator_goldens.py
Comment thread tools/import_validation/validator_goldens.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants