Validation_goldens modfication#2043
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates validator_goldens.py to support namespace-stripped matching when generating goldens and replaces the helper function for writing CSVs with custom csv.DictWriter logic. A review comment points out that extracting CSV column headers from only the first node in golden_nodes assumes all nodes share the same keys, which can lead to errors or missing columns if keys vary. It suggests collecting the union of all keys across all nodes instead.
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request updates tools/import_validation/validator_goldens.py to import the csv module, strip namespaces when matching nodes against must_include_values, and replace the helper file_write_csv_dict with a custom csv.DictWriter implementation. The feedback suggests adding newline='' when opening the CSV file to prevent platform-specific carriage return issues and simplifying the logic for extracting and sorting unique keys.
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request introduces path resolution logic for summary report rules in the validation runner, cleans up CSV node keys during loading, and updates golden file generation to use Python's standard csv.DictWriter. Feedback focuses on adhering to PEP 8 line length limits, replacing debug print statements with proper logging.debug calls, and opening CSV files with newline='' to ensure cross-platform compatibility on Windows.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Summary
This PR addresses data filtering bugs, malformed CSV output formatting, path resolution errors, and parsing crashes within the validation pipeline.
Analysis and Resolution
Issue: The "must-include" filter file (top_100k_places.csv) loaded IDs without namespaces (e.g., "Earth"), whereas the input dataset (WorldBank.csv) utilized fully qualified names (e.g., "dcid:Earth"). This exact-match failure resulted in all nodes being filtered out (0 golden records generated).
Fix: Integrated strip_namespace() during the comparison step to normalize IDs, allowing the nodes to pass the filter correctly.
Issue: The utility function file_write_csv_dict serialized single-property data structures into a cluttered format containing stringified dictionaries (e.g., "{'Column name ': 'Column value'}") instead of a standard single-column layout.
Fix: Replaced the utility call with manual CSV writing to explicitly define the header (Output Column) and write clean, single-column values.
Issue: Relative paths in validation_config.json resolved against the terminal's Current Working Directory (CWD). When executed from the repository root or via automated batch jobs (which run from a nested temporary folder in output/), the runner failed to locate the golden_data directory. This caused the system to silently load 0 expected records and report a false-positive PASSED status.
Fix (runner.py): Stored self.validation_config_path and self.stats_summary in the constructor. Implemented a dynamic path-climbing resolver in run_validations that traverses parent directories relative to the config path, summary path, or CWD to reliably locate the golden_data folder across both direct and batch execution contexts.
Issue: The golden CSV file contained leading spaces in headers (e.g., ' StatVar'), causing mismatches against input files. Additionally, unquoted commas within list values (e.g., [InternationalDollar, USDollar]) caused the CSV reader to generate an extra column with a key of None. Sorting keys containing None triggered a Python TypeError crash.
Fix (validator_goldens.py): Introduced a dictionary comprehension sanitizer within load_nodes_from_file:
Python
cleaned_node = {k.strip(): v for k, v in node.items() if k is not None and isinstance(k, str) and k.strip() != ''}This strips whitespace from headers and drops None keys to prevent parsing crashes.
Verification and Testing Performed: