Skip to content

Fix local dataframe extension detection#779

Open
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/local-dataframe-extension-loading
Open

Fix local dataframe extension detection#779
fallintoplace wants to merge 1 commit into
NVIDIA-NeMo:mainfrom
fallintoplace:fix/local-dataframe-extension-loading

Conversation

@fallintoplace

Copy link
Copy Markdown

Summary

  • strip the leading dot from local dataframe file extensions before dispatching to pandas loaders
  • update the mocked loader test so local Path.suffix behaves like a real path
  • add real local-file regression coverage for .csv, .json, and .parquet inputs via both Path and str

Why

The local-file branch of smart_load_dataframe() used Path.suffix.lower(), which returns values like .csv. The loader dispatch compared that against csv, json, and parquet, so valid local files could fall through to Unsupported file format.

The existing test mocked suffix.lower() to return csv without the dot, which masked the bug. The new regression test writes actual files and exercises the real path behavior.

Testing

  • uv run --package data-designer-config pytest packages/data-designer-config/tests/config/utils/test_io_helpers.py
  • uv run ruff check packages/data-designer-config/src/data_designer/config/utils/io_helpers.py packages/data-designer-config/tests/config/utils/test_io_helpers.py
  • uv run ruff format --check packages/data-designer-config/src/data_designer/config/utils/io_helpers.py packages/data-designer-config/tests/config/utils/test_io_helpers.py

@fallintoplace fallintoplace requested a review from a team as a code owner June 27, 2026 20:57
@github-actions

Copy link
Copy Markdown
Contributor

Linked Issue Check

This PR does not reference an issue. External contributions must link to
a triaged issue before the PR can be merged.

Add one of the following to your PR description:

  • Fixes #<issue-number>
  • Closes #<issue-number>
  • Resolves #<issue-number>

If no issue exists yet, open one
and a maintainer will triage it.

See CONTRIBUTING.md
for details.

@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your submission! We ask that you sign our Developer Certificate of Origin before we can accept your contribution. You can sign the DCO by adding a comment below using this text:


I have read the DCO document and I hereby sign the DCO.


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the DCO Assistant Lite bot.

@greptile-apps

greptile-apps Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a one-line bug in smart_load_dataframe where Path.suffix returns .csv (with a leading dot) but the dispatch compared against csv, json, parquet (without the dot), causing all local-file loads to fall through to ValueError. The fix adds .lstrip(".") to match the pattern already used on the URL branch.

  • io_helpers.py: Adds .lstrip(".") to dataframe.suffix.lower() so the extension matches the dispatch strings, consistent with the URL code path.
  • test_io_helpers.py: Corrects the existing mock to return ".csv" / ".json" (as a real Path.suffix would), and adds a parametrized regression test covering all three formats with both Path and str inputs against real files.

Confidence Score: 5/5

Safe to merge — the change is a single targeted fix that aligns local-file extension parsing with the already-correct URL branch.

The fix is minimal and correct: Path.suffix always returns a single leading dot (e.g. .csv) or an empty string, so .lstrip('.') reliably strips it. The new regression tests exercise real files across all supported formats and both input types, directly covering the previously broken code path. The mock correction in the existing test is also accurate.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-config/src/data_designer/config/utils/io_helpers.py Single-character fix: adds .lstrip('.') to the local-file extension extraction, aligning it with the already-correct URL branch. Change is minimal and correct.
packages/data-designer-config/tests/config/utils/test_io_helpers.py Mock corrected to use the real Path.suffix format (.csv with dot); new parametrized test covers csv/json/parquet x Path/str with real temp files and frame equality assertions.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[smart_load_dataframe input] --> B{isinstance DataFrame?}
    B -- Yes --> C[return as-is]
    B -- No --> D{starts with http?}
    D -- Yes --> E[rewrite URL]
    E --> F["ext = PurePosixPath(url).suffix.lstrip('.').lower()"]
    D -- No --> G["dataframe = Path(dataframe)"]
    G --> H["ext = dataframe.suffix.lstrip('.').lower()  FIXED"]
    H --> I{file exists?}
    I -- No --> J[raise FileNotFoundError]
    I -- Yes --> K{ext?}
    F --> K
    K -- csv --> L[pd.read_csv]
    K -- json --> M[pd.read_json lines=True]
    K -- parquet --> N[pd.read_parquet]
    K -- other --> O[raise ValueError]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[smart_load_dataframe input] --> B{isinstance DataFrame?}
    B -- Yes --> C[return as-is]
    B -- No --> D{starts with http?}
    D -- Yes --> E[rewrite URL]
    E --> F["ext = PurePosixPath(url).suffix.lstrip('.').lower()"]
    D -- No --> G["dataframe = Path(dataframe)"]
    G --> H["ext = dataframe.suffix.lstrip('.').lower()  FIXED"]
    H --> I{file exists?}
    I -- No --> J[raise FileNotFoundError]
    I -- Yes --> K{ext?}
    F --> K
    K -- csv --> L[pd.read_csv]
    K -- json --> M[pd.read_json lines=True]
    K -- parquet --> N[pd.read_parquet]
    K -- other --> O[raise ValueError]
Loading

Reviews (1): Last reviewed commit: "Fix local dataframe extension detection" | Re-trigger Greptile

@fallintoplace

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant