Skip to content

Updating data_config.json with HF links and hashes#8699

Open
e-mny wants to merge 8 commits intoProject-MONAI:devfrom
e-mny:patch-1
Open

Updating data_config.json with HF links and hashes#8699
e-mny wants to merge 8 commits intoProject-MONAI:devfrom
e-mny:patch-1

Conversation

@e-mny
Copy link
Copy Markdown

@e-mny e-mny commented Jan 15, 2026

Fixes #8510

Description

The data_config.json needs to be updated with HF links in view of migrating to HF.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).

e-mny added 2 commits January 15, 2026 10:23
Signed-off-by: enoch <65853622+e-mny@users.noreply.github.com>
Signed-off-by: Enoch Mok <65853622+e-mny@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Jan 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated tests/testing_data/data_config.json: replaced url values for images, videos, and models to point to the HuggingFace dataset https://huggingface.co/datasets/MONAI/testing_data/resolve/main/...; no changes to JSON structure or hash_type/hash_val fields. Updated tests/test_utils.py: added module-level constants SAMPLE_TIFF, SAMPLE_TIFF_HASH, SAMPLE_TIFF_HASH_TYPE and a new test test_download_url() that verifies successful download with a correct hash and raises RuntimeError on a hash mismatch.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning PR partially addresses #8510 but critical reviewer feedback shows hash values were incorrectly updated—should revert to original content hashes per ericspod's comment. Revert hash_val entries to original content hashes while keeping new HF URLs. Add test exercising download+validation to surface any hash issues.
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed Title accurately summarizes the main change: updating data_config.json with Hugging Face links.
Description check ✅ Passed Description addresses the core objective (migrating to HF URLs) but lacks details on hash updates and omits discussion of the new test.
Out of Scope Changes check ✅ Passed New test_download_url() in test_utils.py is in-scope effort to validate download+hash as recommended by reviewers; data_config.json changes are in-scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/testing_data/data_config.json (1)

88-92: Migrate nrrd_example to HuggingFace.

The file exists on HuggingFace but the URL still points to GitHub releases. Update to https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CT_IMAGE_cropped.nrrd for consistency with other entries.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 57fdd59 and ae4b961.

📒 Files selected for processing (1)
  • tests/testing_data/data_config.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: quick-py3 (windows-latest)
  • GitHub Check: quick-py3 (macOS-latest)
  • GitHub Check: flake8-py3 (pytype)
  • GitHub Check: flake8-py3 (codeformat)
  • GitHub Check: packaging
  • GitHub Check: quick-py3 (ubuntu-latest)
  • GitHub Check: flake8-py3 (mypy)
  • GitHub Check: min-dep-pytorch (2.8.0)
  • GitHub Check: build-docs
  • GitHub Check: min-dep-pytorch (2.6.0)
  • GitHub Check: min-dep-pytorch (2.5.1)
  • GitHub Check: min-dep-py3 (3.10)
  • GitHub Check: min-dep-py3 (3.12)
  • GitHub Check: min-dep-os (ubuntu-latest)
  • GitHub Check: min-dep-pytorch (2.7.1)
  • GitHub Check: min-dep-os (macOS-latest)
  • GitHub Check: min-dep-py3 (3.11)
  • GitHub Check: min-dep-py3 (3.9)
  • GitHub Check: min-dep-os (windows-latest)
🔇 Additional comments (3)
tests/testing_data/data_config.json (3)

94-105: LGTM!

Videos section fully migrated with consistent sha256 hashes.


106-162: LGTM!

Models section fully migrated with consistent URL format and sha256 hashes.


163-169: test_meta_file uses GitHub while model weights use HuggingFace.

This is a JSON schema file (not test data) used for metadata validation. The inconsistency may be intentional—model weights were migrated to HuggingFace with sha256 hashes, while schema/infrastructure files remain on GitHub with md5. If migration is desired, note that the same URL is also referenced in tests/testing_data/metadata.json and documentation, requiring coordinated updates.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@ericspod
Copy link
Copy Markdown
Member

Hi @e-mny thanks for the help on this one! I think we're in the right direction, however the hash values mentioned in the file are for the file contents themselves and not the repository hashes on Huggingface. These shouldn't change then since the file contents shouldn't be different once downloaded. I think there isn't validation going on since the tests currently pass, specifically I think here and elsewhere the failing hash checks are being ignored (they produce an exception which gets tossed out by skip_if_downloading_fails).

I would suggest changing the hashes back to their original values in the file but keep the new Huggingface URLs that you've got, and then check the files can be downloaded and hash checked correctly. This may need some digging around in the code to see how the file contents are used, the file is loaded in testing_data_config for example which is called in various places. You may want to add a test for hash download checking that uses this function and checks that download_url either fails entirely to download the file or successfully downloads then correctly validates the hash value against the file.

One thing we may want to change later is to raise a different exception when the hash fails here. It should maybe raise something specific to this and so isn't suppressed by skip_if_download_fails. Let's not worry about that in this PR yet however.

e-mny and others added 4 commits April 17, 2026 09:32
Updated hash types and values for various datasets in data_config.json.

Signed-off-by: Enoch Mok <65853622+e-mny@users.noreply.github.com>
Add tests for downloading a sample TIFF file with hash validation.

Signed-off-by: Enoch Mok <65853622+e-mny@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_utils.py`:
- Around line 187-212: The test has multiple issues: quote SAMPLE_TIFF to make
it a string literal, convert the module-level function test_download_url into a
unittest.TestCase method (e.g., class TestUtils(unittest.TestCase) with def
test_download_url(self):) so self.assertRaises is defined and test discovery
works, add a Google-style docstring to that test method, wrap the successful
download call with skip_if_downloading_fails() (or call that helper per existing
pattern) to avoid transient network failures, and keep the hash-mismatch
assertion using self.assertRaises(RuntimeError) while referencing the existing
download_url and SAMPLE_TIFF/SAMPLE_TIFF_HASH/SAMPLE_TIFF_HASH_TYPE symbols.
Ensure necessary imports (tempfile, os, skip_if_downloading_fails) are present.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55631a61-878a-4de1-a45d-cd5949a0e8ec

📥 Commits

Reviewing files that changed from the base of the PR and between 99cd2d6 and a5d9c93.

📒 Files selected for processing (1)
  • tests/test_utils.py

Comment thread tests/test_utils.py Outdated
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
tests/test_utils.py (1)

187-211: ⚠️ Potential issue | 🔴 Critical

Test still broken: unquoted URL, stray self, not a TestCase, no docstring.

No changes since last round — same four blockers remain:

  1. Line 188: URL is a bare token → SyntaxError (Ruff confirms).
  2. Line 205: self.assertRaises in a module-level function → NameError.
  3. Not inside a unittest.TestCase, so not discovered.
  4. Missing Google-style docstring (per coding guidelines).

Also per PR discussion: skip_if_downloading_fails swallows RuntimeError containing "md5 check" — fine here for sha256, but wrap only the success path so the mismatch assertion stays authoritative. And "wrong_hash" isn't valid hex for sha256; prefer "0" * 64.

🛠️ Proposed fix
-SAMPLE_TIFF = (
-    https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff
-)
-
-SAMPLE_TIFF_HASH = "73a7e89bc15576587c3d68e55d9bf92f09690280166240b48ff4b48230b13bcd"
-SAMPLE_TIFF_HASH_TYPE = "sha256"
-
-def test_download_url():
-    with tempfile.TemporaryDirectory() as tempdir:
-        # successful download with correct hash
-        download_url(
-            url=SAMPLE_TIFF,
-            filepath=os.path.join(tempdir, "model.tiff"),
-            hash_val=SAMPLE_TIFF_HASH,
-            hash_type=SAMPLE_TIFF_HASH_TYPE,
-        )
-
-        # fails with wrong hash
-        with self.assertRaises(RuntimeError):
-            download_url(
-                url=SAMPLE_TIFF,
-                filepath=os.path.join(tempdir, "model_bad.tiff"),
-                hash_val="wrong_hash",
-                hash_type=SAMPLE_TIFF_HASH_TYPE,
-            )
+SAMPLE_TIFF = "https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff"
+SAMPLE_TIFF_HASH = "73a7e89bc15576587c3d68e55d9bf92f09690280166240b48ff4b48230b13bcd"
+SAMPLE_TIFF_HASH_TYPE = "sha256"
+
+
+class TestDownloadUrl(unittest.TestCase):
+    """Exercise ``download_url`` success and hash-mismatch paths."""
+
+    def test_download_url(self):
+        """Download a sample TIFF and validate hash handling.
+
+        Raises:
+            RuntimeError: When the downloaded file's hash does not match.
+        """
+        with tempfile.TemporaryDirectory() as tempdir:
+            with skip_if_downloading_fails():
+                download_url(
+                    url=SAMPLE_TIFF,
+                    filepath=os.path.join(tempdir, "model.tiff"),
+                    hash_val=SAMPLE_TIFF_HASH,
+                    hash_type=SAMPLE_TIFF_HASH_TYPE,
+                )
+            with self.assertRaises(RuntimeError):
+                download_url(
+                    url=SAMPLE_TIFF,
+                    filepath=os.path.join(tempdir, "model_bad.tiff"),
+                    hash_val="0" * 64,
+                    hash_type=SAMPLE_TIFF_HASH_TYPE,
+                )

As per coding guidelines: "Docstrings should be present for all definition..." and "Ensure new or modified definitions will be covered by existing or new unit tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_utils.py` around lines 187 - 211, The URL token SAMPLE_TIFF must
be a quoted string and the test must be a TestCase method with a docstring; wrap
only the successful download call with skip_if_downloading_fails so download
problems don't hide the hash-mismatch assertion. Change SAMPLE_TIFF to a quoted
string, create a unittest.TestCase subclass (e.g., class
TestUtils(unittest.TestCase)) with a Google-style docstring, move
test_download_url into it as def test_download_url(self):, inside use with
skip_if_downloading_fails(): call download_url(url=SAMPLE_TIFF, filepath=...,
hash_val=SAMPLE_TIFF_HASH, hash_type=SAMPLE_TIFF_HASH_TYPE) and then assert the
bad-hash case with self.assertRaises(RuntimeError): download_url(...,
hash_val="0"*64, hash_type=SAMPLE_TIFF_HASH_TYPE). Ensure you reference
SAMPLE_TIFF, SAMPLE_TIFF_HASH, SAMPLE_TIFF_HASH_TYPE, test_download_url,
skip_if_downloading_fails, and download_url when making the edits.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_utils.py`:
- Around line 187-211: The URL token SAMPLE_TIFF must be a quoted string and the
test must be a TestCase method with a docstring; wrap only the successful
download call with skip_if_downloading_fails so download problems don't hide the
hash-mismatch assertion. Change SAMPLE_TIFF to a quoted string, create a
unittest.TestCase subclass (e.g., class TestUtils(unittest.TestCase)) with a
Google-style docstring, move test_download_url into it as def
test_download_url(self):, inside use with skip_if_downloading_fails(): call
download_url(url=SAMPLE_TIFF, filepath=..., hash_val=SAMPLE_TIFF_HASH,
hash_type=SAMPLE_TIFF_HASH_TYPE) and then assert the bad-hash case with
self.assertRaises(RuntimeError): download_url(..., hash_val="0"*64,
hash_type=SAMPLE_TIFF_HASH_TYPE). Ensure you reference SAMPLE_TIFF,
SAMPLE_TIFF_HASH, SAMPLE_TIFF_HASH_TYPE, test_download_url,
skip_if_downloading_fails, and download_url when making the edits.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 55570493-9ec6-46ab-85e1-95af310eeb5f

📥 Commits

Reviewing files that changed from the base of the PR and between a5d9c93 and 1af22aa.

📒 Files selected for processing (1)
  • tests/test_utils.py

Signed-off-by: Enoch Mok <65853622+e-mny@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
tests/test_utils.py (2)

187-189: ⚠️ Potential issue | 🔴 Critical

Quote the HF URL string; current code is invalid Python.

Line 188 is a bare token, so this file cannot be parsed.

Proposed fix
-SAMPLE_TIFF = (
-    https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff
-)
+SAMPLE_TIFF = "https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_utils.py` around lines 187 - 189, The SAMPLE_TIFF constant is
assigned an unquoted bare token which makes the module invalid; wrap the HF URL
string in quotes (e.g., assign SAMPLE_TIFF =
"https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff") so
SAMPLE_TIFF is a valid Python string literal and the file can be parsed.

194-211: ⚠️ Potential issue | 🔴 Critical

Fix test API usage and harden networked test behavior.

Line 205 uses unittest.assertRaises, which is incorrect here; this test will fail at runtime. Also add a docstring and wrap the download path with skip_if_downloading_fails() to avoid flaky CI.

Proposed fix
-def test_download_url():
-    with tempfile.TemporaryDirectory() as tempdir:
-        # successful download with correct hash
-        download_url(
-            url=SAMPLE_TIFF,
-            filepath=os.path.join(tempdir, "model.tiff"),
-            hash_val=SAMPLE_TIFF_HASH,
-            hash_type=SAMPLE_TIFF_HASH_TYPE,
-        )
-
-        # fails with wrong hash
-        with unittest.assertRaises(RuntimeError):
-            download_url(
-                url=SAMPLE_TIFF,
-                filepath=os.path.join(tempdir, "model_bad.tiff"),
-                hash_val="wrong_hash",
-                hash_type=SAMPLE_TIFF_HASH_TYPE,
-            )
+class TestDownloadUrl(unittest.TestCase):
+    """Exercise download_url success and hash mismatch paths."""
+
+    def test_download_url(self):
+        """Validate correct hash acceptance and mismatch failure.
+
+        Raises:
+            RuntimeError: If downloaded file hash validation fails.
+        """
+        with skip_if_downloading_fails(), tempfile.TemporaryDirectory() as tempdir:
+            download_url(
+                url=SAMPLE_TIFF,
+                filepath=os.path.join(tempdir, "model.tiff"),
+                hash_val=SAMPLE_TIFF_HASH,
+                hash_type=SAMPLE_TIFF_HASH_TYPE,
+            )
+
+            with self.assertRaises(RuntimeError):
+                download_url(
+                    url=SAMPLE_TIFF,
+                    filepath=os.path.join(tempdir, "model_bad.tiff"),
+                    hash_val="0" * 64,
+                    hash_type=SAMPLE_TIFF_HASH_TYPE,
+                )

As per coding guidelines, "Docstrings should be present for all definition..." and "Ensure new or modified definitions will be covered by existing or new unit tests."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_utils.py` around lines 194 - 211, The test_download_url function
incorrectly uses unittest.assertRaises and lacks a docstring and flaky-network
protection: add a one-line docstring to test_download_url, replace the incorrect
unittest.assertRaises(...) call with the context-manager form from pytest
(pytest.raises(RuntimeError):) for the "fails with wrong hash" case, and wrap
the networked download_url calls (both the successful and failing attempts that
reference SAMPLE_TIFF / SAMPLE_TIFF_HASH / SAMPLE_TIFF_HASH_TYPE and filepaths
like "model.tiff"/"model_bad.tiff") in the skip_if_downloading_fails() context
manager (import it if necessary) so CI skips the test if downloads fail.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@tests/test_utils.py`:
- Around line 187-189: The SAMPLE_TIFF constant is assigned an unquoted bare
token which makes the module invalid; wrap the HF URL string in quotes (e.g.,
assign SAMPLE_TIFF =
"https://huggingface.co/datasets/MONAI/testing_data/resolve/main/CMU-1.tiff") so
SAMPLE_TIFF is a valid Python string literal and the file can be parsed.
- Around line 194-211: The test_download_url function incorrectly uses
unittest.assertRaises and lacks a docstring and flaky-network protection: add a
one-line docstring to test_download_url, replace the incorrect
unittest.assertRaises(...) call with the context-manager form from pytest
(pytest.raises(RuntimeError):) for the "fails with wrong hash" case, and wrap
the networked download_url calls (both the successful and failing attempts that
reference SAMPLE_TIFF / SAMPLE_TIFF_HASH / SAMPLE_TIFF_HASH_TYPE and filepaths
like "model.tiff"/"model_bad.tiff") in the skip_if_downloading_fails() context
manager (import it if necessary) so CI skips the test if downloads fail.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e879430d-500e-439b-8bbd-e08fe424e8c8

📥 Commits

Reviewing files that changed from the base of the PR and between 1af22aa and 0d1890d.

📒 Files selected for processing (1)
  • tests/test_utils.py

Refactor test for download_url to use a class-based structure and improve hash validation.

Signed-off-by: Enoch Mok <65853622+e-mny@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Moving Testing Data to HuggingFace

2 participants