Replace 'aws s3 ls' shell-out in dataset.py with boto3#239
Open
goanpeca wants to merge 1 commit intoStability-AI:mainfrom
Open
Replace 'aws s3 ls' shell-out in dataset.py with boto3#239goanpeca wants to merge 1 commit intoStability-AI:mainfrom
goanpeca wants to merge 1 commit intoStability-AI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR replaces the AWS CLI shell-outs in the S3 dataset loader with direct boto3 usage, and switches shard fetching to presigned-URL downloads (via curl) so the same code path can work with AWS S3 or other S3-compatible endpoints (via AWS_ENDPOINT_URL).
Changes:
- Added
_get_s3_client()and updatedget_s3_contents()to list objects via boto3 pagination instead ofaws s3 ls. - Updated
get_all_s3_urls()to emitpipe:curl -fsSL "<presigned>"instead ofpipe:aws s3 cp ... -. - Added offline unit tests for the new S3 behavior; documented
AWS_ENDPOINT_URLusage in the README and updated.gitignore.
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
stable_audio_tools/data/dataset.py |
Replaces AWS CLI calls with boto3 listing + presigned-URL generation for shard download. |
tests/test_dataset_s3.py |
Adds offline unit tests covering endpoint env handling, pagination/listing behavior, filtering, and presigned URL emission. |
README.md |
Documents use of AWS_ENDPOINT_URL for S3-compatible storage (Backblaze B2 example). |
.gitignore |
Adds .DS_Store ignore and normalizes wandb/* entry. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
6be8c78 to
6671c21
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Replaces the
aws s3 lssubprocess +pipe:aws s3 cpURLs instable_audio_tools/data/dataset.pywith direct boto3. HonorsAWS_ENDPOINT_URLso the same code path works against any S3-compatible endpoint — AWS S3 by default, or Backblaze B2 whenAWS_ENDPOINT_URLpoints at a B2 endpoint. Backward compatible: whenAWS_ENDPOINT_URLis unset, behavior matches the prior AWS-only path.What changed
_get_s3_client(profile=None)— small helper that builds a boto3 S3 client with optionalAWS_ENDPOINT_URLhonoring.get_s3_contents(...)— uses aboto3paginator instead ofsubprocess.run(['aws', 's3', 'ls', ...]). Returns the same shape (keys relative todataset_path).get_all_s3_urls(...)— emitspipe:curl -fsSL "<presigned URL>"for each shard instead ofpipe:aws s3 cp. Presigned URL is a 1-hour GET signed by boto3 (works withAWS_ENDPOINT_URL).boto3was already pulled in transitively byhuggingface_hub/webdataset, so no new runtime deps.Tests
Adds
tests/test_dataset_s3.py(first tests in the repo) — 8 tests covering_get_s3_clientenv handling,get_s3_contentspagination + filter + recursive vs non-recursive, andget_all_s3_urlspresigned-URL emission. All offline, no fixtures, no network — uses aMagicMockboto3 client patched in viasys.modules.python -m pytest tests/test_dataset_s3.py # 8 passed in 3.6sREADME + .gitignore
Brief README note pointing at
AWS_ENDPOINT_URLfor non-AWS endpoints;.DS_Storeadded to.gitignore.Two behaviors I deliberately changed — happy to revert either if there's a reason:
pipe:aws s3 cp(which requiredawsCLI installed + configured) →pipe:curl -fsSL "<presigned>"(curl is universally available).--cli-connect-timeout 0flag — boto3's defaults are reasonable.Optional follow-up I'd happily file separately: a minimal
.github/workflows/test.yml(Python 3.10/3.11/3.12 matrix, runspytest) so these tests are exercised in CI going forward. Kept out of this PR to keep it focused; let me know if you'd like it.Anything else from the AWS-CLI path worth preserving?