Skip to content

Replace 'aws s3 ls' shell-out in dataset.py with boto3#239

Open
goanpeca wants to merge 1 commit intoStability-AI:mainfrom
goanpeca:feat/boto3-s3-loader
Open

Replace 'aws s3 ls' shell-out in dataset.py with boto3#239
goanpeca wants to merge 1 commit intoStability-AI:mainfrom
goanpeca:feat/boto3-s3-loader

Conversation

@goanpeca
Copy link
Copy Markdown

@goanpeca goanpeca commented May 4, 2026

Replaces the aws s3 ls subprocess + pipe:aws s3 cp URLs in stable_audio_tools/data/dataset.py with direct boto3. Honors AWS_ENDPOINT_URL so the same code path works against any S3-compatible endpoint — AWS S3 by default, or Backblaze B2 when AWS_ENDPOINT_URL points at a B2 endpoint. Backward compatible: when AWS_ENDPOINT_URL is unset, behavior matches the prior AWS-only path.

What changed

  • _get_s3_client(profile=None) — small helper that builds a boto3 S3 client with optional AWS_ENDPOINT_URL honoring.
  • get_s3_contents(...) — uses a boto3 paginator instead of subprocess.run(['aws', 's3', 'ls', ...]). Returns the same shape (keys relative to dataset_path).
  • get_all_s3_urls(...) — emits pipe:curl -fsSL "<presigned URL>" for each shard instead of pipe:aws s3 cp. Presigned URL is a 1-hour GET signed by boto3 (works with AWS_ENDPOINT_URL).
  • boto3 was already pulled in transitively by huggingface_hub / webdataset, so no new runtime deps.

Tests

Adds tests/test_dataset_s3.py (first tests in the repo) — 8 tests covering _get_s3_client env handling, get_s3_contents pagination + filter + recursive vs non-recursive, and get_all_s3_urls presigned-URL emission. All offline, no fixtures, no network — uses a MagicMock boto3 client patched in via sys.modules.

python -m pytest tests/test_dataset_s3.py
# 8 passed in 3.6s

README + .gitignore

Brief README note pointing at AWS_ENDPOINT_URL for non-AWS endpoints; .DS_Store added to .gitignore.

Two behaviors I deliberately changed — happy to revert either if there's a reason:

  1. pipe:aws s3 cp (which required aws CLI installed + configured) → pipe:curl -fsSL "<presigned>" (curl is universally available).
  2. Dropped the AWS-CLI-specific --cli-connect-timeout 0 flag — boto3's defaults are reasonable.

Optional follow-up I'd happily file separately: a minimal .github/workflows/test.yml (Python 3.10/3.11/3.12 matrix, runs pytest) so these tests are exercised in CI going forward. Kept out of this PR to keep it focused; let me know if you'd like it.

Anything else from the AWS-CLI path worth preserving?

Copilot AI review requested due to automatic review settings May 4, 2026 23:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the AWS CLI shell-outs in the S3 dataset loader with direct boto3 usage, and switches shard fetching to presigned-URL downloads (via curl) so the same code path can work with AWS S3 or other S3-compatible endpoints (via AWS_ENDPOINT_URL).

Changes:

  • Added _get_s3_client() and updated get_s3_contents() to list objects via boto3 pagination instead of aws s3 ls.
  • Updated get_all_s3_urls() to emit pipe:curl -fsSL "<presigned>" instead of pipe:aws s3 cp ... -.
  • Added offline unit tests for the new S3 behavior; documented AWS_ENDPOINT_URL usage in the README and updated .gitignore.

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 2 comments.

File Description
stable_audio_tools/data/dataset.py Replaces AWS CLI calls with boto3 listing + presigned-URL generation for shard download.
tests/test_dataset_s3.py Adds offline unit tests covering endpoint env handling, pagination/listing behavior, filtering, and presigned URL emission.
README.md Documents use of AWS_ENDPOINT_URL for S3-compatible storage (Backblaze B2 example).
.gitignore Adds .DS_Store ignore and normalizes wandb/* entry.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread stable_audio_tools/data/dataset.py
Comment thread stable_audio_tools/data/dataset.py
@goanpeca goanpeca force-pushed the feat/boto3-s3-loader branch from 6be8c78 to 6671c21 Compare May 4, 2026 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants