fix(input-handler): bound URL, zip, and git ingest paths by rcha0s · Pull Request #164 · NVIDIA/SkillSpector

rcha0s · 2026-06-23T02:01:47Z

Summary

Closes #21. Closes #131.

The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of InputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. The per-file cap can therefore be defeated upstream:

a multi-GB URL is a memory DoS (client.get(url).content buffers the whole body),
a zip bomb fills the disk before extraction is gated (extractall with no size pre-check),
a large clone lands on disk regardless of the per-file analysis budget (no post-clone size check).

@wernerkasselman-au flagged this in the review of PR #19 and explicitly deferred it ("the whole ingest layer in input_handler.py runs before build_context ever sees a file, and it is unbounded"). PR #19 stayed scoped to the per-file work; this PR addresses the deferred ingest-layer hardening.

Changes

New constants

Constant	Value	Purpose
`INGEST_MAX_BYTES`	100 MiB	Cap on streamed URL downloads, total uncompressed size of zip archives, and post-clone disk usage
`INGEST_MAX_ZIP_MEMBERS`	10,000	Cap on the number of entries in a single zip — defends the "many tiny files" zip-bomb variant where each entry is small but the count itself exhausts the FS

Sized above the existing 1 MB per-file analysis cap so legitimate multi-file skills aren't blocked at ingest, but tight enough to bound a memory / disk DoS.

`_download_file` — streamed with hard cap

Replaces client.get(url).content (full-body buffering) with client.stream("GET", url) + a chunked iterator.
Up-front Content-Length check — if the server provides a Content-Length larger than INGEST_MAX_BYTES, abort before reading any body bytes.
Streamed byte counter is authoritative — if the header is missing, malformed, or wrong, the running count of bytes received via iter_bytes() aborts the read as soon as it crosses the cap.

`_extract_zip` — zip-bomb defended

Sums ZipInfo.file_size across all members and checks the member count before calling extractall. Classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk.
Note: this is complementary to PR fix(input-handler): validate URLs against SSRF and add zip-slip protection #159 (zip-slip path traversal); that PR catches members with .. in their names, this one catches the size/count attack class.

`_clone_git` — post-clone size check

Walks the cloned tree after the existing 60s timeout completes and rejects + cleans up (shutil.rmtree) if total size exceeds INGEST_MAX_BYTES.
Symlinks are skipped (p.is_symlink() check) to avoid runaway counts on malicious /dev/zero-style links.

Error semantics

All limit breaches raise IngestLimitExceededError — a subclass of ValueError so existing callers that catch ValueError from InputHandler.resolve() continue to work. Error messages include both the limit name and the observed size for clarity.

Test plan

8 new unit tests in tests/unit/test_input_handler_bounds.py:
- Under-cap downloads / zips / clones succeed unchanged.
- Oversized Content-Length header rejected before body read (uses httpx.MockTransport with a forged header).
- Streamed body overflow rejected when header is missing (generator-backed stream so httpx can't pre-compute Content-Length).
- Declared-uncompressed-oversize zip rejected without extraction (forged central-directory file_size field).
- Too many members rejected (10,001 entries).
- Oversize clone rejected and the partial clone dir cleaned up.
Existing 6 input-handler tests continue to pass.
Full suite: 728 passed, 12 skipped.
ruff check clean.
ruff format --check clean.
DCO sign-off on the commit.

README

Added a short "Size limits" subsection under Basic Usage documenting both caps and their relationship to the per-file analysis cap (per the maintainer's deferred-work request: "Document in the README that 50 MiB is a per-file analysis limit, not an ingest limit").

Notes / follow-up

The constants are module-level rather than CLI-configurable. If you'd prefer --ingest-max-bytes / --ingest-max-zip-members flags or env vars (SKILLSPECTOR_INGEST_MAX_BYTES), happy to add — wanted to keep this PR scoped.
PR fix(input-handler): validate URLs against SSRF and add zip-slip protection #159 covers SSRF + zip-slip path traversal on the same code paths; that work is orthogonal to size bounds and the two PRs are independently mergeable.

Closes NVIDIA#21. Closes NVIDIA#131. The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of InputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. As a result, the per-file cap can be defeated upstream: a multi-GB URL is a memory DoS, a zip bomb fills the disk before extraction is gated, and a large clone lands on disk regardless of the per-file analysis budget. PR NVIDIA#19 explicitly deferred this; this PR addresses the deferred work. Adds two ingest budgets enforced at every remote/archive ingest path: - INGEST_MAX_BYTES (100 MiB): caps streamed URL downloads, total uncompressed size of zip archives, and post-clone disk usage of Git repos. - INGEST_MAX_ZIP_MEMBERS (10,000): caps the number of entries in a single zip (defends against the "many tiny files" zip-bomb variant). Per ingest path: - _download_file streams the body in 64 KiB chunks directly to a temp file inside the existing session temp dir, with a running byte counter. The cap check fires before each chunk is written, so the response body is never accumulated in memory. Content-Length is checked up-front when the server provides it (aborts before reading any body bytes); the streamed counter is authoritative if the header is missing or malformed. A breach mid-stream removes the partial file before the exception propagates, so an attacker who ships exactly INGEST_MAX_BYTES + 1 bytes cannot fill the temp dir. - _extract_zip sums ZipInfo.file_size across all members and checks the member count *before* calling extractall, so classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk. - _clone_git measures the cloned tree's on-disk size after the existing 60s timeout completes and rejects + cleans up if it exceeds the cap. Symlinks are skipped to avoid runaway counts on malicious /dev/zero-style links. All limit breaches raise IngestLimitExceededError (subclass of ValueError so existing callers that catch ValueError keep working) with a clear message including the limit and the observed size. Adds 10 unit tests covering: under-cap success paths for URL/zip/clone; oversized Content-Length rejected before body read; chunked overflow caught by the streamed counter; *the partial download file is removed when a breach fires mid-stream*; *streaming to disk produces a file of the expected size with no intermediate in-memory concatenation*; declared-uncompressed-oversize zip rejected without extraction; member-count zip-bomb rejected; oversize clone rejected and cleaned up. Full suite: 730 passed, 12 skipped. README documents both caps and their relationship to the per-file analysis cap. Signed-off-by: Rohan Isawe <rohan.isawe@smartsheet.com>

rcha0s · 2026-06-23T02:32:02Z

Force-pushed cf31832 after a security self-review surfaced a real issue: the original v1 streamed bytes through a Python list (chunks.append(chunk)) and the cap check fired after the append, so a 100 MiB attack could still load ~100 MiB into memory before being rejected — and the doubled allocation from b"".join(chunks) made the effective peak ~200 MiB. Not the original "unbounded DoS" the PR was meant to fix, but a meaningful regression of the security claim.

v2 changes

_download_file streams directly to a temp file, not an in-memory list. The body is never held as a single bytes object. Peak memory is now bounded by _DOWNLOAD_CHUNK_BYTES (64 KiB), not INGEST_MAX_BYTES.
Partial file is removed on breach. If the byte counter trips mid-stream, the partial download is unlink'd before the exception propagates. Without this, an attacker who ships exactly INGEST_MAX_BYTES + 1 bytes could fill the temp dir to the cap on every request. Also covered for HTTP errors (transient failures don't leave partial files behind either).
Same change covers the zip download path. A downloaded zip is now rename'd into place from the partial file instead of being written from an in-memory buffer, so the same memory-pressure ceiling holds whether the URL points at a markdown file or a zip.

Two new tests

test_streamed_overflow_leaves_no_partial_file_on_disk — verifies the partial file does not survive a mid-stream breach.
test_download_streams_to_disk_not_memory — verifies a legitimate 5 MiB download lands on disk at the expected size with the sentinel partial-download path cleaned up. Can't directly measure memory in a unit test, but the file-shape assertion is a proxy: with the v1 implementation, the file would be written from a b"".join(chunks) allocation, which the test would still pass but the memory pressure would be visible at scale; with v2 the file is written incrementally.

Other security-review findings — disposition

Finding	Severity	Disposition
Memory accumulation in download (this fix)	MEDIUM	Fixed in v2
Git clone fully lands on disk before measurement	LOW-MED	Documented limitation. Mitigating would need `git clone --filter=blob:limit=...`, which adds complexity for a small win when the runner's `/tmp` is already constrained. Open to adding if you'd prefer.
`_is_git_url` substring match (`"github.com" in netloc`) is a confused-deputy hazard	LOW for this PR (pre-existing), HIGH at repo level	Out of scope here. Will flag separately on PR #159 since that one's already in the SSRF / host-validation space.
`_make_bomb_zip` test helper is fragile to multi-member extensions	LOW (test brittleness, not security)	Helper already has a spec ref + assert; deferring beyond that to keep scope tight.
Symlinks-to-files-inside-the-clone undercount in `_directory_size_bytes`	LOW	The attack requires the symlink target to already be on the filesystem outside the clone (the clone walks a fresh tmpdir), so the practical exploit surface is narrow. Documented in the docstring already.

Verification

10 unit tests in tests/unit/test_input_handler_bounds.py (8 original + 2 new).
Full suite: 730 passed, 12 skipped.
ruff check + ruff format --check: clean.
Commit message updated; DCO sign-off preserved.

Ready for re-review.

rng1995

Verdict: Approve — excellent, security-aware DoS hardening that fails closed.

What's good

Streamed chunked download with an up-front Content-Length check and a streamed byte counter (src/skillspector/input_handler.py, ~L145-170), so an oversized/mislabeled body aborts before buffering in memory; partial file is cleaned up on breach (~L172-182).
Zip: member-count cap + summed-uncompressed-size cap before extractall (~L212-224); post-clone directory-size check with cleanup for git (~L96-111); _directory_size_bytes is symlink-safe (~L234-249).
IngestLimitExceededError(ValueError) keeps back-compat with existing ValueError callers. Clear, documented constants and README note.

Non-blocking (security-relevant)

The zip check sums ZipInfo.file_size (the declared central-directory uncompressed size, ~L218). A crafted bomb that under-declares uncompressed size could still inflate on extractall. For full protection, consider per-member streamed extraction with a running on-disk byte counter that aborts mid-extract. (The forged-central-directory test exercises the declared-size path, not the under-declared case.)

Coordination

Overlaps #159 (this PR lacks the SSRF host allowlist #159 adds) and #163. Reconcile on merge so the final _download_file/_extract_zip has both SSRF allowlisting and ingest size bounds.

Tests

Thorough: MockTransport download caps (header vs streamed), partial-file cleanup, forged central-directory zip bomb, member-count bomb, oversize-clone cleanup. LGTM.

rng1995 · 2026-06-23T12:01:13Z

Please resolve the merge conflicts

rcha0s force-pushed the feat/bound-ingest-layer branch from 56cca8f to cf31832 Compare June 23, 2026 02:31

This was referenced Jun 23, 2026

fix(input-handler): validate URLs against SSRF and add zip-slip protection #159

Merged

Merge updates from upstream and integrate local security report optimizations #163

Open

rng1995 approved these changes Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(input-handler): bound URL, zip, and git ingest paths#164

fix(input-handler): bound URL, zip, and git ingest paths#164
rcha0s wants to merge 1 commit into
NVIDIA:mainfrom
rcha0s:feat/bound-ingest-layer

rcha0s commented Jun 23, 2026

Uh oh!

rcha0s commented Jun 23, 2026

Uh oh!

rng1995 left a comment

Uh oh!

rng1995 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rcha0s commented Jun 23, 2026

Summary

Changes

New constants

_download_file — streamed with hard cap

_extract_zip — zip-bomb defended

_clone_git — post-clone size check

Error semantics

Test plan

README

Notes / follow-up

Uh oh!

rcha0s commented Jun 23, 2026

v2 changes

Two new tests

Other security-review findings — disposition

Verification

Uh oh!

rng1995 left a comment

Choose a reason for hiding this comment

What's good

Non-blocking (security-relevant)

Coordination

Tests

Uh oh!

rng1995 commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

`_download_file` — streamed with hard cap

`_extract_zip` — zip-bomb defended

`_clone_git` — post-clone size check