fix(input-handler): bound URL, zip, and git ingest paths#164
Conversation
Closes NVIDIA#21. Closes NVIDIA#131. The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of InputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. As a result, the per-file cap can be defeated upstream: a multi-GB URL is a memory DoS, a zip bomb fills the disk before extraction is gated, and a large clone lands on disk regardless of the per-file analysis budget. PR NVIDIA#19 explicitly deferred this; this PR addresses the deferred work. Adds two ingest budgets enforced at every remote/archive ingest path: - INGEST_MAX_BYTES (100 MiB): caps streamed URL downloads, total uncompressed size of zip archives, and post-clone disk usage of Git repos. - INGEST_MAX_ZIP_MEMBERS (10,000): caps the number of entries in a single zip (defends against the "many tiny files" zip-bomb variant). Per ingest path: - _download_file streams the body in 64 KiB chunks directly to a temp file inside the existing session temp dir, with a running byte counter. The cap check fires before each chunk is written, so the response body is never accumulated in memory. Content-Length is checked up-front when the server provides it (aborts before reading any body bytes); the streamed counter is authoritative if the header is missing or malformed. A breach mid-stream removes the partial file before the exception propagates, so an attacker who ships exactly INGEST_MAX_BYTES + 1 bytes cannot fill the temp dir. - _extract_zip sums ZipInfo.file_size across all members and checks the member count *before* calling extractall, so classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk. - _clone_git measures the cloned tree's on-disk size after the existing 60s timeout completes and rejects + cleans up if it exceeds the cap. Symlinks are skipped to avoid runaway counts on malicious /dev/zero-style links. All limit breaches raise IngestLimitExceededError (subclass of ValueError so existing callers that catch ValueError keep working) with a clear message including the limit and the observed size. Adds 10 unit tests covering: under-cap success paths for URL/zip/clone; oversized Content-Length rejected before body read; chunked overflow caught by the streamed counter; *the partial download file is removed when a breach fires mid-stream*; *streaming to disk produces a file of the expected size with no intermediate in-memory concatenation*; declared-uncompressed-oversize zip rejected without extraction; member-count zip-bomb rejected; oversize clone rejected and cleaned up. Full suite: 730 passed, 12 skipped. README documents both caps and their relationship to the per-file analysis cap. Signed-off-by: Rohan Isawe <rohan.isawe@smartsheet.com>
56cca8f to
cf31832
Compare
|
Force-pushed v2 changes
Two new tests
Other security-review findings — disposition
Verification
Ready for re-review. |
rng1995
left a comment
There was a problem hiding this comment.
Verdict: Approve — excellent, security-aware DoS hardening that fails closed.
What's good
- Streamed chunked download with an up-front
Content-Lengthcheck and a streamed byte counter (src/skillspector/input_handler.py, ~L145-170), so an oversized/mislabeled body aborts before buffering in memory; partial file is cleaned up on breach (~L172-182). - Zip: member-count cap + summed-uncompressed-size cap before
extractall(~L212-224); post-clone directory-size check with cleanup for git (~L96-111);_directory_size_bytesis symlink-safe (~L234-249). IngestLimitExceededError(ValueError)keeps back-compat with existingValueErrorcallers. Clear, documented constants and README note.
Non-blocking (security-relevant)
- The zip check sums
ZipInfo.file_size(the declared central-directory uncompressed size, ~L218). A crafted bomb that under-declares uncompressed size could still inflate onextractall. For full protection, consider per-member streamed extraction with a running on-disk byte counter that aborts mid-extract. (The forged-central-directory test exercises the declared-size path, not the under-declared case.)
Coordination
Overlaps #159 (this PR lacks the SSRF host allowlist #159 adds) and #163. Reconcile on merge so the final _download_file/_extract_zip has both SSRF allowlisting and ingest size bounds.
Tests
Thorough: MockTransport download caps (header vs streamed), partial-file cleanup, forged central-directory zip bomb, member-count bomb, oversize-clone cleanup. LGTM.
|
Please resolve the merge conflicts |
Summary
Closes #21. Closes #131.
The per-file analysis cap (
MAX_FILE_BYTES, 1 MB) sits downstream ofInputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. The per-file cap can therefore be defeated upstream:client.get(url).contentbuffers the whole body),extractallwith no size pre-check),@wernerkasselman-au flagged this in the review of PR #19 and explicitly deferred it ("the whole ingest layer in
input_handler.pyruns beforebuild_contextever sees a file, and it is unbounded"). PR #19 stayed scoped to the per-file work; this PR addresses the deferred ingest-layer hardening.Changes
New constants
INGEST_MAX_BYTESINGEST_MAX_ZIP_MEMBERSSized above the existing 1 MB per-file analysis cap so legitimate multi-file skills aren't blocked at ingest, but tight enough to bound a memory / disk DoS.
_download_file— streamed with hard capclient.get(url).content(full-body buffering) withclient.stream("GET", url)+ a chunked iterator.INGEST_MAX_BYTES, abort before reading any body bytes.iter_bytes()aborts the read as soon as it crosses the cap._extract_zip— zip-bomb defendedZipInfo.file_sizeacross all members and checks the member count before callingextractall. Classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk...in their names, this one catches the size/count attack class._clone_git— post-clone size checkshutil.rmtree) if total size exceedsINGEST_MAX_BYTES.p.is_symlink()check) to avoid runaway counts on malicious/dev/zero-style links.Error semantics
All limit breaches raise
IngestLimitExceededError— a subclass ofValueErrorso existing callers that catchValueErrorfromInputHandler.resolve()continue to work. Error messages include both the limit name and the observed size for clarity.Test plan
tests/unit/test_input_handler_bounds.py:httpx.MockTransportwith a forged header).file_sizefield).ruff checkclean.ruff format --checkclean.README
Added a short "Size limits" subsection under Basic Usage documenting both caps and their relationship to the per-file analysis cap (per the maintainer's deferred-work request: "Document in the README that 50 MiB is a per-file analysis limit, not an ingest limit").
Notes / follow-up
--ingest-max-bytes/--ingest-max-zip-membersflags or env vars (SKILLSPECTOR_INGEST_MAX_BYTES), happy to add — wanted to keep this PR scoped.