Skip to content

fix(input-handler): bound URL, zip, and git ingest paths#164

Open
rcha0s wants to merge 1 commit into
NVIDIA:mainfrom
rcha0s:feat/bound-ingest-layer
Open

fix(input-handler): bound URL, zip, and git ingest paths#164
rcha0s wants to merge 1 commit into
NVIDIA:mainfrom
rcha0s:feat/bound-ingest-layer

Conversation

@rcha0s

@rcha0s rcha0s commented Jun 23, 2026

Copy link
Copy Markdown

Summary

Closes #21. Closes #131.

The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of InputHandler.resolve(), which pulls scan targets from URLs, zips, or git clones with no size budget of its own. The per-file cap can therefore be defeated upstream:

  • a multi-GB URL is a memory DoS (client.get(url).content buffers the whole body),
  • a zip bomb fills the disk before extraction is gated (extractall with no size pre-check),
  • a large clone lands on disk regardless of the per-file analysis budget (no post-clone size check).

@wernerkasselman-au flagged this in the review of PR #19 and explicitly deferred it ("the whole ingest layer in input_handler.py runs before build_context ever sees a file, and it is unbounded"). PR #19 stayed scoped to the per-file work; this PR addresses the deferred ingest-layer hardening.

Changes

New constants

Constant Value Purpose
INGEST_MAX_BYTES 100 MiB Cap on streamed URL downloads, total uncompressed size of zip archives, and post-clone disk usage
INGEST_MAX_ZIP_MEMBERS 10,000 Cap on the number of entries in a single zip — defends the "many tiny files" zip-bomb variant where each entry is small but the count itself exhausts the FS

Sized above the existing 1 MB per-file analysis cap so legitimate multi-file skills aren't blocked at ingest, but tight enough to bound a memory / disk DoS.

_download_file — streamed with hard cap

  • Replaces client.get(url).content (full-body buffering) with client.stream("GET", url) + a chunked iterator.
  • Up-front Content-Length check — if the server provides a Content-Length larger than INGEST_MAX_BYTES, abort before reading any body bytes.
  • Streamed byte counter is authoritative — if the header is missing, malformed, or wrong, the running count of bytes received via iter_bytes() aborts the read as soon as it crosses the cap.

_extract_zip — zip-bomb defended

  • Sums ZipInfo.file_size across all members and checks the member count before calling extractall. Classic zip bombs (small archive, huge declared uncompressed size) are rejected without materialising any of the bomb on disk.
  • Note: this is complementary to PR fix(input-handler): validate URLs against SSRF and add zip-slip protection #159 (zip-slip path traversal); that PR catches members with .. in their names, this one catches the size/count attack class.

_clone_git — post-clone size check

  • Walks the cloned tree after the existing 60s timeout completes and rejects + cleans up (shutil.rmtree) if total size exceeds INGEST_MAX_BYTES.
  • Symlinks are skipped (p.is_symlink() check) to avoid runaway counts on malicious /dev/zero-style links.

Error semantics

All limit breaches raise IngestLimitExceededError — a subclass of ValueError so existing callers that catch ValueError from InputHandler.resolve() continue to work. Error messages include both the limit name and the observed size for clarity.

Test plan

  • 8 new unit tests in tests/unit/test_input_handler_bounds.py:
    • Under-cap downloads / zips / clones succeed unchanged.
    • Oversized Content-Length header rejected before body read (uses httpx.MockTransport with a forged header).
    • Streamed body overflow rejected when header is missing (generator-backed stream so httpx can't pre-compute Content-Length).
    • Declared-uncompressed-oversize zip rejected without extraction (forged central-directory file_size field).
    • Too many members rejected (10,001 entries).
    • Oversize clone rejected and the partial clone dir cleaned up.
  • Existing 6 input-handler tests continue to pass.
  • Full suite: 728 passed, 12 skipped.
  • ruff check clean.
  • ruff format --check clean.
  • DCO sign-off on the commit.

README

Added a short "Size limits" subsection under Basic Usage documenting both caps and their relationship to the per-file analysis cap (per the maintainer's deferred-work request: "Document in the README that 50 MiB is a per-file analysis limit, not an ingest limit").

Notes / follow-up

  • The constants are module-level rather than CLI-configurable. If you'd prefer --ingest-max-bytes / --ingest-max-zip-members flags or env vars (SKILLSPECTOR_INGEST_MAX_BYTES), happy to add — wanted to keep this PR scoped.
  • PR fix(input-handler): validate URLs against SSRF and add zip-slip protection #159 covers SSRF + zip-slip path traversal on the same code paths; that work is orthogonal to size bounds and the two PRs are independently mergeable.

Closes NVIDIA#21. Closes NVIDIA#131.

The per-file analysis cap (MAX_FILE_BYTES, 1 MB) sits downstream of
InputHandler.resolve(), which pulls scan targets from URLs, zips, or
git clones with no size budget of its own.  As a result, the per-file
cap can be defeated upstream: a multi-GB URL is a memory DoS, a zip
bomb fills the disk before extraction is gated, and a large clone
lands on disk regardless of the per-file analysis budget.  PR NVIDIA#19
explicitly deferred this; this PR addresses the deferred work.

Adds two ingest budgets enforced at every remote/archive ingest path:

- INGEST_MAX_BYTES (100 MiB): caps streamed URL downloads, total
  uncompressed size of zip archives, and post-clone disk usage of
  Git repos.
- INGEST_MAX_ZIP_MEMBERS (10,000): caps the number of entries in a
  single zip (defends against the "many tiny files" zip-bomb variant).

Per ingest path:

- _download_file streams the body in 64 KiB chunks directly to a temp
  file inside the existing session temp dir, with a running byte
  counter.  The cap check fires before each chunk is written, so the
  response body is never accumulated in memory.  Content-Length is
  checked up-front when the server provides it (aborts before reading
  any body bytes); the streamed counter is authoritative if the
  header is missing or malformed.  A breach mid-stream removes the
  partial file before the exception propagates, so an attacker who
  ships exactly INGEST_MAX_BYTES + 1 bytes cannot fill the temp dir.
- _extract_zip sums ZipInfo.file_size across all members and checks
  the member count *before* calling extractall, so classic zip bombs
  (small archive, huge declared uncompressed size) are rejected
  without materialising any of the bomb on disk.
- _clone_git measures the cloned tree's on-disk size after the
  existing 60s timeout completes and rejects + cleans up if it
  exceeds the cap.  Symlinks are skipped to avoid runaway counts on
  malicious /dev/zero-style links.

All limit breaches raise IngestLimitExceededError (subclass of
ValueError so existing callers that catch ValueError keep working)
with a clear message including the limit and the observed size.

Adds 10 unit tests covering: under-cap success paths for URL/zip/clone;
oversized Content-Length rejected before body read; chunked overflow
caught by the streamed counter; *the partial download file is removed
when a breach fires mid-stream*; *streaming to disk produces a file of
the expected size with no intermediate in-memory concatenation*;
declared-uncompressed-oversize zip rejected without extraction;
member-count zip-bomb rejected; oversize clone rejected and cleaned
up.  Full suite: 730 passed, 12 skipped.

README documents both caps and their relationship to the per-file
analysis cap.

Signed-off-by: Rohan Isawe <rohan.isawe@smartsheet.com>
@rcha0s rcha0s force-pushed the feat/bound-ingest-layer branch from 56cca8f to cf31832 Compare June 23, 2026 02:31
@rcha0s

rcha0s commented Jun 23, 2026

Copy link
Copy Markdown
Author

Force-pushed cf31832 after a security self-review surfaced a real issue: the original v1 streamed bytes through a Python list (chunks.append(chunk)) and the cap check fired after the append, so a 100 MiB attack could still load ~100 MiB into memory before being rejected — and the doubled allocation from b"".join(chunks) made the effective peak ~200 MiB. Not the original "unbounded DoS" the PR was meant to fix, but a meaningful regression of the security claim.

v2 changes

  • _download_file streams directly to a temp file, not an in-memory list. The body is never held as a single bytes object. Peak memory is now bounded by _DOWNLOAD_CHUNK_BYTES (64 KiB), not INGEST_MAX_BYTES.
  • Partial file is removed on breach. If the byte counter trips mid-stream, the partial download is unlink'd before the exception propagates. Without this, an attacker who ships exactly INGEST_MAX_BYTES + 1 bytes could fill the temp dir to the cap on every request. Also covered for HTTP errors (transient failures don't leave partial files behind either).
  • Same change covers the zip download path. A downloaded zip is now rename'd into place from the partial file instead of being written from an in-memory buffer, so the same memory-pressure ceiling holds whether the URL points at a markdown file or a zip.

Two new tests

  • test_streamed_overflow_leaves_no_partial_file_on_disk — verifies the partial file does not survive a mid-stream breach.
  • test_download_streams_to_disk_not_memory — verifies a legitimate 5 MiB download lands on disk at the expected size with the sentinel partial-download path cleaned up. Can't directly measure memory in a unit test, but the file-shape assertion is a proxy: with the v1 implementation, the file would be written from a b"".join(chunks) allocation, which the test would still pass but the memory pressure would be visible at scale; with v2 the file is written incrementally.

Other security-review findings — disposition

Finding Severity Disposition
Memory accumulation in download (this fix) MEDIUM Fixed in v2
Git clone fully lands on disk before measurement LOW-MED Documented limitation. Mitigating would need git clone --filter=blob:limit=..., which adds complexity for a small win when the runner's /tmp is already constrained. Open to adding if you'd prefer.
_is_git_url substring match ("github.com" in netloc) is a confused-deputy hazard LOW for this PR (pre-existing), HIGH at repo level Out of scope here. Will flag separately on PR #159 since that one's already in the SSRF / host-validation space.
_make_bomb_zip test helper is fragile to multi-member extensions LOW (test brittleness, not security) Helper already has a spec ref + assert; deferring beyond that to keep scope tight.
Symlinks-to-files-inside-the-clone undercount in _directory_size_bytes LOW The attack requires the symlink target to already be on the filesystem outside the clone (the clone walks a fresh tmpdir), so the practical exploit surface is narrow. Documented in the docstring already.

Verification

  • 10 unit tests in tests/unit/test_input_handler_bounds.py (8 original + 2 new).
  • Full suite: 730 passed, 12 skipped.
  • ruff check + ruff format --check: clean.
  • Commit message updated; DCO sign-off preserved.

Ready for re-review.

@rng1995 rng1995 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: Approve — excellent, security-aware DoS hardening that fails closed.

What's good

  • Streamed chunked download with an up-front Content-Length check and a streamed byte counter (src/skillspector/input_handler.py, ~L145-170), so an oversized/mislabeled body aborts before buffering in memory; partial file is cleaned up on breach (~L172-182).
  • Zip: member-count cap + summed-uncompressed-size cap before extractall (~L212-224); post-clone directory-size check with cleanup for git (~L96-111); _directory_size_bytes is symlink-safe (~L234-249).
  • IngestLimitExceededError(ValueError) keeps back-compat with existing ValueError callers. Clear, documented constants and README note.

Non-blocking (security-relevant)

  • The zip check sums ZipInfo.file_size (the declared central-directory uncompressed size, ~L218). A crafted bomb that under-declares uncompressed size could still inflate on extractall. For full protection, consider per-member streamed extraction with a running on-disk byte counter that aborts mid-extract. (The forged-central-directory test exercises the declared-size path, not the under-declared case.)

Coordination

Overlaps #159 (this PR lacks the SSRF host allowlist #159 adds) and #163. Reconcile on merge so the final _download_file/_extract_zip has both SSRF allowlisting and ingest size bounds.

Tests

Thorough: MockTransport download caps (header vs streamed), partial-file cleanup, forged central-directory zip bomb, member-count bomb, oversize-clone cleanup. LGTM.

@rng1995

rng1995 commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Please resolve the merge conflicts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants