Skip to content

DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262

Draft
jperez999 wants to merge 1 commit into
NVIDIA:mainfrom
jperez999:gpu-pdf-only
Draft

DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262
jperez999 wants to merge 1 commit into
NVIDIA:mainfrom
jperez999:gpu-pdf-only

Conversation

@jperez999

Copy link
Copy Markdown
Collaborator

Description

GPU PDF extraction: native PDFium drop-in, CUDA rasterizer, and a fused zero-copy extract operator (investigation + toggle-gated integration)

Summary

Investigates whether a C++/CUDA PDF extraction path and a fused, zero-copy GPU operator can beat
the current pypdfium2 extraction in nemo_retriever, with end-to-end benchmarks on the real
retriever ingest pipeline. Adds the building blocks behind an opt-in toggle and an honest
findings report.

The default pipeline is unchanged. The fused operator is gated by NEMO_FUSED_EXTRACT (unset =>
existing staged path), so this PR carries no behavior change unless explicitly enabled.

TL;DR of the results

  • Component wins are real: bit-identical native PDFium drop-in (render SSIM 1.0, text 108/108);
    CUDA rasterizer at 712 pages/s/GPU; DLPack zero-copy giving a 2.75x transport speedup
    with identical model predictions (max abs diff = 0).
  • But they do NOT translate end-to-end. On the default retriever ingest pipeline the fused
    operator is neutral in inprocess mode and regresses batch/Ray mode (0.44x-0.60x) -- because
    the pipeline is inference-bound and Ray already parallelizes stages across GPUs by pipelining;
    fusing them into one serial actor sacrifices that overlap.
  • Net recommendation: keep the staged pipeline for the multi-GPU batch path; do not enable
    fusion there. The one low-risk standalone win is the native PDFium drop-in.

Ingest benchmark (default pipeline: local models + vLLM embed + LanceDB; baseline vs NEMO_FUSED_EXTRACT=1)

dataset mode baseline fused fused speedup rows (base/fused)
jp20 (20 PDFs) inprocess 262 s 225 s 1.16x 3147 / 3147
jp20 batch (Ray) 197 s 329 s 0.60x 3147 / 3148
bo767-50 inprocess 342 s 343 s 1.00x 3330 / 3329
bo767-50 batch (Ray) 208 s 470 s 0.44x 3329 / 3330

Recall-neutral: row counts differ by +/-1 (0.03%, non-deterministic batch-boundary effects).
Batch baseline beats inprocess (Ray pipelines stages across the 8 GPUs).

What's in this PR

Production-relevant (toggle-gated, default off):

  • operators/extract/fused/fused_extract.py -- FusedExtractActor composing the real
    extract/page-element/OCR operators in one actor (ron).
  • graph/ingestor_runtime.py -- NEMO_FUSED_EXTRACT toggle: fused single node vs the staged chain.

Research / evidence (under gpu_pdf_extractor/):

  • native/ -- _gpu_pdfium (PDFium C++ drop-in via(CUDA AA rasterizer
    • DLPack DeviceImage handoff).
  • python/gpu_pdfium/ -- pypdfium2-compatible drop-operator prototype.
  • bench/ -- reproducible benchmarks; results/ -- captured numbers + method_comparison.png.
  • Phase write-ups: PHASE0_RESULTS.md, P1_RESULTS.e_RESULTS.md,
    and the consolidated FINDINGS_REPORT.md.

Why the 2.75x doesn't translate (key lesson)

The 2.75x is a transport micro-benchmark; the red, so transport is a
small fraction of wall time. Zero-copy requires one process (fusion); pipelining requires separate
actors
-- mutually exclusive across a Ray boundaryhe page-image base64
is also a required pipeline output, so the encode can't be removed.

Risk / review notes

  • Default behavior unchanged; fusion is opt-in via e
  • The native modules require a CUDA toolchain + prebuilt PDFium; they are not built by default CI.
  • See FINDINGS_REPORT.md for the full methodology

Recommendation

Merge the native PDFium drop-in path as the durable win; keep the fused operator as a
toggle-gated experiment (single-GPU / transport-sue further DLPack
fusion for the batch pipeline.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@jperez999 jperez999 requested review from a team as code owners June 23, 2026 21:55
@jperez999 jperez999 requested a review from ChrisJar June 23, 2026 21:55
@jperez999 jperez999 marked this pull request as draft June 23, 2026 21:56
@greptile-apps

greptile-apps Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds a gpu_pdf_extractor/ research tree investigating a CUDA-accelerated PDF extraction path for nemo_retriever, including a native PDFium C++ drop-in, a CUDA polygon rasterizer, DLPack zero-copy operator fusion, and a full benchmark harness. The default pipeline is unchanged; the fused path is gated by NEMO_FUSED_EXTRACT and the GPU backend by NEMO_PDF_BACKEND=gpu.

  • gpu_pdf_extractor/python/fused/operators.py — introduces FusedGPUOperator composing child operators in-process, plus RasterizeGPUOperator, PageElementGPUOperator, CropGPUOperator, TableStructureGPUOperator, OCRGPUOperator, and HostFinalizeOperator; two sites use except Exception: pass that silently swallow GPU/model errors, and all Python files lack the required SPDX license headers.
  • gpu_pdf_extractor/python/gpu_pdfium/ — pypdfium2-compatible shim backed by the native _gpu_pdfium extension; activate() performs an irreversible sys.modules["pypdfium2"] replacement with no undo path.
  • gpu_pdf_extractor/native/ — C++/CUDA extension sources (CMakeLists, nanobind bindings, CUDA rasterizer kernels) and benchmark scripts; native code quality is solid with proper RAII and error propagation.

Confidence Score: 3/5

Safe to land as an experiment branch given the explicit DO NOT MERGE label and the fact that all new code lives in an isolated gpu_pdf_extractor/ subtree with no production code paths touched. The issues identified should be addressed before any promotion to a merge-ready state.

Two operator classes (PageElementGPUOperator and OCRGPUOperator) silently discard all exceptions during model inference with no logging, so failures during GPU or model execution will produce rows with missing detections and zero OCR output with no visible error signal. The activate() function in the GPU pdfium shim performs an irreversible process-wide sys.modules replacement with no guard against a partially-initialized native extension. All Python files also lack the required SPDX license headers.

gpu_pdf_extractor/python/fused/operators.py (silent exception swallowing at two model-inference sites) and gpu_pdf_extractor/python/gpu_pdfium/init.py (irreversible sys.modules mutation)

Important Files Changed

Filename Overview
gpu_pdf_extractor/python/fused/operators.py Core fused GPU operator implementations — contains bare except Exception: pass at two sites (PageElementGPUOperator and OCRGPUOperator) that silently discard inference and OCR errors; missing SPDX header and unit tests
gpu_pdf_extractor/python/gpu_pdfium/init.py pypdfium2 drop-in shim — activate() performs an irreversible sys.modules mutation with no safety guard or deactivation path; missing SPDX header
gpu_pdf_extractor/native/src/pdf_bindings.cpp Native PDFium C++ bindings via nanobind — clean RAII, proper error propagation via PdfiumError; correct UTF-16LE→UTF-8 conversion and stride-aware bitmap copy
gpu_pdf_extractor/profile_areas.py PDF area profiler script — unclosed file handle on line 16 (no context manager); missing SPDX header
gpu_pdf_extractor/bench/build_corpus.py Corpus assembly script — discovers, dedupes, classifies, and symlinks PDFs; exception handling acceptable at file I/O boundaries; missing SPDX header
gpu_pdf_extractor/bench/raster_bench.py CUDA rasterizer throughput benchmark — correctness check, single/multi-GPU measurements; missing SPDX header
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
gpu_pdf_extractor/python/fused/operators.py:159-168
**Silent swallow of model inference errors**

`except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed.

The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure.

### Issue 2 of 5
gpu_pdf_extractor/python/gpu_pdfium/__init__.py:45-48
**Irreversible `sys.modules` mutation with no failure guard**

`activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.

At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible.

### Issue 3 of 5
gpu_pdf_extractor/python/fused/operators.py:1
**SPDX license header missing**

This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top.

### Issue 4 of 5
gpu_pdf_extractor/profile_areas.py:16
`open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.

```suggestion
    with open(path, "rb") as fh:
        raw = fh.read()
```

### Issue 5 of 5
gpu_pdf_extractor/python/fused/operators.py:1-13
**No unit tests for new operator classes**

`FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run.

Reviews (1): Last reviewed commit: "experiment and analysis on gpu rasteriza..." | Re-trigger Greptile

Comment on lines +159 to +168
bx = lbl = None
try:
b, l, _s = self._model.postprocess(preds)
b0 = b[0] if isinstance(b, (list, tuple)) else b
l0 = l[0] if isinstance(l, (list, tuple)) else l
bx = b0.detach().float().cpu().numpy().reshape(-1, 4) if hasattr(b0, "detach") else None
lbl = l0.detach().cpu().numpy().reshape(-1).astype(int) if hasattr(l0, "detach") else None
except Exception:
pass
boxes_col.append(bx); labels_col.append(lbl)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Silent swallow of model inference errors

except Exception: pass here means a GPU OOM, a CUDA assertion, or a model loading failure in PageElementGPUOperator.process() will silently produce None in pe_boxes/pe_labels for the affected rows with zero visibility. Downstream operators (CropGPUOperator, TableStructureGPUOperator) then operate on rows with None detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (exc_info=True) before being suppressed.

The same pattern appears at line 339 in OCRGPUOperator.process() where except Exception: pass silently discards all OCR results and characters without any record of the failure.

Rule Used: Never use bare 'except:' that silently swallows er... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 159-168

Comment:
**Silent swallow of model inference errors**

`except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed.

The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure.

**Rule Used:** Never use bare 'except:' that silently swallows er... ([source](.greptile))

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +45 to +48
def activate() -> None:
"""Inject this module as `pypdfium2` (and `.raw`) so existing imports resolve here."""
sys.modules["pypdfium2"] = sys.modules[__name__]
sys.modules["pypdfium2.raw"] = raw

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Irreversible sys.modules mutation with no failure guard

activate() replaces sys.modules["pypdfium2"] with the GPU module unconditionally. If _gpu_pdfium is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an ImportError — but any caller that catches it and then calls activate() on a partially-initialized module will silently inject a broken object as pypdfium2 for every downstream import in the process. There is also no deactivate() path; once called, all subsequent import pypdfium2 in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.

At minimum the function should assert _core is fully initialized before mutating sys.modules, and document clearly that calling it is irreversible.

Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/gpu_pdfium/__init__.py
Line: 45-48

Comment:
**Irreversible `sys.modules` mutation with no failure guard**

`activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.

At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible.

How can I resolve this? If you propose a fix, please make it concise.

@@ -0,0 +1,401 @@
"""FusedGPUOperator + device-aware example operators (P3').

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 SPDX license header missing

This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include operators.py, __init__.py under fused/ and gpu_pdfium/, raw.py, profile_areas.py, all files under bench/, etc. Every .py file added in this PR requires this header at the top.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 1

Comment:
**SPDX license header missing**

This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top.

**Rule Used:** Python files added in this PR must include the SPD... ([source](.greptile))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

return best

def profile(path, dpi=300, max_pages=40):
raw = open(path, "rb").read()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 open() is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.

Suggested change
raw = open(path, "rb").read()
with open(path, "rb") as fh:
raw = fh.read()
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/profile_areas.py
Line: 16

Comment:
`open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.

```suggestion
    with open(path, "rb") as fh:
        raw = fh.read()
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +1 to +13
"""FusedGPUOperator + device-aware example operators (P3').

The point: existing operators exchange data as pandas DataFrames across Ray ``map_batches``
boundaries, where any device tensor would be serialized to host (and today images travel as
base64). A *fused* operator runs a list of operators IN ONE PROCESS, threading the DataFrame
through each child's ``preprocess → process → postprocess``. Intermediate device tensors
(``DeviceImage``, DLPack-exportable) then survive between operators with ZERO host copies.

This keeps the **exact existing operator API** — children are ordinary ``AbstractOperator``s.
A child becomes "device-aware" simply by reading/writing the ``page_image_dev`` column
(a column of ``DeviceImage`` handles) instead of base64. Non-fused/legacy operators that only
know base64 still work unchanged; they just don't get the zero-copy benefit.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 No unit tests for new operator classes

FusedGPUOperator, RasterizeGPUOperator, PageElementGPUOperator, CropGPUOperator, OCRGPUOperator, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The test-coverage-new-code standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with @pytest.mark.integration so they are excluded from the default CI run.

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 1-13

Comment:
**No unit tests for new operator classes**

`FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run.

**Rule Used:** New functionality must include corresponding unit ... ([source](.greptile))

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant