DO NOT MERGE - experiment + analysis on gpu rasterization alone by jperez999 · Pull Request #2262 · NVIDIA/NeMo-Retriever

jperez999 · 2026-06-23T21:55:30Z

Description

GPU PDF extraction: native PDFium drop-in, CUDA rasterizer, and a fused zero-copy extract operator (investigation + toggle-gated integration)

Summary

Investigates whether a C++/CUDA PDF extraction path and a fused, zero-copy GPU operator can beat
the current pypdfium2 extraction in nemo_retriever, with end-to-end benchmarks on the real
retriever ingest pipeline. Adds the building blocks behind an opt-in toggle and an honest
findings report.

The default pipeline is unchanged. The fused operator is gated by NEMO_FUSED_EXTRACT (unset =>
existing staged path), so this PR carries no behavior change unless explicitly enabled.

TL;DR of the results

Component wins are real: bit-identical native PDFium drop-in (render SSIM 1.0, text 108/108);
CUDA rasterizer at 712 pages/s/GPU; DLPack zero-copy giving a 2.75x transport speedup
with identical model predictions (max abs diff = 0).
But they do NOT translate end-to-end. On the default retriever ingest pipeline the fused
operator is neutral in inprocess mode and regresses batch/Ray mode (0.44x-0.60x) -- because
the pipeline is inference-bound and Ray already parallelizes stages across GPUs by pipelining;
fusing them into one serial actor sacrifices that overlap.
Net recommendation: keep the staged pipeline for the multi-GPU batch path; do not enable
fusion there. The one low-risk standalone win is the native PDFium drop-in.

Ingest benchmark (default pipeline: local models + vLLM embed + LanceDB; baseline vs `NEMO_FUSED_EXTRACT=1`)

dataset	mode	baseline	fused	fused speedup	rows (base/fused)
jp20 (20 PDFs)	inprocess	262 s	225 s	1.16x	3147 / 3147
jp20	batch (Ray)	197 s	329 s	0.60x	3147 / 3148
bo767-50	inprocess	342 s	343 s	1.00x	3330 / 3329
bo767-50	batch (Ray)	208 s	470 s	0.44x	3329 / 3330

Recall-neutral: row counts differ by +/-1 (0.03%, non-deterministic batch-boundary effects).
Batch baseline beats inprocess (Ray pipelines stages across the 8 GPUs).

What's in this PR

Production-relevant (toggle-gated, default off):

operators/extract/fused/fused_extract.py -- FusedExtractActor composing the real
extract/page-element/OCR operators in one actor (ron).
graph/ingestor_runtime.py -- NEMO_FUSED_EXTRACT toggle: fused single node vs the staged chain.

Research / evidence (under gpu_pdf_extractor/):

native/ -- _gpu_pdfium (PDFium C++ drop-in via(CUDA AA rasterizer
- DLPack DeviceImage handoff).
python/gpu_pdfium/ -- pypdfium2-compatible drop-operator prototype.
bench/ -- reproducible benchmarks; results/ -- captured numbers + method_comparison.png.
Phase write-ups: PHASE0_RESULTS.md, P1_RESULTS.e_RESULTS.md,
and the consolidated FINDINGS_REPORT.md.

Why the 2.75x doesn't translate (key lesson)

The 2.75x is a transport micro-benchmark; the red, so transport is a
small fraction of wall time. Zero-copy requires one process (fusion); pipelining requires separate
actors -- mutually exclusive across a Ray boundaryhe page-image base64
is also a required pipeline output, so the encode can't be removed.

Risk / review notes

Default behavior unchanged; fusion is opt-in via e
The native modules require a CUDA toolchain + prebuilt PDFium; they are not built by default CI.
See FINDINGS_REPORT.md for the full methodology

Recommendation

Merge the native PDFium drop-in path as the durable win; keep the fused operator as a
toggle-gated experiment (single-GPU / transport-sue further DLPack
fusion for the batch pipeline.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

greptile-apps · 2026-06-23T21:59:40Z

Greptile Summary

This PR adds a gpu_pdf_extractor/ research tree investigating a CUDA-accelerated PDF extraction path for nemo_retriever, including a native PDFium C++ drop-in, a CUDA polygon rasterizer, DLPack zero-copy operator fusion, and a full benchmark harness. The default pipeline is unchanged; the fused path is gated by NEMO_FUSED_EXTRACT and the GPU backend by NEMO_PDF_BACKEND=gpu.

gpu_pdf_extractor/python/fused/operators.py — introduces FusedGPUOperator composing child operators in-process, plus RasterizeGPUOperator, PageElementGPUOperator, CropGPUOperator, TableStructureGPUOperator, OCRGPUOperator, and HostFinalizeOperator; two sites use except Exception: pass that silently swallow GPU/model errors, and all Python files lack the required SPDX license headers.
gpu_pdf_extractor/python/gpu_pdfium/ — pypdfium2-compatible shim backed by the native _gpu_pdfium extension; activate() performs an irreversible sys.modules["pypdfium2"] replacement with no undo path.
gpu_pdf_extractor/native/ — C++/CUDA extension sources (CMakeLists, nanobind bindings, CUDA rasterizer kernels) and benchmark scripts; native code quality is solid with proper RAII and error propagation.

Confidence Score: 3/5

Safe to land as an experiment branch given the explicit DO NOT MERGE label and the fact that all new code lives in an isolated gpu_pdf_extractor/ subtree with no production code paths touched. The issues identified should be addressed before any promotion to a merge-ready state.

Two operator classes (PageElementGPUOperator and OCRGPUOperator) silently discard all exceptions during model inference with no logging, so failures during GPU or model execution will produce rows with missing detections and zero OCR output with no visible error signal. The activate() function in the GPU pdfium shim performs an irreversible process-wide sys.modules replacement with no guard against a partially-initialized native extension. All Python files also lack the required SPDX license headers.

gpu_pdf_extractor/python/fused/operators.py (silent exception swallowing at two model-inference sites) and gpu_pdf_extractor/python/gpu_pdfium/init.py (irreversible sys.modules mutation)

Important Files Changed

Filename	Overview
gpu_pdf_extractor/python/fused/operators.py	Core fused GPU operator implementations — contains bare `except Exception: pass` at two sites (PageElementGPUOperator and OCRGPUOperator) that silently discard inference and OCR errors; missing SPDX header and unit tests
gpu_pdf_extractor/python/gpu_pdfium/init.py	pypdfium2 drop-in shim — `activate()` performs an irreversible `sys.modules` mutation with no safety guard or deactivation path; missing SPDX header
gpu_pdf_extractor/native/src/pdf_bindings.cpp	Native PDFium C++ bindings via nanobind — clean RAII, proper error propagation via PdfiumError; correct UTF-16LE→UTF-8 conversion and stride-aware bitmap copy
gpu_pdf_extractor/profile_areas.py	PDF area profiler script — unclosed file handle on line 16 (no context manager); missing SPDX header
gpu_pdf_extractor/bench/build_corpus.py	Corpus assembly script — discovers, dedupes, classifies, and symlinks PDFs; exception handling acceptable at file I/O boundaries; missing SPDX header
gpu_pdf_extractor/bench/raster_bench.py	CUDA rasterizer throughput benchmark — correctness check, single/multi-GPU measurements; missing SPDX header

Prompt To Fix All With AI

Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
gpu_pdf_extractor/python/fused/operators.py:159-168
**Silent swallow of model inference errors**

`except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed.

The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure.

### Issue 2 of 5
gpu_pdf_extractor/python/gpu_pdfium/__init__.py:45-48
**Irreversible `sys.modules` mutation with no failure guard**

`activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.

At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible.

### Issue 3 of 5
gpu_pdf_extractor/python/fused/operators.py:1
**SPDX license header missing**

This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top.

### Issue 4 of 5
gpu_pdf_extractor/profile_areas.py:16
`open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.

```suggestion
    with open(path, "rb") as fh:
        raw = fh.read()
```

### Issue 5 of 5
gpu_pdf_extractor/python/fused/operators.py:1-13
**No unit tests for new operator classes**

`FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run.

_{Reviews (1): Last reviewed commit: "experiment and analysis on gpu rasteriza..." | Re-trigger Greptile}

greptile-apps · 2026-06-23T21:59:44Z

+            bx = lbl = None
+            try:
+                b, l, _s = self._model.postprocess(preds)
+                b0 = b[0] if isinstance(b, (list, tuple)) else b
+                l0 = l[0] if isinstance(l, (list, tuple)) else l
+                bx = b0.detach().float().cpu().numpy().reshape(-1, 4) if hasattr(b0, "detach") else None
+                lbl = l0.detach().cpu().numpy().reshape(-1).astype(int) if hasattr(l0, "detach") else None
+            except Exception:
+                pass
+            boxes_col.append(bx); labels_col.append(lbl)


Silent swallow of model inference errors

except Exception: pass here means a GPU OOM, a CUDA assertion, or a model loading failure in PageElementGPUOperator.process() will silently produce None in pe_boxes/pe_labels for the affected rows with zero visibility. Downstream operators (CropGPUOperator, TableStructureGPUOperator) then operate on rows with None detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (exc_info=True) before being suppressed.

The same pattern appears at line 339 in OCRGPUOperator.process() where except Exception: pass silently discards all OCR results and characters without any record of the failure.

Rule Used: Never use bare 'except:' that silently swallows er... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: gpu_pdf_extractor/python/fused/operators.py Line: 159-168 Comment: **Silent swallow of model inference errors** `except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed. The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure. **Rule Used:** Never use bare 'except:' that silently swallows er... ([source](.greptile)) How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-23T21:59:45Z

+def activate() -> None:
+    """Inject this module as `pypdfium2` (and `.raw`) so existing imports resolve here."""
+    sys.modules["pypdfium2"] = sys.modules[__name__]
+    sys.modules["pypdfium2.raw"] = raw


Irreversible sys.modules mutation with no failure guard

activate() replaces sys.modules["pypdfium2"] with the GPU module unconditionally. If _gpu_pdfium is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an ImportError — but any caller that catches it and then calls activate() on a partially-initialized module will silently inject a broken object as pypdfium2 for every downstream import in the process. There is also no deactivate() path; once called, all subsequent import pypdfium2 in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.

At minimum the function should assert _core is fully initialized before mutating sys.modules, and document clearly that calling it is irreversible.

Prompt To Fix With AI

This is a comment left during a code review. Path: gpu_pdf_extractor/python/gpu_pdfium/__init__.py Line: 45-48 Comment: **Irreversible `sys.modules` mutation with no failure guard** `activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible. At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible. How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-23T21:59:46Z

@@ -0,0 +1,401 @@
+"""FusedGPUOperator + device-aware example operators (P3').


SPDX license header missing

This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include operators.py, __init__.py under fused/ and gpu_pdfium/, raw.py, profile_areas.py, all files under bench/, etc. Every .py file added in this PR requires this header at the top.

Rule Used: Python files added in this PR must include the SPD... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: gpu_pdf_extractor/python/fused/operators.py Line: 1 Comment: **SPDX license header missing** This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top. **Rule Used:** Python files added in this PR must include the SPD... ([source](.greptile)) How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-23T21:59:47Z

+    return best
+
+def profile(path, dpi=300, max_pages=40):
+    raw = open(path, "rb").read()


open() is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.

Suggested change

raw = open(path, "rb").read()

with open(path, "rb") as fh:

raw = fh.read()

Prompt To Fix With AI

This is a comment left during a code review. Path: gpu_pdf_extractor/profile_areas.py Line: 16 Comment: `open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs. ```suggestion with open(path, "rb") as fh: raw = fh.read() ``` How can I resolve this? If you propose a fix, please make it concise.

greptile-apps · 2026-06-23T21:59:48Z

+"""FusedGPUOperator + device-aware example operators (P3').
+
+The point: existing operators exchange data as pandas DataFrames across Ray ``map_batches``
+boundaries, where any device tensor would be serialized to host (and today images travel as
+base64). A *fused* operator runs a list of operators IN ONE PROCESS, threading the DataFrame
+through each child's ``preprocess → process → postprocess``. Intermediate device tensors
+(``DeviceImage``, DLPack-exportable) then survive between operators with ZERO host copies.
+
+This keeps the **exact existing operator API** — children are ordinary ``AbstractOperator``s.
+A child becomes "device-aware" simply by reading/writing the ``page_image_dev`` column
+(a column of ``DeviceImage`` handles) instead of base64. Non-fused/legacy operators that only
+know base64 still work unchanged; they just don't get the zero-copy benefit.
+"""


No unit tests for new operator classes

FusedGPUOperator, RasterizeGPUOperator, PageElementGPUOperator, CropGPUOperator, OCRGPUOperator, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The test-coverage-new-code standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with @pytest.mark.integration so they are excluded from the default CI run.

Rule Used: New functionality must include corresponding unit ... (source)

Prompt To Fix With AI

This is a comment left during a code review. Path: gpu_pdf_extractor/python/fused/operators.py Line: 1-13 Comment: **No unit tests for new operator classes** `FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run. **Rule Used:** New functionality must include corresponding unit ... ([source](.greptile)) How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

experiment and analysis on gpu rasterization only

208e200

jperez999 requested review from a team as code owners June 23, 2026 21:55

jperez999 requested a review from ChrisJar June 23, 2026 21:55

jperez999 marked this pull request as draft June 23, 2026 21:56

greptile-apps Bot reviewed Jun 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262

DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262
jperez999 wants to merge 1 commit into
NVIDIA:mainfrom
jperez999:gpu-pdf-only

jperez999 commented Jun 23, 2026

Uh oh!

greptile-apps Bot commented Jun 23, 2026

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 23, 2026

Uh oh!

greptile-apps Bot Jun 23, 2026

Uh oh!

greptile-apps Bot Jun 23, 2026

Uh oh!

greptile-apps Bot Jun 23, 2026

Uh oh!

greptile-apps Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		@@ -0,0 +1,401 @@
		"""FusedGPUOperator + device-aware example operators (P3').

	raw = open(path, "rb").read()
	with open(path, "rb") as fh:
	raw = fh.read()

Uh oh!

Conversation

jperez999 commented Jun 23, 2026

Description

GPU PDF extraction: native PDFium drop-in, CUDA rasterizer, and a fused zero-copy extract operator (investigation + toggle-gated integration)

Summary

TL;DR of the results

Ingest benchmark (default pipeline: local models + vLLM embed + LanceDB; baseline vs NEMO_FUSED_EXTRACT=1)

What's in this PR

Why the 2.75x doesn't translate (key lesson)

Risk / review notes

Recommendation

Checklist

Uh oh!

greptile-apps Bot commented Jun 23, 2026

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ingest benchmark (default pipeline: local models + vLLM embed + LanceDB; baseline vs `NEMO_FUSED_EXTRACT=1`)