DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262
DO NOT MERGE - experiment + analysis on gpu rasterization alone#2262jperez999 wants to merge 1 commit into
Conversation
Greptile SummaryThis PR adds a
|
| Filename | Overview |
|---|---|
| gpu_pdf_extractor/python/fused/operators.py | Core fused GPU operator implementations — contains bare except Exception: pass at two sites (PageElementGPUOperator and OCRGPUOperator) that silently discard inference and OCR errors; missing SPDX header and unit tests |
| gpu_pdf_extractor/python/gpu_pdfium/init.py | pypdfium2 drop-in shim — activate() performs an irreversible sys.modules mutation with no safety guard or deactivation path; missing SPDX header |
| gpu_pdf_extractor/native/src/pdf_bindings.cpp | Native PDFium C++ bindings via nanobind — clean RAII, proper error propagation via PdfiumError; correct UTF-16LE→UTF-8 conversion and stride-aware bitmap copy |
| gpu_pdf_extractor/profile_areas.py | PDF area profiler script — unclosed file handle on line 16 (no context manager); missing SPDX header |
| gpu_pdf_extractor/bench/build_corpus.py | Corpus assembly script — discovers, dedupes, classifies, and symlinks PDFs; exception handling acceptable at file I/O boundaries; missing SPDX header |
| gpu_pdf_extractor/bench/raster_bench.py | CUDA rasterizer throughput benchmark — correctness check, single/multi-GPU measurements; missing SPDX header |
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 5
gpu_pdf_extractor/python/fused/operators.py:159-168
**Silent swallow of model inference errors**
`except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed.
The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure.
### Issue 2 of 5
gpu_pdf_extractor/python/gpu_pdfium/__init__.py:45-48
**Irreversible `sys.modules` mutation with no failure guard**
`activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.
At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible.
### Issue 3 of 5
gpu_pdf_extractor/python/fused/operators.py:1
**SPDX license header missing**
This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top.
### Issue 4 of 5
gpu_pdf_extractor/profile_areas.py:16
`open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.
```suggestion
with open(path, "rb") as fh:
raw = fh.read()
```
### Issue 5 of 5
gpu_pdf_extractor/python/fused/operators.py:1-13
**No unit tests for new operator classes**
`FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run.
Reviews (1): Last reviewed commit: "experiment and analysis on gpu rasteriza..." | Re-trigger Greptile
| bx = lbl = None | ||
| try: | ||
| b, l, _s = self._model.postprocess(preds) | ||
| b0 = b[0] if isinstance(b, (list, tuple)) else b | ||
| l0 = l[0] if isinstance(l, (list, tuple)) else l | ||
| bx = b0.detach().float().cpu().numpy().reshape(-1, 4) if hasattr(b0, "detach") else None | ||
| lbl = l0.detach().cpu().numpy().reshape(-1).astype(int) if hasattr(l0, "detach") else None | ||
| except Exception: | ||
| pass | ||
| boxes_col.append(bx); labels_col.append(lbl) |
There was a problem hiding this comment.
Silent swallow of model inference errors
except Exception: pass here means a GPU OOM, a CUDA assertion, or a model loading failure in PageElementGPUOperator.process() will silently produce None in pe_boxes/pe_labels for the affected rows with zero visibility. Downstream operators (CropGPUOperator, TableStructureGPUOperator) then operate on rows with None detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (exc_info=True) before being suppressed.
The same pattern appears at line 339 in OCRGPUOperator.process() where except Exception: pass silently discards all OCR results and characters without any record of the failure.
Rule Used: Never use bare 'except:' that silently swallows er... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 159-168
Comment:
**Silent swallow of model inference errors**
`except Exception: pass` here means a GPU OOM, a CUDA assertion, or a model loading failure in `PageElementGPUOperator.process()` will silently produce `None` in `pe_boxes`/`pe_labels` for the affected rows with zero visibility. Downstream operators (`CropGPUOperator`, `TableStructureGPUOperator`) then operate on rows with `None` detection boxes, which could produce empty crops or wrong results while the pipeline reports no errors. Per the no-bare-except rule, exceptions at this boundary must be logged with full context (`exc_info=True`) before being suppressed.
The same pattern appears at line 339 in `OCRGPUOperator.process()` where `except Exception: pass` silently discards all OCR results and characters without any record of the failure.
**Rule Used:** Never use bare 'except:' that silently swallows er... ([source](.greptile))
How can I resolve this? If you propose a fix, please make it concise.| def activate() -> None: | ||
| """Inject this module as `pypdfium2` (and `.raw`) so existing imports resolve here.""" | ||
| sys.modules["pypdfium2"] = sys.modules[__name__] | ||
| sys.modules["pypdfium2.raw"] = raw |
There was a problem hiding this comment.
Irreversible
sys.modules mutation with no failure guard
activate() replaces sys.modules["pypdfium2"] with the GPU module unconditionally. If _gpu_pdfium is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an ImportError — but any caller that catches it and then calls activate() on a partially-initialized module will silently inject a broken object as pypdfium2 for every downstream import in the process. There is also no deactivate() path; once called, all subsequent import pypdfium2 in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.
At minimum the function should assert _core is fully initialized before mutating sys.modules, and document clearly that calling it is irreversible.
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/gpu_pdfium/__init__.py
Line: 45-48
Comment:
**Irreversible `sys.modules` mutation with no failure guard**
`activate()` replaces `sys.modules["pypdfium2"]` with the GPU module unconditionally. If `_gpu_pdfium` is unavailable (no prebuilt PDFium, no CUDA toolchain), the import at module load time will already have thrown an `ImportError` — but any caller that catches it and then calls `activate()` on a partially-initialized module will silently inject a broken object as `pypdfium2` for every downstream import in the process. There is also no `deactivate()` path; once called, all subsequent `import pypdfium2` in the process pick up this implementation permanently, making the toggle effectively process-scoped and irreversible.
At minimum the function should assert `_core` is fully initialized before mutating `sys.modules`, and document clearly that calling it is irreversible.
How can I resolve this? If you propose a fix, please make it concise.| @@ -0,0 +1,401 @@ | |||
| """FusedGPUOperator + device-aware example operators (P3'). | |||
There was a problem hiding this comment.
This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include operators.py, __init__.py under fused/ and gpu_pdfium/, raw.py, profile_areas.py, all files under bench/, etc. Every .py file added in this PR requires this header at the top.
Rule Used: Python files added in this PR must include the SPD... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 1
Comment:
**SPDX license header missing**
This file (and all other Python files added in this PR) is missing the required SPDX header block. Files affected include `operators.py`, `__init__.py` under `fused/` and `gpu_pdfium/`, `raw.py`, `profile_areas.py`, all files under `bench/`, etc. Every `.py` file added in this PR requires this header at the top.
**Rule Used:** Python files added in this PR must include the SPD... ([source](.greptile))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
| return best | ||
|
|
||
| def profile(path, dpi=300, max_pages=40): | ||
| raw = open(path, "rb").read() |
There was a problem hiding this comment.
open() is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.
| raw = open(path, "rb").read() | |
| with open(path, "rb") as fh: | |
| raw = fh.read() |
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/profile_areas.py
Line: 16
Comment:
`open()` is used without a context manager, so the file handle is not explicitly closed. On large runs this accumulates open file descriptors until GC collects them, which can hit OS file-descriptor limits when profiling many PDFs.
```suggestion
with open(path, "rb") as fh:
raw = fh.read()
```
How can I resolve this? If you propose a fix, please make it concise.| """FusedGPUOperator + device-aware example operators (P3'). | ||
|
|
||
| The point: existing operators exchange data as pandas DataFrames across Ray ``map_batches`` | ||
| boundaries, where any device tensor would be serialized to host (and today images travel as | ||
| base64). A *fused* operator runs a list of operators IN ONE PROCESS, threading the DataFrame | ||
| through each child's ``preprocess → process → postprocess``. Intermediate device tensors | ||
| (``DeviceImage``, DLPack-exportable) then survive between operators with ZERO host copies. | ||
|
|
||
| This keeps the **exact existing operator API** — children are ordinary ``AbstractOperator``s. | ||
| A child becomes "device-aware" simply by reading/writing the ``page_image_dev`` column | ||
| (a column of ``DeviceImage`` handles) instead of base64. Non-fused/legacy operators that only | ||
| know base64 still work unchanged; they just don't get the zero-copy benefit. | ||
| """ |
There was a problem hiding this comment.
No unit tests for new operator classes
FusedGPUOperator, RasterizeGPUOperator, PageElementGPUOperator, CropGPUOperator, OCRGPUOperator, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The test-coverage-new-code standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with @pytest.mark.integration so they are excluded from the default CI run.
Rule Used: New functionality must include corresponding unit ... (source)
Prompt To Fix With AI
This is a comment left during a code review.
Path: gpu_pdf_extractor/python/fused/operators.py
Line: 1-13
Comment:
**No unit tests for new operator classes**
`FusedGPUOperator`, `RasterizeGPUOperator`, `PageElementGPUOperator`, `CropGPUOperator`, `OCRGPUOperator`, etc. are production-intended operator classes with non-trivial logic (DataFrame column management, device tensor lifecycle, crop coordinate math) but no accompanying unit tests. The `test-coverage-new-code` standard requires at least a happy-path and error-path test for new business logic. GPU-dependent tests should be gated with `@pytest.mark.integration` so they are excluded from the default CI run.
**Rule Used:** New functionality must include corresponding unit ... ([source](.greptile))
How can I resolve this? If you propose a fix, please make it concise.Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Description
GPU PDF extraction: native PDFium drop-in, CUDA rasterizer, and a fused zero-copy extract operator (investigation + toggle-gated integration)
Summary
Investigates whether a C++/CUDA PDF extraction path and a fused, zero-copy GPU operator can beat
the current
pypdfium2extraction innemo_retriever, with end-to-end benchmarks on the realretriever ingestpipeline. Adds the building blocks behind an opt-in toggle and an honestfindings report.
The default pipeline is unchanged. The fused operator is gated by
NEMO_FUSED_EXTRACT(unset =>existing staged path), so this PR carries no behavior change unless explicitly enabled.
TL;DR of the results
CUDA rasterizer at 712 pages/s/GPU; DLPack zero-copy giving a 2.75x transport speedup
with identical model predictions (max abs diff = 0).
retriever ingestpipeline the fusedoperator is neutral in inprocess mode and regresses batch/Ray mode (0.44x-0.60x) -- because
the pipeline is inference-bound and Ray already parallelizes stages across GPUs by pipelining;
fusing them into one serial actor sacrifices that overlap.
fusion there. The one low-risk standalone win is the native PDFium drop-in.
Ingest benchmark (default pipeline: local models + vLLM embed + LanceDB; baseline vs
NEMO_FUSED_EXTRACT=1)Recall-neutral: row counts differ by +/-1 (0.03%, non-deterministic batch-boundary effects).
Batch baseline beats inprocess (Ray pipelines stages across the 8 GPUs).
What's in this PR
Production-relevant (toggle-gated, default off):
operators/extract/fused/fused_extract.py--FusedExtractActorcomposing the realextract/page-element/OCR operators in one actor (ron).
graph/ingestor_runtime.py--NEMO_FUSED_EXTRACTtoggle: fused single node vs the staged chain.Research / evidence (under
gpu_pdf_extractor/):native/--_gpu_pdfium(PDFium C++ drop-in via(CUDA AA rasterizerDeviceImagehandoff).python/gpu_pdfium/-- pypdfium2-compatible drop-operator prototype.bench/-- reproducible benchmarks;results/-- captured numbers +method_comparison.png.PHASE0_RESULTS.md,P1_RESULTS.e_RESULTS.md,and the consolidated
FINDINGS_REPORT.md.Why the 2.75x doesn't translate (key lesson)
The 2.75x is a transport micro-benchmark; the red, so transport is a
small fraction of wall time. Zero-copy requires one process (fusion); pipelining requires separate
actors -- mutually exclusive across a Ray boundaryhe page-image base64
is also a required pipeline output, so the encode can't be removed.
Risk / review notes
FINDINGS_REPORT.mdfor the full methodologyRecommendation
Merge the native PDFium drop-in path as the durable win; keep the fused operator as a
toggle-gated experiment (single-GPU / transport-sue further DLPack
fusion for the batch pipeline.
Checklist