PDF layout intelligence for .NET — structured extraction with bounding boxes, reading order, and semantic element detection.
The .NET ecosystem has zero RAG-optimized PDF extraction libraries. Python has OpenDataLoader, Docling, pymupdf4llm, Marker — C# has nothing.
PdfStruct fills that gap: a pure .NET library that extracts structured content from PDFs — headings, paragraphs, tables, lists — with bounding boxes for every element. Output as Markdown (for LLM context) or JSON (for citations). No GPU, no cloud, no JVM.
PdfStruct is an early alpha. The current implementation focuses on text layout, reading order, heading detection, Markdown rendering, JSON rendering, and safety filtering. The public API and output schema may still change before the first stable release.
The library currently targets:
net8.0
The development SDK is pinned by global.json. net8.0 is the baseline for new consumers; older target frameworks will be considered only if there is a concrete compatibility need.
Once the package is published:
dotnet add package PdfStruct --prereleaseusing PdfStruct;
var parser = new PdfStructParser(new PdfStructOptions
{
Format = OutputFormat.Both,
SanitizeText = true
});
var result = parser.Parse("document.pdf");
// Markdown — feed directly into your RAG chunking pipeline
Console.WriteLine(result.Markdown);
// JSON — OpenDataLoader-compatible, bounding boxes included
File.WriteAllText("output.json", result.Json);| Feature | Status |
|---|---|
| XY-Cut++ reading order (multi-column) | ✅ |
| Probabilistic heading detection (font + standalone signals) | ✅ |
| Heading-level assignment by typographic-style clustering | ✅ |
| Paragraph grouping with line-continuation merge | ✅ |
Letter-spaced display title recovery (MY DEAREST → one heading) |
✅ |
| Rotated-text word grouping (e.g. arXiv left-margin watermarks) | ✅ |
| Ordered list detection with nested paragraph children | ✅ |
| Running header / footer / side-furniture filtering | ✅ |
| Bounding box per element | ✅ |
| Markdown output | ✅ |
| JSON output (OpenDataLoader-compatible, ISO 8601 dates) | ✅ |
| Prompt injection filtering | ✅ |
| Invalid character replacement | ✅ |
| Sensitive text sanitization (optional) | ✅ |
| Pluggable regex-based heading patterns (per-corpus customization) | ✅ |
| Table extraction (bordered) | 🔜 Phase 2 |
| Unordered list detection | 🔜 Phase 2 |
| Image extraction | 🔜 Phase 2 |
| Tagged PDF structure tree | 🔜 Phase 3 |
| Inline emphasis (bold / italic runs preserved in paragraphs) | 🔜 Phase 3 |
PdfStruct's heading detection is typography-driven, following OpenDataLoader-pdf's probabilistic model: blocks are scored on font-size rarity, font-weight rarity, standalone-row layout, and short-single-line shape, and the document's distinct heading styles are clustered into a 1..N hierarchy.
Where this works well:
- Academic papers (
src/PdfStruct.Tests/Fixtures/plos_*.pdf) — bold sub-headings + larger title font give clean separations across H1/H2/H3. - Display-typeset documents (
src/PdfStruct.Tests/Fixtures/letter.pdf,src/PdfStruct.Tests/Fixtures/lorem_ipsum.pdf) — large title, small body, no ambiguity. - Documents with explicit typographic hierarchy — the U.S. Constitution's Article numbers at 24pt vs section labels at 20pt get distinct levels automatically.
Where it doesn't (yet):
- Documents whose section markers carry no typographic distinction — the Korean constitution's
제1장,제1절,제1관are typeset in the same font and size as body paragraphs. Without language-specific patterns, only the document title is detected as a heading. Inject patterns viaRegexHeadingClassifierwhen the corpus needs it (see "Custom heading patterns" below). - Magazine pull-quotes — large display type that visually quotes body text scores high on font rarity and is sometimes misclassified as a heading. Layout-level disambiguation (pull-quote shape, position offset) is not yet implemented.
- Tables of contents with prominent page numbers — the page-number column at heading-sized type is misclassified.
- Inline bold or italic runs inside paragraphs are not preserved — paragraphs are flattened to plain text on the way to Markdown and JSON, matching ODL's behavior. Per-line bold and italic flags are tracked internally (sourced from PdfPig's
FontDetails) but do not yet propagate into paragraph-internal styling. - Tables, unordered lists, and inline images are not yet detected (Phase 2 roadmap). Ordered numeric lists (
1. … 2. … 3. …) are detected and emitted aslistelements with paragraph children.
# Introduction
This paper presents a novel approach to...
## Related Work
Previous studies have shown that...{
"file name": "paper.pdf",
"number of pages": 12,
"kids": [
{
"type": "heading",
"id": 1,
"page number": 1,
"bounding box": [72.0, 700.0, 540.0, 730.0],
"heading level": 1,
"content": "Introduction"
}
]
}PdfStruct
├── Models/ # Content element types (heading, paragraph, table, list, ...)
├── Analysis/ # LetterGrouper, XY-Cut++ layout analyzer, font + regex
│ # classifiers, list detector, document statistics
├── Rendering/ # Markdown & JSON renderers
├── Safety/ # Prompt injection filtering, text sanitization
├── PdfStructParser # Main entry point
└── PdfStructOptions # Configuration
Built on PdfPig (Apache-2.0) for low-level PDF access.
The repo includes a small local console app for trying PdfStruct against real PDFs. Drop your input documents into playground/ (gitignored) and run:
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf -o out.md
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf -o out.json --format json
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf --sanitize -o out.md
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf --debug-image out/debug
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf --debug-image out/debug --debug-lines
dotnet run --project src/PdfStruct.Cli -- extract playground/document.pdf --include-running-headers -o out.mdBy default, detected running headers, footers, page numbers, and narrow side furniture are excluded from the main content stream. --include-running-headers keeps that detected page furniture. --sanitize masks common sensitive values (emails, phone numbers, etc.) in the extracted text. --debug-image writes one PNG per page rasterized through PDFium (via Docnet.Core) with extracted element bounding boxes overlaid — the page renders exactly as a viewer would display it (fonts, embedded images, vector graphics) so bbox positions are visually verifiable against the actual layout. Add --debug-lines with --debug-image to include the pre-paragraph text-line boxes used before block merging.
A diagnose subcommand emits a per-block CSV with the heading-probability breakdown (base, font-size rarity, font-weight rarity, bulleted boost, total) — useful for calibrating the threshold against new fixtures:
dotnet run --project src/PdfStruct.Cli -- diagnose playground/document.pdf -o scores.csv- .NET SDK
8.0.416or a compatible feature-band roll-forward, as defined inglobal.json - Windows, Linux, or macOS
dotnet restore PdfStruct.sln
dotnet build PdfStruct.sln -c Release --no-restore
dotnet test PdfStruct.sln -c Release --no-build
dotnet format PdfStruct.sln --verify-no-changes --no-restore- C# language version is pinned to
12.0inDirectory.Build.props. - Nullable reference types and implicit usings are enabled repo-wide.
- Package versions are centrally managed in
Directory.Packages.props. - NuGet restore is scoped to
nuget.orgthroughNuGet.config. - Text files use UTF-8 and LF line endings, enforced by
.editorconfigand.gitattributes.
- Phase 1 (current): Reading order, heading/paragraph classification, running header/footer filtering, Markdown/JSON output
- Phase 2: Table detection, list detection, image extraction, layout-strategy auto-selection (single-column content-stream order vs. XY-Cut), pull-quote disambiguation
- Phase 3: Tagged PDF support, borderless table detection, inline emphasis runs