A Claude Code engine that synthesizes multiple PDF books into a single, maximally complete study book on any topic — merging, deduplicating, and integrating content from every relevant source so that nothing worth knowing gets left behind.
You have a shelf of PDFs on a subject. Each book covers some aspects thoroughly and skips others. The best explanation of concept A is in book 3, the best examples are in book 7, and a critical caveat only appears in an appendix of book 2. Reading them sequentially takes weeks. Cross-referencing by hand is exhausting.
The result is that most of that knowledge stays unused — buried across dozens of files you never find time to reconcile.
engine-books treats your PDF collection as a distributed knowledge base and uses Claude to compile it into a single reference document. For each subtopic, it reads every source that covers it, extracts the best from each, and writes a unified passage that is more complete than any individual book.
The output is not a summary. It is a synthesis: everything any of your sources knew about the topic, deduplicated, organised, and written as one coherent book.
The engine runs in two phases, by design.
Claude scans the table of contents of every PDF in your pdfs/ folder, finds
all chapters relevant to your topic, and produces a merged outline that maps
each subsection to the specific pages in each book that cover it.
You review and approve this outline before any book content is generated. You can
add sections, remove subtopics, reorder, or mark gaps with ⚠️ to signal that a
subtopic is not covered by your books (Claude will fill it from general knowledge).
The outline is saved as books/summary_TOPIC.md.
Why a separate phase? The outline is cheap to generate and gives you full control over scope before any heavy reading begins. It also enforces a discipline that prevents the book from drifting — every section in the final book must be accounted for in the approved outline.
Claude reads the approved outline and, for each subsection, fetches the exact pages listed for it from every mapped source — in parallel. It then integrates them:
- The richest, most detailed source leads each subsection
- Every secondary source is read completely; anything it adds that the leading source does not cover gets incorporated
- Only exact duplicate sentences are dropped; complementary phrasings that reinforce the same concept are kept
- Conflicts between sources are surfaced explicitly, not silently resolved
- Subsections marked
⚠️are written from general knowledge, clearly flagged
The result is saved as books/TOPIC.md.
- Claude Code (CLI or IDE extension)
- Python 3.11+
- uv (recommended) or
pip install pypdf
git clone https://github.com/leomajewski/engine-books
cd engine-books
# Place your PDFs here — any subfolder structure is fine
mkdir -p pdfsengine-books/
├── pdfs/ ← your source PDFs (any subfolder structure)
│
├── books/ ← all generated output (auto-created)
│ ├── summary_TOPIC.md ← merged outline — review and approve (Phase 1)
│ ├── TOPIC.md ← the final synthesised book (Phase 2)
│ ├── cache/ ← extracted page ranges, cached for re-use
│ └── text/ ← optional: full pre-extracted texts
│
├── extract_pdf_pages.py ← fetches specific pages or TOC from one PDF
├── extract_all_pdfs.py ← pre-processes all PDFs to text (optional, faster)
│
└── .claude/commands/
├── summary.md ← /summary skill (Phase 1)
└── book.md ← /book skill (Phase 2)
Open Claude Code in this directory and run:
/summary <your topic>
Claude scans all PDFs, maps relevant chapters, and writes a merged outline to
books/summary_TOPIC.md. Review it, adjust as needed, then tell Claude to proceed.
/book <your topic>
Claude reads the approved outline, fetches the source pages, synthesizes the
content, and writes books/TOPIC.md.
For large collections, run this once before generating your first book:
python extract_all_pdfs.pyThis converts every PDF to a plain-text file in books/text/. During /book,
Claude reads these files directly instead of extracting from PDFs in real time —
noticeably faster when working with dozens of books.
# Fetch specific pages from one PDF (with automatic caching)
uv run --with pypdf python extract_pdf_pages.py "pdfs/book.pdf" 42 80
# Fetch the table of contents of a PDF
uv run --with pypdf python extract_pdf_pages.py "pdfs/book.pdf" toc
# Bypass cache and re-extract
uv run --with pypdf python extract_pdf_pages.py "pdfs/book.pdf" 42 80 --no-cache
# Delete all cached extractions
uv run --with pypdf python extract_pdf_pages.py --clear-cache
# Pre-extract all PDFs in pdfs/ to books/text/
python extract_all_pdfs.py
# Pre-extract from a different directory
python extract_all_pdfs.py --dir /path/to/your/books
# Inspect what would be extracted (dry run)
python extract_all_pdfs.py --list- Domain-agnostic — works for any subject: science, law, medicine, history, engineering, philosophy, or anything else your PDFs cover.
- Subfolders — organise
pdfs/however you like; the engine scans recursively. - Caching — extracted page ranges are cached in
books/cache/. Re-running/bookafter editing the outline does not re-read PDFs already cached. ⚠️gaps — mark a summary entry with⚠️to signal a missing topic. Claude writes that section from general knowledge and labels it explicitly.- Incremental — generate a book on one topic today, another tomorrow. Each
/summaryand/bookpair is independent.
MIT