Skip to content

resilient-tech/transaction-parser

Repository files navigation

Transaction Parser

Overview

Transaction Parser is an AI-powered add-on for ERPNext that automatically extracts data from PDFs and creates draft documents (Sales Order / Purchase Invoice) . It supports multiple document types and regions, making it easier to digitize and process business documents.

Features

AI-Powered Extraction: Uses advanced AI models (OpenAI, DeepSeek, Google Gemini, Anthropic) to extract structured data from PDFs

  • Multi-Document Support: Handles Sales Orders and Purchase Invoices (Expenses)
  • Regional Support: Special handling for India-specific requirements (GSTIN, PAN, HSN codes)
  • Email Integration: Automatically processes documents from incoming emails
  • Customizable Schemas: Flexible field mapping and custom schema support
  • Smart Item Matching: Automatically matches items from previous invoices

Configuration

1. Enable Transaction Parser Navigate to Transaction Parser Settings and configure:

  1. Enable: Check to activate the app

    image

2. Default AI Model: Select from available models:

  • DeepSeek Chat

  • DeepSeek Reasoner

  • OpenAI gpt-4o

  • OpenAI gpt-4o-mini

  • OpenAI gpt-5

  • OpenAI gpt-5-mini

  • Google Gemini Pro-2.5

  • Google Gemini Flash-2.5

  • Claude Haiku-4.5

    image

3. API Keys Setup

Add your API keys for the AI services:

Service Provider Models Supported
OpenAI gpt-4o, gpt-4o-mini, gpt-5, gpt-5-mini
DeepSeek deepseek-chat, deepseek-reasoner
Google gemini-2.5-pro, gemini-2.5-flash
Anthropic claude-haiku-4-5
image

4. Email Configuration (Optional) To automatically process documents from emails:

  1. Parse Incoming Emails: Enable email processing

  2. Incoming Email Accounts: Configure which email accounts to monitor

  3. Party Emails: Map email addresses to specific customers/suppliers

    image

5. Transaction Configuration

  • Invoice Lookback Count: Number of past invoices to consider for item matching (default: 5)

    image

Usage

Manual Document Processing

  1. Navigate to Sales Order or Purchase Invoice list view
  2. Click on Actions → Parse Sales Order/Expense Invoice
  3. Upload your PDF file
  4. Select:
    • AI Model: Choose the AI model to use
    • Country: Select India or Other
    • Page Limit: (Optional) Limit pages to process
  5. Click Submit
TransactionParser.1.mp4

The system will:

  • Extract text from the PDF
  • Send it to the AI model for processing
  • Create a draft document with extracted data
  • Attach the original PDF to the created document

Automatic Email Processing

When enabled, the system automatically:

  1. Monitors configured email accounts
  2. Extracts PDF attachments from emails
  3. Processes them based on sender and configuration
  4. Creates draft documents

Model Comparison

Model Provider Best For Speed Cost
gpt-5 OpenAI State-of-the-art accuracy, complex multi-page documents Medium High
gpt-5-mini OpenAI Efficient reasoning, cost-effective Fast Medium
gpt-4o OpenAI Complex documents, high accuracy Medium Medium-High
gpt-4o-mini OpenAI Cost-effective, good accuracy Fast Low
gemini-2.5-pro Google Advanced reasoning, large context window Medium Medium
gemini-2.5-flash Google Fast processing, bulk documents Very Fast Low
deepseek-chat DeepSeek General purpose extraction Fast Low
deepseek-reasoner DeepSeek Complex reasoning tasks Slow Medium
claude-haiku-4-5 Anthropic Fast, lightweight tasks Fast Low

India-Specific Features

The Transaction Parser app includes robust support for Indian business requirements through integration with the India Compliance app. These features enable automatic handling of GST regulations, Indian business identifiers, and region-specific validation requirements.

Prerequisites

  • India Compliance App: Must be installed for India-specific features to work

India-Specific AI Model Enhancements

Enhanced Data Extraction - When processing documents with the India region selected, the AI models are enhanced to extract:

  1. GST Identification Numbers (GSTIN)
  2. Permanent Account Numbers (PAN)
  3. HSN/SAC Codes
  4. Tax Components

Automatic Supplier Creation

GSTIN-Based Supplier Creation When enabled in settings, the system can automatically create suppliers:

  1. Configuration
    • Enable "Auto Create Supplier" in Transaction Parser Settings
    • Requires valid GSTIN in the invoice

PDF Processor Setup

  • Transaction Parser supports three PDF processors for text extraction.
  • Only PDFtoText (the default) is installed as a required dependency.
  • The other two are optional.

Installing Optional PDF Processors

# Install OCRMyPDF
env/bin/pip install -e "apps/transaction_parser[ocrmypdf]"

# Install Docling
env/bin/pip install -e "apps/transaction_parser[docling]"

# Install all optional processors
env/bin/pip install -e "apps/transaction_parser[all]"

1. PDFtoText (Default)

Layout-preserving text extraction using pdftotext.

Important

Install OS dependencies before running bench setup requirements or pip install, otherwise the pdftotext Python package will fail to build.

OS Dependencies (Debian/Ubuntu):

sudo apt install build-essential libpoppler-cpp-dev pkg-config python3-dev

For other operating systems, see pdftotext OS dependencies.

2. OCRMyPDF (Optional)

OCR-based text extraction using OCRmyPDF. Useful for scanned or image-based PDFs.

OS Dependencies (Debian/Ubuntu):

sudo apt-get install -y tesseract-ocr ghostscript

3. Docling (Optional)

Advanced document understanding using Docling with EasyOCR for OCR support.

See Docling OCR engines for more details.

Post-install fix for headless servers:

After installing the docling extra, replace opencv-python with the headless variant:

bench pip uninstall opencv-python
bench pip install opencv-python-headless

This is required because opencv-python depends on libGL.so.1, which is unavailable on headless servers:

ImportError: libGL.so.1: cannot open shared object file: No such file or directory

Summary

Processor Dependency Type OS Packages Required OCR
PDFtoText Required build-essential libpoppler-cpp-dev pkg-config python3-dev No
OCRMyPDF Optional tesseract-ocr ghostscript Yes
Docling Optional None Yes

License

GNU General Public License (v3)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors