AI-Powered Document Processing Pipeline

Extracts data from incoming business documents using AI, verifies against ERP master data with three-way matching, and creates the booking after human review.

Part of a larger automation portal managing ERP operations across subsidiaries of a multinational manufacturer

Python OCR LLM ERP

The Problem

The company processes business documents — supplier invoices, customer orders, delivery notes, credit notes — across multiple entities and languages. Each document required staff to open the email, open the PDF, read every field — document number, dates, line items with quantities and prices, VAT, payment details — then cross-reference against existing data in the ERP before manually keying it in. A single document could take 5–15 minutes end-to-end, depending on complexity. With documents arriving daily across multiple sites — each with different partners, languages, document formats, and business logic — the process ate hours of admin time every week while staying prone to transcription errors, missed price discrepancies, and delayed bookings that led to late payments and audit findings.

Upload a PDF and see AI-powered data extraction in action.

Try the extraction demo

The Approach

Ingestion

PDF ingestion: The system monitors shared mailboxes via Microsoft Graph API webhooks or polls over SMTP at set intervals, with sender-based routing to the correct entity and automatic PDF extraction from attachments. Documents can also come in via a drag-and-drop upload form or by pasting a URL to a file in shared storage. Every PDF lands in central storage with metadata tags for source, entity, and document type.

Extraction

PDF parsing and AI extraction: First checks for embedded structured data (ZUGFeRD/XRechnung XML). If none is found, a vision-capable language model handles visual parsing and field extraction in one pass. Works on both scanned paper documents and digitally created PDFs without separate preprocessing or per-partner template configuration.
Structured extraction with confidence scoring: The extraction prompt returns ~30 typed fields as JSON: document metadata (type, number, date, currency), supplier and customer details, line items with per-position article numbers, quantities, unit prices, customs tariff codes, country of origin and other relevant customs information, and referenced documents, plus amounts with VAT breakdown and payment information (IBAN, QR reference). Every field gets a confidence score that controls how strictly it's validated downstream.

Validation and matching

Partner matching: Extracted supplier and customer names are fuzzy-matched against the ERP vendor and customer master tables via name similarity search and exact number lookup. The system ranks candidate matches and shows them to the reviewer for confirmation.
Article verification: The system matches each line item's article number against the ERP article master — exact match first, fuzzy fallback for partial numbers. Matched articles surface the ERP description, customs tariff number, country of origin, and preferential trade eligibility for automated cross-verification against the PDF values.
Three-way document chain matching: For incoming invoices, the system traces per-line purchase order references into the ERP, resolves the corresponding order positions, then follows the predecessor chain to locate goods receipt entries. This enables position-level three-way matching — document quantities and prices are compared against both the original purchase order and the actual goods receipt, with discrepancies flagged individually.

ERP integration

REST API document creation: On approval, the system creates the document in the ERP via its REST API, reusing the portal's existing document transfer engine. Incoming supplier invoices become purchase invoices; customer orders become sales orders. The system populates all header and position fields, links predecessor documents, and attaches the source PDF as a journal entry on the created record.

Observability

Interactive review interface: A split-pane review page displays the original PDF alongside every extracted field with confidence bars and inline editing. The UI tracks and highlights every correction. Unmatched articles offer instant search against the ERP master data. A verdict banner summarizes extraction confidence, match rates, goods receipt coverage, and price discrepancies — ready to book, or needs attention.

Architecture

The Result

Processing time

Under 15 seconds from PDF to reviewable result — down from 5–15 minutes of manual data entry per document

Extraction scope

Header data, line items with customs details, payment information, and order references — ~30 fields per document, zero manual keying

Verification

Automated three-way matching at position level: invoice ↔ purchase order ↔ goods receipt, with discrepancies flagged before booking

Coverage

All entities and languages in one pipeline — incoming supplier invoices and customer orders