The Python SDK for 99.8%+ Accurate Invoice Data Extraction

Convert unstructured invoices into structured JSON, including line items, with sub-second latency and guaranteed accuracy, eliminating manual data entry.

Steve HarringtonUpdated 2026-01-19
Diagram showing a scanned invoice PDF being processed by the StructOCR API, which then outputs a structured JSON file with extracted fields like invoice number, line items, and total amount.
Figure 1: StructOCR converts raw invoice images and PDFs into validated JSON data.

Why Production-Ready Invoice OCR is Difficult

Open-source tools like Tesseract fail in production because invoices lack a standard format. Each vendor uses a unique layout, making template-based or regex-based parsing brittle and costly to maintain. Core challenges include accurately identifying and extracting tabular data (line items) which can span multiple pages, parsing various date formats (MM/DD/YY vs DD-MM-YYYY), and handling low-quality inputs from scanners which introduce skew, rotation, and digital noise. This forces engineering teams into a perpetual cycle of building and fixing custom parsers for each new vendor, a fundamentally unscalable approach.

Enterprise-Grade Extraction with StructOCR

StructOCR uses pre-trained Deep Learning models specifically architected for invoice and receipt documents. Our API doesn't rely on templates. Instead, it understands document semantics. All inputs first pass through an automatic image pre-processing engine for deskewing and denoising. The models then identify key-value pairs and locate table boundaries to extract line items with high precision. Unlike Tesseract which returns an unstructured dump of text, StructOCR delivers a standardized JSON output, providing clean, predictable data structures ready for direct integration into your AP systems or ERP.

Production Use Cases

  • Accounts Payable Automation: Automate your entire AP workflow from document ingestion to ERP entry. Reduce invoice processing costs by over 80% and eliminate human error.
  • Expense Management Automation: Instantly capture receipt and invoice data for real-time expense reporting and approval, accelerating reimbursement cycles.
  • Supply Chain Finance: Extract purchase order numbers, payment terms, and line items to verify trade documents and accelerate supply chain finance operations.

Implementation: Python SDK

The official Python SDK simplifies the integration by handling file encoding and authentication. It parses the nested JSON response (merchant, financials, line items) automatically.

Prerequisite: pip install structocr

CODE EXAMPLE
from structocr import StructOCR

# 💰 Save 30%+ vs competitors. Get 200 free requests instantly:
# 👉 https://structocr.com/register
# Initialize with your API Key
client = StructOCR("YOUR_API_KEY_HERE")

def process_invoice():
    # Note: Currently supports image inputs (JPG, PNG)
    image_path = "invoice.jpg"

    try:
        print(f"Scanning invoice: {image_path}...")
        
        # The SDK handles the API request and error mapping
        result = client.scan_invoice(image_path)

        # Access the structured data
        data = result['data']
        
        print("✅ Extraction Successful!")
        print(f"Invoice #: {data.get('invoice_number')}")
        print(f"Date: {data.get('date')} (Due: {data.get('due_date')})")
        
        # Merchant Details
        merchant = data.get('merchant', {})
        print(f"Vendor: {merchant.get('name')} (Tax ID: {merchant.get('tax_id')})")
        
        # Financials
        fin = data.get('financials', {})
        print(f"Total: {fin.get('total_amount')} {data.get('currency')}")
        print(f"Tax: {fin.get('tax_amount')}")

        # Line Items Table
        print("\n--- Line Items ---")
        for item in data.get('line_items', []):
            print(f"- {item.get('description')}: {item.get('quantity')} x {item.get('unit_price')} = {item.get('amount')}")

    except Exception as e:
        print(f"❌ Extraction Failed: {e}")

if __name__ == "__main__":
    process_invoice()

Technical Specs

  • Latency: < 5s (Average)
  • Uptime: 98.5% SLA
  • Security: AES-256 Encryption & SOC2 Compliant
  • Input: JPG, PNG (Base64 or File Path)
  • Max File Size: 4.5MB
  • Output: JSON (Nested Structure)

Key Features

  • Line Item Extraction: Automatically parses tables and item lists into structured arrays.
  • Financial Parsing: Separates tax amounts, subtotals, and grand totals for easy AP automation.
  • Vendor Identification: Extracts merchant names, addresses, and tax IDs (VAT/EIN) reliably.

Sample JSON Output

StructOCR returns a normalized JSON object, regardless of the input invoice layout, angle, or quality.

{
  "success": true,
  "data": {
    "type": "invoice",
    "invoice_number": "INV-2026-001",
    "date": "2026-01-15",
    "due_date": "2026-02-15",
    "currency": "USD",
    "merchant": {
      "name": "AWS Web Services",
      "address": "410 Terry Ave N, Seattle, WA",
      "tax_id": "EIN-12-3456789",
      "iban": null
    },
    "customer": {
      "name": "Acme Corp Inc.",
      "tax_id": "987654321"
    },
    "financials": {
      "subtotal": 100,
      "tax_amount": 10,
      "total_amount": 110
    },
    "line_items": [
      {
        "description": "EC2 Instance Usage",
        "quantity": 1,
        "unit_price": 80,
        "amount": 80
      },
      {
        "description": "S3 Storage",
        "quantity": 1,
        "unit_price": 20,
        "amount": 20
      }
    ]
  }
}

Frequently Asked Questions

How does StructOCR compare to AWS Textract or Google Vision?

Generic cloud OCR services like Textract or Vision provide raw text blocks and coordinates, leaving your developers to build and maintain complex parsing logic. StructOCR is a specialized, pre-trained model for invoices. It returns a structured JSON with specific fields like `invoice_number`, `line_items`, and `total_amount` directly, eliminating all post-processing overhead.

Do you store the uploaded documents?

No. We operate on a zero-retention policy. Documents are processed in-memory and permanently deleted immediately after the API call completes. We do not persist your data.

How do you handle blurry or skewed invoices?

Our API includes an automatic image pre-processing pipeline. This engine performs deskewing, denoising, and contrast enhancement before the OCR process begins, maximizing accuracy even on low-quality scans or mobile phone captures.

More OCR Tutorials

Precise Data Extraction and Seamless Integration with AI-powered OCR API.

Empower your solutions with automated data extraction by integrating best-in class StructOCR via API seamlessly.

No credit card required • Full API access included