The Python SDK for 99.8%+ Accurate Invoice Data Extraction
Convert unstructured invoices into structured JSON, including line items, with sub-second latency and guaranteed accuracy, eliminating manual data entry.

Why Production-Ready Invoice OCR is Difficult
Open-source tools like Tesseract fail in production because invoices lack a standard format. Each vendor uses a unique layout, making template-based or regex-based parsing brittle and costly to maintain. Core challenges include accurately identifying and extracting tabular data (line items) which can span multiple pages, parsing various date formats (MM/DD/YY vs DD-MM-YYYY), and handling low-quality inputs from scanners which introduce skew, rotation, and digital noise. This forces engineering teams into a perpetual cycle of building and fixing custom parsers for each new vendor, a fundamentally unscalable approach.
Enterprise-Grade Extraction with StructOCR
StructOCR uses pre-trained Deep Learning models specifically architected for invoice and receipt documents. Our API doesn't rely on templates. Instead, it understands document semantics. All inputs first pass through an automatic image pre-processing engine for deskewing and denoising. The models then identify key-value pairs and locate table boundaries to extract line items with high precision. Unlike Tesseract which returns an unstructured dump of text, StructOCR delivers a standardized JSON output, providing clean, predictable data structures ready for direct integration into your AP systems or ERP.
Production Use Cases
- Accounts Payable Automation: Automate your entire AP workflow from document ingestion to ERP entry. Reduce invoice processing costs by over 80% and eliminate human error.
- Expense Management Automation: Instantly capture receipt and invoice data for real-time expense reporting and approval, accelerating reimbursement cycles.
- Supply Chain Finance: Extract purchase order numbers, payment terms, and line items to verify trade documents and accelerate supply chain finance operations.
Implementation: Python SDK
The official Python SDK simplifies the integration by handling file encoding and authentication. It parses the nested JSON response (merchant, financials, line items) automatically.
Prerequisite: pip install structocr
from structocr import StructOCR
# 💰 Save 30%+ vs competitors. Get 200 free requests instantly:
# 👉 https://structocr.com/register
# Initialize with your API Key
client = StructOCR("YOUR_API_KEY_HERE")
def process_invoice():
# Note: Currently supports image inputs (JPG, PNG)
image_path = "invoice.jpg"
try:
print(f"Scanning invoice: {image_path}...")
# The SDK handles the API request and error mapping
result = client.scan_invoice(image_path)
# Access the structured data
data = result['data']
print("✅ Extraction Successful!")
print(f"Invoice #: {data.get('invoice_number')}")
print(f"Date: {data.get('date')} (Due: {data.get('due_date')})")
# Merchant Details
merchant = data.get('merchant', {})
print(f"Vendor: {merchant.get('name')} (Tax ID: {merchant.get('tax_id')})")
# Financials
fin = data.get('financials', {})
print(f"Total: {fin.get('total_amount')} {data.get('currency')}")
print(f"Tax: {fin.get('tax_amount')}")
# Line Items Table
print("\n--- Line Items ---")
for item in data.get('line_items', []):
print(f"- {item.get('description')}: {item.get('quantity')} x {item.get('unit_price')} = {item.get('amount')}")
except Exception as e:
print(f"❌ Extraction Failed: {e}")
if __name__ == "__main__":
process_invoice()Technical Specs
- •Latency: < 5s (Average)
- •Uptime: 98.5% SLA
- •Security: AES-256 Encryption & SOC2 Compliant
- •Input: JPG, PNG (Base64 or File Path)
- •Max File Size: 4.5MB
- •Output: JSON (Nested Structure)
Key Features
- •Line Item Extraction: Automatically parses tables and item lists into structured arrays.
- •Financial Parsing: Separates tax amounts, subtotals, and grand totals for easy AP automation.
- •Vendor Identification: Extracts merchant names, addresses, and tax IDs (VAT/EIN) reliably.
Sample JSON Output
StructOCR returns a normalized JSON object, regardless of the input invoice layout, angle, or quality.
{
"success": true,
"data": {
"type": "invoice",
"invoice_number": "INV-2026-001",
"date": "2026-01-15",
"due_date": "2026-02-15",
"currency": "USD",
"merchant": {
"name": "AWS Web Services",
"address": "410 Terry Ave N, Seattle, WA",
"tax_id": "EIN-12-3456789",
"iban": null
},
"customer": {
"name": "Acme Corp Inc.",
"tax_id": "987654321"
},
"financials": {
"subtotal": 100,
"tax_amount": 10,
"total_amount": 110
},
"line_items": [
{
"description": "EC2 Instance Usage",
"quantity": 1,
"unit_price": 80,
"amount": 80
},
{
"description": "S3 Storage",
"quantity": 1,
"unit_price": 20,
"amount": 20
}
]
}
}Frequently Asked Questions
How does StructOCR compare to AWS Textract or Google Vision?
Generic cloud OCR services like Textract or Vision provide raw text blocks and coordinates, leaving your developers to build and maintain complex parsing logic. StructOCR is a specialized, pre-trained model for invoices. It returns a structured JSON with specific fields like `invoice_number`, `line_items`, and `total_amount` directly, eliminating all post-processing overhead.
Do you store the uploaded documents?
No. We operate on a zero-retention policy. Documents are processed in-memory and permanently deleted immediately after the API call completes. We do not persist your data.
How do you handle blurry or skewed invoices?
Our API includes an automatic image pre-processing pipeline. This engine performs deskewing, denoising, and contrast enhancement before the OCR process begins, maximizing accuracy even on low-quality scans or mobile phone captures.
More OCR Tutorials
Python Driver's License OCR API
Extract driver's license data with our high-accuracy Python SDK. Get structured JSON output in seconds, eliminating manual entry and Tesseract errors.
Python National ID OCR API
High-accuracy National ID OCR for Python. Get structured JSON output via our dedicated Python SDK. Automate KYC and data entry with 99%+ accuracy.
Python Passport OCR API
Reliable Python Passport OCR API for high-accuracy data extraction. Get structured JSON output in milliseconds using our dedicated Python SDK. Eliminate errors.
Python VIN (Vehicle Identification Number) OCR API
Tutorial: How to use the StructOCR Python SDK to extract data from VIN (Vehicle Identification Number)s. Includes code samples and JSON schema.
Precise Data Extraction and Seamless
Integration with AI-powered OCR API.
Empower your solutions with automated data extraction by
integrating best-in class StructOCR via API seamlessly.
No credit card required • Full API access included