The Python SDK for 99.8%+ Accurate Invoice Data Extraction
Convert unstructured invoices into structured JSON, including line items, with sub-second latency and guaranteed accuracy, eliminating manual data entry.

Why Production-Ready Invoice OCR is Difficult
Open-source tools like Tesseract fail in production because invoices lack a standard format. Each vendor uses a unique layout, making template-based or regex-based parsing brittle and costly to maintain. Core challenges include accurately identifying and extracting tabular data (line items) which can span multiple pages, parsing various date formats (MM/DD/YY vs DD-MM-YYYY), and handling low-quality inputs from scanners which introduce skew, rotation, and digital noise. This forces engineering teams into a perpetual cycle of building and fixing custom parsers for each new vendor, a fundamentally unscalable approach.
Enterprise-Grade Extraction with StructOCR
StructOCR leverages pre-trained Deep Learning models engineered for invoice and receipt documents, offering a powerful accounts payable ocr solution. Our API bypasses template dependency by understanding document semantics. All inputs undergo automatic image pre-processing for deskewing and denoising. The models then precisely identify key-value pairs and table boundaries for accurate line item extraction. Unlike traditional OCR engines that output unstructured text, StructOCR provides a standardized JSON, delivering clean data structures optimized for direct integration into your AP systems and enhancing finance workflows.
Production Use Cases
- Accounts Payable Automation: Automate your entire AP workflow from document ingestion to ERP entry. Reduce invoice processing costs by over 80% and eliminate human error.
- Expense Management Automation: Instantly capture receipt and invoice data for real-time expense reporting and approval, accelerating reimbursement cycles.
- Supply Chain Finance: Extract purchase order numbers, payment terms, and line items to verify trade documents and accelerate supply chain finance operations.
Implementation: Python SDK
The official Python SDK simplifies the integration by handling file encoding and authentication. It parses the nested JSON response (merchant, financials, line items) automatically.
Prerequisite: pip install structocr
from structocr import StructOCR
# 💰 Save 30%+ vs competitors. Get 20 free credits instantly:
# 👉 https://structocr.com/register
# Initialize with your API Key
client = StructOCR("YOUR_API_KEY_HERE")
def process_invoice():
# Note: Currently supports image inputs (JPG, PNG)
image_path = "invoice.jpg"
try:
print(f"Scanning invoice: {image_path}...")
# The SDK handles the API request and error mapping
result = client.scan_invoice(image_path)
# Access the structured data
data = result['data']
print("✅ Extraction Successful!")
print(f"Invoice #: {data.get('invoice_number')}")
print(f"Date: {data.get('date')} (Due: {data.get('due_date')})")
# Merchant Details
merchant = data.get('merchant', {})
print(f"Vendor: {merchant.get('name')} (Tax ID: {merchant.get('tax_id')})")
# Financials
fin = data.get('financials', {})
print(f"Total: {fin.get('total_amount')} {data.get('currency')}")
print(f"Tax: {fin.get('tax_amount')}")
# Line Items Table
print("\n--- Line Items ---")
for item in data.get('line_items', []):
print(f"- {item.get('description')}: {item.get('quantity')} x {item.get('unit_price')} = {item.get('amount')}")
except Exception as e:
print(f"❌ Extraction Failed: {e}")
if __name__ == "__main__":
process_invoice()Technical Specs
- •Latency: < 5s (Average)
- •Uptime: 98.5% SLA
- •Security: AES-256 Encryption & SOC2 Compliant
- •Input: JPG, PNG (Base64 or File Path)
- •Max File Size: 4.5MB
- •Output: JSON (Nested Structure)
Key Features
- •Line Item Extraction: Automatically parses tables and item lists into structured arrays.
- •Financial Parsing: Separates tax amounts, subtotals, and grand totals for easy AP automation.
- •Vendor Identification: Extracts merchant names, addresses, and tax IDs (VAT/EIN) reliably.
Sample JSON Output
StructOCR returns a normalized JSON object, regardless of the input invoice layout, angle, or quality.
{
"success": true,
"data": {
"type": "invoice",
"invoice_number": "INV-2026-001",
"date": "2026-01-15",
"due_date": "2026-02-15",
"currency": "USD",
"merchant": {
"name": "AWS Web Services",
"address": "410 Terry Ave N, Seattle, WA",
"tax_id": "EIN-12-3456789",
"iban": null
},
"customer": {
"name": "Acme Corp Inc.",
"tax_id": "987654321"
},
"financials": {
"subtotal": 100,
"tax_amount": 10,
"total_amount": 110
},
"line_items": [
{
"description": "EC2 Instance Usage",
"quantity": 1,
"unit_price": 80,
"amount": 80
},
{
"description": "S3 Storage",
"quantity": 1,
"unit_price": 20,
"amount": 20
}
]
}
}Frequently Asked Questions
How does StructOCR compare to AWS Textract or Google Vision?
Generic cloud OCR services like Textract or Vision provide raw text blocks and coordinates, leaving your developers to build and maintain complex parsing logic. StructOCR is a specialized, pre-trained model for invoices. It returns a structured JSON with specific fields like `invoice_number`, `line_items`, and `total_amount` directly, eliminating all post-processing overhead.
Do you store the uploaded documents?
No. We operate on a zero-retention policy. Documents are processed in-memory and permanently deleted immediately after the API call completes. We do not persist your data.
How do you handle blurry or skewed invoices?
Our API includes an automatic image pre-processing pipeline. This engine performs deskewing, denoising, and contrast enhancement before the OCR process begins, maximizing accuracy even on low-quality scans or mobile phone captures.
More OCR Tutorials
Python Shipping Container OCR API
Tutorial: Learn how to use the StructOCR Python SDK for shipping container OCR. Extract ISO 6346 container numbers with 99% accuracy. Includes code samples and JSON schemas.
Python Driver License OCR API
Stop struggling with manual driver's license data entry. Our Python OCR API delivers structured JSON in <5s average latency, secured by AES-256 and 98.5% uptime SLA.
Python HIN (Hull Identification Number) OCR API SDK
Tutorial: Extract Hull Identification Numbers (HIN) using the StructOCR Python SDK. Perfect for marine data pipelines, ETL workflows, and automated watercraft valuations.
Python National ID OCR API
High-accuracy National ID OCR for Python. Get structured JSON output via our dedicated Python SDK. Automate KYC and data entry with 99%+ accuracy.
Precise Data Extraction and Seamless
Integration with AI-powered OCR API.
Empower your solutions with automated data extraction by
integrating best-in class StructOCR via API seamlessly.
No credit card required • Full API access included