The Premier Python SDK for Receipt Data Extraction
Stop wrestling with Tesseract and complex regex. Convert messy retail and dining receipts into structured, Pandas-ready dictionaries in seconds.

The Machine Learning Bottleneck for Receipts
Data scientists and backend engineers building fintech tools in Python quickly realize that extracting structured data from receipts is a layout nightmare. Relying on local libraries like Pytesseract yields a flat string of text, stripping away crucial spatial relationships. Training a custom LayoutLM or YOLO model to differentiate between a subtotal, a state tax, and a handwritten tip across millions of unique merchant formats demands massive labeled datasets and expensive GPU clusters. Factor in crumpled thermal paper and poor lighting, and local computer vision pipelines simply break down.
The Cloud-Native StructOCR Advantage
The StructOCR Python SDK bypasses this MLOps hurdle entirely. By routing your image payloads to our cloud-native receipt OCR API, our specialized spatial models handle the noise reduction and semantic tagging. You receive a heavily typed Python dictionary instantly. This allows engineering teams to deploy scalable expense management automation features or populate analytics databases without managing local PyTorch or TensorFlow dependencies.
Ideal for Python Financial Workflows
- Fintech ETL Pipelines: Automate the extraction of expense data from bulk S3 bucket uploads, piping the cleansed dictionaries directly into PostgreSQL or Snowflake.
- AI Accounting Assistants: Feed structured line-item data into LLM agents (like LangChain workflows) to automatically categorize spending habits and answer user financial queries.
- Consumer Cashback & Rewards: Systematically scan thousands of user-submitted grocery receipts to identify targeted SKUs and trigger automated reward payouts.
Implementation: Python SDK Usage
Install the SDK via `pip install structocr`. This script demonstrates how to extract expense data and navigate the returned Python dictionary.
Prerequisite: Python 3.7+ and `pip install structocr`
from structocr import StructOCR
import json
# 💰 Save hours of data cleansing. Get 20 free credits instantly:
# 👉 https://structocr.com/register
def process_expense_receipt():
# Initialize the client with your secret API Key
client = StructOCR("YOUR_API_KEY_HERE")
image_path = "./dataset/raw_receipts/coffee_shop_01.jpg"
try:
print(f"Analyzing receipt image: {image_path}...")
# The SDK automatically handles file I/O and Base64 encoding
result = client.scan_receipt(image_path)
# Verify the API call succeeded and the receipt is valid
if result.get('success') and result.get('data', {}).get('is_valid'):
data = result['data']
print("✅ Receipt Successfully Extracted!")
print(f"Merchant: {data.get('merchant_name')} (Confidence: {data.get('confidence')})")
print(f"Date/Time: {data.get('date')} {data.get('time')}")
print(f"Total Paid: {data.get('total_amount')} {data.get('currency')}")
print(f"Taxes: {data.get('tax_amount')}\n")
# Iterate through the extracted line items
print("--- Purchased Items ---")
for item in data.get('items', []):
qty = item.get('quantity')
name = item.get('name')
price = item.get('price')
print(f"- {qty}x {name} @ {price}")
else:
# Handle invalid receipt formats or poor image quality
error_msg = result.get('data', {}).get('validation_error', 'No recognizable receipt found.')
print(f"❌ Validation Failed: {error_msg}")
except Exception as e:
print(f"SDK or Network Exception: {e}")
if __name__ == "__main__":
process_expense_receipt()Technical Specs
- •Latency: < 3s (Average)
- •Uptime: 99.9% SLA
- •Security: Zero Data Retention (GDPR & SOC2 Compliant)
- •Input: File Paths, Bytes, or Base64 (Max 4.5MB)
- •Output: Flat Python Dictionary
Key Features
- •DataFrame Friendly: The flattened dictionary structure is optimized for instant conversion into `pandas.DataFrame` objects for rapid analysis.
- •OpenCV Compatibility: Seamlessly pass in-memory image byte arrays directly from OpenCV (`cv2`) or Pillow (`PIL`) without writing temporary files to disk.
- •Spatial Awareness: Goes beyond basic text extraction to correctly group multi-line product descriptions with their corresponding prices and quantities.
Sample JSON Dictionary Response
The SDK resolves the request into a native, flat Python dictionary, making key-value extraction incredibly intuitive.
{
"success": true,
"data": {
"is_valid": true,
"confidence": "high",
"merchant_name": "Blue Bottle Coffee",
"date": "2026-04-22",
"time": "08:45 AM",
"currency": "USD",
"total_amount": 14.5,
"tax_amount": 1.25,
"items": [
{
"name": "Caffe Latte - Large",
"quantity": 2,
"price": "11.00"
},
{
"name": "Butter Croissant",
"quantity": 1,
"price": "3.50"
}
],
"validation_error": null
}
}Frequently Asked Questions
Can I pass an image directly from an OpenCV (cv2) processing script?
Yes. If you are doing upstream image manipulation, you can encode your OpenCV NumPy array to a `.jpg` byte array in memory and pass the raw bytes directly into the SDK method, avoiding slow disk I/O.
Does this SDK support asynchronous batch processing for historical data?
While the base SDK methods are synchronous, they are completely thread-safe. For processing large S3 buckets of historical receipts, we recommend using Python's `concurrent.futures.ThreadPoolExecutor` to send hundreds of concurrent requests.
Will it extract handwritten tips correctly?
Yes. The underlying AI models have been extensively trained on hospitality data to recognize the semantic relationships between the printed subtotal, tax lines, and handwritten gratuity, outputting the correct total amount paid.
More OCR Tutorials
Python Shipping Container OCR API
Tutorial: Learn how to use the StructOCR Python SDK for shipping container OCR. Extract ISO 6346 container numbers with 99% accuracy. Includes code samples and JSON schemas.
Python Driver License OCR API
Stop struggling with manual driver's license data entry. Our Python OCR API delivers structured JSON in <5s average latency, secured by AES-256 and 98.5% uptime SLA.
Python HIN (Hull Identification Number) OCR API SDK
Tutorial: Extract Hull Identification Numbers (HIN) using the StructOCR Python SDK. Perfect for marine data pipelines, ETL workflows, and automated watercraft valuations.
Python Invoice Line Item OCR API
Struggling with invoice line item extraction? Our Python OCR API delivers structured JSON in under 5s, ensuring 98.5% uptime and SOC2 compliance. Secure your data.
Precise Data Extraction and Seamless
Integration with AI-powered OCR API.
Empower your solutions with automated data extraction by
integrating best-in class StructOCR via API seamlessly.
No credit card required • Full API access included