Extraction Templates
Templates define how Extract interprets your documents. Each template has prompts that guide the vision model and optional schemas that validate the output.
Template Structure
A template consists of:
| Field | Required | Description |
|---|---|---|
id | Yes | Unique identifier (e.g., my_invoice_template) |
system_prompt | Yes | Instructions for model behavior |
user_prompt | Yes | What to extract, with {placeholders} |
context_schema | No | Defines available placeholder variables |
output_schema | No | JSON schema for validating extracted data |
vision_model | No | Override the default vision model |
Using Default Templates
Three templates are available out of the box:
detailed_invoice
Extracts comprehensive invoice data including line items:
response = requests.post(
"http://localhost/api/v1/extract/process",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": open("invoice.pdf", "rb")},
data={
"template_id": "detailed_invoice",
"context": '{"company_name": "Your Company"}'
}
)
Output includes:
- Service provider details (name, address, tax ID)
- Buyer information
- Invoice metadata (number, date, due date)
- Line items with descriptions, quantities, prices
- Tax breakdown and totals
simple_receipt
Parses basic retail receipts:
response = requests.post(
"http://localhost/api/v1/extract/process",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": open("receipt.jpg", "rb")},
data={"template_id": "simple_receipt"}
)
Output includes:
- Store name and location
- Transaction date and time
- List of items with prices
- Subtotal, tax, and total
expense_report
Classifies expenses for reporting:
response = requests.post(
"http://localhost/api/v1/extract/process",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": open("expense.png", "rb")},
data={"template_id": "expense_report"}
)
Output includes:
- Expense category
- Amount and currency
- Date
- Vendor name
- Description
Creating Custom Templates
Create templates for your specific document types:
import requests
template = {
"id": "purchase_order",
"system_prompt": """You are a document extraction specialist.
Extract data accurately from purchase orders.
Return valid JSON matching the requested structure.
If a field is not visible, use null.""",
"user_prompt": """Extract the following from this purchase order for {company_name}:
- PO number
- Vendor name and address
- Order date
- Delivery date
- Line items (part number, description, quantity, unit price)
- Shipping terms
- Total amount""",
"context_schema": {
"company_name": {
"type": "string",
"description": "Name of the ordering company"
}
},
"output_schema": {
"type": "object",
"properties": {
"po_number": {"type": "string"},
"vendor": {
"type": "object",
"properties": {
"name": {"type": "string"},
"address": {"type": "string"}
}
},
"order_date": {"type": "string", "format": "date"},
"delivery_date": {"type": "string", "format": "date"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"part_number": {"type": "string"},
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"}
}
}
},
"shipping_terms": {"type": "string"},
"total_amount": {"type": "number"}
}
}
}
response = requests.post(
"http://localhost/api/v1/extract/templates",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json=template
)
print(f"Created template: {response.json()['id']}")
Template Placeholders
Use {placeholder} syntax in your user prompt to inject context at processing time:
# Template with placeholders
template = {
"id": "contract_review",
"user_prompt": """Review this contract between {party_a} and {party_b}.
Extract:
- Effective date
- Term length
- Key obligations for {party_a}
- Payment terms
- Termination conditions""",
"context_schema": {
"party_a": {"type": "string", "description": "First party name"},
"party_b": {"type": "string", "description": "Second party name"}
}
}
# Processing with context
response = requests.post(
"http://localhost/api/v1/extract/process",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": open("contract.pdf", "rb")},
data={
"template_id": "contract_review",
"context": '{"party_a": "Acme Corp", "party_b": "Widget Inc"}'
}
)
Managing Templates
List All Templates
response = requests.get(
"http://localhost/api/v1/extract/templates",
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
for template in response.json()["templates"]:
print(f"{template['id']}: {template['system_prompt'][:50]}...")
Get Template Details
response = requests.get(
"http://localhost/api/v1/extract/templates/detailed_invoice",
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
template = response.json()
print(f"System prompt: {template['system_prompt']}")
print(f"User prompt: {template['user_prompt']}")
Update Template
response = requests.put(
"http://localhost/api/v1/extract/templates/my_template",
headers={"Authorization": "Bearer YOUR_API_KEY"},
json={
"user_prompt": "Updated extraction instructions...",
"output_schema": {"type": "object", "properties": {...}}
}
)
Delete Template
response = requests.delete(
"http://localhost/api/v1/extract/templates/my_template",
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
Reset Default Templates
Restore the built-in templates to their original state:
response = requests.post(
"http://localhost/api/v1/extract/templates/reset-defaults",
headers={"Authorization": "Bearer YOUR_API_KEY"}
)
Template Wizard
Generate a template from a sample document:
with open("sample_invoice.pdf", "rb") as f:
response = requests.post(
"http://localhost/api/v1/extract/templates/wizard",
headers={"Authorization": "Bearer YOUR_API_KEY"},
files={"file": f},
data={"description": "Monthly vendor invoice with line items"}
)
suggested_template = response.json()
print(f"Suggested template: {suggested_template}")
The wizard analyzes your document and suggests prompts and schemas based on its structure.
Best Practices
Writing System Prompts
Good:
You are a document extraction specialist. Extract data accurately and completely.
Return valid JSON. Use null for missing fields. Do not hallucinate data.
Bad:
Extract stuff from the document.
Writing User Prompts
Good:
Extract the following fields from this invoice:
- Invoice number (top right corner, format: INV-XXXXX)
- Vendor name and full address
- Each line item with: description, quantity, unit price, line total
- Tax amount and rate
- Total amount due
Bad:
Get the invoice data.
Output Schemas
Define schemas to catch extraction errors early:
"output_schema": {
"type": "object",
"required": ["invoice_number", "total_amount"],
"properties": {
"invoice_number": {"type": "string", "pattern": "^INV-\\d{5}$"},
"total_amount": {"type": "number", "minimum": 0},
"line_items": {
"type": "array",
"minItems": 1,
"items": {
"type": "object",
"required": ["description", "amount"],
"properties": {
"description": {"type": "string"},
"quantity": {"type": "integer", "minimum": 1},
"unit_price": {"type": "number"},
"amount": {"type": "number"}
}
}
}
}
}
Next Steps
- Processing - Learn the full document processing workflow
- Examples - See complete integration examples