Multimodal — vision, audio & documents
Text-only LLMs are enough for chat—but real products ingest screenshots, PDFs, scans, audio, and video frames. Multimodal models let you ask questions about images and documents; specialized pipelines extract text, tables, and structure before RAG. This guide covers when to call vision APIs directly vs build a document pipeline, and how to ship both without blowing cost or latency budgets.
After reading, you should be able to: send images to GPT-4o, Claude, or Gemini with correct encoding and detail settings; choose PyMuPDF vs OCR vs layout parsers for PDFs; compare Textract vs Tesseract for your doc types; extract tables and structured fields; wire Whisper/TTS for audio; and design async production pipelines with PII-aware storage.
Vision APIs — GPT-4o, Claude, Gemini
Frontier chat models accept images alongside text in the same messages array. The model runs a vision encoder that converts pixels into tokens the language model can reason over— you describe the task in natural language; no separate “vision API” endpoint on OpenAI (it is part of chat completions).
Supported input formats
Providers accept images as:
- Base64 data URLs — embed in JSON; simplest for uploads from your backend; increases payload size ~33% vs raw bytes.
- Public HTTPS URLs — OpenAI and Anthropic fetch the image server-side; requires stable, reachable URLs; good when images already live in S3/GCS with presigned links.
- File API / uploads — OpenAI Files API and Gemini File API for larger or repeated use; amortizes upload cost across multiple prompts.
When to use base64: user uploads hit your API directly; short-lived content; no public CDN. When to use URL: images already in object storage; you want smaller request bodies; provider-side caching may help on retries.
OpenAI detail levels (low vs high)
GPT-4o scales images to fixed token budgets based on detail. low uses a fixed ~85-token thumbnail—fast and cheap for classification. high tiles large images (512px patches) and can consume hundreds to thousands of image tokens on dense screenshots or multi-column docs.
| Setting | Token behavior (approx.) | Best for | Avoid when |
|---|---|---|---|
| low | Fixed ~85 image tokens regardless of source resolution | Binary classification, “is this a receipt?”, coarse moderation, icon/UI category | Small text in screenshots, dense tables, fine chart reading |
| high | Scales with image dimensions; large screenshots → 1K–2K+ image tokens | Document Q&A on a page image, reading axis labels on charts, form field extraction | High-volume thumbnail tagging at scale without budget |
| auto (default) | Model/provider chooses based on image size | General-purpose apps where you have not profiled cost yet | Cost-sensitive batch jobs—pin explicitly after measurement |
Claude and Gemini differences
- Claude 3.5+ — up to 20 images per request (check current limits); strong on diagrams and handwriting; image blocks in Messages API with base64 or URL.
- Gemini 1.5 — native multimodal with very long context; can ingest many images or mixed PDF/video in one session; pricing per image token differs from OpenAI—benchmark on your doc set.
- Bedrock — Claude and Amazon Nova models expose the same patterns via invoke_model; IAM and VPC endpoints for regulated workloads.
Cost vs accuracy trade-offs
Vision is priced as image tokens (input) plus text tokens. A support bot that attaches full-resolution Retina screenshots at high detail on every ticket can cost 10× a pipeline that resizes to 1024px max edge and uses low for triage, escalating only hard cases.
Common use cases
- Document understanding — photo of a signed contract page, ask “what is the termination clause?”
- Screenshots & UI — “what error is shown?” from a user-submitted bug report image
- Charts & dashboards — describe trends; verify numbers against extracted CSV when stakes are high
- Moderation — policy violation detection on user-uploaded images (combine with dedicated moderation APIs for scale)
- Accessibility — alt-text generation for images in CMS workflows
Vision API call
Resize large uploads before encode; log image token usage from the response. Use presigned URLs in production instead of shipping multi-MB base64 through your API gateway.
import base64
from pathlib import Path
from openai import OpenAI
client = OpenAI()
image_path = Path("screenshot.png")
b64 = base64.b64encode(image_path.read_bytes()).decode("utf-8")
response = client.chat.completions.create(
model="gpt-4o",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What error message is visible? Reply in one sentence."},
{
"type": "image_url",
"image_url": {
"url": f"data:image/png;base64,{b64}",
"detail": "high", # use "low" for cheap triage
},
},
],
}],
)
print(response.choices[0].message.content)
print("prompt tokens:", response.usage.prompt_tokens)
import base64, anthropic
from pathlib import Path
client = anthropic.Anthropic()
media_type = "image/png"
data = base64.standard_b64encode(Path("screenshot.png").read_bytes()).decode()
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=512,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}},
{"type": "text", "text": "What error message is visible? One sentence."},
],
}],
)
print(message.content[0].text)
import boto3, base64, json
from pathlib import Path
bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
b64 = base64.b64encode(Path("screenshot.png").read_bytes()).decode()
body = {
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 512,
"messages": [{
"role": "user",
"content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "What error message is visible? One sentence."},
],
}],
}
resp = bedrock.invoke_model(modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=json.dumps(body))
print(json.loads(resp["body"].read())["content"][0]["text"])
// Spring AI — UserMessage with Media (Resource or byte[])
Resource image = new ClassPathResource("screenshot.png");
Media media = new Media(MimeTypeUtils.IMAGE_PNG, image);
UserMessage userMessage = UserMessage.builder()
.text("What error message is visible? One sentence.")
.media(media)
.build();
ChatResponse response = chatClient.prompt(new Prompt(userMessage))
.options(OpenAiChatOptions.builder().withModel("gpt-4o").withMaxTokens(512).build())
.call()
.chatResponse();
System.out.println(response.getResult().getOutput().getContent());
AnthropicChatModel model = AnthropicChatModel.builder()
.modelName("claude-3-5-sonnet-20241022")
.maxTokens(512)
.build();
byte[] png = Files.readAllBytes(Path.of("screenshot.png"));
Media media = new Media(MimeTypeUtils.IMAGE_PNG, png);
UserMessage msg = UserMessage.builder()
.text("What error message is visible? One sentence.")
.media(media)
.build();
String answer = model.call(new Prompt(msg)).getResult().getOutput().getContent();
var chat = new BedrockAnthropicChatModel(
BedrockAnthropicChatOptions.builder()
.withModel("anthropic.claude-3-5-sonnet-20241022-v2:0")
.withMaxTokens(512)
.build());
byte[] png = Files.readAllBytes(Path.of("screenshot.png"));
Media media = new Media(MimeTypeUtils.IMAGE_PNG, png);
String text = chat.call(new Prompt(
UserMessage.builder().text("What error message is visible?").media(media).build()
)).getResult().getOutput().getContent();
Image tokens bill at the model’s input rate. A 1920×1080 screenshot at high detail on GPT-4o can exceed 1,000 image tokens—often more expensive than the text answer. Measure per doc type: monthly_vision_cost ≈ avg_image_tokens × price_in × images_per_day × 30.
Vision-only Q&A on full PDF page renders is simple to ship but expensive at volume. Hybrid: extract text with PyMuPDF first; send page images to vision only when OCR confidence is low or layout is complex (see Document pipelines).
Sending 4K phone photos without downscaling. Resize to max 1024–1568px on the long edge before encode—vision models rarely need full sensor resolution, and you pay for pixels.
Document pipelines — PDF text extraction
Most enterprise knowledge is PDFs: invoices, policies, filings, lab reports. A production pipeline extracts text and structure before embedding or LLM Q&A—not every PDF should go straight to a vision model.
Digital-born vs scanned PDFs
- Digital-born — text is selectable in Acrobat; fonts and vector graphics embedded. PyMuPDF or pdfplumber extract clean Unicode with coordinates.
- Scanned — pages are bitmap images inside PDF wrappers; no text layer. You need OCR (Tesseract, Textract, Document Intelligence) or vision per page.
- Mixed — common in legacy archives: some pages digital, some scans. Detect per page: if page.get_text() returns empty or garbage, route to OCR.
PyMuPDF (fitz) vs pdfplumber
| Library | Strengths | Weaknesses |
|---|---|---|
| PyMuPDF | Very fast; page render to PNG; text blocks with bbox; good for bulk ingest | Table structure heuristic; less Pythonic API |
| pdfplumber | Table detection via lines/edges; detailed char-level geometry | Slower on large corpora; memory on huge files |
When OCR is needed
Run OCR when:
- Extracted text length is near zero but page has visual content
- Character entropy looks wrong (lots of □ or random symbols)
- Document is known scan-only (fax, photographed forms)
- You need handwriting or stamped signatures read
Layout-aware parsing
Plain text dump loses columns, headers, footers, and reading order. Unstructured partitions PDFs into elements (Title, NarrativeText, Table, Image) with metadata for chunking. LlamaParse (LlamaIndex) is a hosted parser optimized for RAG—handles complex tables and slides; adds vendor dependency and per-page cost.
Pattern: partition → clean → chunk → embed (Track 2). Store page number and bbox in chunk metadata so citations link back to source PDF pages.
PDF pipeline snippet
import fitz # PyMuPDF
def extract_pdf_pages(path: str, min_chars: int = 40) -> list[dict]:
doc = fitz.open(path)
pages = []
for i, page in enumerate(doc):
text = page.get_text("text").strip()
needs_ocr = len(text) < min_chars
pages.append({
"page": i + 1,
"text": text,
"needs_ocr": needs_ocr,
"width": page.rect.width,
"height": page.rect.height,
})
return pages
# Optional: render OCR candidate to PNG for Textract or vision
def page_to_png(doc: fitz.Document, page_index: int, dpi: int = 150) -> bytes:
page = doc[page_index]
pix = page.get_pixmap(dpi=dpi)
return pix.tobytes("png")
// Apache PDFBox — extract text per page with OCR routing flag
try (PDDocument doc = Loader.loadPDF(new File("contract.pdf"))) {
PDFTextStripper stripper = new PDFTextStripper();
int minChars = 40;
for (int i = 0; i < doc.getNumberOfPages(); i++) {
stripper.setStartPage(i + 1);
stripper.setEndPage(i + 1);
String text = stripper.getText(doc).trim();
boolean needsOcr = text.length() < minChars;
// enqueue PageJob(page=i+1, text, needsOcr) to queue
}
}
PDF “text” is drawing instructions (Tj, TJ operators), not a simple string table. Extractors reconstruct reading order from glyph positions—multi-column layouts confuse naive order. Layout parsers use ML or heuristics to reorder blocks before chunking.
Cache extracted text keyed by sha256(file) + parser_version. Re-parsing the same 200-page policy on every user question burns CPU and adds seconds of latency.
Legal and insurance RAG systems often split pipelines: born-digital contracts via PyMuPDF + Unstructured; mailroom scans via Textract with human QA on extraction confidence below 0.85. One parser rarely fits all doc classes.
OCR — Tesseract, Textract, Document Intelligence
OCR turns pixel text into Unicode. Open-source Tesseract is free but ops-heavy; cloud OCR adds forms, tables, and confidence scores—choose based on document mix and compliance, not benchmark headlines.
Provider comparison
| Provider | Deployment | Strengths | Limitations | Typical cost model |
|---|---|---|---|---|
| Tesseract | Self-hosted, Docker, on-device | No per-page cloud fee; offline; 100+ languages | Weak on complex layouts/forms; needs preprocessing (deskew, binarize); you own tuning | CPU/GPU infra only |
| AWS Textract | AWS API, VPC endpoints | Forms (key-value), tables, queries; AnalyzeDocument API; IAM integration | AWS lock-in; async for multi-page; handwriting variable | Per page; higher for Forms/Tables API |
| Azure Document Intelligence | Azure API, custom models | Prebuilt models (invoice, receipt, ID); train custom neural models; layout markdown output | Azure ecosystem; model training cycle for niche docs | Per page; custom model hosting extra |
| Google Document AI | GCP | Specialized processors (W-2, utility bills); strong enterprise doc types | Processor-per-doc-type mental model | Per page / processor |
Preprocessing matters (especially Tesseract)
- Deskew rotated scans
- Convert to grayscale; adaptive threshold for faded fax
- Remove speckle noise; crop margins
- 300 DPI minimum for small fonts; 150 DPI often enough for body text
Routing strategy
Tier 1: PyMuPDF text extract (free, instant). Tier 2: Tesseract on-page PNG if internal / high volume / data residency. Tier 3: Textract or Document Intelligence for forms, tables, or low-confidence Tesseract output. Tier 4: Vision LLM on page image for one-off complex layouts or when OCR + structure still fails eval.
“How would you ingest 10M scanned PDFs for search?” — Do not say “GPT-4o on every page.” Answer: classify digital vs scan; async OCR queue; store text + confidence; embed chunks; vision only on failure bucket; cost and throughput numbers (pages/hour per Textract vs self-hosted Tesseract fleet).
Cloud OCR simplifies forms and tables but sends document content to a third party. For HIPAA/PCI workloads, use VPC endpoints, BAA-covered regions, or on-prem Tesseract with strict network boundaries—and accept lower accuracy on hard layouts.
Table extraction — camelot, tabula, structured output
Tables in PDFs break naive text extraction—columns collapse into unreadable lines. Dedicated table extractors use line detection (lattice) or whitespace alignment (stream) to rebuild rows and cells.
camelot (Python)
- Lattice — bordered tables with visible grid lines; high precision when lines exist.
- Stream — borderless tables aligned by whitespace; common in financial reports.
- Outputs pandas DataFrame; export CSV/JSON for downstream LLM or SQL load.
tabula (Java wrapper around tabula-java)
JVM-friendly table extraction from PDFs; good in Spring services that already use PDFBox. Same lattice vs stream modes; integrate with batch ETL pipelines.
When to skip local extractors
Merged cells, nested headers, and tables split across pages frustrate rule-based tools. Use Textract AnalyzeDocument TABLES feature, Document Intelligence layout model, or render page → vision LLM with JSON schema for small-volume high-value docs.
Structured output from tables
After extraction:
- Normalize headers (snake_case, dedupe)
- Type coercion (dates, currency stripping $ and commas)
- Validate row counts vs source; flag empty cells
- Pass compact JSON to LLM for summarization—not raw 500-row CSV in prompt
import camelot
tables = camelot.read_pdf("report.pdf", pages="3", flavor="lattice")
print(f"found {tables.n} tables on page 3")
df = tables[0].df
records = df.to_dict(orient="records") # feed to validation + LLM summary
// tabula-java — extract tables from page 3 (1-based)
List<Table> tables = TabulaExtractor.extract(
new File("report.pdf"),
TabulaExtractOptions.builder().pages(3).guess(true).build()
);
for (Table t : tables) {
List<List<String>> rows = t.getRows();
// map rows → JSON → validate → optional LLM summary
}
Trusting extracted tables without visual QA on a sample. camelot reports accuracy scores—reject below threshold and route to human review or cloud OCR.
Local camelot/tabula: compute only. Textract TABLES: per-page premium over plain OCR—cheaper than sending full page images to GPT-4o at scale if you only need cell values.
Audio — Whisper, streaming ASR, TTS
Speech unlocks meeting notes, IVR analytics, voice agents, and accessibility. ASR (automatic speech recognition) converts audio → text; TTS (text-to-speech) goes the other direction for voice UX.
OpenAI Whisper API
- Transcription — same language output (whisper-1)
- Translation — any supported language → English (translations endpoint)
- Input: mp3, mp4, wav, webm, etc.; 25 MB file limit on API—split long recordings
- Word-level timestamps via verbose JSON; segment for diarization post-processing
Self-hosted Whisper
Run faster-whisper or whisper.cpp on GPU for high volume or data residency. Trade API simplicity for batch throughput tuning and model size selection (large-v3 vs distil).
Streaming ASR (WebSocket overview)
Batch Whisper waits for full file upload—unacceptable for live captions. Streaming providers (Deepgram, AssemblyAI, Azure Speech, Google STT) expose WebSocket APIs:
- Client opens WS with auth header
- Send audio chunks (PCM or encoded) as they are captured
- Receive partial transcripts (is_final: false) and finalized segments
- Close stream on utterance end; reconnect with backoff on network blips
Latency budget: partial results every 200–500 ms; final segment after ~700 ms silence (VAD-dependent). Voice agents combine streaming ASR → LLM → streaming TTS in a tight loop (Track 3 agents).
Text-to-speech options
| Provider | Quality | Latency | Notes |
|---|---|---|---|
| OpenAI TTS | Natural, limited voices | Low; streaming supported | Simple API; same vendor as GPT |
| ElevenLabs | Very high; voice cloning | Low–medium | Commercial licensing; voice library |
| Amazon Polly | Good; many languages | Low | Neural voices; IAM; good for AWS stacks |
| Azure / Google | Enterprise grade | Low | SSML control; regional compliance |
from openai import OpenAI
client = OpenAI()
with open("meeting.mp3", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=f,
response_format="verbose_json",
)
print(transcript.text)
# segment-level timestamps: transcript.segments
# Anthropic has no native ASR — route audio → Whisper/OpenAI or Deepgram first,
# then pass transcript text to Claude for summarization.
summary = client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=1024,
messages=[{"role": "user", "content": f"Summarize this meeting:\n\n{transcript.text}"}],
)
# Amazon Transcribe (batch) or Transcribe Streaming for ASR on AWS
import boto3
transcribe = boto3.client("transcribe")
# StartTranscriptionJob with S3 URI → poll → fetch transcript JSON
# Then invoke Bedrock Claude on transcript text for action items
OpenAiAudioTranscriptionOptions opts = OpenAiAudioTranscriptionOptions.builder()
.withModel("whisper-1")
.withResponseFormat(OpenAiAudioTranscriptionResponseFormat.VVERBOSE_JSON)
.build();
Resource audio = new FileSystemResource("meeting.mp3");
OpenAiAudioTranscriptionModel whisper = new OpenAiAudioTranscriptionModel(apiKey);
AudioTranscriptionResponse resp = whisper.call(new AudioTranscriptionPrompt(audio, opts));
System.out.println(resp.getResult().getOutput());
// Pipeline: Whisper (OpenAI or self-hosted) → Claude via Spring AI
String transcript = whisperService.transcribe(Path.of("meeting.mp3"));
String summary = chatClient.prompt()
.user("Summarize this meeting:\n\n" + transcript)
.call().content();
// AWS SDK — TranscribeStreamingAsyncClient for live audio
// Batch: TranscribeClient.startTranscriptionJob(...)
// Post-process JSON transcript → BedrockAnthropicChatModel for summarization
For meeting bots, store raw audio in object storage with retention policy; persist transcript + diarization separately. Re-run improved ASR models later without re-fetching Zoom/Teams recordings.
Support call analytics: streaming ASR → PII redaction (SSN, card numbers) → topic classification → supervisor dashboard. TTS used sparingly for outbound IVR; most cost is ASR minutes × concurrent calls.
Video — frame extraction → vision
Video models are emerging, but most production systems today sample frames and run image understanding on each frame—or on keyframes only. Treat full-video LLM ingest as expensive and experimental unless you have a clear ROI.
Typical pipeline
- Ingest — upload to S3/GCS; transcode to standard format (H.264 mp4) if needed
- Extract audio — ffmpeg → Whisper for spoken content (cheaper than watching every frame)
- Sample frames — 1 fps for surveillance-style review; scene-change detection for highlights; max N frames cap
- Vision per frame — classify, OCR on-screen text, detect objects/events
- Aggregate — LLM summarizes frame captions + transcript into timeline
ffmpeg frame extraction
import subprocess
from pathlib import Path
def extract_frames(video: Path, out_dir: Path, fps: float = 1.0) -> None:
out_dir.mkdir(parents=True, exist_ok=True)
subprocess.run([
"ffmpeg", "-i", str(video),
"-vf", f"fps={fps}",
str(out_dir / "frame_%04d.png"),
], check=True)
// ProcessBuilder wrapping ffmpeg — same args as Python
ProcessBuilder pb = new ProcessBuilder(
"ffmpeg", "-i", videoPath,
"-vf", "fps=1",
outDir.resolve("frame_%04d.png").toString()
);
pb.inheritIO().start().waitFor();
Cost and latency warnings
- A 10-minute video at 1 fps = 600 frames. At ~85–500 tokens/frame, vision cost explodes.
- Always cap max_frames and downscale to 512–768px before API calls.
- Prefer audio transcript for spoken meetings; use frames only for slide/demo visibility questions.
- Batch frame inference overnight; never block synchronous user requests on full-video analysis.
- Gemini native video may reduce glue code but verify pricing vs frame-sampling approach on your lengths.
Back-of-envelope: 600 frames × 200 image tokens × $2.50/1M input ≈ $0.30 per 10-min video before text output—10× higher at high detail or 4K frames. Scene detection + 20 keyframes often beats 1 fps brute force.
Sending every frame to vision without deduplication—consecutive frames differ minimally. Perceptual hash or SSIM dedupe cuts redundant API calls 80%+ on static talking-head video.
“Design video search for training content.” — Separate speech (ASR index) from visual (sparse keyframes + embeddings); query routes to transcript BM25 + visual similarity; mention cost caps and async indexing lag SLA.
Structured extraction — invoices, forms, contracts
Multimodal shines when output must be machine-readable: line items, parties, dates, clause types. Combine OCR/layout parsers with LLM structured output (JSON schema) and validation—not free-form prose you parse later.
Invoice parsing
- Prebuilt: Azure prebuilt-invoice, Document AI Invoice Parser—fast baseline
- Custom: extract fields with schema → validate totals (sum line items ≈ total)
- Fallback: page image + GPT-4o with response_format JSON schema
Form extraction
Government and healthcare forms have fixed boxes. Textract FORMS returns key-value pairs with geometry; map keys via fuzzy match to your canonical schema. Low-confidence fields → human-in-the-loop queue (Track 5 evals).
Contract analysis
Multi-page legal docs exceed single-shot context if you naïvely paste OCR text. Pipeline:
- Partition into clauses (headings, numbered sections)
- Embed clauses for RAG (“find indemnity language”)
- Per-clause LLM extraction with schema: obligation, party, deadline, governing law
- Cross-page references: store section IDs; vision on signature pages only
Multi-page chunking for vision + RAG bridge
When text layer is poor but layout matters (charts beside paragraphs):
- Render page → PNG (150 DPI)
- Vision model: “Describe this page and list all named entities and numbers” → structured JSON
- Merge JSON with text-layer chunks sharing same page_number
- Embed combined summary + raw text for retrieval
Chunk size: one page or one logical section—not whole 200-page PDF in one vision call. Parallelize pages with worker queue; respect provider rate limits.
INVOICE_SCHEMA = {
"type": "object",
"properties": {
"vendor_name": {"type": "string"},
"invoice_number": {"type": "string"},
"total_amount": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount": {"type": "number"},
},
"required": ["description", "amount"],
},
},
},
"required": ["vendor_name", "total_amount"],
}
response = client.chat.completions.create(
model="gpt-4o",
temperature=0,
response_format={"type": "json_schema", "json_schema": {"name": "invoice", "schema": INVOICE_SCHEMA}},
messages=[{"role": "user", "content": [
{"type": "text", "text": "Extract invoice fields from this page."},
{"type": "image_url", "image_url": {"url": image_data_url, "detail": "high"}},
]}],
)
# Claude tool_use or JSON mode with schema in system prompt + validation
message = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=2048,
temperature=0,
system="Return only valid JSON matching the invoice schema.",
messages=[{"role": "user", "content": [
{"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
{"type": "text", "text": "Extract vendor_name, invoice_number, total_amount, line_items."},
]}],
)
# Use Bedrock Converse API with toolConfig for structured extraction
# or invoke Claude with JSON instruction + pydantic validate downstream
// Spring AI structured output — BeanOutputConverter or JSON schema options
Invoice invoice = chatClient.prompt()
.options(OpenAiChatOptions.builder()
.withModel("gpt-4o")
.withTemperature(0.0)
.withResponseFormat(new ResponseFormat(ResponseFormat.Type.JSON_SCHEMA, schema))
.build())
.user(u -> u.text("Extract invoice fields").media(imageMedia))
.call()
.entity(Invoice.class);
// Parse Claude JSON response with Jackson → Invoice DTO; validate totals in service layer
ObjectMapper mapper = new ObjectMapper();
Invoice invoice = mapper.readValue(jsonText, Invoice.class);
invoiceValidator.validateLineItemsSum(invoice);
// Bedrock ConverseResponse → extract toolUse input map → map to Invoice record
Structured output constrains decoding toward valid JSON—reduces but does not eliminate hallucinated numbers. Always validate arithmetic (tax + subtotal = total) and cross-check against regex for invoice numbers in OCR text.
Prebuilt cloud invoice models are cheaper and faster for standard layouts. Vision LLM extraction wins on messy scans and one-off formats—route by document classifier confidence.
What this looks like in production
Multimodal features are IO-heavy: uploads, transcoding, OCR, and vision calls do not belong on the synchronous request path for large files. Production is an async job graph with object storage, queues, and strict PII handling.
PII in images and audio
- IDs, credit cards, faces, and signatures appear in uploads—treat media like sensitive DB columns
- Redact before logging; store originals encrypted at rest (SSE-KMS, CMEK)
- Retention TTL per tenant; legal hold overrides; delete derived transcripts when parent audio expires
- Moderation APIs + custom classifiers on uploads before human review queues
- DPA/BAA with OCR/LLM vendors when PHI is possible
Async processing queues
Typical job stages:
- ingest — validate MIME, virus scan, write to s3://bucket/{tenant}/{doc_id}/original
- classify — digital vs scan; language detect; route parser
- extract — PyMuPDF / OCR / tables (parallel per page)
- enrich — optional vision summary per page; structured field extraction
- index — chunk, embed, upsert vector store (Track 2)
- notify — webhook or SSE when status=ready
Use SQS + Lambda/ECS workers, Celery + Redis, or Temporal for retries and visibility timeouts. Idempotency key = doc_id + stage so poison messages do not double-charge OCR.
Storage patterns
| Artifact | Storage | Lifecycle |
|---|---|---|
| Original upload | Object storage, encrypted | TTL or legal hold |
| Page PNG renders | Object storage, content-addressed | Delete after OCR unless audit needs |
| Extracted text JSON | DB or object storage | Version with parser_version |
| Structured fields | PostgreSQL JSONB | Source of truth for apps |
| Embeddings | Vector DB | Re-index on model change |
Observability
- Per-stage latency histograms (p50/p95 ingest → ready)
- OCR confidence distribution; alert if median drops (model drift or bad scan batch)
- Vision token usage per doc type and tenant
- DLQ depth and age; replay tooling with admin audit
Production readiness checklist
- Max upload size and page count enforced at edge
- Async status API; never block HTTP 30s on 200-page PDF
- Parser + OCR version pinned; golden docs in CI (Track 5)
- PII redaction in logs; presigned URLs expire in minutes
- Cost caps per tenant on vision and cloud OCR
- Human review queue for low-confidence extractions
Expose processing_status and page_ready_count in your API so the UI can show progress. Users tolerate 60s ingest if they see “page 12 of 40 indexed”—not a spinner with no feedback.
Loan origination platforms queue overnight batch OCR for tax returns while synchronous path handles W-2 prebuilt models in <5s. Same product, two SLAs—routing is business rules, not one pipeline fits all.
“User uploads 500-page PDF for chat.” — API accepts file → S3 → job ID → workers chunk + embed → poll/WebSocket ready. Mention virus scan, tenant isolation, cost estimate upfront, and refusal if over page quota.