Lang
API

Multimodal — vision, audio & documents

Text-only LLMs are enough for chat—but real products ingest screenshots, PDFs, scans, audio, and video frames. Multimodal models let you ask questions about images and documents; specialized pipelines extract text, tables, and structure before RAG. This guide covers when to call vision APIs directly vs build a document pipeline, and how to ship both without blowing cost or latency budgets.

After reading, you should be able to: send images to GPT-4o, Claude, or Gemini with correct encoding and detail settings; choose PyMuPDF vs OCR vs layout parsers for PDFs; compare Textract vs Tesseract for your doc types; extract tables and structured fields; wire Whisper/TTS for audio; and design async production pipelines with PII-aware storage.

developer platform architect Track 1 GPT-4o Claude 3.5 Gemini 1.5 Whisper PyMuPDF

Vision APIs — GPT-4o, Claude, Gemini

Frontier chat models accept images alongside text in the same messages array. The model runs a vision encoder that converts pixels into tokens the language model can reason over— you describe the task in natural language; no separate “vision API” endpoint on OpenAI (it is part of chat completions).

Supported input formats

Providers accept images as:

  • Base64 data URLs — embed in JSON; simplest for uploads from your backend; increases payload size ~33% vs raw bytes.
  • Public HTTPS URLs — OpenAI and Anthropic fetch the image server-side; requires stable, reachable URLs; good when images already live in S3/GCS with presigned links.
  • File API / uploads — OpenAI Files API and Gemini File API for larger or repeated use; amortizes upload cost across multiple prompts.

When to use base64: user uploads hit your API directly; short-lived content; no public CDN. When to use URL: images already in object storage; you want smaller request bodies; provider-side caching may help on retries.

OpenAI detail levels (low vs high)

GPT-4o scales images to fixed token budgets based on detail. low uses a fixed ~85-token thumbnail—fast and cheap for classification. high tiles large images (512px patches) and can consume hundreds to thousands of image tokens on dense screenshots or multi-column docs.

SettingToken behavior (approx.)Best forAvoid when
low Fixed ~85 image tokens regardless of source resolution Binary classification, “is this a receipt?”, coarse moderation, icon/UI category Small text in screenshots, dense tables, fine chart reading
high Scales with image dimensions; large screenshots → 1K–2K+ image tokens Document Q&A on a page image, reading axis labels on charts, form field extraction High-volume thumbnail tagging at scale without budget
auto (default) Model/provider chooses based on image size General-purpose apps where you have not profiled cost yet Cost-sensitive batch jobs—pin explicitly after measurement

Claude and Gemini differences

  • Claude 3.5+ — up to 20 images per request (check current limits); strong on diagrams and handwriting; image blocks in Messages API with base64 or URL.
  • Gemini 1.5 — native multimodal with very long context; can ingest many images or mixed PDF/video in one session; pricing per image token differs from OpenAI—benchmark on your doc set.
  • Bedrock — Claude and Amazon Nova models expose the same patterns via invoke_model; IAM and VPC endpoints for regulated workloads.

Cost vs accuracy trade-offs

Vision is priced as image tokens (input) plus text tokens. A support bot that attaches full-resolution Retina screenshots at high detail on every ticket can cost 10× a pipeline that resizes to 1024px max edge and uses low for triage, escalating only hard cases.

Common use cases

  • Document understanding — photo of a signed contract page, ask “what is the termination clause?”
  • Screenshots & UI — “what error is shown?” from a user-submitted bug report image
  • Charts & dashboards — describe trends; verify numbers against extracted CSV when stakes are high
  • Moderation — policy violation detection on user-uploaded images (combine with dedicated moderation APIs for scale)
  • Accessibility — alt-text generation for images in CMS workflows

Vision API call

Resize large uploads before encode; log image token usage from the response. Use presigned URLs in production instead of shipping multi-MB base64 through your API gateway.

Describe an uploaded screenshot
import base64
from pathlib import Path
from openai import OpenAI

client = OpenAI()
image_path = Path("screenshot.png")
b64 = base64.b64encode(image_path.read_bytes()).decode("utf-8")

response = client.chat.completions.create(
    model="gpt-4o",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What error message is visible? Reply in one sentence."},
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/png;base64,{b64}",
                    "detail": "high",  # use "low" for cheap triage
                },
            },
        ],
    }],
)
print(response.choices[0].message.content)
print("prompt tokens:", response.usage.prompt_tokens)
import base64, anthropic
from pathlib import Path

client = anthropic.Anthropic()
media_type = "image/png"
data = base64.standard_b64encode(Path("screenshot.png").read_bytes()).decode()

message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=512,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": media_type, "data": data}},
            {"type": "text", "text": "What error message is visible? One sentence."},
        ],
    }],
)
print(message.content[0].text)
import boto3, base64, json
from pathlib import Path

bedrock = boto3.client("bedrock-runtime", region_name="us-east-1")
b64 = base64.b64encode(Path("screenshot.png").read_bytes()).decode()

body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 512,
    "messages": [{
        "role": "user",
        "content": [
            {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
            {"type": "text", "text": "What error message is visible? One sentence."},
        ],
    }],
}
resp = bedrock.invoke_model(modelId="anthropic.claude-3-5-sonnet-20241022-v2:0", body=json.dumps(body))
print(json.loads(resp["body"].read())["content"][0]["text"])
// Spring AI — UserMessage with Media (Resource or byte[])
Resource image = new ClassPathResource("screenshot.png");
Media media = new Media(MimeTypeUtils.IMAGE_PNG, image);

UserMessage userMessage = UserMessage.builder()
    .text("What error message is visible? One sentence.")
    .media(media)
    .build();

ChatResponse response = chatClient.prompt(new Prompt(userMessage))
    .options(OpenAiChatOptions.builder().withModel("gpt-4o").withMaxTokens(512).build())
    .call()
    .chatResponse();

System.out.println(response.getResult().getOutput().getContent());
AnthropicChatModel model = AnthropicChatModel.builder()
    .modelName("claude-3-5-sonnet-20241022")
    .maxTokens(512)
    .build();

byte[] png = Files.readAllBytes(Path.of("screenshot.png"));
Media media = new Media(MimeTypeUtils.IMAGE_PNG, png);

UserMessage msg = UserMessage.builder()
    .text("What error message is visible? One sentence.")
    .media(media)
    .build();

String answer = model.call(new Prompt(msg)).getResult().getOutput().getContent();
var chat = new BedrockAnthropicChatModel(
    BedrockAnthropicChatOptions.builder()
        .withModel("anthropic.claude-3-5-sonnet-20241022-v2:0")
        .withMaxTokens(512)
        .build());

byte[] png = Files.readAllBytes(Path.of("screenshot.png"));
Media media = new Media(MimeTypeUtils.IMAGE_PNG, png);

String text = chat.call(new Prompt(
    UserMessage.builder().text("What error message is visible?").media(media).build()
)).getResult().getOutput().getContent();
💰 Cost

Image tokens bill at the model’s input rate. A 1920×1080 screenshot at high detail on GPT-4o can exceed 1,000 image tokens—often more expensive than the text answer. Measure per doc type: monthly_vision_cost ≈ avg_image_tokens × price_in × images_per_day × 30.

⚖️ Trade-off

Vision-only Q&A on full PDF page renders is simple to ship but expensive at volume. Hybrid: extract text with PyMuPDF first; send page images to vision only when OCR confidence is low or layout is complex (see Document pipelines).

⚠️ Pitfall

Sending 4K phone photos without downscaling. Resize to max 1024–1568px on the long edge before encode—vision models rarely need full sensor resolution, and you pay for pixels.

Document pipelines — PDF text extraction

Most enterprise knowledge is PDFs: invoices, policies, filings, lab reports. A production pipeline extracts text and structure before embedding or LLM Q&A—not every PDF should go straight to a vision model.

Digital-born vs scanned PDFs

  • Digital-born — text is selectable in Acrobat; fonts and vector graphics embedded. PyMuPDF or pdfplumber extract clean Unicode with coordinates.
  • Scanned — pages are bitmap images inside PDF wrappers; no text layer. You need OCR (Tesseract, Textract, Document Intelligence) or vision per page.
  • Mixed — common in legacy archives: some pages digital, some scans. Detect per page: if page.get_text() returns empty or garbage, route to OCR.

PyMuPDF (fitz) vs pdfplumber

LibraryStrengthsWeaknesses
PyMuPDF Very fast; page render to PNG; text blocks with bbox; good for bulk ingest Table structure heuristic; less Pythonic API
pdfplumber Table detection via lines/edges; detailed char-level geometry Slower on large corpora; memory on huge files

When OCR is needed

Run OCR when:

  • Extracted text length is near zero but page has visual content
  • Character entropy looks wrong (lots of □ or random symbols)
  • Document is known scan-only (fax, photographed forms)
  • You need handwriting or stamped signatures read

Layout-aware parsing

Plain text dump loses columns, headers, footers, and reading order. Unstructured partitions PDFs into elements (Title, NarrativeText, Table, Image) with metadata for chunking. LlamaParse (LlamaIndex) is a hosted parser optimized for RAG—handles complex tables and slides; adds vendor dependency and per-page cost.

Pattern: partition → clean → chunk → embed (Track 2). Store page number and bbox in chunk metadata so citations link back to source PDF pages.

PDF pipeline snippet

Extract text per page; OCR fallback flag
import fitz  # PyMuPDF

def extract_pdf_pages(path: str, min_chars: int = 40) -> list[dict]:
    doc = fitz.open(path)
    pages = []
    for i, page in enumerate(doc):
        text = page.get_text("text").strip()
        needs_ocr = len(text) < min_chars
        pages.append({
            "page": i + 1,
            "text": text,
            "needs_ocr": needs_ocr,
            "width": page.rect.width,
            "height": page.rect.height,
        })
    return pages

# Optional: render OCR candidate to PNG for Textract or vision
def page_to_png(doc: fitz.Document, page_index: int, dpi: int = 150) -> bytes:
    page = doc[page_index]
    pix = page.get_pixmap(dpi=dpi)
    return pix.tobytes("png")
// Apache PDFBox — extract text per page with OCR routing flag
try (PDDocument doc = Loader.loadPDF(new File("contract.pdf"))) {
    PDFTextStripper stripper = new PDFTextStripper();
    int minChars = 40;
    for (int i = 0; i < doc.getNumberOfPages(); i++) {
        stripper.setStartPage(i + 1);
        stripper.setEndPage(i + 1);
        String text = stripper.getText(doc).trim();
        boolean needsOcr = text.length() < minChars;
        // enqueue PageJob(page=i+1, text, needsOcr) to queue
    }
}
🔬 Under the Hood

PDF “text” is drawing instructions (Tj, TJ operators), not a simple string table. Extractors reconstruct reading order from glyph positions—multi-column layouts confuse naive order. Layout parsers use ML or heuristics to reorder blocks before chunking.

💡 Pro Tip

Cache extracted text keyed by sha256(file) + parser_version. Re-parsing the same 200-page policy on every user question burns CPU and adds seconds of latency.

📦 Real World

Legal and insurance RAG systems often split pipelines: born-digital contracts via PyMuPDF + Unstructured; mailroom scans via Textract with human QA on extraction confidence below 0.85. One parser rarely fits all doc classes.

OCR — Tesseract, Textract, Document Intelligence

OCR turns pixel text into Unicode. Open-source Tesseract is free but ops-heavy; cloud OCR adds forms, tables, and confidence scores—choose based on document mix and compliance, not benchmark headlines.

Provider comparison

ProviderDeploymentStrengthsLimitationsTypical cost model
Tesseract Self-hosted, Docker, on-device No per-page cloud fee; offline; 100+ languages Weak on complex layouts/forms; needs preprocessing (deskew, binarize); you own tuning CPU/GPU infra only
AWS Textract AWS API, VPC endpoints Forms (key-value), tables, queries; AnalyzeDocument API; IAM integration AWS lock-in; async for multi-page; handwriting variable Per page; higher for Forms/Tables API
Azure Document Intelligence Azure API, custom models Prebuilt models (invoice, receipt, ID); train custom neural models; layout markdown output Azure ecosystem; model training cycle for niche docs Per page; custom model hosting extra
Google Document AI GCP Specialized processors (W-2, utility bills); strong enterprise doc types Processor-per-doc-type mental model Per page / processor

Preprocessing matters (especially Tesseract)

  • Deskew rotated scans
  • Convert to grayscale; adaptive threshold for faded fax
  • Remove speckle noise; crop margins
  • 300 DPI minimum for small fonts; 150 DPI often enough for body text

Routing strategy

Tier 1: PyMuPDF text extract (free, instant). Tier 2: Tesseract on-page PNG if internal / high volume / data residency. Tier 3: Textract or Document Intelligence for forms, tables, or low-confidence Tesseract output. Tier 4: Vision LLM on page image for one-off complex layouts or when OCR + structure still fails eval.

🎯 Interview Tip

“How would you ingest 10M scanned PDFs for search?” — Do not say “GPT-4o on every page.” Answer: classify digital vs scan; async OCR queue; store text + confidence; embed chunks; vision only on failure bucket; cost and throughput numbers (pages/hour per Textract vs self-hosted Tesseract fleet).

⚖️ Trade-off

Cloud OCR simplifies forms and tables but sends document content to a third party. For HIPAA/PCI workloads, use VPC endpoints, BAA-covered regions, or on-prem Tesseract with strict network boundaries—and accept lower accuracy on hard layouts.

Table extraction — camelot, tabula, structured output

Tables in PDFs break naive text extraction—columns collapse into unreadable lines. Dedicated table extractors use line detection (lattice) or whitespace alignment (stream) to rebuild rows and cells.

camelot (Python)

  • Lattice — bordered tables with visible grid lines; high precision when lines exist.
  • Stream — borderless tables aligned by whitespace; common in financial reports.
  • Outputs pandas DataFrame; export CSV/JSON for downstream LLM or SQL load.

tabula (Java wrapper around tabula-java)

JVM-friendly table extraction from PDFs; good in Spring services that already use PDFBox. Same lattice vs stream modes; integrate with batch ETL pipelines.

When to skip local extractors

Merged cells, nested headers, and tables split across pages frustrate rule-based tools. Use Textract AnalyzeDocument TABLES feature, Document Intelligence layout model, or render page → vision LLM with JSON schema for small-volume high-value docs.

Structured output from tables

After extraction:

  1. Normalize headers (snake_case, dedupe)
  2. Type coercion (dates, currency stripping $ and commas)
  3. Validate row counts vs source; flag empty cells
  4. Pass compact JSON to LLM for summarization—not raw 500-row CSV in prompt
Extract tables with camelot
import camelot

tables = camelot.read_pdf("report.pdf", pages="3", flavor="lattice")
print(f"found {tables.n} tables on page 3")
df = tables[0].df
records = df.to_dict(orient="records")  # feed to validation + LLM summary
// tabula-java — extract tables from page 3 (1-based)
List<Table> tables = TabulaExtractor.extract(
    new File("report.pdf"),
    TabulaExtractOptions.builder().pages(3).guess(true).build()
);
for (Table t : tables) {
    List<List<String>> rows = t.getRows();
    // map rows → JSON → validate → optional LLM summary
}
⚠️ Pitfall

Trusting extracted tables without visual QA on a sample. camelot reports accuracy scores—reject below threshold and route to human review or cloud OCR.

💰 Cost

Local camelot/tabula: compute only. Textract TABLES: per-page premium over plain OCR—cheaper than sending full page images to GPT-4o at scale if you only need cell values.

Audio — Whisper, streaming ASR, TTS

Speech unlocks meeting notes, IVR analytics, voice agents, and accessibility. ASR (automatic speech recognition) converts audio → text; TTS (text-to-speech) goes the other direction for voice UX.

OpenAI Whisper API

  • Transcription — same language output (whisper-1)
  • Translation — any supported language → English (translations endpoint)
  • Input: mp3, mp4, wav, webm, etc.; 25 MB file limit on API—split long recordings
  • Word-level timestamps via verbose JSON; segment for diarization post-processing

Self-hosted Whisper

Run faster-whisper or whisper.cpp on GPU for high volume or data residency. Trade API simplicity for batch throughput tuning and model size selection (large-v3 vs distil).

Streaming ASR (WebSocket overview)

Batch Whisper waits for full file upload—unacceptable for live captions. Streaming providers (Deepgram, AssemblyAI, Azure Speech, Google STT) expose WebSocket APIs:

  1. Client opens WS with auth header
  2. Send audio chunks (PCM or encoded) as they are captured
  3. Receive partial transcripts (is_final: false) and finalized segments
  4. Close stream on utterance end; reconnect with backoff on network blips

Latency budget: partial results every 200–500 ms; final segment after ~700 ms silence (VAD-dependent). Voice agents combine streaming ASR → LLM → streaming TTS in a tight loop (Track 3 agents).

Text-to-speech options

ProviderQualityLatencyNotes
OpenAI TTS Natural, limited voices Low; streaming supported Simple API; same vendor as GPT
ElevenLabs Very high; voice cloning Low–medium Commercial licensing; voice library
Amazon Polly Good; many languages Low Neural voices; IAM; good for AWS stacks
Azure / Google Enterprise grade Low SSML control; regional compliance
Whisper transcription (OpenAI)
from openai import OpenAI

client = OpenAI()
with open("meeting.mp3", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",
        file=f,
        response_format="verbose_json",
    )
print(transcript.text)
# segment-level timestamps: transcript.segments
# Anthropic has no native ASR — route audio → Whisper/OpenAI or Deepgram first,
# then pass transcript text to Claude for summarization.
summary = client.messages.create(
    model="claude-3-5-haiku-20241022",
    max_tokens=1024,
    messages=[{"role": "user", "content": f"Summarize this meeting:\n\n{transcript.text}"}],
)
# Amazon Transcribe (batch) or Transcribe Streaming for ASR on AWS
import boto3
transcribe = boto3.client("transcribe")
# StartTranscriptionJob with S3 URI → poll → fetch transcript JSON
# Then invoke Bedrock Claude on transcript text for action items
OpenAiAudioTranscriptionOptions opts = OpenAiAudioTranscriptionOptions.builder()
    .withModel("whisper-1")
    .withResponseFormat(OpenAiAudioTranscriptionResponseFormat.VVERBOSE_JSON)
    .build();

Resource audio = new FileSystemResource("meeting.mp3");
OpenAiAudioTranscriptionModel whisper = new OpenAiAudioTranscriptionModel(apiKey);
AudioTranscriptionResponse resp = whisper.call(new AudioTranscriptionPrompt(audio, opts));
System.out.println(resp.getResult().getOutput());
// Pipeline: Whisper (OpenAI or self-hosted) → Claude via Spring AI
String transcript = whisperService.transcribe(Path.of("meeting.mp3"));
String summary = chatClient.prompt()
    .user("Summarize this meeting:\n\n" + transcript)
    .call().content();
// AWS SDK — TranscribeStreamingAsyncClient for live audio
// Batch: TranscribeClient.startTranscriptionJob(...)
// Post-process JSON transcript → BedrockAnthropicChatModel for summarization
💡 Pro Tip

For meeting bots, store raw audio in object storage with retention policy; persist transcript + diarization separately. Re-run improved ASR models later without re-fetching Zoom/Teams recordings.

📦 Real World

Support call analytics: streaming ASR → PII redaction (SSN, card numbers) → topic classification → supervisor dashboard. TTS used sparingly for outbound IVR; most cost is ASR minutes × concurrent calls.

Video — frame extraction → vision

Video models are emerging, but most production systems today sample frames and run image understanding on each frame—or on keyframes only. Treat full-video LLM ingest as expensive and experimental unless you have a clear ROI.

Typical pipeline

  1. Ingest — upload to S3/GCS; transcode to standard format (H.264 mp4) if needed
  2. Extract audio — ffmpeg → Whisper for spoken content (cheaper than watching every frame)
  3. Sample frames — 1 fps for surveillance-style review; scene-change detection for highlights; max N frames cap
  4. Vision per frame — classify, OCR on-screen text, detect objects/events
  5. Aggregate — LLM summarizes frame captions + transcript into timeline

ffmpeg frame extraction

One frame per second to PNG
import subprocess
from pathlib import Path

def extract_frames(video: Path, out_dir: Path, fps: float = 1.0) -> None:
    out_dir.mkdir(parents=True, exist_ok=True)
    subprocess.run([
        "ffmpeg", "-i", str(video),
        "-vf", f"fps={fps}",
        str(out_dir / "frame_%04d.png"),
    ], check=True)
// ProcessBuilder wrapping ffmpeg — same args as Python
ProcessBuilder pb = new ProcessBuilder(
    "ffmpeg", "-i", videoPath,
    "-vf", "fps=1",
    outDir.resolve("frame_%04d.png").toString()
);
pb.inheritIO().start().waitFor();

Cost and latency warnings

  • A 10-minute video at 1 fps = 600 frames. At ~85–500 tokens/frame, vision cost explodes.
  • Always cap max_frames and downscale to 512–768px before API calls.
  • Prefer audio transcript for spoken meetings; use frames only for slide/demo visibility questions.
  • Batch frame inference overnight; never block synchronous user requests on full-video analysis.
  • Gemini native video may reduce glue code but verify pricing vs frame-sampling approach on your lengths.
💰 Cost

Back-of-envelope: 600 frames × 200 image tokens × $2.50/1M input ≈ $0.30 per 10-min video before text output—10× higher at high detail or 4K frames. Scene detection + 20 keyframes often beats 1 fps brute force.

⚠️ Pitfall

Sending every frame to vision without deduplication—consecutive frames differ minimally. Perceptual hash or SSIM dedupe cuts redundant API calls 80%+ on static talking-head video.

🎯 Interview Tip

“Design video search for training content.” — Separate speech (ASR index) from visual (sparse keyframes + embeddings); query routes to transcript BM25 + visual similarity; mention cost caps and async indexing lag SLA.

Structured extraction — invoices, forms, contracts

Multimodal shines when output must be machine-readable: line items, parties, dates, clause types. Combine OCR/layout parsers with LLM structured output (JSON schema) and validation—not free-form prose you parse later.

Invoice parsing

  • Prebuilt: Azure prebuilt-invoice, Document AI Invoice Parser—fast baseline
  • Custom: extract fields with schema → validate totals (sum line items ≈ total)
  • Fallback: page image + GPT-4o with response_format JSON schema

Form extraction

Government and healthcare forms have fixed boxes. Textract FORMS returns key-value pairs with geometry; map keys via fuzzy match to your canonical schema. Low-confidence fields → human-in-the-loop queue (Track 5 evals).

Contract analysis

Multi-page legal docs exceed single-shot context if you naïvely paste OCR text. Pipeline:

  1. Partition into clauses (headings, numbered sections)
  2. Embed clauses for RAG (“find indemnity language”)
  3. Per-clause LLM extraction with schema: obligation, party, deadline, governing law
  4. Cross-page references: store section IDs; vision on signature pages only

Multi-page chunking for vision + RAG bridge

When text layer is poor but layout matters (charts beside paragraphs):

  • Render page → PNG (150 DPI)
  • Vision model: “Describe this page and list all named entities and numbers” → structured JSON
  • Merge JSON with text-layer chunks sharing same page_number
  • Embed combined summary + raw text for retrieval

Chunk size: one page or one logical section—not whole 200-page PDF in one vision call. Parallelize pages with worker queue; respect provider rate limits.

Structured invoice extraction from page image
INVOICE_SCHEMA = {
    "type": "object",
    "properties": {
        "vendor_name": {"type": "string"},
        "invoice_number": {"type": "string"},
        "total_amount": {"type": "number"},
        "line_items": {
            "type": "array",
            "items": {
                "type": "object",
                "properties": {
                    "description": {"type": "string"},
                    "amount": {"type": "number"},
                },
                "required": ["description", "amount"],
            },
        },
    },
    "required": ["vendor_name", "total_amount"],
}

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    response_format={"type": "json_schema", "json_schema": {"name": "invoice", "schema": INVOICE_SCHEMA}},
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "Extract invoice fields from this page."},
        {"type": "image_url", "image_url": {"url": image_data_url, "detail": "high"}},
    ]}],
)
# Claude tool_use or JSON mode with schema in system prompt + validation
message = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=2048,
    temperature=0,
    system="Return only valid JSON matching the invoice schema.",
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
        {"type": "text", "text": "Extract vendor_name, invoice_number, total_amount, line_items."},
    ]}],
)
# Use Bedrock Converse API with toolConfig for structured extraction
# or invoke Claude with JSON instruction + pydantic validate downstream
// Spring AI structured output — BeanOutputConverter or JSON schema options
Invoice invoice = chatClient.prompt()
    .options(OpenAiChatOptions.builder()
        .withModel("gpt-4o")
        .withTemperature(0.0)
        .withResponseFormat(new ResponseFormat(ResponseFormat.Type.JSON_SCHEMA, schema))
        .build())
    .user(u -> u.text("Extract invoice fields").media(imageMedia))
    .call()
    .entity(Invoice.class);
// Parse Claude JSON response with Jackson → Invoice DTO; validate totals in service layer
ObjectMapper mapper = new ObjectMapper();
Invoice invoice = mapper.readValue(jsonText, Invoice.class);
invoiceValidator.validateLineItemsSum(invoice);
// Bedrock ConverseResponse → extract toolUse input map → map to Invoice record
🔬 Under the Hood

Structured output constrains decoding toward valid JSON—reduces but does not eliminate hallucinated numbers. Always validate arithmetic (tax + subtotal = total) and cross-check against regex for invoice numbers in OCR text.

⚖️ Trade-off

Prebuilt cloud invoice models are cheaper and faster for standard layouts. Vision LLM extraction wins on messy scans and one-off formats—route by document classifier confidence.

What this looks like in production

Multimodal features are IO-heavy: uploads, transcoding, OCR, and vision calls do not belong on the synchronous request path for large files. Production is an async job graph with object storage, queues, and strict PII handling.

PII in images and audio

  • IDs, credit cards, faces, and signatures appear in uploads—treat media like sensitive DB columns
  • Redact before logging; store originals encrypted at rest (SSE-KMS, CMEK)
  • Retention TTL per tenant; legal hold overrides; delete derived transcripts when parent audio expires
  • Moderation APIs + custom classifiers on uploads before human review queues
  • DPA/BAA with OCR/LLM vendors when PHI is possible

Async processing queues

Typical job stages:

  1. ingest — validate MIME, virus scan, write to s3://bucket/{tenant}/{doc_id}/original
  2. classify — digital vs scan; language detect; route parser
  3. extract — PyMuPDF / OCR / tables (parallel per page)
  4. enrich — optional vision summary per page; structured field extraction
  5. index — chunk, embed, upsert vector store (Track 2)
  6. notify — webhook or SSE when status=ready

Use SQS + Lambda/ECS workers, Celery + Redis, or Temporal for retries and visibility timeouts. Idempotency key = doc_id + stage so poison messages do not double-charge OCR.

Storage patterns

ArtifactStorageLifecycle
Original uploadObject storage, encryptedTTL or legal hold
Page PNG rendersObject storage, content-addressedDelete after OCR unless audit needs
Extracted text JSONDB or object storageVersion with parser_version
Structured fieldsPostgreSQL JSONBSource of truth for apps
EmbeddingsVector DBRe-index on model change

Observability

  • Per-stage latency histograms (p50/p95 ingest → ready)
  • OCR confidence distribution; alert if median drops (model drift or bad scan batch)
  • Vision token usage per doc type and tenant
  • DLQ depth and age; replay tooling with admin audit

Production readiness checklist

  • Max upload size and page count enforced at edge
  • Async status API; never block HTTP 30s on 200-page PDF
  • Parser + OCR version pinned; golden docs in CI (Track 5)
  • PII redaction in logs; presigned URLs expire in minutes
  • Cost caps per tenant on vision and cloud OCR
  • Human review queue for low-confidence extractions
💡 Pro Tip

Expose processing_status and page_ready_count in your API so the UI can show progress. Users tolerate 60s ingest if they see “page 12 of 40 indexed”—not a spinner with no feedback.

📦 Real World

Loan origination platforms queue overnight batch OCR for tax returns while synchronous path handles W-2 prebuilt models in <5s. Same product, two SLAs—routing is business rules, not one pipeline fits all.

🎯 Interview Tip

“User uploads 500-page PDF for chat.” — API accepts file → S3 → job ID → workers chunk + embed → poll/WebSocket ready. Mention virus scan, tenant isolation, cost estimate upfront, and refusal if over page quota.