Structured data extraction & agents

Extraction basics — from entities to schemas

“Extraction” spans a spectrum: find spans (NER), link entities (relations), assemble timelines (events), or fill a full document schema (invoice, contract). Modern stacks mix classical NLP for cheap high-precision spans with LLMs for messy layout and long-tail fields—always landing on a typed schema before any database write.

Named entity recognition (NER)

NER labels token spans with types: ORG, DATE, MONEY, PERSON, domain tags like INVOICE_ID. Use spaCy, Flair, or fine-tuned transformers when you have labeled data and need sub-10ms per paragraph. LLM NER is flexible but expensive—reserve it for documents where layout breaks classical models or entity types evolve weekly.

Approach	Best for	Watch out
Rule / regex NER	Invoice numbers, ISO dates, currency with known patterns	Breaks on international formats; brittle on OCR noise
Transformer NER	High-volume homogeneous docs (medical, legal entities)	Retrain when schema adds fields; needs GPU for batch
LLM span extraction	Messy scans, novel layouts, few-shot new entity types	Hallucinated spans; always validate offsets against source text

Relation extraction

Relation extraction links entities: (Vendor X) —[supplies]→ (Product Y), (Party A) —[signed_on]→ (2024-03-15). Pairwise classifiers and dependency-parse features still win on clean text; LLMs excel when relations are implicit (“the aforementioned supplier”) or spread across sections. Store relations as typed edges with provenance (source_page, quote) so downstream agents can cite, not guess.

Event extraction

Event extraction bundles who did what, when, where: renewal notices, payment due dates, breach triggers. Event schemas are small graphs—trigger, arguments (agent, patient, time), optional attributes. For compliance, events need immutable IDs and links to source clauses; regenerating events on re-extract must diff, not overwrite silently.

Schema-driven extraction (Pydantic / Instructor)

Document-level extraction maps full text (from Guide 4 ingest) into a Pydantic model or Java record. Instructor sends model_json_schema() to the provider, parses the response, validates, and retries with validation errors— the production pattern from Track 3 reliable formatting. Spring AI offers .entity(Invoice.class) and OpenAI json_schema options for the same contract on the JVM.

import instructor
from decimal import Decimal
from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
from typing import Annotated

class LineItem(BaseModel):
    description: str = Field(max_length=500)
    quantity: Annotated[Decimal, Field(gt=0)]
    unit_price: Annotated[Decimal, Field(ge=0)]
    line_total: Annotated[Decimal, Field(ge=0)]

    @field_validator("line_total")
    @classmethod
    def qty_times_price(cls, v: Decimal, info) -> Decimal:
        data = info.data
        expected = data["quantity"] * data["unit_price"]
        if abs(v - expected) > Decimal("0.02"):
            raise ValueError(f"line_total {v} != qty*price {expected}")
        return v

class InvoiceExtract(BaseModel):
    vendor_name: str
    invoice_number: str
    invoice_date: str  # ISO date string from model; normalize in post-step
    currency: str = Field(pattern=r"^[A-Z]{3}$")
    line_items: list[LineItem]
    subtotal: Decimal
    tax: Decimal = Decimal("0")
    total: Decimal
    confidence: Annotated[float, Field(ge=0, le=1)]

client = instructor.from_openai(OpenAI())

def extract_invoice(document_text: str) -> InvoiceExtract:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=InvoiceExtract,
        max_retries=3,
        messages=[
            {"role": "system", "content": (
                "Extract invoice fields from OCR text. "
                "Set confidence 0-1 for overall extraction quality. "
                "Use only values present in the document."
            )},
            {"role": "user", "content": document_text},
        ],
    )

import instructor
from anthropic import Anthropic

client = instructor.from_anthropic(Anthropic())

def extract_invoice(document_text: str) -> InvoiceExtract:
    return client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=4096,
        response_model=InvoiceExtract,
        max_retries=3,
        messages=[{"role": "user", "content": f"Extract invoice:\n\n{document_text}"}],
    )

import instructor
from instructor import from_bedrock

client = from_bedrock(bedrock_client)

def extract_invoice(document_text: str) -> InvoiceExtract:
    return client.chat.completions.create(
        model="anthropic.claude-3-5-haiku-20241022",
        response_model=InvoiceExtract,
        max_retries=3,
        messages=[{"role": "user", "content": document_text}],
    )

public record LineItem(
    String description,
    @JsonProperty("quantity") BigDecimal quantity,
    @JsonProperty("unit_price") BigDecimal unitPrice,
    @JsonProperty("line_total") BigDecimal lineTotal
) {
  public LineItem {
    BigDecimal expected = quantity.multiply(unitPrice);
    if (lineTotal.subtract(expected).abs().compareTo(new BigDecimal("0.02")) > 0) {
      throw new IllegalArgumentException("line_total mismatch");
    }
  }
}

public record InvoiceExtract(
    String vendorName,
    String invoiceNumber,
    String invoiceDate,
    String currency,
    List<LineItem> lineItems,
    BigDecimal subtotal,
    BigDecimal tax,
    BigDecimal total,
    double confidence
) {}

@Service
public class InvoiceExtractor {
  private final ChatClient chatClient;

  public InvoiceExtract extract(String documentText) {
    return chatClient.prompt()
        .system("Extract invoice fields. Set confidence 0-1. Only document values.")
        .user(documentText)
        .options(OpenAiChatOptions.builder()
            .withModel("gpt-4o-mini")
            .withResponseFormat(new ResponseFormat(ResponseFormat.Type.JSON_SCHEMA, invoiceSchema()))
            .build())
        .call()
        .entity(InvoiceExtract.class);
  }
}

public InvoiceExtract extract(String documentText) {
  return anthropicChatClient.prompt()
      .user("Extract invoice as JSON matching schema:\n" + documentText)
      .call()
      .entity(InvoiceExtract.class);
}

public InvoiceExtract extract(String documentText) {
  return bedrockChatClient.prompt()
      .user(documentText)
      .options(BedrockAnthropicChatOptions.builder()
          .withModel("anthropic.claude-3-5-haiku-20241022")
          .build())
      .call()
      .entity(InvoiceExtract.class);
}

Extraction task comparison

Task	Output shape	Typical tool
NER	Spans + labels	spaCy, GLiNER, LLM + offset validation
Relation	(head, relation, tail) triples	RE models, LLM structured triple list
Event	Trigger + argument frames	ACE-style parsers, schema JSON
Schema fill	Full document POJO	Instructor, Spring AI .entity()

⚠️ Pitfall

Skipping validation because “the model returned JSON.” Strings where you expected decimals, invented invoice numbers, and missing line items pass json.loads every time. Pydantic validators and Jackson records are your API boundary.

⚖️ Trade-off

Classical NER is cheaper and auditable but needs retraining when schemas change. LLM schema fill adapts in hours but costs more per page and hallucinates on poor OCR. Hybrid: regex for IDs and dates, LLM for line items and free text.

💡 Pro Tip

Add source_quotes: dict[str, str] to your schema mapping field names to verbatim snippets. Validators can fuzzy-match quotes against source text—cheap grounding without full RAG.

🎯 Interview Tip

“NER vs schema extraction?” — NER finds mentions; schema extraction produces business records. Production needs both: NER for search and highlighting, schema for ERP/CLM writes with validation.

Production pipeline — five steps to trusted records

Every extraction job—batch overnight or real-time upload—should follow the same five steps. Async workers from Guide 4 feed step 1; steps 2–3 are synchronous per document; steps 4–5 are transactional and auditable.

flowchart LR
  subgraph ingest["Guide 4 ingest"]
    A[PDF / scan / email] --> B[Extract text + tables]
  end
  B --> C[LLM schema fill]
  C --> D{Pydantic / Jackson validate}
  D -->|pass| E[(Store DB + object refs)]
  D -->|fail| F[Repair retry max 3]
  F --> C
  E --> G{confidence >= 0.8?}
  G -->|yes| H[Auto-approve downstream]
  G -->|no| I[HITL review queue]
  I --> J[Human corrects + approves]
  J --> E

Step 1 — Extract text

Reuse the tiered PDF pipeline from multimodal ingest: native text, OCR, layout parse. Persist document_id, extraction_tier, page-level text, and table CSV URIs. Extraction quality ceiling is set here—no schema prompt fixes garbage OCR.

Step 2 — LLM schema fill

Route by document classifier output (invoice, contract, generic). Each route binds a different Pydantic model and system prompt. Temperature 0; cap output tokens to schema size + margin. Log model ID, prompt hash, and input token count for cost attribution per tenant.

Step 3 — Validate

Validation layers: (1) schema types and constraints, (2) business rules (line totals, date ordering, party IDs in master data), (3) optional cross-field checks against ERP lookup APIs. Failed validation triggers Instructor retry with error text—not silent discard. After max retries, send to HITL with status=validation_failed.

Step 4 — Store DB

Write immutable extraction version: extractions table with JSONB payload, schema version, model version, confidence, and link to source document_id. Never update in place for compliance—append version 2 when re-run. Object storage holds raw PDF and parsed text for replay.

Step 5 — Human review queue

Items land in review when confidence < 0.8, validation failed, or policy flags (amount > threshold, new vendor). Review UI shows side-by-side PDF highlight + editable form fields. Approved corrections feed golden datasets for eval CI (Track 5).

Step	SLA target	Failure mode
1 Extract text	Seconds (sync) or minutes (batch queue)	Empty text → escalate OCR tier
2 Schema fill	2–15 s per doc (model dependent)	Rate limit → gateway queue
3 Validate	<100 ms	Retry then HITL
4 Store	Transactional <50 ms	DLQ on DB conflict
5 HITL	Business-hours SLA	Escalation on aging queue

from enum import Enum
from uuid import uuid4

HITL_THRESHOLD = 0.8

class ExtractionStatus(str, Enum):
    AUTO_APPROVED = "auto_approved"
    PENDING_REVIEW = "pending_review"
    VALIDATION_FAILED = "validation_failed"

def run_pipeline(document_id: str, db, review_queue) -> str:
  text_bundle = ingest_service.extract_text(document_id)  # Step 1
  doc_type = classifier.predict(text_bundle.plain_text)
  schema_fn = ROUTES[doc_type]  # invoice | contract

  try:
    record = schema_fn(text_bundle.plain_text)  # Step 2 — Instructor inside
  except ValidationError as e:
    review_queue.enqueue(document_id, reason=str(e), status=ExtractionStatus.VALIDATION_FAILED)
    return ExtractionStatus.VALIDATION_FAILED

  business_errors = business_rules.validate(record, doc_type)  # Step 3
  if business_errors:
    review_queue.enqueue(document_id, reason=business_errors, status=ExtractionStatus.VALIDATION_FAILED)
    return ExtractionStatus.VALIDATION_FAILED

  extraction_id = str(uuid4())
  db.save_extraction(extraction_id, document_id, record.model_dump(), record.confidence)  # Step 4

  if record.confidence < HITL_THRESHOLD:
    review_queue.enqueue(document_id, extraction_id=extraction_id, status=ExtractionStatus.PENDING_REVIEW)  # Step 5
    return ExtractionStatus.PENDING_REVIEW

  downstream.publish_auto_approved(extraction_id)
  return ExtractionStatus.AUTO_APPROVED

@Service
public class ExtractionPipeline {
  static final double HITL_THRESHOLD = 0.8;

  public ExtractionStatus run(UUID documentId) {
    TextBundle text = ingestService.extractText(documentId);
    DocType type = classifier.classify(text.plainText());
    InvoiceExtract record = extractors.get(type).extract(text.plainText());

    List<String> errors = businessRules.validate(record, type);
    if (!errors.isEmpty()) {
      reviewQueue.enqueue(documentId, errors, ExtractionStatus.VALIDATION_FAILED);
      return ExtractionStatus.VALIDATION_FAILED;
    }

    UUID extractionId = extractionRepo.save(documentId, record);
    if (record.confidence() < HITL_THRESHOLD) {
      reviewQueue.enqueue(documentId, extractionId, ExtractionStatus.PENDING_REVIEW);
      return ExtractionStatus.PENDING_REVIEW;
    }
    downstreamPublisher.autoApprove(extractionId);
    return ExtractionStatus.AUTO_APPROVED;
  }
}

📦 Real World

AP teams auto-approve 60–70% of invoices after tuning confidence thresholds; the rest clear in a 24-hour review SLA. Pipeline metrics: auto-approve rate, validation retry rate, mean time in HITL queue.

⚠️ Pitfall

Writing to ERP before validation completes. Use outbox pattern: DB commit for extraction record first, then async webhook to ERP only after status=auto_approved or human approval event.

🔒 Security

Review queue UIs expose PII and contract terms—RBAC per queue, audit log on every field edit, no LLM re-call on corrected data without explicit “re-extract” action.

💡 Pro Tip

Idempotency key = hash(document_bytes) + schema_version. Re-uploads skip LLM if cached extraction exists unless force=true.

Confidence scoring — when to trust the model

A single float gates billions in payable invoices. Combine self-reported confidence from the schema, ensemble runs when stakes are high, and deterministic checks. Route everything below 0.8 to HITL unless policy overrides.

Self-reported confidence

Add confidence: float to your Pydantic model with Field(ge=0, le=1). Instruct the model: “0.9+ only when every field has clear textual evidence; 0.5–0.7 when inferred; below 0.5 when guessing.” Calibrate monthly: plot confidence vs human override rate. Models are often overconfident—apply Platt scaling or a simple linear shrink if overrides cluster above 0.85.

Ensemble N-times

For high-value contracts or invoices above dollar thresholds, run N independent extractions (N=3 or 5) at temperature 0 with different prompt seeds or models. Aggregate: majority vote on categorical fields; median on numerics; flag fields with disagreement > tolerance. Ensemble confidence = agreement fraction (e.g. 3/3 fields match → 1.0; 2/3 → 0.67 → HITL).

Signal	Cost	When to use
Self-reported only	1× LLM call	Low-risk docs, high volume AP
Ensemble N=3	3× LLM calls	Invoices > $10k, new vendors
Ensemble + verifier model	4× calls	Legal contracts, regulatory filings
Deterministic validators	CPU only	Always—math, dates, master data lookup

from collections import Counter
from statistics import median

HITL_THRESHOLD = 0.8
ENSEMBLE_N = 3

def ensemble_extract(text: str) -> tuple[InvoiceExtract, float]:
    runs = [extract_invoice(text) for _ in range(ENSEMBLE_N)]
    agreement_scores = []

    # Per-field agreement on invoice_number and total
    for field in ("invoice_number", "total"):
        values = [str(getattr(r, field)) for r in runs]
        winner, count = Counter(values).most_common(1)[0]
        agreement_scores.append(count / ENSEMBLE_N)

    totals = [float(r.total) for r in runs]
    if max(totals) - min(totals) > 0.05:
        agreement_scores.append(0.0)
    else:
        agreement_scores.append(1.0)

    ensemble_confidence = min(agreement_scores)
    self_conf = median([r.confidence for r in runs])
    final_confidence = min(ensemble_confidence, self_conf)

    merged = runs[0].model_copy(update={
        "total": median(totals),
        "confidence": final_confidence,
    })
    return merged, final_confidence

def needs_hitl(confidence: float) -> bool:
    return confidence < HITL_THRESHOLD

public record EnsembleResult(InvoiceExtract merged, double confidence) {}

public EnsembleResult ensembleExtract(String text) {
  List<InvoiceExtract> runs = IntStream.range(0, 3)
      .mapToObj(i -> extractor.extract(text))
      .toList();

  double numberAgreement = agreementOn(runs, InvoiceExtract::invoiceNumber);
  double totalAgreement = numericAgreement(runs.stream().map(InvoiceExtract::total).toList());
  double selfMedian = median(runs.stream().mapToDouble(InvoiceExtract::confidence).toArray());
  double finalConfidence = Math.min(Math.min(numberAgreement, totalAgreement), selfMedian);

  InvoiceExtract merged = mergeMajority(runs);
  merged = merged.withConfidence(finalConfidence);
  return new EnsembleResult(merged, finalConfidence);
}

public boolean needsHitl(double confidence) {
  return confidence < 0.8;
}

HITL threshold < 0.8

Default gate: confidence < 0.8 → human review. Tune per document type—contracts may use 0.85, low-value receipts 0.75 with extra fraud checks. Never lower threshold to “improve automation rate” without measuring error cost: a 1% false auto-approve on $50k invoices dwarfs reviewer salary.

⚖️ Trade-off

Ensemble reduces variance but triples token cost. Apply dollar-tier routing: ensemble only when estimated_amount > threshold or vendor is new.

⚠️ Pitfall

Trusting self-reported confidence without calibration. Models routinely output 0.95 on wrong totals. Always blend with deterministic checks and periodic override audits.

🎯 Interview Tip

“How do you decide auto-approve vs human review?” — Schema validation first, blended confidence (self + ensemble + rules), threshold 0.8 default, dollar-tier overrides, metrics on override rate by confidence bucket.

📦 Real World

Teams that ensemble only top 5% by amount capture most fraud/error risk while keeping mean cost per doc flat—FinOps dashboard splits extraction spend by tier.

Contract agent — parties, obligations, and risk

Contracts are long, cross-referenced, and high stakes. A contract agent is not one-shot extraction—it orchestrates section-aware reads, structured obligation graphs, version diffs against prior executes, and risk summaries for legal ops.

Core extractions

Parties — legal names, roles (licensor/licensee), addresses, signatories with title
Dates — effective, expiration, renewal notice windows, termination for convenience periods
Obligations — payment terms, SLAs, indemnity caps, data processing, audit rights—each with clause reference

Field group	Agent tool	Output
Parties & dates	Schema fill on cover + signature pages	ContractHeader record
Obligations	Section retrieval + per-clause extract	List of Obligation with clause_id
Version compare	Diff tool vs stored v{n-1}	ClauseChange[]
Risk analysis	Policy rules + LLM summary	Ranked RiskFlag with severity

Compare versions

When a redline PDF arrives, normalize both versions to clause-aligned text (heading detection from ingest). Agent calls diff_clauses(v_prev, v_new)—embedding similarity to match moved clauses, then LLM summarizes material changes (liability cap lowered, auto-renewal added). Store diff artifact for CLM; never overwrite prior extraction version.

Risk analysis

Combine playbook rules (“uncapped liability → critical”) with LLM narrative for context. Risk output is advisory—legal counsel signs off. Log which model and policy version produced each flag for audit.

import instructor
from pydantic import BaseModel, Field
from enum import Enum

class RiskSeverity(str, Enum):
    low = "low"
    medium = "medium"
    high = "high"
    critical = "critical"

class Obligation(BaseModel):
    clause_ref: str
    obligation_type: str
    description: str = Field(max_length=1000)
    due_date: str | None = None
    source_quote: str

class RiskFlag(BaseModel):
    severity: RiskSeverity
    topic: str
    rationale: str
    clause_ref: str

class ContractAnalysis(BaseModel):
    parties: list[str]
    effective_date: str
    expiration_date: str | None
    obligations: list[Obligation]
    risk_flags: list[RiskFlag]
    confidence: float

client = instructor.from_openai(OpenAI())

def analyze_contract(full_text: str, playbook: str) -> ContractAnalysis:
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=ContractAnalysis,
        max_retries=2,
        messages=[
            {"role": "system", "content": (
                "Extract contract parties, dates, obligations with clause refs. "
                f"Apply risk playbook:\n{playbook}"
            )},
            {"role": "user", "content": full_text[:120_000]},
        ],
    )

def compare_versions(text_prev: str, text_new: str) -> list[dict]:
    # Agent step: structured clause diff (simplified)
    prompt = f"List material clause changes.\n\nPREV:\n{text_prev[:50_000]}\n\nNEW:\n{text_new[:50_000]}"
    # ... tool-calling loop with diff_clauses tool in full implementation
    return []

public record Obligation(
    String clauseRef, String obligationType, String description,
    String dueDate, String sourceQuote
) {}

public record RiskFlag(RiskSeverity severity, String topic, String rationale, String clauseRef) {}

public record ContractAnalysis(
    List<String> parties,
    String effectiveDate,
    String expirationDate,
    List<Obligation> obligations,
    List<RiskFlag> riskFlags,
    double confidence
) {}

@Service
public class ContractAgent {
  public ContractAnalysis analyze(String fullText, String playbook) {
    return chatClient.prompt()
        .system("Extract parties, dates, obligations. Apply playbook: " + playbook)
        .user(fullText.substring(0, Math.min(fullText.length(), 120_000)))
        .call()
        .entity(ContractAnalysis.class);
  }

  public List<ClauseChange> compareVersions(String prev, String next) {
    return diffTool.diff(prev, next);  // Spring AI tool / custom bean
  }
}

🔒 Security

Contracts are confidential—route through VPC gateway, disable training on enterprise tier, redact counterparty PII in logs. Risk flags are not legal advice; watermark UI accordingly.

⚠️ Pitfall

Single-pass extraction on 80-page MSAs misses exhibits. Agent must iterate: TOC → section retrieve → per-section schema fill with token budget.

💡 Pro Tip

Store obligations as rows keyed by clause_ref—version diff updates row status (added|modified|removed) instead of replacing the whole contract JSON.

🎯 Interview Tip

“Design a contract AI system.” — Ingest + clause segmentation, structured extraction per section, immutable versions, diff on redlines, rules-based risk + LLM summary, mandatory legal HITL on critical flags.

Invoice agent — AP automation with guardrails

Invoices are shorter than contracts but hit payment systems directly. The invoice agent extracts vendor and line items, validates math, detects duplicates against historical payments, and routes approvals by amount, cost center, and policy.

Vendor and line items

Match extracted vendor_name to ERP vendor master (fuzzy + tax ID when present). Line items need quantity × unit price = line total; sum(line totals) + tax ≈ total within rounding tolerance. Failed math → auto HITL regardless of confidence.

Math validation

Check	Rule	On failure
Line item	abs(qty × price - line_total) ≤ 0.02	Validation error + HITL
Subtotal	sum(line_totals) ≈ subtotal	Retry once, then HITL
Total	subtotal + tax ≈ total	Block auto-approve
Currency	ISO 4217 + match PO currency	Route to treasury review

Duplicate detection

Fingerprint: hash(vendor_id, invoice_number, total, invoice_date). Query last 24 months; fuzzy match on number variants (INV-001 vs 001). Duplicate candidate → hold payment, alert AP, never auto-approve.

Approval routing

Rules engine after extraction: <$500 manager; $500–$5k director + cost center owner; >$5k VP + PO match required. Agent prepares approval packet (PDF link, extracted JSON, PO match score, duplicate check result)—humans click approve in existing workflow tool.

from decimal import Decimal
import hashlib

def invoice_fingerprint(vendor_id: str, inv: InvoiceExtract) -> str:
    raw = f"{vendor_id}|{inv.invoice_number}|{inv.total}|{inv.invoice_date}"
    return hashlib.sha256(raw.encode()).hexdigest()

def validate_math(inv: InvoiceExtract) -> list[str]:
    errors = []
    line_sum = Decimal("0")
    for item in inv.line_items:
        expected = item.quantity * item.unit_price
        if abs(item.line_total - expected) > Decimal("0.02"):
            errors.append(f"line {item.description}: total mismatch")
        line_sum += item.line_total
    if abs(line_sum - inv.subtotal) > Decimal("0.05"):
        errors.append("subtotal mismatch")
    if abs(inv.subtotal + inv.tax - inv.total) > Decimal("0.05"):
        errors.append("grand total mismatch")
    return errors

def check_duplicate(fp: str, db) -> bool:
    return db.invoices.exists_fingerprint(fp, months=24)

def approval_route(total: Decimal, cost_center: str) -> str:
    if total < 500:
        return "manager"
    if total < 5000:
        return f"director:{cost_center}"
    return "vp+po_match"

def run_invoice_agent(document_id: str, db, review_queue):
    inv = extract_invoice(ingest.text(document_id))
    vendor_id = vendor_master.resolve(inv.vendor_name)
    errors = validate_math(inv)
    if errors:
        review_queue.enqueue(document_id, reason=errors)
        return

    fp = invoice_fingerprint(vendor_id, inv)
    if check_duplicate(fp, db):
        review_queue.enqueue(document_id, reason="duplicate_candidate", fingerprint=fp)
        return

    if inv.confidence < 0.8:
        review_queue.enqueue(document_id, extraction=inv)
        return

    approver = approval_route(inv.total, cost_center=db.po_lookup(document_id))
    workflow.start_approval(document_id, approver, inv)

public void runInvoiceAgent(UUID documentId) {
  InvoiceExtract inv = extractor.extract(ingest.text(documentId));
  String vendorId = vendorMaster.resolve(inv.vendorName());

  List<String> mathErrors = mathValidator.validate(inv);
  if (!mathErrors.isEmpty()) {
    reviewQueue.enqueue(documentId, mathErrors);
    return;
  }

  String fingerprint = fingerprintService.compute(vendorId, inv);
  if (invoiceRepo.existsFingerprint(fingerprint, Duration.ofDays(730))) {
    reviewQueue.enqueue(documentId, List.of("duplicate_candidate"));
    return;
  }

  if (inv.confidence() < 0.8) {
    reviewQueue.enqueue(documentId, inv);
    return;
  }

  String approver = routingRules.resolve(inv.total(), poLookup.costCenter(documentId));
  workflowClient.startApproval(documentId, approver, inv);
}

📦 Real World

Duplicate detection prevents double payment more often than fraud models—hash + fuzzy invoice number catches scanner re-uploads and email forwards.

⚠️ Pitfall

Auto-approving when PO match is “close enough.” Require structured PO line match scores; <0.9 match → HITL even at confidence 0.99.

⚖️ Trade-off

Aggressive vendor fuzzy matching reduces HITL but risks mis-attribution to wrong supplier. Prefer tax ID / remit-to address match when present.

💡 Pro Tip

Log every auto-approved invoice with extraction snapshot—when audit asks “why paid?” you replay JSON + model version, not chat history.

Production: structured extraction checklist

Structured extraction is production-ready when schemas are versioned, confidence is calibrated, HITL is staffed, and contract/invoice agents cannot bypass math or duplicate checks—not when a demo parses one PDF into JSON.

Track 6 structured extraction checklist

☐ Document routes (invoice / contract / generic) map to versioned Pydantic or Java schemas
☐ Five-step pipeline: extract text → schema fill → validate → store → HITL queue
☐ Instructor or Spring AI structured output with max_retries; no raw json.loads to DB
☐ Business validators: line math, dates, currency, master data lookup
☐ Confidence field on schema; calibrated vs human override rate quarterly
☐ Ensemble N=3 on high-dollar tier; agreement score blended with self-reported confidence
☐ HITL threshold default <0.8; policy overrides documented per doc type
☐ Extraction versions immutable; re-run appends v{n+1} with diff to prior
☐ Contract agent: parties, dates, obligations with clause refs; version diff artifact stored
☐ Invoice agent: duplicate fingerprint; math validation blocks auto-approve
☐ Approval routing integrated with existing workflow (not shadow inbox)
☐ Golden extraction fixtures in CI; regression on field-level accuracy (Track 5 evals)
☐ Runbook: validation spike, HITL queue aging, ERP webhook failure, schema migration

Guide 5 closes the loop from multimodal bytes to business records. Next, production AI architecture zooms out—reference patterns for copilots, batch extraction fleets, multi-region gateways, and how platform teams support product squads.

📦 Production checklist

Ship when auto-approve rate is stable, override rate by confidence bucket is predictable, and zero payments bypass duplicate + math gates—not on first successful JSON dump.

Track 6 path: Continue to Guide 6 — Production AI architecture or return to Shipping track home.

📦 Real World

Schema version bumps without migration scripts brick ERP integrations—treat extraction schemas like public API contracts with deprecation windows.

🔒 Security

Extracted JSON in DB is as sensitive as source PDFs—encrypt at rest, column-level RBAC, mask bank details in review UI for users without payment authority.

🎯 Interview Tip

“End-to-end invoice AI?” — Ingest tiers → classify → Pydantic extract → math + duplicate validators → confidence/HITL → approval workflow → immutable audit trail.