Structured data extraction & agents
Guide 4 turned bytes into text, tables, and classified routes. Guide 5 turns that text into validated business records— invoices in your ERP, obligations in your CLM, events in your audit log. Production extraction is not “ask GPT for JSON.” It is schema-first pipelines: NER and relation extraction where classical NLP wins, LLM schema fill with Pydantic/Instructor or Spring AI structured output where flexibility wins, deterministic validation, confidence scoring with human review below 0.8, and agentic workflows for contracts and invoices that compare versions, catch math errors, and route approvals.
After reading, you should be able to: distinguish NER, relation extraction, event extraction, and schema-driven LLM fill; implement the five-step production pipeline (extract text → LLM schema fill → validate → store → HITL queue); combine self-reported and ensemble confidence with a <0.8 review threshold; build contract agents (parties, dates, obligations, version diff, risk flags) and invoice agents (line items, math checks, duplicate detection, approval routing); and complete the Track 6 structured extraction checklist before production architecture.
Extraction basics — from entities to schemas
“Extraction” spans a spectrum: find spans (NER), link entities (relations), assemble timelines (events), or fill a full document schema (invoice, contract). Modern stacks mix classical NLP for cheap high-precision spans with LLMs for messy layout and long-tail fields—always landing on a typed schema before any database write.
Named entity recognition (NER)
NER labels token spans with types: ORG, DATE, MONEY, PERSON, domain tags like INVOICE_ID. Use spaCy, Flair, or fine-tuned transformers when you have labeled data and need sub-10ms per paragraph. LLM NER is flexible but expensive—reserve it for documents where layout breaks classical models or entity types evolve weekly.
| Approach | Best for | Watch out |
|---|---|---|
| Rule / regex NER | Invoice numbers, ISO dates, currency with known patterns | Breaks on international formats; brittle on OCR noise |
| Transformer NER | High-volume homogeneous docs (medical, legal entities) | Retrain when schema adds fields; needs GPU for batch |
| LLM span extraction | Messy scans, novel layouts, few-shot new entity types | Hallucinated spans; always validate offsets against source text |
Relation extraction
Relation extraction links entities: (Vendor X) —[supplies]→ (Product Y), (Party A) —[signed_on]→ (2024-03-15). Pairwise classifiers and dependency-parse features still win on clean text; LLMs excel when relations are implicit (“the aforementioned supplier”) or spread across sections. Store relations as typed edges with provenance (source_page, quote) so downstream agents can cite, not guess.
Event extraction
Event extraction bundles who did what, when, where: renewal notices, payment due dates, breach triggers. Event schemas are small graphs—trigger, arguments (agent, patient, time), optional attributes. For compliance, events need immutable IDs and links to source clauses; regenerating events on re-extract must diff, not overwrite silently.
Schema-driven extraction (Pydantic / Instructor)
Document-level extraction maps full text (from Guide 4 ingest) into a Pydantic model or Java record. Instructor sends model_json_schema() to the provider, parses the response, validates, and retries with validation errors— the production pattern from Track 3 reliable formatting. Spring AI offers .entity(Invoice.class) and OpenAI json_schema options for the same contract on the JVM.
import instructor
from decimal import Decimal
from pydantic import BaseModel, Field, field_validator
from openai import OpenAI
from typing import Annotated
class LineItem(BaseModel):
description: str = Field(max_length=500)
quantity: Annotated[Decimal, Field(gt=0)]
unit_price: Annotated[Decimal, Field(ge=0)]
line_total: Annotated[Decimal, Field(ge=0)]
@field_validator("line_total")
@classmethod
def qty_times_price(cls, v: Decimal, info) -> Decimal:
data = info.data
expected = data["quantity"] * data["unit_price"]
if abs(v - expected) > Decimal("0.02"):
raise ValueError(f"line_total {v} != qty*price {expected}")
return v
class InvoiceExtract(BaseModel):
vendor_name: str
invoice_number: str
invoice_date: str # ISO date string from model; normalize in post-step
currency: str = Field(pattern=r"^[A-Z]{3}$")
line_items: list[LineItem]
subtotal: Decimal
tax: Decimal = Decimal("0")
total: Decimal
confidence: Annotated[float, Field(ge=0, le=1)]
client = instructor.from_openai(OpenAI())
def extract_invoice(document_text: str) -> InvoiceExtract:
return client.chat.completions.create(
model="gpt-4o-mini",
response_model=InvoiceExtract,
max_retries=3,
messages=[
{"role": "system", "content": (
"Extract invoice fields from OCR text. "
"Set confidence 0-1 for overall extraction quality. "
"Use only values present in the document."
)},
{"role": "user", "content": document_text},
],
)
import instructor
from anthropic import Anthropic
client = instructor.from_anthropic(Anthropic())
def extract_invoice(document_text: str) -> InvoiceExtract:
return client.messages.create(
model="claude-3-5-haiku-20241022",
max_tokens=4096,
response_model=InvoiceExtract,
max_retries=3,
messages=[{"role": "user", "content": f"Extract invoice:\n\n{document_text}"}],
)
import instructor
from instructor import from_bedrock
client = from_bedrock(bedrock_client)
def extract_invoice(document_text: str) -> InvoiceExtract:
return client.chat.completions.create(
model="anthropic.claude-3-5-haiku-20241022",
response_model=InvoiceExtract,
max_retries=3,
messages=[{"role": "user", "content": document_text}],
)
public record LineItem(
String description,
@JsonProperty("quantity") BigDecimal quantity,
@JsonProperty("unit_price") BigDecimal unitPrice,
@JsonProperty("line_total") BigDecimal lineTotal
) {
public LineItem {
BigDecimal expected = quantity.multiply(unitPrice);
if (lineTotal.subtract(expected).abs().compareTo(new BigDecimal("0.02")) > 0) {
throw new IllegalArgumentException("line_total mismatch");
}
}
}
public record InvoiceExtract(
String vendorName,
String invoiceNumber,
String invoiceDate,
String currency,
List<LineItem> lineItems,
BigDecimal subtotal,
BigDecimal tax,
BigDecimal total,
double confidence
) {}
@Service
public class InvoiceExtractor {
private final ChatClient chatClient;
public InvoiceExtract extract(String documentText) {
return chatClient.prompt()
.system("Extract invoice fields. Set confidence 0-1. Only document values.")
.user(documentText)
.options(OpenAiChatOptions.builder()
.withModel("gpt-4o-mini")
.withResponseFormat(new ResponseFormat(ResponseFormat.Type.JSON_SCHEMA, invoiceSchema()))
.build())
.call()
.entity(InvoiceExtract.class);
}
}
public InvoiceExtract extract(String documentText) {
return anthropicChatClient.prompt()
.user("Extract invoice as JSON matching schema:\n" + documentText)
.call()
.entity(InvoiceExtract.class);
}
public InvoiceExtract extract(String documentText) {
return bedrockChatClient.prompt()
.user(documentText)
.options(BedrockAnthropicChatOptions.builder()
.withModel("anthropic.claude-3-5-haiku-20241022")
.build())
.call()
.entity(InvoiceExtract.class);
}
Extraction task comparison
| Task | Output shape | Typical tool |
|---|---|---|
| NER | Spans + labels | spaCy, GLiNER, LLM + offset validation |
| Relation | (head, relation, tail) triples | RE models, LLM structured triple list |
| Event | Trigger + argument frames | ACE-style parsers, schema JSON |
| Schema fill | Full document POJO | Instructor, Spring AI .entity() |
Skipping validation because “the model returned JSON.” Strings where you expected decimals, invented invoice numbers, and missing line items pass json.loads every time. Pydantic validators and Jackson records are your API boundary.
Classical NER is cheaper and auditable but needs retraining when schemas change. LLM schema fill adapts in hours but costs more per page and hallucinates on poor OCR. Hybrid: regex for IDs and dates, LLM for line items and free text.
Add source_quotes: dict[str, str] to your schema mapping field names to verbatim snippets. Validators can fuzzy-match quotes against source text—cheap grounding without full RAG.
“NER vs schema extraction?” — NER finds mentions; schema extraction produces business records. Production needs both: NER for search and highlighting, schema for ERP/CLM writes with validation.
Production pipeline — five steps to trusted records
Every extraction job—batch overnight or real-time upload—should follow the same five steps. Async workers from Guide 4 feed step 1; steps 2–3 are synchronous per document; steps 4–5 are transactional and auditable.
flowchart LR
subgraph ingest["Guide 4 ingest"]
A[PDF / scan / email] --> B[Extract text + tables]
end
B --> C[LLM schema fill]
C --> D{Pydantic / Jackson validate}
D -->|pass| E[(Store DB + object refs)]
D -->|fail| F[Repair retry max 3]
F --> C
E --> G{confidence >= 0.8?}
G -->|yes| H[Auto-approve downstream]
G -->|no| I[HITL review queue]
I --> J[Human corrects + approves]
J --> E
Step 1 — Extract text
Reuse the tiered PDF pipeline from multimodal ingest: native text, OCR, layout parse. Persist document_id, extraction_tier, page-level text, and table CSV URIs. Extraction quality ceiling is set here—no schema prompt fixes garbage OCR.
Step 2 — LLM schema fill
Route by document classifier output (invoice, contract, generic). Each route binds a different Pydantic model and system prompt. Temperature 0; cap output tokens to schema size + margin. Log model ID, prompt hash, and input token count for cost attribution per tenant.
Step 3 — Validate
Validation layers: (1) schema types and constraints, (2) business rules (line totals, date ordering, party IDs in master data), (3) optional cross-field checks against ERP lookup APIs. Failed validation triggers Instructor retry with error text—not silent discard. After max retries, send to HITL with status=validation_failed.
Step 4 — Store DB
Write immutable extraction version: extractions table with JSONB payload, schema version, model version, confidence, and link to source document_id. Never update in place for compliance—append version 2 when re-run. Object storage holds raw PDF and parsed text for replay.
Step 5 — Human review queue
Items land in review when confidence < 0.8, validation failed, or policy flags (amount > threshold, new vendor). Review UI shows side-by-side PDF highlight + editable form fields. Approved corrections feed golden datasets for eval CI (Track 5).
| Step | SLA target | Failure mode |
|---|---|---|
| 1 Extract text | Seconds (sync) or minutes (batch queue) | Empty text → escalate OCR tier |
| 2 Schema fill | 2–15 s per doc (model dependent) | Rate limit → gateway queue |
| 3 Validate | <100 ms | Retry then HITL |
| 4 Store | Transactional <50 ms | DLQ on DB conflict |
| 5 HITL | Business-hours SLA | Escalation on aging queue |
from enum import Enum
from uuid import uuid4
HITL_THRESHOLD = 0.8
class ExtractionStatus(str, Enum):
AUTO_APPROVED = "auto_approved"
PENDING_REVIEW = "pending_review"
VALIDATION_FAILED = "validation_failed"
def run_pipeline(document_id: str, db, review_queue) -> str:
text_bundle = ingest_service.extract_text(document_id) # Step 1
doc_type = classifier.predict(text_bundle.plain_text)
schema_fn = ROUTES[doc_type] # invoice | contract
try:
record = schema_fn(text_bundle.plain_text) # Step 2 — Instructor inside
except ValidationError as e:
review_queue.enqueue(document_id, reason=str(e), status=ExtractionStatus.VALIDATION_FAILED)
return ExtractionStatus.VALIDATION_FAILED
business_errors = business_rules.validate(record, doc_type) # Step 3
if business_errors:
review_queue.enqueue(document_id, reason=business_errors, status=ExtractionStatus.VALIDATION_FAILED)
return ExtractionStatus.VALIDATION_FAILED
extraction_id = str(uuid4())
db.save_extraction(extraction_id, document_id, record.model_dump(), record.confidence) # Step 4
if record.confidence < HITL_THRESHOLD:
review_queue.enqueue(document_id, extraction_id=extraction_id, status=ExtractionStatus.PENDING_REVIEW) # Step 5
return ExtractionStatus.PENDING_REVIEW
downstream.publish_auto_approved(extraction_id)
return ExtractionStatus.AUTO_APPROVED
@Service
public class ExtractionPipeline {
static final double HITL_THRESHOLD = 0.8;
public ExtractionStatus run(UUID documentId) {
TextBundle text = ingestService.extractText(documentId);
DocType type = classifier.classify(text.plainText());
InvoiceExtract record = extractors.get(type).extract(text.plainText());
List<String> errors = businessRules.validate(record, type);
if (!errors.isEmpty()) {
reviewQueue.enqueue(documentId, errors, ExtractionStatus.VALIDATION_FAILED);
return ExtractionStatus.VALIDATION_FAILED;
}
UUID extractionId = extractionRepo.save(documentId, record);
if (record.confidence() < HITL_THRESHOLD) {
reviewQueue.enqueue(documentId, extractionId, ExtractionStatus.PENDING_REVIEW);
return ExtractionStatus.PENDING_REVIEW;
}
downstreamPublisher.autoApprove(extractionId);
return ExtractionStatus.AUTO_APPROVED;
}
}
AP teams auto-approve 60–70% of invoices after tuning confidence thresholds; the rest clear in a 24-hour review SLA. Pipeline metrics: auto-approve rate, validation retry rate, mean time in HITL queue.
Writing to ERP before validation completes. Use outbox pattern: DB commit for extraction record first, then async webhook to ERP only after status=auto_approved or human approval event.
Review queue UIs expose PII and contract terms—RBAC per queue, audit log on every field edit, no LLM re-call on corrected data without explicit “re-extract” action.
Idempotency key = hash(document_bytes) + schema_version. Re-uploads skip LLM if cached extraction exists unless force=true.
Confidence scoring — when to trust the model
A single float gates billions in payable invoices. Combine self-reported confidence from the schema, ensemble runs when stakes are high, and deterministic checks. Route everything below 0.8 to HITL unless policy overrides.
Self-reported confidence
Add confidence: float to your Pydantic model with Field(ge=0, le=1). Instruct the model: “0.9+ only when every field has clear textual evidence; 0.5–0.7 when inferred; below 0.5 when guessing.” Calibrate monthly: plot confidence vs human override rate. Models are often overconfident—apply Platt scaling or a simple linear shrink if overrides cluster above 0.85.
Ensemble N-times
For high-value contracts or invoices above dollar thresholds, run N independent extractions (N=3 or 5) at temperature 0 with different prompt seeds or models. Aggregate: majority vote on categorical fields; median on numerics; flag fields with disagreement > tolerance. Ensemble confidence = agreement fraction (e.g. 3/3 fields match → 1.0; 2/3 → 0.67 → HITL).
| Signal | Cost | When to use |
|---|---|---|
| Self-reported only | 1× LLM call | Low-risk docs, high volume AP |
| Ensemble N=3 | 3× LLM calls | Invoices > $10k, new vendors |
| Ensemble + verifier model | 4× calls | Legal contracts, regulatory filings |
| Deterministic validators | CPU only | Always—math, dates, master data lookup |
from collections import Counter
from statistics import median
HITL_THRESHOLD = 0.8
ENSEMBLE_N = 3
def ensemble_extract(text: str) -> tuple[InvoiceExtract, float]:
runs = [extract_invoice(text) for _ in range(ENSEMBLE_N)]
agreement_scores = []
# Per-field agreement on invoice_number and total
for field in ("invoice_number", "total"):
values = [str(getattr(r, field)) for r in runs]
winner, count = Counter(values).most_common(1)[0]
agreement_scores.append(count / ENSEMBLE_N)
totals = [float(r.total) for r in runs]
if max(totals) - min(totals) > 0.05:
agreement_scores.append(0.0)
else:
agreement_scores.append(1.0)
ensemble_confidence = min(agreement_scores)
self_conf = median([r.confidence for r in runs])
final_confidence = min(ensemble_confidence, self_conf)
merged = runs[0].model_copy(update={
"total": median(totals),
"confidence": final_confidence,
})
return merged, final_confidence
def needs_hitl(confidence: float) -> bool:
return confidence < HITL_THRESHOLD
public record EnsembleResult(InvoiceExtract merged, double confidence) {}
public EnsembleResult ensembleExtract(String text) {
List<InvoiceExtract> runs = IntStream.range(0, 3)
.mapToObj(i -> extractor.extract(text))
.toList();
double numberAgreement = agreementOn(runs, InvoiceExtract::invoiceNumber);
double totalAgreement = numericAgreement(runs.stream().map(InvoiceExtract::total).toList());
double selfMedian = median(runs.stream().mapToDouble(InvoiceExtract::confidence).toArray());
double finalConfidence = Math.min(Math.min(numberAgreement, totalAgreement), selfMedian);
InvoiceExtract merged = mergeMajority(runs);
merged = merged.withConfidence(finalConfidence);
return new EnsembleResult(merged, finalConfidence);
}
public boolean needsHitl(double confidence) {
return confidence < 0.8;
}
HITL threshold < 0.8
Default gate: confidence < 0.8 → human review. Tune per document type—contracts may use 0.85, low-value receipts 0.75 with extra fraud checks. Never lower threshold to “improve automation rate” without measuring error cost: a 1% false auto-approve on $50k invoices dwarfs reviewer salary.
Ensemble reduces variance but triples token cost. Apply dollar-tier routing: ensemble only when estimated_amount > threshold or vendor is new.
Trusting self-reported confidence without calibration. Models routinely output 0.95 on wrong totals. Always blend with deterministic checks and periodic override audits.
“How do you decide auto-approve vs human review?” — Schema validation first, blended confidence (self + ensemble + rules), threshold 0.8 default, dollar-tier overrides, metrics on override rate by confidence bucket.
Teams that ensemble only top 5% by amount capture most fraud/error risk while keeping mean cost per doc flat—FinOps dashboard splits extraction spend by tier.
Contract agent — parties, obligations, and risk
Contracts are long, cross-referenced, and high stakes. A contract agent is not one-shot extraction—it orchestrates section-aware reads, structured obligation graphs, version diffs against prior executes, and risk summaries for legal ops.
Core extractions
- Parties — legal names, roles (licensor/licensee), addresses, signatories with title
- Dates — effective, expiration, renewal notice windows, termination for convenience periods
- Obligations — payment terms, SLAs, indemnity caps, data processing, audit rights—each with clause reference
| Field group | Agent tool | Output |
|---|---|---|
| Parties & dates | Schema fill on cover + signature pages | ContractHeader record |
| Obligations | Section retrieval + per-clause extract | List of Obligation with clause_id |
| Version compare | Diff tool vs stored v{n-1} | ClauseChange[] |
| Risk analysis | Policy rules + LLM summary | Ranked RiskFlag with severity |
Compare versions
When a redline PDF arrives, normalize both versions to clause-aligned text (heading detection from ingest). Agent calls diff_clauses(v_prev, v_new)—embedding similarity to match moved clauses, then LLM summarizes material changes (liability cap lowered, auto-renewal added). Store diff artifact for CLM; never overwrite prior extraction version.
Risk analysis
Combine playbook rules (“uncapped liability → critical”) with LLM narrative for context. Risk output is advisory—legal counsel signs off. Log which model and policy version produced each flag for audit.
import instructor
from pydantic import BaseModel, Field
from enum import Enum
class RiskSeverity(str, Enum):
low = "low"
medium = "medium"
high = "high"
critical = "critical"
class Obligation(BaseModel):
clause_ref: str
obligation_type: str
description: str = Field(max_length=1000)
due_date: str | None = None
source_quote: str
class RiskFlag(BaseModel):
severity: RiskSeverity
topic: str
rationale: str
clause_ref: str
class ContractAnalysis(BaseModel):
parties: list[str]
effective_date: str
expiration_date: str | None
obligations: list[Obligation]
risk_flags: list[RiskFlag]
confidence: float
client = instructor.from_openai(OpenAI())
def analyze_contract(full_text: str, playbook: str) -> ContractAnalysis:
return client.chat.completions.create(
model="gpt-4o",
response_model=ContractAnalysis,
max_retries=2,
messages=[
{"role": "system", "content": (
"Extract contract parties, dates, obligations with clause refs. "
f"Apply risk playbook:\n{playbook}"
)},
{"role": "user", "content": full_text[:120_000]},
],
)
def compare_versions(text_prev: str, text_new: str) -> list[dict]:
# Agent step: structured clause diff (simplified)
prompt = f"List material clause changes.\n\nPREV:\n{text_prev[:50_000]}\n\nNEW:\n{text_new[:50_000]}"
# ... tool-calling loop with diff_clauses tool in full implementation
return []
public record Obligation(
String clauseRef, String obligationType, String description,
String dueDate, String sourceQuote
) {}
public record RiskFlag(RiskSeverity severity, String topic, String rationale, String clauseRef) {}
public record ContractAnalysis(
List<String> parties,
String effectiveDate,
String expirationDate,
List<Obligation> obligations,
List<RiskFlag> riskFlags,
double confidence
) {}
@Service
public class ContractAgent {
public ContractAnalysis analyze(String fullText, String playbook) {
return chatClient.prompt()
.system("Extract parties, dates, obligations. Apply playbook: " + playbook)
.user(fullText.substring(0, Math.min(fullText.length(), 120_000)))
.call()
.entity(ContractAnalysis.class);
}
public List<ClauseChange> compareVersions(String prev, String next) {
return diffTool.diff(prev, next); // Spring AI tool / custom bean
}
}
Contracts are confidential—route through VPC gateway, disable training on enterprise tier, redact counterparty PII in logs. Risk flags are not legal advice; watermark UI accordingly.
Single-pass extraction on 80-page MSAs misses exhibits. Agent must iterate: TOC → section retrieve → per-section schema fill with token budget.
Store obligations as rows keyed by clause_ref—version diff updates row status (added|modified|removed) instead of replacing the whole contract JSON.
“Design a contract AI system.” — Ingest + clause segmentation, structured extraction per section, immutable versions, diff on redlines, rules-based risk + LLM summary, mandatory legal HITL on critical flags.
Invoice agent — AP automation with guardrails
Invoices are shorter than contracts but hit payment systems directly. The invoice agent extracts vendor and line items, validates math, detects duplicates against historical payments, and routes approvals by amount, cost center, and policy.
Vendor and line items
Match extracted vendor_name to ERP vendor master (fuzzy + tax ID when present). Line items need quantity × unit price = line total; sum(line totals) + tax ≈ total within rounding tolerance. Failed math → auto HITL regardless of confidence.
Math validation
| Check | Rule | On failure |
|---|---|---|
| Line item | abs(qty × price - line_total) ≤ 0.02 | Validation error + HITL |
| Subtotal | sum(line_totals) ≈ subtotal | Retry once, then HITL |
| Total | subtotal + tax ≈ total | Block auto-approve |
| Currency | ISO 4217 + match PO currency | Route to treasury review |
Duplicate detection
Fingerprint: hash(vendor_id, invoice_number, total, invoice_date). Query last 24 months; fuzzy match on number variants (INV-001 vs 001). Duplicate candidate → hold payment, alert AP, never auto-approve.
Approval routing
Rules engine after extraction: <$500 manager; $500–$5k director + cost center owner; >$5k VP + PO match required. Agent prepares approval packet (PDF link, extracted JSON, PO match score, duplicate check result)—humans click approve in existing workflow tool.
from decimal import Decimal
import hashlib
def invoice_fingerprint(vendor_id: str, inv: InvoiceExtract) -> str:
raw = f"{vendor_id}|{inv.invoice_number}|{inv.total}|{inv.invoice_date}"
return hashlib.sha256(raw.encode()).hexdigest()
def validate_math(inv: InvoiceExtract) -> list[str]:
errors = []
line_sum = Decimal("0")
for item in inv.line_items:
expected = item.quantity * item.unit_price
if abs(item.line_total - expected) > Decimal("0.02"):
errors.append(f"line {item.description}: total mismatch")
line_sum += item.line_total
if abs(line_sum - inv.subtotal) > Decimal("0.05"):
errors.append("subtotal mismatch")
if abs(inv.subtotal + inv.tax - inv.total) > Decimal("0.05"):
errors.append("grand total mismatch")
return errors
def check_duplicate(fp: str, db) -> bool:
return db.invoices.exists_fingerprint(fp, months=24)
def approval_route(total: Decimal, cost_center: str) -> str:
if total < 500:
return "manager"
if total < 5000:
return f"director:{cost_center}"
return "vp+po_match"
def run_invoice_agent(document_id: str, db, review_queue):
inv = extract_invoice(ingest.text(document_id))
vendor_id = vendor_master.resolve(inv.vendor_name)
errors = validate_math(inv)
if errors:
review_queue.enqueue(document_id, reason=errors)
return
fp = invoice_fingerprint(vendor_id, inv)
if check_duplicate(fp, db):
review_queue.enqueue(document_id, reason="duplicate_candidate", fingerprint=fp)
return
if inv.confidence < 0.8:
review_queue.enqueue(document_id, extraction=inv)
return
approver = approval_route(inv.total, cost_center=db.po_lookup(document_id))
workflow.start_approval(document_id, approver, inv)
public void runInvoiceAgent(UUID documentId) {
InvoiceExtract inv = extractor.extract(ingest.text(documentId));
String vendorId = vendorMaster.resolve(inv.vendorName());
List<String> mathErrors = mathValidator.validate(inv);
if (!mathErrors.isEmpty()) {
reviewQueue.enqueue(documentId, mathErrors);
return;
}
String fingerprint = fingerprintService.compute(vendorId, inv);
if (invoiceRepo.existsFingerprint(fingerprint, Duration.ofDays(730))) {
reviewQueue.enqueue(documentId, List.of("duplicate_candidate"));
return;
}
if (inv.confidence() < 0.8) {
reviewQueue.enqueue(documentId, inv);
return;
}
String approver = routingRules.resolve(inv.total(), poLookup.costCenter(documentId));
workflowClient.startApproval(documentId, approver, inv);
}
Duplicate detection prevents double payment more often than fraud models—hash + fuzzy invoice number catches scanner re-uploads and email forwards.
Auto-approving when PO match is “close enough.” Require structured PO line match scores; <0.9 match → HITL even at confidence 0.99.
Aggressive vendor fuzzy matching reduces HITL but risks mis-attribution to wrong supplier. Prefer tax ID / remit-to address match when present.
Log every auto-approved invoice with extraction snapshot—when audit asks “why paid?” you replay JSON + model version, not chat history.
Production: structured extraction checklist
Structured extraction is production-ready when schemas are versioned, confidence is calibrated, HITL is staffed, and contract/invoice agents cannot bypass math or duplicate checks—not when a demo parses one PDF into JSON.
Track 6 structured extraction checklist
- ☐ Document routes (invoice / contract / generic) map to versioned Pydantic or Java schemas
- ☐ Five-step pipeline: extract text → schema fill → validate → store → HITL queue
- ☐ Instructor or Spring AI structured output with max_retries; no raw json.loads to DB
- ☐ Business validators: line math, dates, currency, master data lookup
- ☐ Confidence field on schema; calibrated vs human override rate quarterly
- ☐ Ensemble N=3 on high-dollar tier; agreement score blended with self-reported confidence
- ☐ HITL threshold default <0.8; policy overrides documented per doc type
- ☐ Extraction versions immutable; re-run appends v{n+1} with diff to prior
- ☐ Contract agent: parties, dates, obligations with clause refs; version diff artifact stored
- ☐ Invoice agent: duplicate fingerprint; math validation blocks auto-approve
- ☐ Approval routing integrated with existing workflow (not shadow inbox)
- ☐ Golden extraction fixtures in CI; regression on field-level accuracy (Track 5 evals)
- ☐ Runbook: validation spike, HITL queue aging, ERP webhook failure, schema migration
Guide 5 closes the loop from multimodal bytes to business records. Next, production AI architecture zooms out—reference patterns for copilots, batch extraction fleets, multi-region gateways, and how platform teams support product squads.
Ship when auto-approve rate is stable, override rate by confidence bucket is predictable, and zero payments bypass duplicate + math gates—not on first successful JSON dump.
Track 6 path: Continue to Guide 6 — Production AI architecture or return to Shipping track home.
Schema version bumps without migration scripts brick ERP integrations—treat extraction schemas like public API contracts with deprecation windows.
Extracted JSON in DB is as sensitive as source PDFs—encrypt at rest, column-level RBAC, mask bank details in review UI for users without payment authority.
“End-to-end invoice AI?” — Ingest tiers → classify → Pydantic extract → math + duplicate validators → confidence/HITL → approval workflow → immutable audit trail.