Principal Backend AI platform — Interview notepad

0 — How to use this notepad (single source of truth)

This page is the only prep doc you need for a Principal Backend Engineer — AI platform interview. It maps every responsibility in the target JD to components, tools, design principles, and spoken answers.

Anchor story: OptivalueTek Project-0 — internal banking assistant (policies, KYC, cases) grounded on approved documents. Parallel mapping: same platform patterns as a document extraction product (ingest → pipeline → LLM → metadata DB → review UI).

Study order (one week)

Day 1–2: §1 (your story) + §2 (architecture) — be able to draw the diagram from memory.
Day 3: §4 LLM gateway + §5 queues + §8 data.
Day 4: §3 pipelines + §14 agentic + §15 RAG/PDF.
Day 5: §6 AWS + §9 security + §10 observability.
Day 6: §7 APIs + §12 Python + §13 distributed patterns.
Day 7: §17 whiteboard + §18 Q&A — speak answers out loud.

Notepad. Open every answer with: problem → non-goals → architecture → tradeoffs → metrics. Never lead with framework names; lead with boundaries and failure modes.

Master component map (every JD bullet → section)

JD topic	§	Your anchor
Platform architecture, multi-tenant scale	§2	Bank entity/branch + chunk ACL
Extraction pipelines, registry, cost/latency/accuracy	§3	Policy ingest + re-ingest
LLM gateway, keys, failover, structured output, tools	§4	Structured JSON + gateway pattern
Async workers, DLQ, retry, backpressure	§5	Evolve long case packs to jobs
AWS ECS, RDS, S3, Bedrock, ALB, Secrets, IaC	§6	Project-1 AWS + OpenShift
OpenAPI 3.1, FastAPI endpoints, versioning	§7	FastAPI orchestration APIs
PostgreSQL metadata, indexing, tenancy	§8	Audit + session metadata pattern
OAuth2, JWT, RBAC, CORS, IAM	§9	SSO + retrieval-time ACL
Logging, tracing, SLOs	§10	Datadog traces + audit events
Technical leadership	§11	RFCs, mentoring, eval gates
Python FastAPI SQLAlchemy Pydantic asyncio	§12	Daily orchestration stack
Sagas, circuit breaker, idempotency	§13	Kafka + integration habits
LangGraph, agentic patterns, guardrails	§14	Project-0 agent loop
RAG, PyMuPDF, vectors, hybrid	§15	Policy RAG layer
Bedrock, Temporal, K8s, ML serving	§16	Bridge from experience
System design whiteboard	§17	45-min script

1 — Your current project (Optivalue Project-0) — tell-me-about-yourself

30-second opener

At OptivalueTek I lead a large retail bank programme. Project-0 is an internal AI assistant for branch and back-office staff: credit policies, KYC steps, exceptions, and case context in plain language — answers must come from approved documents, respect role-based access, and be auditable. I owned orchestration: RAG, LangChain steps, LangGraph for branching and human-in-the-loop, bounded agent tools on allow-listed APIs, and FastAPI beside existing Java/Spring services with shared SSO and logging.

Users and problem

Users: branch staff, KYC ops, credit analysts — not customers.
Pain: policy in PDFs/circulars/SOPs; inconsistent interpretation; slow search.
Success: grounded answers + citations; fail-closed on weak evidence; audit replay; least privilege.

Four ways AI is used

RAG Q&A (~80%): rewrite → ACL retrieval → generate with citations → abstain if weak.
Structured JSON: checklists and case note drafts for workflow UI (Pydantic validation).
Bounded agent: allow-listed read tools (case lookup, policy version); max steps/tokens/time.
Human-in-the-loop: high-risk intents and any write/side effect pause in LangGraph.

What you did as Technical Lead

Architecture boundaries: Python orchestration vs Java system-of-record.
RBAC at retrieval (filter-first), not only API gateway.
Prompt template versioning + golden-question CI evals.
Compliance/design reviews; mentoring on async FastAPI and production ownership.

Sibling track (Project-1): core customer/transaction microservices — Java 21, Spring Boot, Kafka, AWS, OpenShift — proves you operate full platform, not AI silo.

RBAC — five layers (speak in order)

Identity: bank SSO OAuth2/OIDC → JWT (issuer, audience, expiry validated every call).
API authorization: FastAPI deps map roles/scopes to routes; tenant/entity from claims not client body.
Retrieval ACL (most important for AI): chunk metadata allowed_roles, library, effective_from; vector query applies filters before similarity ranking.
Tool RBAC: each tool declares scope; runtime enforces; writes need case:write + human approval.
Session isolation: no cross-user memory; TTL session store; stateless workers; server-side prompt assembly only.

Guardrails — input / generation / output

Layer	Controls	Fail mode
Input	Injection filter, PII mask, risk router, out-of-scope deny	Refuse early
Generation	Bank system prompts, retrieval-only context, schema mode, tool allow-list, step/token caps	Truncate or route to HITL
Output	Pydantic validate, citation completeness, claim-evidence check, numeric verify, PII scrub	Abstain — no softer guess

Tools on Project-0 — what and why

Tool	Role	Why chosen
Python	Orchestration, ingest jobs, evals	Speed of iteration; same code notebook → batch
FastAPI	Chat, retrieval, stream tokens	Async I/O; OpenAPI; fits AI iteration
Pydantic v2	API + LLM output validation	Fail bad JSON before workflow UI
Vector DB	Policy chunk index	Semantic search over long PDFs
Embeddings	Chunk vectors	Match paraphrased staff questions
LangChain	Rewrite, retrieve, compress, parse	Composable testable steps
LangGraph	Branch, HITL, agent loop, retries	Explicit states vs nested ifs
LLM (enterprise API)	Generation	Bank contract; private endpoint
Java/Spring	Case and core APIs	System of record; existing ops
OpenShift/Helm/Jenkins	Deploy	Same as transaction services
Datadog	Traces and dashboards	One pane with Project-1

Project-0 request path

flowchart TB
  UI[Internal UI] --> SSO[Bank SSO OAuth2]
  SSO --> API[FastAPI orchestrator]
  API --> IN[Input guardrails]
  IN --> LG[LangGraph]
  LG --> RAG[ACL vector retrieval]
  LG --> LLM[LLM gateway]
  LG --> TOOLS[Allow-listed APIs]
  TOOLS --> JAVA[Java core services]
  LG --> OUT[Output verify abstain]
  OUT --> AUD[Audit log]
  BATCH[Re-ingest jobs] --> RAG

Notepad. Phrase: "In banking, safe AI means deny by default — verification failure returns no answer, not a softer guess."

2 — Platform architecture & scalability (multi-tenant, high throughput)

Interview framing: prototype → production = separate control plane (API, auth, config) from data plane (workers, LLM, storage).

Reference architecture (target extraction platform + your bank assistant)

Edge: ALB + WAF → API service (FastAPI) — sync only for submit/status/chat stream.
Async plane: queue → worker pool → pipeline registry executes versioned steps.
Knowledge plane: object storage (raw docs) + vector index + PostgreSQL metadata.
LLM plane: provider gateway (keys, failover, structured output, rate limits).
Observability plane: OTel traces, metrics, audit events (append-only).

Multi-tenancy models (know all three)

Model	Isolation	When to use	Risk
Shared schema + tenant_id column	App filters every query; optional RLS in Postgres	Default for SaaS; fastest to ship	Bug in filter = cross-tenant leak
Row-Level Security (RLS)	DB enforces tenant_id from session var	Regulated SaaS; defense in depth	Migration complexity; connection pooling setup
Schema-per-tenant / DB-per-tenant	Hard isolation	Enterprise tier, data residency	Ops cost; migration explosion

Your story: bank used entity/branch claims + chunk ACL metadata — same as tenant-scoped retrieval filters.

Microservices decomposition (10+ years framing)

Split by business capability — documents, jobs, extraction results, LLM gateway, ingest workers — not “one AI service.”
Failure isolation: LLM outage must not take down document upload; queue absorbs spikes.
Communication: sync REST for queries; async events for pipeline; no distributed monolith chat between 12 services.
Data ownership: each service owns its tables; no shared mutable DB antipattern.
Your evidence: Project-1 split customer, postings, limits; Project-0 Python orchestration beside Java — integration via APIs + shared identity only.

Scalability tactics

Stateless API horizontal scale; session in Redis/DB with TTL.
Workers scale on queue depth + GPU/CPU saturation (KEDA/ECS autoscaling).
Partition hot paths: ingest, inference, and export as separate services.
Backpressure: 429 when queue age > SLO; shed load before OOM.
Cache: embedding cache for repeated policy chunks; not unbounded chat memory.

Throughput for extraction workloads

Documents are bursty (batch upload). Design: accept job → return job_id → workers process pages in parallel → aggregate → persist traces. Never hold HTTP open for 10-minute PDF runs.

Multi-tenant platform layers

flowchart LR
  subgraph tenants [Tenants]
    T1[Org A]
    T2[Org B]
  end
  subgraph edge [Edge]
    ALB[ALB]
    API[FastAPI API]
  end
  subgraph data [Data plane]
    Q[Queue]
    W[Workers]
    PG[(PostgreSQL)]
    S3[(S3)]
    VDB[(Vector index)]
  end
  T1 --> ALB
  T2 --> ALB
  ALB --> API --> Q --> W
  W --> PG
  W --> S3
  W --> VDB

3 — Extraction pipeline engineering (registry, abstractions, execution strategies)

JD focus: architect AI pipelines balancing accuracy, latency, LLM cost.

Pipeline as a product, not a script

Pipeline definition: versioned DAG of steps (declarative YAML/JSON or code registry).
Pipeline registry: pipeline_id@version → step list, model routes, validators.
Execution strategies: sync (small doc), async queue (large), fan-out per page, map-reduce merge.
Artifacts: each step writes typed outputs to S3 + metadata row (trace).

Typical extraction stages (engineering drawings OR bank policy docs)

Stage	Tech	Output	Cost/latency knob
Ingest	S3 upload, virus scan, tenant tag	raw_object_key	cheap
Rasterize/parse	PyMuPDF / PDF parser	pages, images, text spans	CPU — ThreadPoolExecutor
Layout/detect	YOLO/rules (optional)	bounding boxes	GPU or skip for v1
Region crop	coordinates normalized	crops per field type	medium
LLM extract	structured schema per field	JSON: holes, GD&T, title block OR policy clauses	expensive — route model
Validate	deterministic rules + second-pass LLM	confidence score	tunable
Persist	PostgreSQL + export	review queue	cheap

Your parallel (bank assistant)

Ingest policy PDFs → chunk + embed → retrieve → LLM answer with schema → validate citations → audit. Re-ingest job when circular changes — same as extraction re-index.

Accuracy vs latency vs cost (how to talk tradeoffs)

Cheaper model for rewrite/routing; frontier model only for final extract or low-confidence regions.
Deterministic validators first (regex, units, GD&T rules) before second LLM pass.
Confidence routing: auto-accept > τ, human review between τ₁–τ₂, reject below τ₂.
Golden-file regression: CI runs pipeline on fixed docs; block deploy on accuracy drop.

Pipeline registry API (design verbally)

POST /v1/jobs body: { pipeline_version, document_id, tenant_id, idempotency_key }. Worker loads registry entry, runs state machine, checkpoints after each step.

Notepad. Say: v1 = linear pipeline + structured LLM; v2 = agent only on ambiguous regions; never agent-on-every-field.

4 — Multi-provider LLM abstraction (keys, failover, structured output, tools)

All model calls go through a gateway service — never scatter SDK calls in workers.

Gateway responsibilities

Unified interface: complete(), complete_structured(schema), invoke_tools(messages, tools).
Multi-key round-robin: pool of API keys per provider; per-key rate limiter (token bucket).
Provider failover: primary → secondary (Claude ↔ OpenAI ↔ Bedrock) with same JSON schema contract.
Circuit breaker: open on error rate; half-open probe; per-provider health.
Retries: transient 429/503 with exponential backoff + jitter; respect Retry-After.
Observability: log model id, tokens, latency, cost estimate per request_id.

Structured output

Pydantic v2 models define fields (holes, dimensions, policy_clause_ids).
Provider native JSON schema mode where available; else tool-style schema + validate + one repair retry.
Schema version in audit log — critical for replay.

Tool-calling protocol

Tools registered with JSON Schema args; runtime validates before execution.
Loop: model → tool_calls → execute allow-listed → append tool messages → model until done or cap.
Parallel read-only tool calls OK; writes sequential + idempotent.

Amazon Bedrock (JD + preferred)

IAM roles on ECS tasks — no long-lived keys in env.
Cross-region inference for resilience; model access via inference profiles.
Claude/Nova via Bedrock for enterprise contract and VPC endpoints.

Concern	Implementation
Rate limit	Per-tenant + per-key token bucket; queue or 429
Cost cap	max_tokens per job; downgrade model on budget
Failover	health-checked provider list; sticky session only for debugging
Testing	mock gateway in CI; record/replay golden traces

5 — Async workers & queue architecture (DLQ, retry, backpressure, idempotency)

Rule: HTTP accepts work; workers do work. Long PDF + multi-step LLM = async job.

Job state machine

Job lifecycle

stateDiagram-v2
  [*] --> Queued
  Queued --> Running
  Running --> Succeeded
  Running --> Failed
  Failed --> Retrying
  Retrying --> Running
  Retrying --> DeadLetter
  Failed --> DeadLetter
  Succeeded --> [*]
  DeadLetter --> [*]

Queue choice

AWS SQS + DLQ — simple, managed, visibility timeout for long jobs.
Redis + Celery/RQ — if team already standardized.
Temporal / Airflow (preferred) — long-running stateful jobs with built-in retry, timers, human signals.

Must-have patterns

Idempotency key on POST /jobs — unique constraint prevents duplicate charges/runs.
Visibility timeout > p99 step duration; heartbeat extends lease while worker alive.
Checkpointing after each pipeline step — resume from last good state.
DLQ + admin replay API with audit — poison PDFs quarantined.
Retry policy: classify transient (429, 503, timeout) vs permanent (400 bad doc); max attempts 3–5 with jitter.
Backpressure: scale workers on queue depth; API returns 503 when depth > threshold; shed low-priority tenants.
At-least-once delivery: workers must be idempotent — upsert by (job_id, step_id).

Your bank parallel

Long “analyze full case file” flows should be job-based like extraction — even if v1 was mostly sync chat, describe evolution to async for 100-page packs.

Notepad. Kafka from Project-1: domain events can trigger re-ingest — document-published → enqueue pipeline.

6 — AWS production deployment (ECS, RDS, S3, Bedrock, ALB, Secrets, IaC, blue-green)

Service topology

Component	AWS service	Notes
API	ECS Fargate/EC2 behind ALB	Health /ready; autoscale on CPU/RPS
Workers	ECS separate service	Scale on SQS ApproximateNumberOfMessagesVisible
Metadata	RDS PostgreSQL Multi-AZ	Extraction traces, jobs, tenants
Blobs	S3 SSE-KMS	PDFs, page images, JSON artifacts
LLM	Bedrock (+ optional VPC endpoint)	IAM role per task
Secrets	Secrets Manager	DB creds, API keys rotation
Logs/metrics	CloudWatch + OTel → X-Ray	Alarms on queue age, 5xx, LLM 429
IaC	Terraform or CDK	Environments via workspaces; no console drift

Docker & ECS task definitions

Local: Docker Compose — API + worker + Postgres + Redis + mock LLM for dev parity.
ECS task def: CPU/memory per service; secrets from Secrets Manager; awsvpc mode; health check command hits /ready.
Sidecars (optional): OTel collector exporting to X-Ray.
Image: slim Python base; non-root user; pin dependencies (lock file).

Blue-green / zero-downtime

New ECS task definition revision → target group weight 10/90 canary → 100.
Or CodeDeploy blue-green with automatic rollback on CloudWatch alarm.
DB migrations: expand-contract — add column → dual-write → backfill → switch read → drop old.

Security & compliance on AWS

IAM: task roles least privilege — Bedrock InvokeModel only on allowed ARNs; S3 prefix per tenant.
Network: private subnets; NAT for egress; no public RDS.
Data residency: region-pinned stacks per enterprise customer.

Your experience bridge

Project-1: EC2, RDS, S3, IAM, OpenShift/K8s, Helm, Jenkins — say AI services deploy with same GitOps discipline; greenfield role adds Bedrock-native gateway and ECS-first AI workers.

AWS deployment

flowchart TB
  ALB[ALB] --> API[ECS API tasks]
  API --> SQS[SQS]
  SQS --> W[ECS worker tasks]
  W --> RDS[(RDS Postgres)]
  W --> S3[(S3)]
  W --> BR[Bedrock]
  API --> SM[Secrets Manager]
  W --> SM
  CW[CloudWatch alarms] --> API
  CW --> W

7 — API design & governance (OpenAPI 3.1, FastAPI, versioning, 25+ endpoints)

Spec-first: OpenAPI 3.1 is the contract; FastAPI generates or validates against it.

Endpoint families (extraction platform — know them all)

Documents: upload, get, list, delete (soft), presigned URL.
Jobs: create, get status, cancel, list by tenant.
Extractions: get result, patch review status, export batch.
Pipelines: list versions, get schema for output types.
Admin: replay DLQ, trigger re-ingest, tenant config.
Chat/assistant (your bank): session, message, stream, feedback.

Versioning & compatibility

URL prefix /v1 or header Accept-Version.
Non-breaking: optional fields, new enum values, new endpoints.
Breaking: rename field, change type → new major version + sunset headers (Sunset, Deprecation).
Publish changelog; contract tests in CI (Schemathesis, Dredd).

FastAPI implementation details

Pydantic v2 request/response models; strict mode for external APIs.
Dependency injection for current_user, tenant_id, db_session.
request_id middleware; problem+json errors (type, title, detail).
Async endpoints for I/O; run_in_executor for CPU PDF work.

Notepad. Mention you enforced consistent error codes on Java and Python services for channel integrators.

8 — Data architecture (PostgreSQL, extraction metadata, indexing, multi-tenant)

Core tables (extraction platform)

tenants — plan, config, data residency region.
documents — tenant_id, s3_key, status, page_count, checksum.
extraction_jobs — pipeline_version, state, idempotency_key, timestamps.
extraction_runs — per-step traces (model, tokens, latency, chunk_ids).
extractions — structured fields JSONB (holes, dimensions, GD&T, title_block, notes).
batches — export jobs for dataset / review workflows.
review_states — human accept/reject, reviewer_id.

Indexing strategy

Query pattern	Index
List jobs by tenant + status	BTREE (tenant_id, status, created_at DESC)
Document by tenant + external id	UNIQUE (tenant_id, external_id)
Review queue low confidence	PARTIAL WHERE confidence < τ
JSONB field lookup	GIN on extractions.fields (sparingly)

SQLAlchemy 2.0 async

async_sessionmaker; explicit transactions per request.
Repository layer — no raw SQL in route handlers.
Alembic migrations in CI; backward-compatible expand-contract.

Multi-tenant isolation in DB

Every query includes tenant_id from JWT — never from client body alone.
RLS: SET app.tenant_id on connection; policy tenant_id = current_setting(...).
Integration tests that attempt cross-tenant read must fail.

Bank assistant parallel

Store: session_id, message, retrieval chunk_ids, model_version, response_hash — not full customer PII in logs.

9 — Security & compliance (OAuth2, JWT, RBAC, CORS, IAM, secrets, service-to-service)

Authentication

OAuth2 / OIDC with bank IdP; JWT access tokens short-lived; refresh via secure cookie or backend-for-frontend.
Validate issuer, audience, signature, expiry on every request.

RBAC layers (repeat until automatic)

API route authorization (roles/scopes).
Tool allow-list per role.
RAG filter-first on chunk metadata (libraries, classification).
Row-level DB tenant isolation.

Service-to-service

mTLS inside mesh OR signed internal JWT with short TTL.
No shared API keys between services — rotate per task role.

CORS & browser clients

Allowlist origins; no * with credentials.
Separate public vs internal assistant origins.

Secrets

AWS Secrets Manager; inject at task start; never in git.
Bedrock via IAM role — preferred over API keys.

LLM-specific threats

Prompt injection — treat retrieved text as untrusted; delimiter + policy router.
PII in prompts — mask/deny; minimize log retention.
Audit 7yr retention policy — confirm with compliance (bank).

Notepad. Your Project-0: writes and outbound comms always human-approved; agent cannot call arbitrary internal APIs.

10 — Observability & reliability (logging, OTel, X-Ray, CloudWatch, SLOs)

Three pillars — what to instrument

Logs: structured JSON — request_id, tenant_id, job_id, pipeline_step, model, tokens, outcome.
Metrics: queue depth, job duration histogram, LLM latency, abstain rate, retrieval empty rate, cost USD.
Traces: OpenTelemetry → AWS X-Ray; span per retrieval, LLM call, tool execution.

SLO examples (state these numbers as examples, adjust to role)

SLI	SLO target	Error budget action
Pipeline throughput	p95 job completion < 5 min for 50-page doc	Scale workers; optimize rasterize
LLM call latency	p95 < 8s per structured call	Failover provider; reduce context
API availability	99.9% monthly	Freeze features; fix burn
Extraction accuracy	>92% on golden set	Block prompt deploy

Alarms (CloudWatch)

SQS ApproximateAgeOfOldestMessage > 10 min.
ECS CPU > 85% sustained; task restarts spike.
LLM 429 rate > 5% over 5 min.
DLQ message count > 0.

Your Datadog experience

Project-1: service dashboards, distributed traces on posting path — same trace_id propagated into Python orchestrator for AI requests.

11 — Engineering leadership (mentor, reviews, standards, roadmap)

What Principal means in interview

Owns technical direction — writes RFCs, not only tickets.
Raises bar — API standards, pipeline registry, on-call playbooks.
Develops seniors — design review facilitation, delegation, career coaching.
Partners — product, security, compliance, ops — translates constraints into architecture.

Concrete examples to cite

Introduced golden-question eval gate before prompt promotion.
Split AI orchestration from Java core — reduced release coupling.
Post-incident: DLQ replay runbook + idempotency bug fix class.
Mentored engineers on Spring + FastAPI production patterns.

Roadmap narrative (prototype → enterprise)

v1: sync RAG assistant + audit.
v2: async jobs + pipeline registry + multi-tenant hardening.
v3: extraction agents on low-confidence regions only + Temporal.
v4: multi-region, Bedrock cross-region, enterprise SSO federation.

Notepad. Use STAR: Situation, Task, Action, Result — one story per leadership bullet.

12 — Python depth (FastAPI, SQLAlchemy, Pydantic v2, asyncio, ThreadPoolExecutor)

FastAPI

Async routes for DB/HTTP/LLM I/O; sync only when necessary.
Lifespan hooks: warm connections, drain on shutdown.
BackgroundTasks for fire-and-forget only if loss acceptable — prefer queue for real work.

Pydantic v2

model_validate, Field constraints, model_config = ConfigDict(strict=True) for external APIs.
Separate DTOs: JobCreate, JobResponse, internal ExtractionRecord.

SQLAlchemy 2.0

select() style; async engine; avoid N+1 with joinedload where needed.
Unit of work per request; explicit commit/rollback.

asyncio vs ThreadPoolExecutor

asyncio: network-bound concurrency (retrieval, LLM HTTP).
ThreadPoolExecutor: CPU-bound PDF rasterize, image ops — do not block event loop.
asyncio.to_thread() or executor wrapper pattern.

Notepad. Interview trap: 'async is faster for PDF parsing' — wrong; async helps concurrent I/O waits.

13 — Distributed systems patterns (sagas, circuit breaker, idempotency, degradation)

Pattern	Use in AI platform	One-liner
Idempotency keys	POST /jobs, tool writes	Same key → same job id, no double extract
At-least-once + idempotent worker	SQS consumers	Upsert by job_id+step
Circuit breaker	LLM provider gateway	Stop hammering failing provider
Retry + jitter	429/503	Avoid thundering herd
Saga / compensation	Multi-step pipeline failure	Mark job failed; release lease; notify; no partial bill
Graceful degradation	LLM down	Return retrieved chunks only or queue for later
Bulkhead	Separate worker pools	Ingest vs inference vs export
DLQ	Poison messages	Human triage + replay

Saga example (spoken)

Job runs rasterize → extract → persist. Persist fails after S3 write: compensate by marking job failed, retaining artifacts for debug, not charging customer credit, enqueue notification — do not leave job running forever.

14 — Production agentic AI (LangGraph, tools, guardrails, patterns)

When agents vs pipeline

Use fixed pipeline	Use bounded agent
Stable layout, known fields	Variable cross-page reasoning
Cost-sensitive bulk extract	Low-volume exception handling
Strict audit per step	Needs tool access to case/ticket systems

Agentic patterns (JD preferred — name them)

Planner–executor: planner outputs steps; executor runs tools — planner not authority on writes.
Multi-round tool-use loop: cap rounds; fingerprint repeated tool+args to detect loops.
Shared notepad / blackboard: schema-versioned state; single writer per slot (facts).
Confidence-based validation: deterministic rules → optional second LLM → human queue.
Supervisor–worker: rule supervisor in CI; LLM supervisor behind feature flag only.

LangGraph (your stack)

Nodes = functions; edges = conditions; checkpoint for resume.
Human node = interrupt; resume with approval payload.

LangChain vs CrewAI

LangChain = composable steps. CrewAI = role personas — mention but prefer LangGraph for production state. CrewAI good for prototypes.

Guardrails checklist

Tool allow-list in code per node.
Max steps, tokens, wall time, USD per session.
No API keys in context — secrets gateway.
Audit every tool call: name, args hash, latency, approval_id.

Bounded agent loop

flowchart LR
  P[Planner] --> E[Executor]
  E --> T{Tool?}
  T -->|read| R[Allow-listed APIs]
  T -->|write| H[Human approve]
  R --> V[Validator]
  H --> V
  V -->|ok| OUT[Response]
  V -->|fail| A[Abstain]

15 — RAG & document pipelines (chunking, hybrid retrieval, PyMuPDF, vectors)

Ingest pipeline

Upload → checksum → tenant tag → S3.
Parse PDF (PyMuPDF): text + coordinates per page.
Chunk: section-aware; overlap for boundary facts; table handling strategy.
Embed → vector index with metadata (doc_id, version, ACL, page, bbox optional).
Keyword index optional (BM25) for IDs and regulatory codes — hybrid RRF merge.

Retrieval at query time

Filter-first ACL → hybrid search → reranker (cross-encoder optional) → top-k context.
Compression: summarize long chunks before LLM if over token budget.
Citations: return chunk_id + page + source doc version in response schema.

ChromaDB / vector stores

Chroma for dev/small prod; at scale: OpenSearch, pgvector, Pinecone — principles identical: metadata filters + collection per env.

Your bank implementation

Policy PDFs/circulars; re-ingest on change; retrieval-time library filter by role; monitor empty-retrieval rate.

Notepad. Extraction adds: bounding box normalization, multi-page coordinate systems, rasterized page images for vision models.

16 — Good-to-have depth (Bedrock, Temporal, K8s, ML serving, SaaS)

Amazon Bedrock

IAM-based access; inference profiles; cross-region for DR.
Model choice: Claude for extraction JSON; Nova for cost-sensitive steps (if approved).

Temporal vs Airflow

Temporal: long-running workflows, signals (human approve), precise retries — best for extraction jobs.
Airflow: scheduled batch analytics, nightly re-index — not interactive latency.

Kubernetes / EKS / service mesh

Your OpenShift/Helm experience maps to EKS + Helm.
Istio/Linkerd: mTLS, traffic split for canaries — mention for enterprise evolution.

ML serving (when JD asks)

YOLO/object detection as separate GPU service — versioned model artifact in S3.
HuggingFace/PyTorch for custom rankers — batch inference off hot path.

High-scale SaaS

Multi-region active-passive; tenant pinning; strict SLAs; enterprise SSO + SCIM.
Noisy neighbor: per-tenant rate limits and fair queue scheduling.

17 — 45-minute system design script (extraction platform whiteboard)

Minute 0–5 — Clarify

Tenants? doc types? sync vs async SLA? human review? multi-region?
Non-goals: real-time sub-second on 200-page PDF; full autonomy without audit.

Minute 5–15 — High-level diagram

ALB → FastAPI → SQS → workers → (rasterize → detect → LLM extract → validate) → Postgres + S3; LLM gateway; OTel; Secrets Manager.

Minute 15–25 — Deep dives they pick

Multi-tenant: tenant_id + RLS + S3 prefix + queue fair scheduling.
Pipeline registry + idempotent jobs + DLQ.
LLM gateway failover + structured schema per field type.

Minute 25–35 — Agentic (only if asked)

Fixed pipeline v1; agent on low-confidence regions with 3 tool max; human review queue.

Minute 35–45 — Operate

SLOs, golden tests in CI, blue-green deploy, incident runbooks, cost per document metric.

Notepad. Close with tradeoff: correctness and auditability first, then cost via model routing, then latency via parallel page workers.

18 — Rapid-fire Q&A bank

Question	Answer
Why RAG not fine-tune?	Policies change; RAG + versioned docs auditable; fine-tune blurs lineage
SQLAlchemy sync vs async?	Async API under load; one async engine per process
Schema per tenant?	Only enterprise tier; default RLS + tenant_id
Exactly-once jobs?	At-least-once queue + idempotent upserts — exactly-once end state
OpenAPI vs code-first?	Spec-first for 25+ endpoints and mobile lagging clients
How prevent prompt injection?	Untrusted retrieval; input filter; fail closed; no tool beyond allow-list
Biggest cost driver?	LLM tokens on full doc — route, cache, chunk, parallelize cheap steps
How measure extraction accuracy?	Golden docs; field-level F1; human eval sample; CI gate

19 — Day-before checklist

Draw architecture diagram 3× from memory (2 min each).
Speak 30s opener + 3 min project story out loud.
Recite RBAC four layers without notes.
Recite job state machine + idempotency + DLQ.
Name 5 CloudWatch alarms.
Prepare 2 STAR stories: incident + leadership conflict.
Sleep — familiarity beats cramming new frameworks.

Built from Optivalue Project-0 (sharpbyte.dev/about) + Principal Backend AI platform JD. Cross-links: Design guide, Regulated LLM banking, About.