sharpbyte.dev
← Interview ready
Interview notepad · Single source of truth

Principal Backend AI platform

One scrollable page covering every JD topic: architecture, extraction pipelines, LLM gateway, queues, AWS, APIs, PostgreSQL, security, observability, leadership, Python, distributed patterns, production agents, RAG/PDF, and good-to-haves — anchored on Optivalue Project-0 (banking document assistant).

This page has moved. Use the easier Interview guide — eighteen short chapters in What / Why / How format with larger type and links to existing sharpbyte.dev topics.

0 — How to use this notepad (single source of truth)

This page is the only prep doc you need for a Principal Backend Engineer — AI platform interview. It maps every responsibility in the target JD to components, tools, design principles, and spoken answers.

Anchor story: OptivalueTek Project-0 — internal banking assistant (policies, KYC, cases) grounded on approved documents. Parallel mapping: same platform patterns as a document extraction product (ingest → pipeline → LLM → metadata DB → review UI).

Study order (one week)

  • Day 1–2: §1 (your story) + §2 (architecture) — be able to draw the diagram from memory.
  • Day 3: §4 LLM gateway + §5 queues + §8 data.
  • Day 4: §3 pipelines + §14 agentic + §15 RAG/PDF.
  • Day 5: §6 AWS + §9 security + §10 observability.
  • Day 6: §7 APIs + §12 Python + §13 distributed patterns.
  • Day 7: §17 whiteboard + §18 Q&A — speak answers out loud.
Notepad. Open every answer with: problem → non-goals → architecture → tradeoffs → metrics. Never lead with framework names; lead with boundaries and failure modes.

Master component map (every JD bullet → section)

JD topic§Your anchor
Platform architecture, multi-tenant scale§2Bank entity/branch + chunk ACL
Extraction pipelines, registry, cost/latency/accuracy§3Policy ingest + re-ingest
LLM gateway, keys, failover, structured output, tools§4Structured JSON + gateway pattern
Async workers, DLQ, retry, backpressure§5Evolve long case packs to jobs
AWS ECS, RDS, S3, Bedrock, ALB, Secrets, IaC§6Project-1 AWS + OpenShift
OpenAPI 3.1, FastAPI endpoints, versioning§7FastAPI orchestration APIs
PostgreSQL metadata, indexing, tenancy§8Audit + session metadata pattern
OAuth2, JWT, RBAC, CORS, IAM§9SSO + retrieval-time ACL
Logging, tracing, SLOs§10Datadog traces + audit events
Technical leadership§11RFCs, mentoring, eval gates
Python FastAPI SQLAlchemy Pydantic asyncio§12Daily orchestration stack
Sagas, circuit breaker, idempotency§13Kafka + integration habits
LangGraph, agentic patterns, guardrails§14Project-0 agent loop
RAG, PyMuPDF, vectors, hybrid§15Policy RAG layer
Bedrock, Temporal, K8s, ML serving§16Bridge from experience
System design whiteboard§1745-min script

1 — Your current project (Optivalue Project-0) — tell-me-about-yourself

30-second opener

At OptivalueTek I lead a large retail bank programme. Project-0 is an internal AI assistant for branch and back-office staff: credit policies, KYC steps, exceptions, and case context in plain language — answers must come from approved documents, respect role-based access, and be auditable. I owned orchestration: RAG, LangChain steps, LangGraph for branching and human-in-the-loop, bounded agent tools on allow-listed APIs, and FastAPI beside existing Java/Spring services with shared SSO and logging.

Users and problem

  • Users: branch staff, KYC ops, credit analysts — not customers.
  • Pain: policy in PDFs/circulars/SOPs; inconsistent interpretation; slow search.
  • Success: grounded answers + citations; fail-closed on weak evidence; audit replay; least privilege.

Four ways AI is used

  1. RAG Q&A (~80%): rewrite → ACL retrieval → generate with citations → abstain if weak.
  2. Structured JSON: checklists and case note drafts for workflow UI (Pydantic validation).
  3. Bounded agent: allow-listed read tools (case lookup, policy version); max steps/tokens/time.
  4. Human-in-the-loop: high-risk intents and any write/side effect pause in LangGraph.

What you did as Technical Lead

  • Architecture boundaries: Python orchestration vs Java system-of-record.
  • RBAC at retrieval (filter-first), not only API gateway.
  • Prompt template versioning + golden-question CI evals.
  • Compliance/design reviews; mentoring on async FastAPI and production ownership.

Sibling track (Project-1): core customer/transaction microservices — Java 21, Spring Boot, Kafka, AWS, OpenShift — proves you operate full platform, not AI silo.

RBAC — five layers (speak in order)

  1. Identity: bank SSO OAuth2/OIDC → JWT (issuer, audience, expiry validated every call).
  2. API authorization: FastAPI deps map roles/scopes to routes; tenant/entity from claims not client body.
  3. Retrieval ACL (most important for AI): chunk metadata allowed_roles, library, effective_from; vector query applies filters before similarity ranking.
  4. Tool RBAC: each tool declares scope; runtime enforces; writes need case:write + human approval.
  5. Session isolation: no cross-user memory; TTL session store; stateless workers; server-side prompt assembly only.

Guardrails — input / generation / output

LayerControlsFail mode
InputInjection filter, PII mask, risk router, out-of-scope denyRefuse early
GenerationBank system prompts, retrieval-only context, schema mode, tool allow-list, step/token capsTruncate or route to HITL
OutputPydantic validate, citation completeness, claim-evidence check, numeric verify, PII scrubAbstain — no softer guess

Tools on Project-0 — what and why

ToolRoleWhy chosen
PythonOrchestration, ingest jobs, evalsSpeed of iteration; same code notebook → batch
FastAPIChat, retrieval, stream tokensAsync I/O; OpenAPI; fits AI iteration
Pydantic v2API + LLM output validationFail bad JSON before workflow UI
Vector DBPolicy chunk indexSemantic search over long PDFs
EmbeddingsChunk vectorsMatch paraphrased staff questions
LangChainRewrite, retrieve, compress, parseComposable testable steps
LangGraphBranch, HITL, agent loop, retriesExplicit states vs nested ifs
LLM (enterprise API)GenerationBank contract; private endpoint
Java/SpringCase and core APIsSystem of record; existing ops
OpenShift/Helm/JenkinsDeploySame as transaction services
DatadogTraces and dashboardsOne pane with Project-1

Project-0 request path

flowchart TB
  UI[Internal UI] --> SSO[Bank SSO OAuth2]
  SSO --> API[FastAPI orchestrator]
  API --> IN[Input guardrails]
  IN --> LG[LangGraph]
  LG --> RAG[ACL vector retrieval]
  LG --> LLM[LLM gateway]
  LG --> TOOLS[Allow-listed APIs]
  TOOLS --> JAVA[Java core services]
  LG --> OUT[Output verify abstain]
  OUT --> AUD[Audit log]
  BATCH[Re-ingest jobs] --> RAG
Notepad. Phrase: "In banking, safe AI means deny by default — verification failure returns no answer, not a softer guess."

2 — Platform architecture & scalability (multi-tenant, high throughput)

Interview framing: prototype → production = separate control plane (API, auth, config) from data plane (workers, LLM, storage).

Reference architecture (target extraction platform + your bank assistant)

  • Edge: ALB + WAF → API service (FastAPI) — sync only for submit/status/chat stream.
  • Async plane: queue → worker pool → pipeline registry executes versioned steps.
  • Knowledge plane: object storage (raw docs) + vector index + PostgreSQL metadata.
  • LLM plane: provider gateway (keys, failover, structured output, rate limits).
  • Observability plane: OTel traces, metrics, audit events (append-only).

Multi-tenancy models (know all three)

ModelIsolationWhen to useRisk
Shared schema + tenant_id columnApp filters every query; optional RLS in PostgresDefault for SaaS; fastest to shipBug in filter = cross-tenant leak
Row-Level Security (RLS)DB enforces tenant_id from session varRegulated SaaS; defense in depthMigration complexity; connection pooling setup
Schema-per-tenant / DB-per-tenantHard isolationEnterprise tier, data residencyOps cost; migration explosion

Your story: bank used entity/branch claims + chunk ACL metadata — same as tenant-scoped retrieval filters.

Microservices decomposition (10+ years framing)

  • Split by business capability — documents, jobs, extraction results, LLM gateway, ingest workers — not “one AI service.”
  • Failure isolation: LLM outage must not take down document upload; queue absorbs spikes.
  • Communication: sync REST for queries; async events for pipeline; no distributed monolith chat between 12 services.
  • Data ownership: each service owns its tables; no shared mutable DB antipattern.
  • Your evidence: Project-1 split customer, postings, limits; Project-0 Python orchestration beside Java — integration via APIs + shared identity only.

Scalability tactics

  • Stateless API horizontal scale; session in Redis/DB with TTL.
  • Workers scale on queue depth + GPU/CPU saturation (KEDA/ECS autoscaling).
  • Partition hot paths: ingest, inference, and export as separate services.
  • Backpressure: 429 when queue age > SLO; shed load before OOM.
  • Cache: embedding cache for repeated policy chunks; not unbounded chat memory.

Throughput for extraction workloads

Documents are bursty (batch upload). Design: accept job → return job_id → workers process pages in parallel → aggregate → persist traces. Never hold HTTP open for 10-minute PDF runs.

Multi-tenant platform layers

flowchart LR
  subgraph tenants [Tenants]
    T1[Org A]
    T2[Org B]
  end
  subgraph edge [Edge]
    ALB[ALB]
    API[FastAPI API]
  end
  subgraph data [Data plane]
    Q[Queue]
    W[Workers]
    PG[(PostgreSQL)]
    S3[(S3)]
    VDB[(Vector index)]
  end
  T1 --> ALB
  T2 --> ALB
  ALB --> API --> Q --> W
  W --> PG
  W --> S3
  W --> VDB

3 — Extraction pipeline engineering (registry, abstractions, execution strategies)

JD focus: architect AI pipelines balancing accuracy, latency, LLM cost.

Pipeline as a product, not a script

  • Pipeline definition: versioned DAG of steps (declarative YAML/JSON or code registry).
  • Pipeline registry: pipeline_id@version → step list, model routes, validators.
  • Execution strategies: sync (small doc), async queue (large), fan-out per page, map-reduce merge.
  • Artifacts: each step writes typed outputs to S3 + metadata row (trace).

Typical extraction stages (engineering drawings OR bank policy docs)

StageTechOutputCost/latency knob
IngestS3 upload, virus scan, tenant tagraw_object_keycheap
Rasterize/parsePyMuPDF / PDF parserpages, images, text spansCPU — ThreadPoolExecutor
Layout/detectYOLO/rules (optional)bounding boxesGPU or skip for v1
Region cropcoordinates normalizedcrops per field typemedium
LLM extractstructured schema per fieldJSON: holes, GD&T, title block OR policy clausesexpensive — route model
Validatedeterministic rules + second-pass LLMconfidence scoretunable
PersistPostgreSQL + exportreview queuecheap

Your parallel (bank assistant)

Ingest policy PDFs → chunk + embed → retrieve → LLM answer with schema → validate citations → audit. Re-ingest job when circular changes — same as extraction re-index.

Accuracy vs latency vs cost (how to talk tradeoffs)

  • Cheaper model for rewrite/routing; frontier model only for final extract or low-confidence regions.
  • Deterministic validators first (regex, units, GD&T rules) before second LLM pass.
  • Confidence routing: auto-accept > τ, human review between τ₁–τ₂, reject below τ₂.
  • Golden-file regression: CI runs pipeline on fixed docs; block deploy on accuracy drop.

Pipeline registry API (design verbally)

POST /v1/jobs body: { pipeline_version, document_id, tenant_id, idempotency_key }. Worker loads registry entry, runs state machine, checkpoints after each step.

Notepad. Say: v1 = linear pipeline + structured LLM; v2 = agent only on ambiguous regions; never agent-on-every-field.

4 — Multi-provider LLM abstraction (keys, failover, structured output, tools)

All model calls go through a gateway service — never scatter SDK calls in workers.

Gateway responsibilities

  • Unified interface: complete(), complete_structured(schema), invoke_tools(messages, tools).
  • Multi-key round-robin: pool of API keys per provider; per-key rate limiter (token bucket).
  • Provider failover: primary → secondary (Claude ↔ OpenAI ↔ Bedrock) with same JSON schema contract.
  • Circuit breaker: open on error rate; half-open probe; per-provider health.
  • Retries: transient 429/503 with exponential backoff + jitter; respect Retry-After.
  • Observability: log model id, tokens, latency, cost estimate per request_id.

Structured output

  • Pydantic v2 models define fields (holes, dimensions, policy_clause_ids).
  • Provider native JSON schema mode where available; else tool-style schema + validate + one repair retry.
  • Schema version in audit log — critical for replay.

Tool-calling protocol

  • Tools registered with JSON Schema args; runtime validates before execution.
  • Loop: model → tool_calls → execute allow-listed → append tool messages → model until done or cap.
  • Parallel read-only tool calls OK; writes sequential + idempotent.

Amazon Bedrock (JD + preferred)

  • IAM roles on ECS tasks — no long-lived keys in env.
  • Cross-region inference for resilience; model access via inference profiles.
  • Claude/Nova via Bedrock for enterprise contract and VPC endpoints.
ConcernImplementation
Rate limitPer-tenant + per-key token bucket; queue or 429
Cost capmax_tokens per job; downgrade model on budget
Failoverhealth-checked provider list; sticky session only for debugging
Testingmock gateway in CI; record/replay golden traces

5 — Async workers & queue architecture (DLQ, retry, backpressure, idempotency)

Rule: HTTP accepts work; workers do work. Long PDF + multi-step LLM = async job.

Job state machine

Job lifecycle

stateDiagram-v2
  [*] --> Queued
  Queued --> Running
  Running --> Succeeded
  Running --> Failed
  Failed --> Retrying
  Retrying --> Running
  Retrying --> DeadLetter
  Failed --> DeadLetter
  Succeeded --> [*]
  DeadLetter --> [*]

Queue choice

  • AWS SQS + DLQ — simple, managed, visibility timeout for long jobs.
  • Redis + Celery/RQ — if team already standardized.
  • Temporal / Airflow (preferred) — long-running stateful jobs with built-in retry, timers, human signals.

Must-have patterns

  • Idempotency key on POST /jobs — unique constraint prevents duplicate charges/runs.
  • Visibility timeout > p99 step duration; heartbeat extends lease while worker alive.
  • Checkpointing after each pipeline step — resume from last good state.
  • DLQ + admin replay API with audit — poison PDFs quarantined.
  • Retry policy: classify transient (429, 503, timeout) vs permanent (400 bad doc); max attempts 3–5 with jitter.
  • Backpressure: scale workers on queue depth; API returns 503 when depth > threshold; shed low-priority tenants.
  • At-least-once delivery: workers must be idempotent — upsert by (job_id, step_id).

Your bank parallel

Long “analyze full case file” flows should be job-based like extraction — even if v1 was mostly sync chat, describe evolution to async for 100-page packs.

Notepad. Kafka from Project-1: domain events can trigger re-ingest — document-published → enqueue pipeline.

6 — AWS production deployment (ECS, RDS, S3, Bedrock, ALB, Secrets, IaC, blue-green)

Service topology

ComponentAWS serviceNotes
APIECS Fargate/EC2 behind ALBHealth /ready; autoscale on CPU/RPS
WorkersECS separate serviceScale on SQS ApproximateNumberOfMessagesVisible
MetadataRDS PostgreSQL Multi-AZExtraction traces, jobs, tenants
BlobsS3 SSE-KMSPDFs, page images, JSON artifacts
LLMBedrock (+ optional VPC endpoint)IAM role per task
SecretsSecrets ManagerDB creds, API keys rotation
Logs/metricsCloudWatch + OTel → X-RayAlarms on queue age, 5xx, LLM 429
IaCTerraform or CDKEnvironments via workspaces; no console drift

Docker & ECS task definitions

  • Local: Docker Compose — API + worker + Postgres + Redis + mock LLM for dev parity.
  • ECS task def: CPU/memory per service; secrets from Secrets Manager; awsvpc mode; health check command hits /ready.
  • Sidecars (optional): OTel collector exporting to X-Ray.
  • Image: slim Python base; non-root user; pin dependencies (lock file).

Blue-green / zero-downtime

  • New ECS task definition revision → target group weight 10/90 canary → 100.
  • Or CodeDeploy blue-green with automatic rollback on CloudWatch alarm.
  • DB migrations: expand-contract — add column → dual-write → backfill → switch read → drop old.

Security & compliance on AWS

  • IAM: task roles least privilege — Bedrock InvokeModel only on allowed ARNs; S3 prefix per tenant.
  • Network: private subnets; NAT for egress; no public RDS.
  • Data residency: region-pinned stacks per enterprise customer.

Your experience bridge

Project-1: EC2, RDS, S3, IAM, OpenShift/K8s, Helm, Jenkins — say AI services deploy with same GitOps discipline; greenfield role adds Bedrock-native gateway and ECS-first AI workers.

AWS deployment

flowchart TB
  ALB[ALB] --> API[ECS API tasks]
  API --> SQS[SQS]
  SQS --> W[ECS worker tasks]
  W --> RDS[(RDS Postgres)]
  W --> S3[(S3)]
  W --> BR[Bedrock]
  API --> SM[Secrets Manager]
  W --> SM
  CW[CloudWatch alarms] --> API
  CW --> W

7 — API design & governance (OpenAPI 3.1, FastAPI, versioning, 25+ endpoints)

Spec-first: OpenAPI 3.1 is the contract; FastAPI generates or validates against it.

Endpoint families (extraction platform — know them all)

  • Documents: upload, get, list, delete (soft), presigned URL.
  • Jobs: create, get status, cancel, list by tenant.
  • Extractions: get result, patch review status, export batch.
  • Pipelines: list versions, get schema for output types.
  • Admin: replay DLQ, trigger re-ingest, tenant config.
  • Chat/assistant (your bank): session, message, stream, feedback.

Versioning & compatibility

  • URL prefix /v1 or header Accept-Version.
  • Non-breaking: optional fields, new enum values, new endpoints.
  • Breaking: rename field, change type → new major version + sunset headers (Sunset, Deprecation).
  • Publish changelog; contract tests in CI (Schemathesis, Dredd).

FastAPI implementation details

  • Pydantic v2 request/response models; strict mode for external APIs.
  • Dependency injection for current_user, tenant_id, db_session.
  • request_id middleware; problem+json errors (type, title, detail).
  • Async endpoints for I/O; run_in_executor for CPU PDF work.
Notepad. Mention you enforced consistent error codes on Java and Python services for channel integrators.

8 — Data architecture (PostgreSQL, extraction metadata, indexing, multi-tenant)

Core tables (extraction platform)

  • tenants — plan, config, data residency region.
  • documents — tenant_id, s3_key, status, page_count, checksum.
  • extraction_jobs — pipeline_version, state, idempotency_key, timestamps.
  • extraction_runs — per-step traces (model, tokens, latency, chunk_ids).
  • extractions — structured fields JSONB (holes, dimensions, GD&T, title_block, notes).
  • batches — export jobs for dataset / review workflows.
  • review_states — human accept/reject, reviewer_id.

Indexing strategy

Query patternIndex
List jobs by tenant + statusBTREE (tenant_id, status, created_at DESC)
Document by tenant + external idUNIQUE (tenant_id, external_id)
Review queue low confidencePARTIAL WHERE confidence < τ
JSONB field lookupGIN on extractions.fields (sparingly)

SQLAlchemy 2.0 async

  • async_sessionmaker; explicit transactions per request.
  • Repository layer — no raw SQL in route handlers.
  • Alembic migrations in CI; backward-compatible expand-contract.

Multi-tenant isolation in DB

  • Every query includes tenant_id from JWT — never from client body alone.
  • RLS: SET app.tenant_id on connection; policy tenant_id = current_setting(...).
  • Integration tests that attempt cross-tenant read must fail.

Bank assistant parallel

Store: session_id, message, retrieval chunk_ids, model_version, response_hash — not full customer PII in logs.

9 — Security &amp; compliance (OAuth2, JWT, RBAC, CORS, IAM, secrets, service-to-service)

Authentication

  • OAuth2 / OIDC with bank IdP; JWT access tokens short-lived; refresh via secure cookie or backend-for-frontend.
  • Validate issuer, audience, signature, expiry on every request.

RBAC layers (repeat until automatic)

  1. API route authorization (roles/scopes).
  2. Tool allow-list per role.
  3. RAG filter-first on chunk metadata (libraries, classification).
  4. Row-level DB tenant isolation.

Service-to-service

  • mTLS inside mesh OR signed internal JWT with short TTL.
  • No shared API keys between services — rotate per task role.

CORS & browser clients

  • Allowlist origins; no * with credentials.
  • Separate public vs internal assistant origins.

Secrets

  • AWS Secrets Manager; inject at task start; never in git.
  • Bedrock via IAM role — preferred over API keys.

LLM-specific threats

  • Prompt injection — treat retrieved text as untrusted; delimiter + policy router.
  • PII in prompts — mask/deny; minimize log retention.
  • Audit 7yr retention policy — confirm with compliance (bank).
Notepad. Your Project-0: writes and outbound comms always human-approved; agent cannot call arbitrary internal APIs.

10 — Observability &amp; reliability (logging, OTel, X-Ray, CloudWatch, SLOs)

Three pillars — what to instrument

  • Logs: structured JSON — request_id, tenant_id, job_id, pipeline_step, model, tokens, outcome.
  • Metrics: queue depth, job duration histogram, LLM latency, abstain rate, retrieval empty rate, cost USD.
  • Traces: OpenTelemetry → AWS X-Ray; span per retrieval, LLM call, tool execution.

SLO examples (state these numbers as examples, adjust to role)

SLISLO targetError budget action
Pipeline throughputp95 job completion < 5 min for 50-page docScale workers; optimize rasterize
LLM call latencyp95 < 8s per structured callFailover provider; reduce context
API availability99.9% monthlyFreeze features; fix burn
Extraction accuracy>92% on golden setBlock prompt deploy

Alarms (CloudWatch)

  • SQS ApproximateAgeOfOldestMessage > 10 min.
  • ECS CPU > 85% sustained; task restarts spike.
  • LLM 429 rate > 5% over 5 min.
  • DLQ message count > 0.

Your Datadog experience

Project-1: service dashboards, distributed traces on posting path — same trace_id propagated into Python orchestrator for AI requests.

11 — Engineering leadership (mentor, reviews, standards, roadmap)

What Principal means in interview

  • Owns technical direction — writes RFCs, not only tickets.
  • Raises bar — API standards, pipeline registry, on-call playbooks.
  • Develops seniors — design review facilitation, delegation, career coaching.
  • Partners — product, security, compliance, ops — translates constraints into architecture.

Concrete examples to cite

  • Introduced golden-question eval gate before prompt promotion.
  • Split AI orchestration from Java core — reduced release coupling.
  • Post-incident: DLQ replay runbook + idempotency bug fix class.
  • Mentored engineers on Spring + FastAPI production patterns.

Roadmap narrative (prototype → enterprise)

  1. v1: sync RAG assistant + audit.
  2. v2: async jobs + pipeline registry + multi-tenant hardening.
  3. v3: extraction agents on low-confidence regions only + Temporal.
  4. v4: multi-region, Bedrock cross-region, enterprise SSO federation.
Notepad. Use STAR: Situation, Task, Action, Result — one story per leadership bullet.

12 — Python depth (FastAPI, SQLAlchemy, Pydantic v2, asyncio, ThreadPoolExecutor)

FastAPI

  • Async routes for DB/HTTP/LLM I/O; sync only when necessary.
  • Lifespan hooks: warm connections, drain on shutdown.
  • BackgroundTasks for fire-and-forget only if loss acceptable — prefer queue for real work.

Pydantic v2

  • model_validate, Field constraints, model_config = ConfigDict(strict=True) for external APIs.
  • Separate DTOs: JobCreate, JobResponse, internal ExtractionRecord.

SQLAlchemy 2.0

  • select() style; async engine; avoid N+1 with joinedload where needed.
  • Unit of work per request; explicit commit/rollback.

asyncio vs ThreadPoolExecutor

  • asyncio: network-bound concurrency (retrieval, LLM HTTP).
  • ThreadPoolExecutor: CPU-bound PDF rasterize, image ops — do not block event loop.
  • asyncio.to_thread() or executor wrapper pattern.
Notepad. Interview trap: 'async is faster for PDF parsing' — wrong; async helps concurrent I/O waits.

13 — Distributed systems patterns (sagas, circuit breaker, idempotency, degradation)

PatternUse in AI platformOne-liner
Idempotency keysPOST /jobs, tool writesSame key → same job id, no double extract
At-least-once + idempotent workerSQS consumersUpsert by job_id+step
Circuit breakerLLM provider gatewayStop hammering failing provider
Retry + jitter429/503Avoid thundering herd
Saga / compensationMulti-step pipeline failureMark job failed; release lease; notify; no partial bill
Graceful degradationLLM downReturn retrieved chunks only or queue for later
BulkheadSeparate worker poolsIngest vs inference vs export
DLQPoison messagesHuman triage + replay

Saga example (spoken)

Job runs rasterize → extract → persist. Persist fails after S3 write: compensate by marking job failed, retaining artifacts for debug, not charging customer credit, enqueue notification — do not leave job running forever.

14 — Production agentic AI (LangGraph, tools, guardrails, patterns)

When agents vs pipeline

Use fixed pipelineUse bounded agent
Stable layout, known fieldsVariable cross-page reasoning
Cost-sensitive bulk extractLow-volume exception handling
Strict audit per stepNeeds tool access to case/ticket systems

Agentic patterns (JD preferred — name them)

  • Planner–executor: planner outputs steps; executor runs tools — planner not authority on writes.
  • Multi-round tool-use loop: cap rounds; fingerprint repeated tool+args to detect loops.
  • Shared notepad / blackboard: schema-versioned state; single writer per slot (facts).
  • Confidence-based validation: deterministic rules → optional second LLM → human queue.
  • Supervisor–worker: rule supervisor in CI; LLM supervisor behind feature flag only.

LangGraph (your stack)

  • Nodes = functions; edges = conditions; checkpoint for resume.
  • Human node = interrupt; resume with approval payload.

LangChain vs CrewAI

LangChain = composable steps. CrewAI = role personas — mention but prefer LangGraph for production state. CrewAI good for prototypes.

Guardrails checklist

  • Tool allow-list in code per node.
  • Max steps, tokens, wall time, USD per session.
  • No API keys in context — secrets gateway.
  • Audit every tool call: name, args hash, latency, approval_id.

Bounded agent loop

flowchart LR
  P[Planner] --> E[Executor]
  E --> T{Tool?}
  T -->|read| R[Allow-listed APIs]
  T -->|write| H[Human approve]
  R --> V[Validator]
  H --> V
  V -->|ok| OUT[Response]
  V -->|fail| A[Abstain]

15 — RAG &amp; document pipelines (chunking, hybrid retrieval, PyMuPDF, vectors)

Ingest pipeline

  1. Upload → checksum → tenant tag → S3.
  2. Parse PDF (PyMuPDF): text + coordinates per page.
  3. Chunk: section-aware; overlap for boundary facts; table handling strategy.
  4. Embed → vector index with metadata (doc_id, version, ACL, page, bbox optional).
  5. Keyword index optional (BM25) for IDs and regulatory codes — hybrid RRF merge.

Retrieval at query time

  • Filter-first ACL → hybrid search → reranker (cross-encoder optional) → top-k context.
  • Compression: summarize long chunks before LLM if over token budget.
  • Citations: return chunk_id + page + source doc version in response schema.

ChromaDB / vector stores

Chroma for dev/small prod; at scale: OpenSearch, pgvector, Pinecone — principles identical: metadata filters + collection per env.

Your bank implementation

Policy PDFs/circulars; re-ingest on change; retrieval-time library filter by role; monitor empty-retrieval rate.

Notepad. Extraction adds: bounding box normalization, multi-page coordinate systems, rasterized page images for vision models.

16 — Good-to-have depth (Bedrock, Temporal, K8s, ML serving, SaaS)

Amazon Bedrock

  • IAM-based access; inference profiles; cross-region for DR.
  • Model choice: Claude for extraction JSON; Nova for cost-sensitive steps (if approved).

Temporal vs Airflow

  • Temporal: long-running workflows, signals (human approve), precise retries — best for extraction jobs.
  • Airflow: scheduled batch analytics, nightly re-index — not interactive latency.

Kubernetes / EKS / service mesh

  • Your OpenShift/Helm experience maps to EKS + Helm.
  • Istio/Linkerd: mTLS, traffic split for canaries — mention for enterprise evolution.

ML serving (when JD asks)

  • YOLO/object detection as separate GPU service — versioned model artifact in S3.
  • HuggingFace/PyTorch for custom rankers — batch inference off hot path.

High-scale SaaS

  • Multi-region active-passive; tenant pinning; strict SLAs; enterprise SSO + SCIM.
  • Noisy neighbor: per-tenant rate limits and fair queue scheduling.

17 — 45-minute system design script (extraction platform whiteboard)

Minute 0–5 — Clarify

  • Tenants? doc types? sync vs async SLA? human review? multi-region?
  • Non-goals: real-time sub-second on 200-page PDF; full autonomy without audit.

Minute 5–15 — High-level diagram

ALB → FastAPI → SQS → workers → (rasterize → detect → LLM extract → validate) → Postgres + S3; LLM gateway; OTel; Secrets Manager.

Minute 15–25 — Deep dives they pick

  • Multi-tenant: tenant_id + RLS + S3 prefix + queue fair scheduling.
  • Pipeline registry + idempotent jobs + DLQ.
  • LLM gateway failover + structured schema per field type.

Minute 25–35 — Agentic (only if asked)

Fixed pipeline v1; agent on low-confidence regions with 3 tool max; human review queue.

Minute 35–45 — Operate

SLOs, golden tests in CI, blue-green deploy, incident runbooks, cost per document metric.

Notepad. Close with tradeoff: correctness and auditability first, then cost via model routing, then latency via parallel page workers.

18 — Rapid-fire Q&amp;A bank

QuestionAnswer
Why RAG not fine-tune?Policies change; RAG + versioned docs auditable; fine-tune blurs lineage
SQLAlchemy sync vs async?Async API under load; one async engine per process
Schema per tenant?Only enterprise tier; default RLS + tenant_id
Exactly-once jobs?At-least-once queue + idempotent upserts — exactly-once end state
OpenAPI vs code-first?Spec-first for 25+ endpoints and mobile lagging clients
How prevent prompt injection?Untrusted retrieval; input filter; fail closed; no tool beyond allow-list
Biggest cost driver?LLM tokens on full doc — route, cache, chunk, parallelize cheap steps
How measure extraction accuracy?Golden docs; field-level F1; human eval sample; CI gate

19 — Day-before checklist

  • Draw architecture diagram 3× from memory (2 min each).
  • Speak 30s opener + 3 min project story out loud.
  • Recite RBAC four layers without notes.
  • Recite job state machine + idempotency + DLQ.
  • Name 5 CloudWatch alarms.
  • Prepare 2 STAR stories: incident + leadership conflict.
  • Sleep — familiarity beats cramming new frameworks.

Built from Optivalue Project-0 (sharpbyte.dev/about) + Principal Backend AI platform JD. Cross-links: Design guide, Regulated LLM banking, About.