Lang
API

Vector databases

A vector database stores dense embeddings alongside structured metadata and answers “which chunks are closest to this query vector?” using approximate nearest neighbor (ANN) search—fast enough for interactive RAG, semantic search, and recommendation at millions of rows. It is the retrieval layer between chunking and the LLM.

After reading, you should be able to: explain how embeddings, metadata, and ANN indexes fit together; compare Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, Redis Vector, and OpenSearch k-NN; choose HNSW vs IVF vs flat; implement metadata pre-filtering and hybrid dense+BM25 with RRF; design multi-tenant isolation; upsert and query with pgvector in Python and Java; and operate indexes in production (rebuild, backup, monitoring).

developer platform architect Track 2 · RAG pgvector Pinecone Qdrant HNSW RRF

What is a vector database?

A vector database is specialized storage for high-dimensional float arrays (embeddings) with metadata columns and an ANN index so similarity search stays sub-linear instead of scanning every row. In RAG, each row is typically one chunk: vector + text + source, ACL, timestamps.

The three layers

  1. Embeddings — fixed-size vectors (e.g. 1536 floats from text-embedding-3-small) representing chunk meaning. Same model at index and query time; never mix models in one collection.
  2. Metadata — JSON or typed columns attached to each vector: source, document_type, user_id, updated_at, ACL flags. Used for pre-filtering before or during ANN.
  3. ANN search — graph or cluster structures (HNSW, IVF) that return top-K nearest neighbors in milliseconds at 100K–100M scale. Exact brute force is reserved for eval baselines and tiny corpora.

Vector DBs are not general OLTP databases—though pgvector blurs the line by living inside PostgreSQL. They optimize for read-heavy similarity queries with optional metadata predicates, not complex joins or transactional updates across many tables. Your application still owns chunk text, embedding jobs, and reranking; the store answers “nearest vectors under these filters.”

Vector DB vs relational DB vs search engine

SystemPrimary queryStrengthWeakness for RAG
Vector DB (Pinecone, Qdrant, pgvector+ANN) Top-K by cosine / L2 / dot Fast semantic retrieval, metadata filters No BM25 alone; not source of truth for billing
Relational DB (Postgres without vectors) SQL filters, joins, transactions ACID, authZ at row level No native similarity at scale
Search engine (OpenSearch, Elasticsearch) BM25 keyword + optional k-NN Hybrid lexical + dense in one cluster Vector tuning less ergonomic than purpose-built DBs

End-to-end RAG retrieval path

  1. User question → embed query (cached).
  2. Apply metadata filters (tenant, date, document_type).
  3. ANN top-K (e.g. 20–50) from vector store.
  4. Optional rerank (cross-encoder) → top 5 for LLM context.

The vector store sits between your chunking pipeline and retrieval strategies (reranking, query expansion, refusal thresholds). Embeddings fundamentals live in Track 1 · Embeddings & semantic search.

🔬 Under the Hood

Most vector stores persist vectors as float32 arrays (6 KB per 1536-dim vector) plus B-tree or inverted indexes on metadata. ANN structures add 50–200% memory overhead for HNSW graph edges. Inserts update the graph incrementally (Qdrant, Pinecone) or defer index build until batch load completes (pgvector CREATE INDEX CONCURRENTLY).

⚠️ Pitfall

Treating a vector DB as the only database. Store authoritative document metadata and ACLs in Postgres; mirror retrieval fields into the vector index. If the vector store is wiped, you must be able to rebuild from object storage + embedding job without losing business data.

⚙ Config

Pin these at collection creation: embedding_model, dimensions, distance_metric (cosine / L2 / dot), index_type (HNSW params). Version the collection name when any of these change (kb_v3_1536_cosine).

Comparing vector stores

No single winner—choice depends on scale, team ops capacity, existing stack, hybrid search needs, and multi-tenant isolation model. Below is a practitioner’s matrix for RAG workloads in 2025–2026.

At-a-glance matrix

Store Deployment ANN indexes Metadata filter Hybrid BM25 Multi-tenant Best for
Pinecone Managed SaaS Proprietary (pod/serverless) Strong metadata filters Sparse-dense (integrated) Namespace per tenant Fastest path to prod, minimal ops
Weaviate Managed + self-hosted HNSW, flat GraphQL / where filters Native BM25 + vector fusion Class + tenant shard Hybrid search in one OSS stack
Qdrant Managed + self-hosted HNSW Payload indexes (fast) Sparse vectors + RRF Collection or payload filter Performance, filtering, Rust efficiency
Chroma Embedded / server HNSW (via DuckDB/SQLite backends) where dict filters Limited native; pair with ES Collection per tenant (small scale) Prototypes, local dev, hello-RAG
pgvector Self-hosted (Postgres) HNSW, IVFFlat SQL WHERE (B-tree) Via pg_bm25 / tsvector + RRF in app Row-level tenant_id Teams already on Postgres, unified stack
Milvus Managed (Zilliz) + self-hosted HNSW, IVF, DiskANN Scalar filtering Sparse + dense (2.4+) Database / partition Billion-scale, GPU options
Redis Vector Self-hosted / Redis Cloud HNSW (RediSearch) Tag / numeric filters RediSearch full-text + vector Key prefix / index per tenant Low-latency cache-adjacent retrieval
OpenSearch k-NN Managed (AWS) + self-hosted HNSW, IVF, Faiss engines Query DSL filters Native BM25 + k-NN hybrid Index alias per tenant AWS-native, existing ES/OpenSearch shops

Operational and cost dimensions

Store Typical latency p99 Ops burden Pricing model Backup story
Pinecone 10–50 ms (serverless) Low Per vector dimension-month + queries Managed snapshots; export via API
Weaviate Cloud 20–80 ms Low–medium Cluster size + objects Cloud backups; K8s Velero self-hosted
Qdrant Cloud 5–30 ms Low–medium RAM + storage tiers Snapshots, S3 backup hooks
Chroma 5–100 ms (local) Dev: none; prod: you run it OSS free; hosting cost only Filesystem / volume snapshots
pgvector 10–100 ms (tuned HNSW) Medium (DBA skills) Postgres infra you already pay pg_dump, WAL, PITR standard
Milvus / Zilliz 10–50 ms at scale Medium–high self-hosted Zilliz cloud or K8s cost MinIO / S3 segment backup
Redis Vector <10 ms (in-memory) Medium (Redis ops) Memory-bound RDB/AOF; not primary archive
OpenSearch k-NN 20–150 ms Medium–high (cluster) Instance hours + storage Automated snapshots (AWS)

Decision heuristics

  • Already on Postgres, <5M vectors: start with pgvector + HNSW—one backup pipeline, SQL filters, no new vendor.
  • Need hybrid BM25+dense without glue code: Weaviate, Qdrant sparse, OpenSearch, or Redis Stack.
  • Zero ops, fast launch: Pinecone serverless or Qdrant Cloud; accept vendor cost until unit economics clear.
  • Billion vectors, ML platform team: Milvus / Zilliz with DiskANN.
  • Sub-10ms on hot cache: Redis Vector for session-scoped retrieval; Postgres/pgvector for durable corpus.
  • Laptop / CI / tutorial: Chroma embedded—migrate before multi-tenant prod.
⚖️ Trade-off

Managed vector DBs trade dollars for engineer time. pgvector trades query latency tuning and index rebuild complexity for stack simplicity. Running OpenSearch only for k-NN when you already have Postgres often doubles operational surface—justify it with hybrid search requirements or AWS mandate.

💰 Cost

Rule of thumb: 1M × 1536-dim float32 vectors ≈ 6 GB raw + 50–100% index overhead. Pinecone serverless bills storage + read units; self-hosted pgvector bills RDS storage and RAM for shared_buffers + HNSW. Model embedding cost usually exceeds store cost until 10M+ vectors—optimize re-index frequency first.

📦 Real World

Many Series B SaaS RAG products run pgvector on RDS until ~3–5M active chunks, then split hot tenant indexes to Qdrant or Pinecone namespaces while Postgres remains source of truth. Notion-scale teams invest in custom sharding; your first production index does not need that on day one.

🎯 Interview Tip

“Why pgvector over Pinecone?” — Answer with team skills, existing Postgres HA, transactional updates with app data, and acceptable QPS. Follow-up: “When would you migrate?” — ANN p99 > SLO, HNSW rebuild windows too long, or need managed hybrid sparse+dense without operating OpenSearch.

ANN index types: HNSW, IVF, Flat

The index structure determines recall, memory footprint, build time, and insert behavior. Every vector store exposes one or more of these; tuning parameters is how you hit p99 latency without sacrificing RAG recall.

Flat (exact / brute force)

No auxiliary structure—compare the query to every vector (or every row passing metadata pre-filter). Recall: 100%. Cost: O(n × d) per query. Use for golden-set eval (“what is true top-10?”), corpora under ~10K vectors, and CI regression tests that compare ANN vs exact rankings.

HNSW (Hierarchical Navigable Small World)

Multi-layer proximity graph: greedy beam search from entry points, refine on lower layers. Default for pgvector, Qdrant, Weaviate, Redis HNSW, OpenSearch nmslib/faiss HNSW.

  • Build: moderate; no separate training phase (unlike IVF).
  • Query params: ef_search (pgvector hnsw.ef_search) — higher → better recall, slower queries.
  • Insert: online but slower than IVF; heavy bulk load may defer index creation.
  • Memory: high—stores neighbor lists per vector.

IVF (Inverted File / IVFFlat)

k-means partitions the vector space into lists (cells). Each vector assigned to one cell; query probes probes nearest centroids only. pgvector ivfflat, Milvus IVF, Faiss IVF—common at very large scale with memory pressure.

  • Build: requires training pass on representative sample—skewed samples hurt recall.
  • Query params: probes — higher → better recall, more cells scanned.
  • Insert: assign to nearest centroid—fast bulk ingest.
  • Memory: lower than HNSW at same n; pairs with PQ compression at billion scale.

Recall / memory / latency trade-offs

Index Recall @10 (typical tuned) Memory overhead Query p99 (1M × 1536) Bulk load When to choose
Flat 100% None 200–2000 ms (CPU) Instant Eval baseline, <10K vectors
HNSW (ef_search=64) 92–96% +50–120% vector RAM 5–25 ms Build after load or incremental Default production RAG
HNSW (ef_search=128) 96–99% Same index size 10–40 ms Same When eval shows recall gaps
IVFFlat (probes=10) 85–93% +10–30% 8–35 ms Fast after train Memory constrained, batch-heavy
IVF + PQ 80–92% Compressed (4–16×) 5–30 ms Train + encode 100M+ vectors, approximate OK

Measuring recall correctly

  1. Build labeled set: query text → relevant chunk IDs (50–200 questions from support logs).
  2. Run exact flat search → ground-truth top-10 IDs.
  3. Run ANN with candidate params → compute recall@K = |ANN ∩ truth| / K.
  4. Plot recall vs p99 latency; pick smallest ef_search / probes that clears 95% recall@10.
pgvector HNSW index DDL
# Run after bulk COPY into chunks — not row-by-row during ingest
HNSW_DDL = """
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
  ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
"""
# Session tuning per query (optional)
SET_DDL = "SET hnsw.ef_search = 100;"  # default 40; raise for recall
// Flyway / Liquibase migration snippet
/*
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
  ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
*/
@Transactional(readOnly = true)
public void tuneHnsw(JdbcTemplate jdbc) {
  jdbc.execute("SET hnsw.ef_search = 100");
}
💡 Pro Tip

After bulk ingest into pgvector, run VACUUM ANALYZE then create HNSW CONCURRENTLY off-peak. Building HNSW while streaming inserts causes lock contention and suboptimal graphs.

📐 IaC

Encode index type and params in Terraform/Helm values—not manual psql on prod. Example: vector_index: { type: hnsw, m: 16, ef_construction: 64, ef_search: 100 } applied via migration job in deploy pipeline.

Metadata filtering

Pure vector search over the entire corpus returns chunks from wrong tenants, stale policies, or unrelated document types. Pre-filtering restricts ANN to rows matching structured predicates—before or during graph traversal depending on the engine.

Common metadata fields in RAG

FieldTypeExample useIndex hint
tenant_id / org_id UUID / string SaaS isolation—only search customer A’s docs B-tree / payload keyword index
user_id UUID Personal notes vs shared workspace Composite (tenant_id, user_id)
source string Filter to confluence vs slack Keyword index
document_type enum Policies vs API reference vs tickets Enum / keyword
updated_at timestamp Prefer docs newer than policy effective date B-tree range
acl_group string[] Intersect with caller’s groups GIN array (Postgres)

Pre-filter vs post-filter

  • Pre-filter (preferred): ANN runs only on IDs matching tenant_id = X AND document_type = 'policy'. Qdrant payload indexes, Pinecone metadata filters, pgvector SQL WHERE before ORDER BY distance.
  • Post-filter: ANN returns top-100 globally, then drop rows failing filter—dangerous when filter is selective (you may get <K results). Always over-fetch (top 50→filter→5) or use pre-filter.

Example filter expressions

Filtered vector query patterns
# Qdrant — payload index on tenant_id recommended
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

flt = Filter(must=[
    FieldCondition(key="tenant_id", match=MatchValue(value="org_abc")),
    FieldCondition(key="document_type", match=MatchValue(value="policy")),
    FieldCondition(key="updated_at", range=Range(gte="2024-01-01T00:00:00Z")),
])
hits = client.search(
    collection_name="kb",
    query_vector=query_emb,
    query_filter=flt,
    limit=20,
)

# Pinecone — metadata dict filter
# index.query(vector=q, top_k=20, filter={"tenant_id": "org_abc", "document_type": "policy"})
// pgvector + Spring JDBC — pre-filter in SQL
String sql = """
  SELECT id, content,
         1 - (embedding <=> ?::vector) AS score
  FROM document_chunks
  WHERE tenant_id = ?
    AND document_type = ?
    AND updated_at >= ?::timestamptz
  ORDER BY embedding <=> ?::vector
  LIMIT ?
  """;
// B-tree on (tenant_id, document_type, updated_at) keeps filter selective

Filter selectivity and recall

If only 0.1% of vectors pass the filter, HNSW may not traverse enough candidates—some engines build per-tenant partitions or namespace indexes. Symptoms: correct chunk exists but never appears in top-K. Fix: dedicated collection/namespace per large tenant, raise ef_search, or partition index by tenant.

🔒 Security

Never rely on post-filter alone for authZ—attackers craft queries that surface other tenants’ chunks in ANN results before filter drops them (information leakage in logs). Enforce tenant_id in the query API from the session JWT; reject client-supplied tenant overrides. Mirror ACL checks from Postgres if vector metadata could be stale.

⚠️ Pitfall

Storing only source_url without normalized document_type forces fuzzy string filters. Normalize metadata at ingest—enum fields indexed at ingest time beat regex post-filters on URLs.

Multi-tenancy patterns

B2B RAG must guarantee tenant A never retrieves tenant B’s chunks. Two dominant patterns: namespace per tenant (physical isolation) vs shared index + metadata filter (logical isolation).

Namespace / collection per tenant

  • Each tenant gets Pinecone namespace, Qdrant collection, or Chroma collection.
  • Pros: strong isolation, independent index tuning, easy delete-tenant (drop namespace), no cross-tenant ANN leakage.
  • Cons: operational explosion at 10K+ tenants; cold namespaces; harder cross-tenant analytics; more index memory overhead.

Shared index + metadata filter

  • Single collection; every row has tenant_id; every query includes mandatory filter.
  • Pros: one index to manage, efficient for many small tenants, simpler backup.
  • Cons: filter selectivity issues for huge tenants; blast radius if filter bug; noisy neighbor on shared RAM.

Hybrid tiering (common at scale)

Tenant tierPatternRationale
Free / small (<10K chunks) Shared index + tenant_id filter Cost efficient
Enterprise (>500K chunks) Dedicated namespace or collection Recall + performance isolation
Regulated (HIPAA, SOC2) Dedicated cluster or DB schema Audit boundary, encryption per tenant

Tenant lifecycle

  1. Provision — create namespace or verify filter path; seed empty index.
  2. Ingest — embed jobs tag every vector with tenant_id from auth context—not from client body.
  3. Query — middleware injects tenant from JWT; integration tests assert cross-tenant negative cases.
  4. Offboard — delete namespace or DELETE WHERE tenant_id = ?; verify object storage purge.
Mandatory tenant scoping in API layer
def search_kb(query: str, ctx: RequestContext) -> list[ChunkHit]:
    tenant_id = ctx.jwt["org_id"]  # never from request body
    embedding = embed(query)
    return vector_store.search(
        vector=embedding,
        filter={"tenant_id": tenant_id},
        namespace=tenant_id if ctx.tier == "enterprise" else None,
        top_k=20,
    )
public List<ChunkHit> searchKb(String query, AuthContext ctx) {
  String tenantId = ctx.orgId(); // from verified JWT
  float[] embedding = embedder.embed(query);
  return vectorStore.search(SearchRequest.builder()
      .vector(embedding)
      .filter(Map.of("tenant_id", tenantId))
      .namespace(ctx.isEnterprise() ? tenantId : null)
      .topK(20)
      .build());
}
🔒 Security

Add integration tests: user of tenant A queries with forged tenant B filter in body → must ignore body, use JWT only. Log and alert on queries returning zero results when corpus should exist—may indicate filter misconfiguration rather than bad retrieval.

🎯 Interview Tip

“Design multi-tenant RAG for 5,000 customers.” — Shared pgvector with tenant_id for long tail; dedicated Qdrant collection for top 20 accounts by chunk volume; central embedding gateway; per-tenant rate limits; offboarding runbook. Mention compliance tier with separate RDS instance.

pgvector: upsert and filtered search

pgvector keeps vectors beside relational data—ideal when you already run Postgres for users, documents, and ACLs. Below: schema, idempotent upsert on content hash, and filtered cosine search in Python and Java.

Schema

DDL — document_chunks table
SCHEMA_SQL = """
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
  id            BIGSERIAL PRIMARY KEY,
  chunk_id      TEXT NOT NULL,
  tenant_id     TEXT NOT NULL,
  source        TEXT NOT NULL,
  document_type TEXT NOT NULL,
  user_id       TEXT,
  content       TEXT NOT NULL,
  content_hash  TEXT NOT NULL,
  embedding     vector(1536) NOT NULL,
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, chunk_id)
);

CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
CREATE INDEX idx_chunks_updated ON document_chunks (updated_at DESC);
"""
/*
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
  id BIGSERIAL PRIMARY KEY,
  chunk_id TEXT NOT NULL,
  tenant_id TEXT NOT NULL,
  source TEXT NOT NULL,
  document_type TEXT NOT NULL,
  user_id TEXT,
  content TEXT NOT NULL,
  content_hash TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, chunk_id)
);
CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
*/

Upsert (idempotent on tenant + chunk_id)

Upsert chunks with embeddings
import hashlib
import psycopg
from openai import OpenAI

client = OpenAI()
MODEL = "text-embedding-3-small"

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def embed(text: str) -> list[float]:
    return client.embeddings.create(model=MODEL, input=text).data[0].embedding

def upsert_chunk(conn, row: dict) -> None:
    h = content_hash(row["content"])
    if row.get("content_hash") == h:
        return  # skip unchanged — no API call
    vec = embed(row["content"])
    conn.execute(
        """
        INSERT INTO document_chunks
          (chunk_id, tenant_id, source, document_type, user_id,
           content, content_hash, embedding, updated_at)
        VALUES (%(chunk_id)s, %(tenant_id)s, %(source)s, %(document_type)s,
                %(user_id)s, %(content)s, %(hash)s, %(emb)s, now())
        ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
          content = EXCLUDED.content,
          content_hash = EXCLUDED.content_hash,
          embedding = EXCLUDED.embedding,
          source = EXCLUDED.source,
          document_type = EXCLUDED.document_type,
          updated_at = now()
        WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
        """,
        {**row, "hash": h, "emb": vec},
    )

# Batch ingest: upsert many, then CREATE INDEX CONCURRENTLY once
@Repository
public class ChunkRepository {
  private final JdbcTemplate jdbc;
  private final EmbeddingModel embedder;

  public void upsertChunk(ChunkRow row) {
    String hash = sha256(row.content());
    Optional<String> existing = jdbc.query(
        "SELECT content_hash FROM document_chunks WHERE tenant_id=? AND chunk_id=?",
        rs -> rs.next() ? Optional.of(rs.getString(1)) : Optional.empty(),
        row.tenantId(), row.chunkId());
    if (existing.isPresent() && existing.get().equals(hash)) return;

    float[] vec = embedder.embed(row.content());
    String vecLiteral = toPgVector(vec);
    jdbc.update("""
        INSERT INTO document_chunks
          (chunk_id, tenant_id, source, document_type, user_id,
           content, content_hash, embedding, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?::vector, now())
        ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
          content = EXCLUDED.content,
          content_hash = EXCLUDED.content_hash,
          embedding = EXCLUDED.embedding,
          updated_at = now()
        WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
        """,
        row.chunkId(), row.tenantId(), row.source(), row.documentType(),
        row.userId(), row.content(), hash, vecLiteral);
  }
}

Filtered similarity search

Cosine search with metadata pre-filter
def search(
    conn,
    query: str,
    tenant_id: str,
    document_type: str | None = None,
    since: str | None = None,
    k: int = 10,
) -> list[dict]:
    q_vec = embed(query)
    sql = """
      SELECT chunk_id, content, source,
             1 - (embedding <=> %(q)s::vector) AS score
      FROM document_chunks
      WHERE tenant_id = %(tenant)s
    """
    params: dict = {"q": q_vec, "tenant": tenant_id, "k": k}
    if document_type:
        sql += " AND document_type = %(dtype)s"
        params["dtype"] = document_type
    if since:
        sql += " AND updated_at >= %(since)s::timestamptz"
        params["since"] = since
    sql += " ORDER BY embedding <=> %(q)s::vector LIMIT %(k)s"
    return conn.execute(sql, params).fetchall()
public List<ChunkHit> search(
    String query, String tenantId, String documentType, Instant since, int k) {
  float[] qVec = embedder.embed(query);
  String vec = toPgVector(qVec);
  StringBuilder sql = new StringBuilder("""
      SELECT chunk_id, content, source,
             1 - (embedding <=> ?::vector) AS score
      FROM document_chunks
      WHERE tenant_id = ?
      """);
  List<Object> args = new ArrayList<>(List.of(vec, tenantId));
  if (documentType != null) {
    sql.append(" AND document_type = ?");
    args.add(documentType);
  }
  if (since != null) {
    sql.append(" AND updated_at >= ?");
    args.add(Timestamp.from(since));
  }
  sql.append(" ORDER BY embedding <=> ?::vector LIMIT ?");
  args.add(vec);
  args.add(k);
  return jdbc.query(sql.toString(), chunkMapper, args.toArray());
}
💡 Pro Tip

Use <=> (cosine distance) with vector_cosine_ops HNSW index— score as 1 - distance for UI display. Keep raw distance in logs for drift detection.

🔬 Under the Hood

pgvector stores vectors inline in heap pages—updates rewrite rows when embedding changes. Batch upserts inside transactions; avoid per-row commits during million-chunk backfill. Connection pool size × query concurrency must fit within Postgres max_connections and RAM for HNSW graph.

What this looks like in production

A vector index is living infrastructure—it grows with every document upload, model upgrade, and tenant onboard. Plan index rebuilds, backups, and monitoring before the first customer connects their Google Drive.

Index rebuild strategy

  1. Model change — new collection/table suffix (kb_v4); embed full corpus; build HNSW; alias swap; retire old index after soak.
  2. Bulk re-ingest — disable HNSW during load; COPY or batch INSERT; VACUUM ANALYZE; CREATE INDEX CONCURRENTLY.
  3. Incremental — upsert changed chunks only (content_hash gate); periodic index maintenance (pgvector reindex if supported; Qdrant optimize).
  4. Blue/green — dual-write to staging index; validate recall@K; flip read traffic; delete green after 24h.

Backup and disaster recovery

  • pgvector: standard Postgres PITR + nightly pg_dump; test restore quarterly.
  • Managed (Pinecone/Qdrant): enable vendor snapshots; export vectors + metadata to S3 Parquet for vendor exit.
  • Rebuild path: source docs in object storage + chunk manifest + embedding model version → full rebuild without vector backup (slow but sufficient).
  • RPO/RTO: vectors are regenerable if corpus text exists—prioritize document backup over vector snapshot if storage cost bites.

Monitoring dashboard

MetricAlert ifWhy it matters
vector_search_p99_ms > SLO (e.g. 100ms) User-facing RAG latency
ann_recall_at_10 (weekly eval) Drop > 2 pts vs baseline Silent index degradation
index_vector_count vs CMS doc count Drift > 5% Ingest pipeline stuck
embedding_model_version mix >1 version in active index Broken ranking after partial migration
filtered_search_empty_rate Spike per tenant Filter bug or ACL mismatch
hnsw_index_size_bytes > 80% RAM OOM risk on pgvector/Qdrant node

Production readiness checklist

  • Single embedding model per active index; migration runbook documented
  • HNSW (or IVF) tuned with measured recall@K—not defaults only
  • Mandatory tenant filter from auth context on every query
  • Content-hash skip on upsert to avoid pointless re-embeds
  • Index rebuild procedure tested in staging (timing documented)
  • Vectors or rebuild inputs backed up; restore drill completed
  • Hybrid / RRF evaluated against dense-only baseline in eval suite
  • Runbook for “recall dropped after deploy”—check model version, index params, filter regression

Incident playbooks (short)

Symptom: all scores low, wrong answers. Check embedding model pin; verify query vectors normalized; compare one query brute-force vs ANN in SQL.

Symptom: p99 latency spike. Check HNSW RAM pressure, connection pool exhaustion, missing metadata index causing seq scan + sort.

Symptom: tenant sees empty results. Verify tenant_id filter, ingest lag, offboarded namespace—not retrieval quality.

⚠️ Pitfall

Running REINDEX on production pgvector HNSW during peak traffic locks reads or doubles storage. Schedule concurrent rebuild on replica promote pattern, or build new index table and swap.

💰 Cost

Monitor embedding API spend separately from vector storage—re-index cron without content_hash gates is a common surprise invoice. Right-size RDS: HNSW for 2M × 1536 vectors often needs 32GB+ RAM for comfortable p99.

📦 Real World

Teams that treat recall@K as a release gate catch ANN regressions before customers do—weekly cron runs labeled questions against staging index on every deploy. Pair with chunking evals from the previous guide so you know whether failure is retrieval or bad splits.