Vector databases

What is a vector database?

A vector database is specialized storage for high-dimensional float arrays (embeddings) with metadata columns and an ANN index so similarity search stays sub-linear instead of scanning every row. In RAG, each row is typically one chunk: vector + text + source, ACL, timestamps.

The three layers

Embeddings — fixed-size vectors (e.g. 1536 floats from text-embedding-3-small) representing chunk meaning. Same model at index and query time; never mix models in one collection.
Metadata — JSON or typed columns attached to each vector: source, document_type, user_id, updated_at, ACL flags. Used for pre-filtering before or during ANN.
ANN search — graph or cluster structures (HNSW, IVF) that return top-K nearest neighbors in milliseconds at 100K–100M scale. Exact brute force is reserved for eval baselines and tiny corpora.

Vector DBs are not general OLTP databases—though pgvector blurs the line by living inside PostgreSQL. They optimize for read-heavy similarity queries with optional metadata predicates, not complex joins or transactional updates across many tables. Your application still owns chunk text, embedding jobs, and reranking; the store answers “nearest vectors under these filters.”

Vector DB vs relational DB vs search engine

System	Primary query	Strength	Weakness for RAG
Vector DB (Pinecone, Qdrant, pgvector+ANN)	Top-K by cosine / L2 / dot	Fast semantic retrieval, metadata filters	No BM25 alone; not source of truth for billing
Relational DB (Postgres without vectors)	SQL filters, joins, transactions	ACID, authZ at row level	No native similarity at scale
Search engine (OpenSearch, Elasticsearch)	BM25 keyword + optional k-NN	Hybrid lexical + dense in one cluster	Vector tuning less ergonomic than purpose-built DBs

End-to-end RAG retrieval path

User question → embed query (cached).
Apply metadata filters (tenant, date, document_type).
ANN top-K (e.g. 20–50) from vector store.
Optional rerank (cross-encoder) → top 5 for LLM context.

The vector store sits between your chunking pipeline and retrieval strategies (reranking, query expansion, refusal thresholds). Embeddings fundamentals live in Track 1 · Embeddings & semantic search.

🔬 Under the Hood

Most vector stores persist vectors as float32 arrays (6 KB per 1536-dim vector) plus B-tree or inverted indexes on metadata. ANN structures add 50–200% memory overhead for HNSW graph edges. Inserts update the graph incrementally (Qdrant, Pinecone) or defer index build until batch load completes (pgvector CREATE INDEX CONCURRENTLY).

⚠️ Pitfall

Treating a vector DB as the only database. Store authoritative document metadata and ACLs in Postgres; mirror retrieval fields into the vector index. If the vector store is wiped, you must be able to rebuild from object storage + embedding job without losing business data.

⚙ Config

Pin these at collection creation: embedding_model, dimensions, distance_metric (cosine / L2 / dot), index_type (HNSW params). Version the collection name when any of these change (kb_v3_1536_cosine).

Comparing vector stores

No single winner—choice depends on scale, team ops capacity, existing stack, hybrid search needs, and multi-tenant isolation model. Below is a practitioner’s matrix for RAG workloads in 2025–2026.

At-a-glance matrix

Store	Deployment	ANN indexes	Metadata filter	Hybrid BM25	Multi-tenant	Best for
Pinecone	Managed SaaS	Proprietary (pod/serverless)	Strong metadata filters	Sparse-dense (integrated)	Namespace per tenant	Fastest path to prod, minimal ops
Weaviate	Managed + self-hosted	HNSW, flat	GraphQL / where filters	Native BM25 + vector fusion	Class + tenant shard	Hybrid search in one OSS stack
Qdrant	Managed + self-hosted	HNSW	Payload indexes (fast)	Sparse vectors + RRF	Collection or payload filter	Performance, filtering, Rust efficiency
Chroma	Embedded / server	HNSW (via DuckDB/SQLite backends)	where dict filters	Limited native; pair with ES	Collection per tenant (small scale)	Prototypes, local dev, hello-RAG
pgvector	Self-hosted (Postgres)	HNSW, IVFFlat	SQL WHERE (B-tree)	Via pg_bm25 / tsvector + RRF in app	Row-level tenant_id	Teams already on Postgres, unified stack
Milvus	Managed (Zilliz) + self-hosted	HNSW, IVF, DiskANN	Scalar filtering	Sparse + dense (2.4+)	Database / partition	Billion-scale, GPU options
Redis Vector	Self-hosted / Redis Cloud	HNSW (RediSearch)	Tag / numeric filters	RediSearch full-text + vector	Key prefix / index per tenant	Low-latency cache-adjacent retrieval
OpenSearch k-NN	Managed (AWS) + self-hosted	HNSW, IVF, Faiss engines	Query DSL filters	Native BM25 + k-NN hybrid	Index alias per tenant	AWS-native, existing ES/OpenSearch shops

Operational and cost dimensions

Store	Typical latency p99	Ops burden	Pricing model	Backup story
Pinecone	10–50 ms (serverless)	Low	Per vector dimension-month + queries	Managed snapshots; export via API
Weaviate Cloud	20–80 ms	Low–medium	Cluster size + objects	Cloud backups; K8s Velero self-hosted
Qdrant Cloud	5–30 ms	Low–medium	RAM + storage tiers	Snapshots, S3 backup hooks
Chroma	5–100 ms (local)	Dev: none; prod: you run it	OSS free; hosting cost only	Filesystem / volume snapshots
pgvector	10–100 ms (tuned HNSW)	Medium (DBA skills)	Postgres infra you already pay	pg_dump, WAL, PITR standard
Milvus / Zilliz	10–50 ms at scale	Medium–high self-hosted	Zilliz cloud or K8s cost	MinIO / S3 segment backup
Redis Vector	<10 ms (in-memory)	Medium (Redis ops)	Memory-bound	RDB/AOF; not primary archive
OpenSearch k-NN	20–150 ms	Medium–high (cluster)	Instance hours + storage	Automated snapshots (AWS)

Decision heuristics

Already on Postgres, <5M vectors: start with pgvector + HNSW—one backup pipeline, SQL filters, no new vendor.
Need hybrid BM25+dense without glue code: Weaviate, Qdrant sparse, OpenSearch, or Redis Stack.
Zero ops, fast launch: Pinecone serverless or Qdrant Cloud; accept vendor cost until unit economics clear.
Billion vectors, ML platform team: Milvus / Zilliz with DiskANN.
Sub-10ms on hot cache: Redis Vector for session-scoped retrieval; Postgres/pgvector for durable corpus.
Laptop / CI / tutorial: Chroma embedded—migrate before multi-tenant prod.

⚖️ Trade-off

Managed vector DBs trade dollars for engineer time. pgvector trades query latency tuning and index rebuild complexity for stack simplicity. Running OpenSearch only for k-NN when you already have Postgres often doubles operational surface—justify it with hybrid search requirements or AWS mandate.

💰 Cost

Rule of thumb: 1M × 1536-dim float32 vectors ≈ 6 GB raw + 50–100% index overhead. Pinecone serverless bills storage + read units; self-hosted pgvector bills RDS storage and RAM for shared_buffers + HNSW. Model embedding cost usually exceeds store cost until 10M+ vectors—optimize re-index frequency first.

📦 Real World

Many Series B SaaS RAG products run pgvector on RDS until ~3–5M active chunks, then split hot tenant indexes to Qdrant or Pinecone namespaces while Postgres remains source of truth. Notion-scale teams invest in custom sharding; your first production index does not need that on day one.

🎯 Interview Tip

“Why pgvector over Pinecone?” — Answer with team skills, existing Postgres HA, transactional updates with app data, and acceptable QPS. Follow-up: “When would you migrate?” — ANN p99 > SLO, HNSW rebuild windows too long, or need managed hybrid sparse+dense without operating OpenSearch.

ANN index types: HNSW, IVF, Flat

The index structure determines recall, memory footprint, build time, and insert behavior. Every vector store exposes one or more of these; tuning parameters is how you hit p99 latency without sacrificing RAG recall.

Flat (exact / brute force)

No auxiliary structure—compare the query to every vector (or every row passing metadata pre-filter). Recall: 100%. Cost: O(n × d) per query. Use for golden-set eval (“what is true top-10?”), corpora under ~10K vectors, and CI regression tests that compare ANN vs exact rankings.

HNSW (Hierarchical Navigable Small World)

Multi-layer proximity graph: greedy beam search from entry points, refine on lower layers. Default for pgvector, Qdrant, Weaviate, Redis HNSW, OpenSearch nmslib/faiss HNSW.

Build: moderate; no separate training phase (unlike IVF).
Query params: ef_search (pgvector hnsw.ef_search) — higher → better recall, slower queries.
Insert: online but slower than IVF; heavy bulk load may defer index creation.
Memory: high—stores neighbor lists per vector.

IVF (Inverted File / IVFFlat)

k-means partitions the vector space into lists (cells). Each vector assigned to one cell; query probes probes nearest centroids only. pgvector ivfflat, Milvus IVF, Faiss IVF—common at very large scale with memory pressure.

Build: requires training pass on representative sample—skewed samples hurt recall.
Query params: probes — higher → better recall, more cells scanned.
Insert: assign to nearest centroid—fast bulk ingest.
Memory: lower than HNSW at same n; pairs with PQ compression at billion scale.

Recall / memory / latency trade-offs

Index	Recall @10 (typical tuned)	Memory overhead	Query p99 (1M × 1536)	Bulk load	When to choose
Flat	100%	None	200–2000 ms (CPU)	Instant	Eval baseline, <10K vectors
HNSW (ef_search=64)	92–96%	+50–120% vector RAM	5–25 ms	Build after load or incremental	Default production RAG
HNSW (ef_search=128)	96–99%	Same index size	10–40 ms	Same	When eval shows recall gaps
IVFFlat (probes=10)	85–93%	+10–30%	8–35 ms	Fast after train	Memory constrained, batch-heavy
IVF + PQ	80–92%	Compressed (4–16×)	5–30 ms	Train + encode	100M+ vectors, approximate OK

Measuring recall correctly

Build labeled set: query text → relevant chunk IDs (50–200 questions from support logs).
Run exact flat search → ground-truth top-10 IDs.
Run ANN with candidate params → compute recall@K = |ANN ∩ truth| / K.
Plot recall vs p99 latency; pick smallest ef_search / probes that clears 95% recall@10.

# Run after bulk COPY into chunks — not row-by-row during ingest
HNSW_DDL = """
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
  ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
"""
# Session tuning per query (optional)
SET_DDL = "SET hnsw.ef_search = 100;"  # default 40; raise for recall

// Flyway / Liquibase migration snippet
/*
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
  ON document_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);
*/
@Transactional(readOnly = true)
public void tuneHnsw(JdbcTemplate jdbc) {
  jdbc.execute("SET hnsw.ef_search = 100");
}

💡 Pro Tip

After bulk ingest into pgvector, run VACUUM ANALYZE then create HNSW CONCURRENTLY off-peak. Building HNSW while streaming inserts causes lock contention and suboptimal graphs.

📐 IaC

Encode index type and params in Terraform/Helm values—not manual psql on prod. Example: vector_index: { type: hnsw, m: 16, ef_construction: 64, ef_search: 100 } applied via migration job in deploy pipeline.

Metadata filtering

Pure vector search over the entire corpus returns chunks from wrong tenants, stale policies, or unrelated document types. Pre-filtering restricts ANN to rows matching structured predicates—before or during graph traversal depending on the engine.

Common metadata fields in RAG

Field	Type	Example use	Index hint
tenant_id / org_id	UUID / string	SaaS isolation—only search customer A’s docs	B-tree / payload keyword index
user_id	UUID	Personal notes vs shared workspace	Composite (tenant_id, user_id)
source	string	Filter to confluence vs slack	Keyword index
document_type	enum	Policies vs API reference vs tickets	Enum / keyword
updated_at	timestamp	Prefer docs newer than policy effective date	B-tree range
acl_group	string[]	Intersect with caller’s groups	GIN array (Postgres)

Pre-filter vs post-filter

Pre-filter (preferred): ANN runs only on IDs matching tenant_id = X AND document_type = 'policy'. Qdrant payload indexes, Pinecone metadata filters, pgvector SQL WHERE before ORDER BY distance.
Post-filter: ANN returns top-100 globally, then drop rows failing filter—dangerous when filter is selective (you may get <K results). Always over-fetch (top 50→filter→5) or use pre-filter.

Example filter expressions

# Qdrant — payload index on tenant_id recommended
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range

flt = Filter(must=[
    FieldCondition(key="tenant_id", match=MatchValue(value="org_abc")),
    FieldCondition(key="document_type", match=MatchValue(value="policy")),
    FieldCondition(key="updated_at", range=Range(gte="2024-01-01T00:00:00Z")),
])
hits = client.search(
    collection_name="kb",
    query_vector=query_emb,
    query_filter=flt,
    limit=20,
)

# Pinecone — metadata dict filter
# index.query(vector=q, top_k=20, filter={"tenant_id": "org_abc", "document_type": "policy"})

// pgvector + Spring JDBC — pre-filter in SQL
String sql = """
  SELECT id, content,
         1 - (embedding <=> ?::vector) AS score
  FROM document_chunks
  WHERE tenant_id = ?
    AND document_type = ?
    AND updated_at >= ?::timestamptz
  ORDER BY embedding <=> ?::vector
  LIMIT ?
  """;
// B-tree on (tenant_id, document_type, updated_at) keeps filter selective

Filter selectivity and recall

If only 0.1% of vectors pass the filter, HNSW may not traverse enough candidates—some engines build per-tenant partitions or namespace indexes. Symptoms: correct chunk exists but never appears in top-K. Fix: dedicated collection/namespace per large tenant, raise ef_search, or partition index by tenant.

🔒 Security

Never rely on post-filter alone for authZ—attackers craft queries that surface other tenants’ chunks in ANN results before filter drops them (information leakage in logs). Enforce tenant_id in the query API from the session JWT; reject client-supplied tenant overrides. Mirror ACL checks from Postgres if vector metadata could be stale.

⚠️ Pitfall

Storing only source_url without normalized document_type forces fuzzy string filters. Normalize metadata at ingest—enum fields indexed at ingest time beat regex post-filters on URLs.

Hybrid search: dense + BM25

Embeddings excel at paraphrase; BM25 excels at exact tokens (SKUs, error codes, legal citations). Hybrid search runs both retrievers and fuses rankings—Reciprocal Rank Fusion (RRF) is the default merge that needs no score calibration.

Why neither alone is enough

Query type	Dense embedding	BM25 lexical	Hybrid
“How do I cancel subscription?”	✓ Strong	△ Misses paraphrase	✓
“Error ECONNREFUSED port 5432”	△ Smooths rare tokens	✓ Exact match	✓
“INV-88291 refund status”	✗ Poor on IDs	✓	✓ (+ structured lookup)
“PTO policy 2025”	✓ Semantic	✓ Keyword “PTO”	✓ Best recall

Reciprocal Rank Fusion (RRF)

For each document d, sum contributions from each ranked list (dense list, BM25 list):

RRF(d) = Σ 1 / (k + rank_i(d))

where rank_i(d) is the 1-based rank of d in list i (missing docs contribute 0), and k is a constant—typically 60 (Cormack et al.; widely used in Elasticsearch, OpenSearch, Weaviate). Higher RRF score → better fused rank. No need to normalize cosine [0,1] with BM25 unbounded scores.

Hybrid implementation sketch

Retrieve top-N_dense (e.g. 50) from vector store.
Retrieve top-N_lex (e.g. 50) from BM25 (OpenSearch, Postgres tsvector, or store-native).
Compute RRF scores in application code or engine-native hybrid query.
Take top-K fused (e.g. 20) → optional cross-encoder rerank → 5 for LLM.

from collections import defaultdict

K_RRF = 60

def rrf_fuse(*ranked_lists: list[str], k: int = K_RRF) -> list[tuple[str, float]]:
    scores: dict[str, float] = defaultdict(float)
    for lst in ranked_lists:
        for rank, doc_id in enumerate(lst, start=1):
            scores[doc_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: -x[1])

dense_ids = ["c1", "c4", "c2", "c9"]   # from vector ANN
bm25_ids  = ["c4", "c1", "c7", "c2"]  # from OpenSearch
fused = rrf_fuse(dense_ids, bm25_ids)
top_ids = [doc_id for doc_id, _ in fused[:20]]

public static final int K_RRF = 60;

public List<String> rrfFuse(List<List<String>> rankedLists, int topK) {
  Map<String, Double> scores = new HashMap<>();
  for (List<String> list : rankedLists) {
    for (int rank = 0; rank < list.size(); rank++) {
      String id = list.get(rank);
      scores.merge(id, 1.0 / (K_RRF + rank + 1), Double::sum);
    }
  }
  return scores.entrySet().stream()
      .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
      .limit(topK)
      .map(Map.Entry::getKey)
      .toList();
}

When hybrid wins (and when it does not)

Win: technical docs, support KBs, legal corpora with defined terms + natural language questions.
Win: eval shows dense-only misses exact-match queries; BM25-only misses paraphrase.
Skip (initially): tiny homogeneous corpus (<5K chunks) where dense alone hits 95% recall@5.
Skip: pure conversational memory where lexical overlap is irrelevant—dense + recency may suffice.

Native hybrid (Weaviate, OpenSearch hybrid query, Qdrant sparse+dense) reduces round trips; app-level RRF stays portable across pgvector + external BM25.

⚖️ Trade-off

Hybrid doubles retrieval latency and index surface (vector + inverted index). Start dense-only with metadata filters; add BM25 when labeled eval proves keyword gaps—not because “best practices” say so.

☸ Platform

On Kubernetes, running separate OpenSearch and Qdrant clusters means two StatefulSets to patch. Weaviate or OpenSearch single-cluster hybrid reduces moving parts if your team already operates search infrastructure.

Multi-tenancy patterns

B2B RAG must guarantee tenant A never retrieves tenant B’s chunks. Two dominant patterns: namespace per tenant (physical isolation) vs shared index + metadata filter (logical isolation).

Namespace / collection per tenant

Each tenant gets Pinecone namespace, Qdrant collection, or Chroma collection.
Pros: strong isolation, independent index tuning, easy delete-tenant (drop namespace), no cross-tenant ANN leakage.
Cons: operational explosion at 10K+ tenants; cold namespaces; harder cross-tenant analytics; more index memory overhead.

Shared index + metadata filter

Single collection; every row has tenant_id; every query includes mandatory filter.
Pros: one index to manage, efficient for many small tenants, simpler backup.
Cons: filter selectivity issues for huge tenants; blast radius if filter bug; noisy neighbor on shared RAM.

Hybrid tiering (common at scale)

Tenant tier	Pattern	Rationale
Free / small (<10K chunks)	Shared index + tenant_id filter	Cost efficient
Enterprise (>500K chunks)	Dedicated namespace or collection	Recall + performance isolation
Regulated (HIPAA, SOC2)	Dedicated cluster or DB schema	Audit boundary, encryption per tenant

Tenant lifecycle

Provision — create namespace or verify filter path; seed empty index.
Ingest — embed jobs tag every vector with tenant_id from auth context—not from client body.
Query — middleware injects tenant from JWT; integration tests assert cross-tenant negative cases.
Offboard — delete namespace or DELETE WHERE tenant_id = ?; verify object storage purge.

def search_kb(query: str, ctx: RequestContext) -> list[ChunkHit]:
    tenant_id = ctx.jwt["org_id"]  # never from request body
    embedding = embed(query)
    return vector_store.search(
        vector=embedding,
        filter={"tenant_id": tenant_id},
        namespace=tenant_id if ctx.tier == "enterprise" else None,
        top_k=20,
    )

public List<ChunkHit> searchKb(String query, AuthContext ctx) {
  String tenantId = ctx.orgId(); // from verified JWT
  float[] embedding = embedder.embed(query);
  return vectorStore.search(SearchRequest.builder()
      .vector(embedding)
      .filter(Map.of("tenant_id", tenantId))
      .namespace(ctx.isEnterprise() ? tenantId : null)
      .topK(20)
      .build());
}

🔒 Security

Add integration tests: user of tenant A queries with forged tenant B filter in body → must ignore body, use JWT only. Log and alert on queries returning zero results when corpus should exist—may indicate filter misconfiguration rather than bad retrieval.

🎯 Interview Tip

“Design multi-tenant RAG for 5,000 customers.” — Shared pgvector with tenant_id for long tail; dedicated Qdrant collection for top 20 accounts by chunk volume; central embedding gateway; per-tenant rate limits; offboarding runbook. Mention compliance tier with separate RDS instance.

pgvector: upsert and filtered search

pgvector keeps vectors beside relational data—ideal when you already run Postgres for users, documents, and ACLs. Below: schema, idempotent upsert on content hash, and filtered cosine search in Python and Java.

Schema

SCHEMA_SQL = """
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
  id            BIGSERIAL PRIMARY KEY,
  chunk_id      TEXT NOT NULL,
  tenant_id     TEXT NOT NULL,
  source        TEXT NOT NULL,
  document_type TEXT NOT NULL,
  user_id       TEXT,
  content       TEXT NOT NULL,
  content_hash  TEXT NOT NULL,
  embedding     vector(1536) NOT NULL,
  updated_at    TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, chunk_id)
);

CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
CREATE INDEX idx_chunks_updated ON document_chunks (updated_at DESC);
"""

/*
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
  id BIGSERIAL PRIMARY KEY,
  chunk_id TEXT NOT NULL,
  tenant_id TEXT NOT NULL,
  source TEXT NOT NULL,
  document_type TEXT NOT NULL,
  user_id TEXT,
  content TEXT NOT NULL,
  content_hash TEXT NOT NULL,
  embedding vector(1536) NOT NULL,
  updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE (tenant_id, chunk_id)
);
CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
*/

Upsert (idempotent on tenant + chunk_id)

import hashlib
import psycopg
from openai import OpenAI

client = OpenAI()
MODEL = "text-embedding-3-small"

def content_hash(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()

def embed(text: str) -> list[float]:
    return client.embeddings.create(model=MODEL, input=text).data[0].embedding

def upsert_chunk(conn, row: dict) -> None:
    h = content_hash(row["content"])
    if row.get("content_hash") == h:
        return  # skip unchanged — no API call
    vec = embed(row["content"])
    conn.execute(
        """
        INSERT INTO document_chunks
          (chunk_id, tenant_id, source, document_type, user_id,
           content, content_hash, embedding, updated_at)
        VALUES (%(chunk_id)s, %(tenant_id)s, %(source)s, %(document_type)s,
                %(user_id)s, %(content)s, %(hash)s, %(emb)s, now())
        ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
          content = EXCLUDED.content,
          content_hash = EXCLUDED.content_hash,
          embedding = EXCLUDED.embedding,
          source = EXCLUDED.source,
          document_type = EXCLUDED.document_type,
          updated_at = now()
        WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
        """,
        {**row, "hash": h, "emb": vec},
    )

# Batch ingest: upsert many, then CREATE INDEX CONCURRENTLY once

@Repository
public class ChunkRepository {
  private final JdbcTemplate jdbc;
  private final EmbeddingModel embedder;

  public void upsertChunk(ChunkRow row) {
    String hash = sha256(row.content());
    Optional<String> existing = jdbc.query(
        "SELECT content_hash FROM document_chunks WHERE tenant_id=? AND chunk_id=?",
        rs -> rs.next() ? Optional.of(rs.getString(1)) : Optional.empty(),
        row.tenantId(), row.chunkId());
    if (existing.isPresent() && existing.get().equals(hash)) return;

    float[] vec = embedder.embed(row.content());
    String vecLiteral = toPgVector(vec);
    jdbc.update("""
        INSERT INTO document_chunks
          (chunk_id, tenant_id, source, document_type, user_id,
           content, content_hash, embedding, updated_at)
        VALUES (?, ?, ?, ?, ?, ?, ?, ?::vector, now())
        ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
          content = EXCLUDED.content,
          content_hash = EXCLUDED.content_hash,
          embedding = EXCLUDED.embedding,
          updated_at = now()
        WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
        """,
        row.chunkId(), row.tenantId(), row.source(), row.documentType(),
        row.userId(), row.content(), hash, vecLiteral);
  }
}

Filtered similarity search

def search(
    conn,
    query: str,
    tenant_id: str,
    document_type: str | None = None,
    since: str | None = None,
    k: int = 10,
) -> list[dict]:
    q_vec = embed(query)
    sql = """
      SELECT chunk_id, content, source,
             1 - (embedding <=> %(q)s::vector) AS score
      FROM document_chunks
      WHERE tenant_id = %(tenant)s
    """
    params: dict = {"q": q_vec, "tenant": tenant_id, "k": k}
    if document_type:
        sql += " AND document_type = %(dtype)s"
        params["dtype"] = document_type
    if since:
        sql += " AND updated_at >= %(since)s::timestamptz"
        params["since"] = since
    sql += " ORDER BY embedding <=> %(q)s::vector LIMIT %(k)s"
    return conn.execute(sql, params).fetchall()

public List<ChunkHit> search(
    String query, String tenantId, String documentType, Instant since, int k) {
  float[] qVec = embedder.embed(query);
  String vec = toPgVector(qVec);
  StringBuilder sql = new StringBuilder("""
      SELECT chunk_id, content, source,
             1 - (embedding <=> ?::vector) AS score
      FROM document_chunks
      WHERE tenant_id = ?
      """);
  List<Object> args = new ArrayList<>(List.of(vec, tenantId));
  if (documentType != null) {
    sql.append(" AND document_type = ?");
    args.add(documentType);
  }
  if (since != null) {
    sql.append(" AND updated_at >= ?");
    args.add(Timestamp.from(since));
  }
  sql.append(" ORDER BY embedding <=> ?::vector LIMIT ?");
  args.add(vec);
  args.add(k);
  return jdbc.query(sql.toString(), chunkMapper, args.toArray());
}

💡 Pro Tip

Use <=> (cosine distance) with vector_cosine_ops HNSW index— score as 1 - distance for UI display. Keep raw distance in logs for drift detection.

🔬 Under the Hood

pgvector stores vectors inline in heap pages—updates rewrite rows when embedding changes. Batch upserts inside transactions; avoid per-row commits during million-chunk backfill. Connection pool size × query concurrency must fit within Postgres max_connections and RAM for HNSW graph.

What this looks like in production

A vector index is living infrastructure—it grows with every document upload, model upgrade, and tenant onboard. Plan index rebuilds, backups, and monitoring before the first customer connects their Google Drive.

Index rebuild strategy

Model change — new collection/table suffix (kb_v4); embed full corpus; build HNSW; alias swap; retire old index after soak.
Bulk re-ingest — disable HNSW during load; COPY or batch INSERT; VACUUM ANALYZE; CREATE INDEX CONCURRENTLY.
Incremental — upsert changed chunks only (content_hash gate); periodic index maintenance (pgvector reindex if supported; Qdrant optimize).
Blue/green — dual-write to staging index; validate recall@K; flip read traffic; delete green after 24h.

Backup and disaster recovery

pgvector: standard Postgres PITR + nightly pg_dump; test restore quarterly.
Managed (Pinecone/Qdrant): enable vendor snapshots; export vectors + metadata to S3 Parquet for vendor exit.
Rebuild path: source docs in object storage + chunk manifest + embedding model version → full rebuild without vector backup (slow but sufficient).
RPO/RTO: vectors are regenerable if corpus text exists—prioritize document backup over vector snapshot if storage cost bites.

Monitoring dashboard

Metric	Alert if	Why it matters
vector_search_p99_ms	> SLO (e.g. 100ms)	User-facing RAG latency
ann_recall_at_10 (weekly eval)	Drop > 2 pts vs baseline	Silent index degradation
index_vector_count vs CMS doc count	Drift > 5%	Ingest pipeline stuck
embedding_model_version mix	>1 version in active index	Broken ranking after partial migration
filtered_search_empty_rate	Spike per tenant	Filter bug or ACL mismatch
hnsw_index_size_bytes	> 80% RAM	OOM risk on pgvector/Qdrant node

Production readiness checklist

Single embedding model per active index; migration runbook documented
HNSW (or IVF) tuned with measured recall@K—not defaults only
Mandatory tenant filter from auth context on every query
Content-hash skip on upsert to avoid pointless re-embeds
Index rebuild procedure tested in staging (timing documented)
Vectors or rebuild inputs backed up; restore drill completed
Hybrid / RRF evaluated against dense-only baseline in eval suite
Runbook for “recall dropped after deploy”—check model version, index params, filter regression

Incident playbooks (short)

Symptom: all scores low, wrong answers. Check embedding model pin; verify query vectors normalized; compare one query brute-force vs ANN in SQL.

Symptom: p99 latency spike. Check HNSW RAM pressure, connection pool exhaustion, missing metadata index causing seq scan + sort.

Symptom: tenant sees empty results. Verify tenant_id filter, ingest lag, offboarded namespace—not retrieval quality.

⚠️ Pitfall

Running REINDEX on production pgvector HNSW during peak traffic locks reads or doubles storage. Schedule concurrent rebuild on replica promote pattern, or build new index table and swap.

💰 Cost

Monitor embedding API spend separately from vector storage—re-index cron without content_hash gates is a common surprise invoice. Right-size RDS: HNSW for 2M × 1536 vectors often needs 32GB+ RAM for comfortable p99.

📦 Real World

Teams that treat recall@K as a release gate catch ANN regressions before customers do—weekly cron runs labeled questions against staging index on every deploy. Pair with chunking evals from the previous guide so you know whether failure is retrieval or bad splits.