Vector databases
A vector database stores dense embeddings alongside structured metadata and answers “which chunks are closest to this query vector?” using approximate nearest neighbor (ANN) search—fast enough for interactive RAG, semantic search, and recommendation at millions of rows. It is the retrieval layer between chunking and the LLM.
After reading, you should be able to: explain how embeddings, metadata, and ANN indexes fit together; compare Pinecone, Weaviate, Qdrant, Chroma, pgvector, Milvus, Redis Vector, and OpenSearch k-NN; choose HNSW vs IVF vs flat; implement metadata pre-filtering and hybrid dense+BM25 with RRF; design multi-tenant isolation; upsert and query with pgvector in Python and Java; and operate indexes in production (rebuild, backup, monitoring).
What is a vector database?
A vector database is specialized storage for high-dimensional float arrays (embeddings) with metadata columns and an ANN index so similarity search stays sub-linear instead of scanning every row. In RAG, each row is typically one chunk: vector + text + source, ACL, timestamps.
The three layers
- Embeddings — fixed-size vectors (e.g. 1536 floats from text-embedding-3-small) representing chunk meaning. Same model at index and query time; never mix models in one collection.
- Metadata — JSON or typed columns attached to each vector: source, document_type, user_id, updated_at, ACL flags. Used for pre-filtering before or during ANN.
- ANN search — graph or cluster structures (HNSW, IVF) that return top-K nearest neighbors in milliseconds at 100K–100M scale. Exact brute force is reserved for eval baselines and tiny corpora.
Vector DBs are not general OLTP databases—though pgvector blurs the line by living inside PostgreSQL. They optimize for read-heavy similarity queries with optional metadata predicates, not complex joins or transactional updates across many tables. Your application still owns chunk text, embedding jobs, and reranking; the store answers “nearest vectors under these filters.”
Vector DB vs relational DB vs search engine
| System | Primary query | Strength | Weakness for RAG |
|---|---|---|---|
| Vector DB (Pinecone, Qdrant, pgvector+ANN) | Top-K by cosine / L2 / dot | Fast semantic retrieval, metadata filters | No BM25 alone; not source of truth for billing |
| Relational DB (Postgres without vectors) | SQL filters, joins, transactions | ACID, authZ at row level | No native similarity at scale |
| Search engine (OpenSearch, Elasticsearch) | BM25 keyword + optional k-NN | Hybrid lexical + dense in one cluster | Vector tuning less ergonomic than purpose-built DBs |
End-to-end RAG retrieval path
- User question → embed query (cached).
- Apply metadata filters (tenant, date, document_type).
- ANN top-K (e.g. 20–50) from vector store.
- Optional rerank (cross-encoder) → top 5 for LLM context.
The vector store sits between your chunking pipeline and retrieval strategies (reranking, query expansion, refusal thresholds). Embeddings fundamentals live in Track 1 · Embeddings & semantic search.
Most vector stores persist vectors as float32 arrays (6 KB per 1536-dim vector) plus B-tree or inverted indexes on metadata. ANN structures add 50–200% memory overhead for HNSW graph edges. Inserts update the graph incrementally (Qdrant, Pinecone) or defer index build until batch load completes (pgvector CREATE INDEX CONCURRENTLY).
Treating a vector DB as the only database. Store authoritative document metadata and ACLs in Postgres; mirror retrieval fields into the vector index. If the vector store is wiped, you must be able to rebuild from object storage + embedding job without losing business data.
Pin these at collection creation: embedding_model, dimensions, distance_metric (cosine / L2 / dot), index_type (HNSW params). Version the collection name when any of these change (kb_v3_1536_cosine).
Comparing vector stores
No single winner—choice depends on scale, team ops capacity, existing stack, hybrid search needs, and multi-tenant isolation model. Below is a practitioner’s matrix for RAG workloads in 2025–2026.
At-a-glance matrix
| Store | Deployment | ANN indexes | Metadata filter | Hybrid BM25 | Multi-tenant | Best for |
|---|---|---|---|---|---|---|
| Pinecone | Managed SaaS | Proprietary (pod/serverless) | Strong metadata filters | Sparse-dense (integrated) | Namespace per tenant | Fastest path to prod, minimal ops |
| Weaviate | Managed + self-hosted | HNSW, flat | GraphQL / where filters | Native BM25 + vector fusion | Class + tenant shard | Hybrid search in one OSS stack |
| Qdrant | Managed + self-hosted | HNSW | Payload indexes (fast) | Sparse vectors + RRF | Collection or payload filter | Performance, filtering, Rust efficiency |
| Chroma | Embedded / server | HNSW (via DuckDB/SQLite backends) | where dict filters | Limited native; pair with ES | Collection per tenant (small scale) | Prototypes, local dev, hello-RAG |
| pgvector | Self-hosted (Postgres) | HNSW, IVFFlat | SQL WHERE (B-tree) | Via pg_bm25 / tsvector + RRF in app | Row-level tenant_id | Teams already on Postgres, unified stack |
| Milvus | Managed (Zilliz) + self-hosted | HNSW, IVF, DiskANN | Scalar filtering | Sparse + dense (2.4+) | Database / partition | Billion-scale, GPU options |
| Redis Vector | Self-hosted / Redis Cloud | HNSW (RediSearch) | Tag / numeric filters | RediSearch full-text + vector | Key prefix / index per tenant | Low-latency cache-adjacent retrieval |
| OpenSearch k-NN | Managed (AWS) + self-hosted | HNSW, IVF, Faiss engines | Query DSL filters | Native BM25 + k-NN hybrid | Index alias per tenant | AWS-native, existing ES/OpenSearch shops |
Operational and cost dimensions
| Store | Typical latency p99 | Ops burden | Pricing model | Backup story |
|---|---|---|---|---|
| Pinecone | 10–50 ms (serverless) | Low | Per vector dimension-month + queries | Managed snapshots; export via API |
| Weaviate Cloud | 20–80 ms | Low–medium | Cluster size + objects | Cloud backups; K8s Velero self-hosted |
| Qdrant Cloud | 5–30 ms | Low–medium | RAM + storage tiers | Snapshots, S3 backup hooks |
| Chroma | 5–100 ms (local) | Dev: none; prod: you run it | OSS free; hosting cost only | Filesystem / volume snapshots |
| pgvector | 10–100 ms (tuned HNSW) | Medium (DBA skills) | Postgres infra you already pay | pg_dump, WAL, PITR standard |
| Milvus / Zilliz | 10–50 ms at scale | Medium–high self-hosted | Zilliz cloud or K8s cost | MinIO / S3 segment backup |
| Redis Vector | <10 ms (in-memory) | Medium (Redis ops) | Memory-bound | RDB/AOF; not primary archive |
| OpenSearch k-NN | 20–150 ms | Medium–high (cluster) | Instance hours + storage | Automated snapshots (AWS) |
Decision heuristics
- Already on Postgres, <5M vectors: start with pgvector + HNSW—one backup pipeline, SQL filters, no new vendor.
- Need hybrid BM25+dense without glue code: Weaviate, Qdrant sparse, OpenSearch, or Redis Stack.
- Zero ops, fast launch: Pinecone serverless or Qdrant Cloud; accept vendor cost until unit economics clear.
- Billion vectors, ML platform team: Milvus / Zilliz with DiskANN.
- Sub-10ms on hot cache: Redis Vector for session-scoped retrieval; Postgres/pgvector for durable corpus.
- Laptop / CI / tutorial: Chroma embedded—migrate before multi-tenant prod.
Managed vector DBs trade dollars for engineer time. pgvector trades query latency tuning and index rebuild complexity for stack simplicity. Running OpenSearch only for k-NN when you already have Postgres often doubles operational surface—justify it with hybrid search requirements or AWS mandate.
Rule of thumb: 1M × 1536-dim float32 vectors ≈ 6 GB raw + 50–100% index overhead. Pinecone serverless bills storage + read units; self-hosted pgvector bills RDS storage and RAM for shared_buffers + HNSW. Model embedding cost usually exceeds store cost until 10M+ vectors—optimize re-index frequency first.
Many Series B SaaS RAG products run pgvector on RDS until ~3–5M active chunks, then split hot tenant indexes to Qdrant or Pinecone namespaces while Postgres remains source of truth. Notion-scale teams invest in custom sharding; your first production index does not need that on day one.
“Why pgvector over Pinecone?” — Answer with team skills, existing Postgres HA, transactional updates with app data, and acceptable QPS. Follow-up: “When would you migrate?” — ANN p99 > SLO, HNSW rebuild windows too long, or need managed hybrid sparse+dense without operating OpenSearch.
ANN index types: HNSW, IVF, Flat
The index structure determines recall, memory footprint, build time, and insert behavior. Every vector store exposes one or more of these; tuning parameters is how you hit p99 latency without sacrificing RAG recall.
Flat (exact / brute force)
No auxiliary structure—compare the query to every vector (or every row passing metadata pre-filter). Recall: 100%. Cost: O(n × d) per query. Use for golden-set eval (“what is true top-10?”), corpora under ~10K vectors, and CI regression tests that compare ANN vs exact rankings.
HNSW (Hierarchical Navigable Small World)
Multi-layer proximity graph: greedy beam search from entry points, refine on lower layers. Default for pgvector, Qdrant, Weaviate, Redis HNSW, OpenSearch nmslib/faiss HNSW.
- Build: moderate; no separate training phase (unlike IVF).
- Query params: ef_search (pgvector hnsw.ef_search) — higher → better recall, slower queries.
- Insert: online but slower than IVF; heavy bulk load may defer index creation.
- Memory: high—stores neighbor lists per vector.
IVF (Inverted File / IVFFlat)
k-means partitions the vector space into lists (cells). Each vector assigned to one cell; query probes probes nearest centroids only. pgvector ivfflat, Milvus IVF, Faiss IVF—common at very large scale with memory pressure.
- Build: requires training pass on representative sample—skewed samples hurt recall.
- Query params: probes — higher → better recall, more cells scanned.
- Insert: assign to nearest centroid—fast bulk ingest.
- Memory: lower than HNSW at same n; pairs with PQ compression at billion scale.
Recall / memory / latency trade-offs
| Index | Recall @10 (typical tuned) | Memory overhead | Query p99 (1M × 1536) | Bulk load | When to choose |
|---|---|---|---|---|---|
| Flat | 100% | None | 200–2000 ms (CPU) | Instant | Eval baseline, <10K vectors |
| HNSW (ef_search=64) | 92–96% | +50–120% vector RAM | 5–25 ms | Build after load or incremental | Default production RAG |
| HNSW (ef_search=128) | 96–99% | Same index size | 10–40 ms | Same | When eval shows recall gaps |
| IVFFlat (probes=10) | 85–93% | +10–30% | 8–35 ms | Fast after train | Memory constrained, batch-heavy |
| IVF + PQ | 80–92% | Compressed (4–16×) | 5–30 ms | Train + encode | 100M+ vectors, approximate OK |
Measuring recall correctly
- Build labeled set: query text → relevant chunk IDs (50–200 questions from support logs).
- Run exact flat search → ground-truth top-10 IDs.
- Run ANN with candidate params → compute recall@K = |ANN ∩ truth| / K.
- Plot recall vs p99 latency; pick smallest ef_search / probes that clears 95% recall@10.
# Run after bulk COPY into chunks — not row-by-row during ingest
HNSW_DDL = """
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
"""
# Session tuning per query (optional)
SET_DDL = "SET hnsw.ef_search = 100;" # default 40; raise for recall
// Flyway / Liquibase migration snippet
/*
CREATE INDEX CONCURRENTLY idx_chunks_embedding_hnsw
ON document_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
*/
@Transactional(readOnly = true)
public void tuneHnsw(JdbcTemplate jdbc) {
jdbc.execute("SET hnsw.ef_search = 100");
}
After bulk ingest into pgvector, run VACUUM ANALYZE then create HNSW CONCURRENTLY off-peak. Building HNSW while streaming inserts causes lock contention and suboptimal graphs.
Encode index type and params in Terraform/Helm values—not manual psql on prod. Example: vector_index: { type: hnsw, m: 16, ef_construction: 64, ef_search: 100 } applied via migration job in deploy pipeline.
Metadata filtering
Pure vector search over the entire corpus returns chunks from wrong tenants, stale policies, or unrelated document types. Pre-filtering restricts ANN to rows matching structured predicates—before or during graph traversal depending on the engine.
Common metadata fields in RAG
| Field | Type | Example use | Index hint |
|---|---|---|---|
| tenant_id / org_id | UUID / string | SaaS isolation—only search customer A’s docs | B-tree / payload keyword index |
| user_id | UUID | Personal notes vs shared workspace | Composite (tenant_id, user_id) |
| source | string | Filter to confluence vs slack | Keyword index |
| document_type | enum | Policies vs API reference vs tickets | Enum / keyword |
| updated_at | timestamp | Prefer docs newer than policy effective date | B-tree range |
| acl_group | string[] | Intersect with caller’s groups | GIN array (Postgres) |
Pre-filter vs post-filter
- Pre-filter (preferred): ANN runs only on IDs matching tenant_id = X AND document_type = 'policy'. Qdrant payload indexes, Pinecone metadata filters, pgvector SQL WHERE before ORDER BY distance.
- Post-filter: ANN returns top-100 globally, then drop rows failing filter—dangerous when filter is selective (you may get <K results). Always over-fetch (top 50→filter→5) or use pre-filter.
Example filter expressions
# Qdrant — payload index on tenant_id recommended
from qdrant_client.models import Filter, FieldCondition, MatchValue, Range
flt = Filter(must=[
FieldCondition(key="tenant_id", match=MatchValue(value="org_abc")),
FieldCondition(key="document_type", match=MatchValue(value="policy")),
FieldCondition(key="updated_at", range=Range(gte="2024-01-01T00:00:00Z")),
])
hits = client.search(
collection_name="kb",
query_vector=query_emb,
query_filter=flt,
limit=20,
)
# Pinecone — metadata dict filter
# index.query(vector=q, top_k=20, filter={"tenant_id": "org_abc", "document_type": "policy"})
// pgvector + Spring JDBC — pre-filter in SQL
String sql = """
SELECT id, content,
1 - (embedding <=> ?::vector) AS score
FROM document_chunks
WHERE tenant_id = ?
AND document_type = ?
AND updated_at >= ?::timestamptz
ORDER BY embedding <=> ?::vector
LIMIT ?
""";
// B-tree on (tenant_id, document_type, updated_at) keeps filter selective
Filter selectivity and recall
If only 0.1% of vectors pass the filter, HNSW may not traverse enough candidates—some engines build per-tenant partitions or namespace indexes. Symptoms: correct chunk exists but never appears in top-K. Fix: dedicated collection/namespace per large tenant, raise ef_search, or partition index by tenant.
Never rely on post-filter alone for authZ—attackers craft queries that surface other tenants’ chunks in ANN results before filter drops them (information leakage in logs). Enforce tenant_id in the query API from the session JWT; reject client-supplied tenant overrides. Mirror ACL checks from Postgres if vector metadata could be stale.
Storing only source_url without normalized document_type forces fuzzy string filters. Normalize metadata at ingest—enum fields indexed at ingest time beat regex post-filters on URLs.
Hybrid search: dense + BM25
Embeddings excel at paraphrase; BM25 excels at exact tokens (SKUs, error codes, legal citations). Hybrid search runs both retrievers and fuses rankings—Reciprocal Rank Fusion (RRF) is the default merge that needs no score calibration.
Why neither alone is enough
| Query type | Dense embedding | BM25 lexical | Hybrid |
|---|---|---|---|
| “How do I cancel subscription?” | ✓ Strong | △ Misses paraphrase | ✓ |
| “Error ECONNREFUSED port 5432” | △ Smooths rare tokens | ✓ Exact match | ✓ |
| “INV-88291 refund status” | ✗ Poor on IDs | ✓ | ✓ (+ structured lookup) |
| “PTO policy 2025” | ✓ Semantic | ✓ Keyword “PTO” | ✓ Best recall |
Reciprocal Rank Fusion (RRF)
For each document d, sum contributions from each ranked list (dense list, BM25 list):
RRF(d) = Σ 1 / (k + rank_i(d))
where rank_i(d) is the 1-based rank of d in list i (missing docs contribute 0), and k is a constant—typically 60 (Cormack et al.; widely used in Elasticsearch, OpenSearch, Weaviate). Higher RRF score → better fused rank. No need to normalize cosine [0,1] with BM25 unbounded scores.
Hybrid implementation sketch
- Retrieve top-Ndense (e.g. 50) from vector store.
- Retrieve top-Nlex (e.g. 50) from BM25 (OpenSearch, Postgres tsvector, or store-native).
- Compute RRF scores in application code or engine-native hybrid query.
- Take top-K fused (e.g. 20) → optional cross-encoder rerank → 5 for LLM.
from collections import defaultdict
K_RRF = 60
def rrf_fuse(*ranked_lists: list[str], k: int = K_RRF) -> list[tuple[str, float]]:
scores: dict[str, float] = defaultdict(float)
for lst in ranked_lists:
for rank, doc_id in enumerate(lst, start=1):
scores[doc_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: -x[1])
dense_ids = ["c1", "c4", "c2", "c9"] # from vector ANN
bm25_ids = ["c4", "c1", "c7", "c2"] # from OpenSearch
fused = rrf_fuse(dense_ids, bm25_ids)
top_ids = [doc_id for doc_id, _ in fused[:20]]
public static final int K_RRF = 60;
public List<String> rrfFuse(List<List<String>> rankedLists, int topK) {
Map<String, Double> scores = new HashMap<>();
for (List<String> list : rankedLists) {
for (int rank = 0; rank < list.size(); rank++) {
String id = list.get(rank);
scores.merge(id, 1.0 / (K_RRF + rank + 1), Double::sum);
}
}
return scores.entrySet().stream()
.sorted(Map.Entry.<String, Double>comparingByValue().reversed())
.limit(topK)
.map(Map.Entry::getKey)
.toList();
}
When hybrid wins (and when it does not)
- Win: technical docs, support KBs, legal corpora with defined terms + natural language questions.
- Win: eval shows dense-only misses exact-match queries; BM25-only misses paraphrase.
- Skip (initially): tiny homogeneous corpus (<5K chunks) where dense alone hits 95% recall@5.
- Skip: pure conversational memory where lexical overlap is irrelevant—dense + recency may suffice.
Native hybrid (Weaviate, OpenSearch hybrid query, Qdrant sparse+dense) reduces round trips; app-level RRF stays portable across pgvector + external BM25.
Hybrid doubles retrieval latency and index surface (vector + inverted index). Start dense-only with metadata filters; add BM25 when labeled eval proves keyword gaps—not because “best practices” say so.
On Kubernetes, running separate OpenSearch and Qdrant clusters means two StatefulSets to patch. Weaviate or OpenSearch single-cluster hybrid reduces moving parts if your team already operates search infrastructure.
Multi-tenancy patterns
B2B RAG must guarantee tenant A never retrieves tenant B’s chunks. Two dominant patterns: namespace per tenant (physical isolation) vs shared index + metadata filter (logical isolation).
Namespace / collection per tenant
- Each tenant gets Pinecone namespace, Qdrant collection, or Chroma collection.
- Pros: strong isolation, independent index tuning, easy delete-tenant (drop namespace), no cross-tenant ANN leakage.
- Cons: operational explosion at 10K+ tenants; cold namespaces; harder cross-tenant analytics; more index memory overhead.
Shared index + metadata filter
- Single collection; every row has tenant_id; every query includes mandatory filter.
- Pros: one index to manage, efficient for many small tenants, simpler backup.
- Cons: filter selectivity issues for huge tenants; blast radius if filter bug; noisy neighbor on shared RAM.
Hybrid tiering (common at scale)
| Tenant tier | Pattern | Rationale |
|---|---|---|
| Free / small (<10K chunks) | Shared index + tenant_id filter | Cost efficient |
| Enterprise (>500K chunks) | Dedicated namespace or collection | Recall + performance isolation |
| Regulated (HIPAA, SOC2) | Dedicated cluster or DB schema | Audit boundary, encryption per tenant |
Tenant lifecycle
- Provision — create namespace or verify filter path; seed empty index.
- Ingest — embed jobs tag every vector with tenant_id from auth context—not from client body.
- Query — middleware injects tenant from JWT; integration tests assert cross-tenant negative cases.
- Offboard — delete namespace or DELETE WHERE tenant_id = ?; verify object storage purge.
def search_kb(query: str, ctx: RequestContext) -> list[ChunkHit]:
tenant_id = ctx.jwt["org_id"] # never from request body
embedding = embed(query)
return vector_store.search(
vector=embedding,
filter={"tenant_id": tenant_id},
namespace=tenant_id if ctx.tier == "enterprise" else None,
top_k=20,
)
public List<ChunkHit> searchKb(String query, AuthContext ctx) {
String tenantId = ctx.orgId(); // from verified JWT
float[] embedding = embedder.embed(query);
return vectorStore.search(SearchRequest.builder()
.vector(embedding)
.filter(Map.of("tenant_id", tenantId))
.namespace(ctx.isEnterprise() ? tenantId : null)
.topK(20)
.build());
}
Add integration tests: user of tenant A queries with forged tenant B filter in body → must ignore body, use JWT only. Log and alert on queries returning zero results when corpus should exist—may indicate filter misconfiguration rather than bad retrieval.
“Design multi-tenant RAG for 5,000 customers.” — Shared pgvector with tenant_id for long tail; dedicated Qdrant collection for top 20 accounts by chunk volume; central embedding gateway; per-tenant rate limits; offboarding runbook. Mention compliance tier with separate RDS instance.
pgvector: upsert and filtered search
pgvector keeps vectors beside relational data—ideal when you already run Postgres for users, documents, and ACLs. Below: schema, idempotent upsert on content hash, and filtered cosine search in Python and Java.
Schema
SCHEMA_SQL = """
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
chunk_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
source TEXT NOT NULL,
document_type TEXT NOT NULL,
user_id TEXT,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
embedding vector(1536) NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (tenant_id, chunk_id)
);
CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
CREATE INDEX idx_chunks_updated ON document_chunks (updated_at DESC);
"""
/*
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
chunk_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
source TEXT NOT NULL,
document_type TEXT NOT NULL,
user_id TEXT,
content TEXT NOT NULL,
content_hash TEXT NOT NULL,
embedding vector(1536) NOT NULL,
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
UNIQUE (tenant_id, chunk_id)
);
CREATE INDEX idx_chunks_tenant_type ON document_chunks (tenant_id, document_type);
*/
Upsert (idempotent on tenant + chunk_id)
import hashlib
import psycopg
from openai import OpenAI
client = OpenAI()
MODEL = "text-embedding-3-small"
def content_hash(text: str) -> str:
return hashlib.sha256(text.encode()).hexdigest()
def embed(text: str) -> list[float]:
return client.embeddings.create(model=MODEL, input=text).data[0].embedding
def upsert_chunk(conn, row: dict) -> None:
h = content_hash(row["content"])
if row.get("content_hash") == h:
return # skip unchanged — no API call
vec = embed(row["content"])
conn.execute(
"""
INSERT INTO document_chunks
(chunk_id, tenant_id, source, document_type, user_id,
content, content_hash, embedding, updated_at)
VALUES (%(chunk_id)s, %(tenant_id)s, %(source)s, %(document_type)s,
%(user_id)s, %(content)s, %(hash)s, %(emb)s, now())
ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
content = EXCLUDED.content,
content_hash = EXCLUDED.content_hash,
embedding = EXCLUDED.embedding,
source = EXCLUDED.source,
document_type = EXCLUDED.document_type,
updated_at = now()
WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
""",
{**row, "hash": h, "emb": vec},
)
# Batch ingest: upsert many, then CREATE INDEX CONCURRENTLY once
@Repository
public class ChunkRepository {
private final JdbcTemplate jdbc;
private final EmbeddingModel embedder;
public void upsertChunk(ChunkRow row) {
String hash = sha256(row.content());
Optional<String> existing = jdbc.query(
"SELECT content_hash FROM document_chunks WHERE tenant_id=? AND chunk_id=?",
rs -> rs.next() ? Optional.of(rs.getString(1)) : Optional.empty(),
row.tenantId(), row.chunkId());
if (existing.isPresent() && existing.get().equals(hash)) return;
float[] vec = embedder.embed(row.content());
String vecLiteral = toPgVector(vec);
jdbc.update("""
INSERT INTO document_chunks
(chunk_id, tenant_id, source, document_type, user_id,
content, content_hash, embedding, updated_at)
VALUES (?, ?, ?, ?, ?, ?, ?, ?::vector, now())
ON CONFLICT (tenant_id, chunk_id) DO UPDATE SET
content = EXCLUDED.content,
content_hash = EXCLUDED.content_hash,
embedding = EXCLUDED.embedding,
updated_at = now()
WHERE document_chunks.content_hash IS DISTINCT FROM EXCLUDED.content_hash
""",
row.chunkId(), row.tenantId(), row.source(), row.documentType(),
row.userId(), row.content(), hash, vecLiteral);
}
}
Filtered similarity search
def search(
conn,
query: str,
tenant_id: str,
document_type: str | None = None,
since: str | None = None,
k: int = 10,
) -> list[dict]:
q_vec = embed(query)
sql = """
SELECT chunk_id, content, source,
1 - (embedding <=> %(q)s::vector) AS score
FROM document_chunks
WHERE tenant_id = %(tenant)s
"""
params: dict = {"q": q_vec, "tenant": tenant_id, "k": k}
if document_type:
sql += " AND document_type = %(dtype)s"
params["dtype"] = document_type
if since:
sql += " AND updated_at >= %(since)s::timestamptz"
params["since"] = since
sql += " ORDER BY embedding <=> %(q)s::vector LIMIT %(k)s"
return conn.execute(sql, params).fetchall()
public List<ChunkHit> search(
String query, String tenantId, String documentType, Instant since, int k) {
float[] qVec = embedder.embed(query);
String vec = toPgVector(qVec);
StringBuilder sql = new StringBuilder("""
SELECT chunk_id, content, source,
1 - (embedding <=> ?::vector) AS score
FROM document_chunks
WHERE tenant_id = ?
""");
List<Object> args = new ArrayList<>(List.of(vec, tenantId));
if (documentType != null) {
sql.append(" AND document_type = ?");
args.add(documentType);
}
if (since != null) {
sql.append(" AND updated_at >= ?");
args.add(Timestamp.from(since));
}
sql.append(" ORDER BY embedding <=> ?::vector LIMIT ?");
args.add(vec);
args.add(k);
return jdbc.query(sql.toString(), chunkMapper, args.toArray());
}
Use <=> (cosine distance) with vector_cosine_ops HNSW index— score as 1 - distance for UI display. Keep raw distance in logs for drift detection.
pgvector stores vectors inline in heap pages—updates rewrite rows when embedding changes. Batch upserts inside transactions; avoid per-row commits during million-chunk backfill. Connection pool size × query concurrency must fit within Postgres max_connections and RAM for HNSW graph.
What this looks like in production
A vector index is living infrastructure—it grows with every document upload, model upgrade, and tenant onboard. Plan index rebuilds, backups, and monitoring before the first customer connects their Google Drive.
Index rebuild strategy
- Model change — new collection/table suffix (kb_v4); embed full corpus; build HNSW; alias swap; retire old index after soak.
- Bulk re-ingest — disable HNSW during load; COPY or batch INSERT; VACUUM ANALYZE; CREATE INDEX CONCURRENTLY.
- Incremental — upsert changed chunks only (content_hash gate); periodic index maintenance (pgvector reindex if supported; Qdrant optimize).
- Blue/green — dual-write to staging index; validate recall@K; flip read traffic; delete green after 24h.
Backup and disaster recovery
- pgvector: standard Postgres PITR + nightly pg_dump; test restore quarterly.
- Managed (Pinecone/Qdrant): enable vendor snapshots; export vectors + metadata to S3 Parquet for vendor exit.
- Rebuild path: source docs in object storage + chunk manifest + embedding model version → full rebuild without vector backup (slow but sufficient).
- RPO/RTO: vectors are regenerable if corpus text exists—prioritize document backup over vector snapshot if storage cost bites.
Monitoring dashboard
| Metric | Alert if | Why it matters |
|---|---|---|
| vector_search_p99_ms | > SLO (e.g. 100ms) | User-facing RAG latency |
| ann_recall_at_10 (weekly eval) | Drop > 2 pts vs baseline | Silent index degradation |
| index_vector_count vs CMS doc count | Drift > 5% | Ingest pipeline stuck |
| embedding_model_version mix | >1 version in active index | Broken ranking after partial migration |
| filtered_search_empty_rate | Spike per tenant | Filter bug or ACL mismatch |
| hnsw_index_size_bytes | > 80% RAM | OOM risk on pgvector/Qdrant node |
Production readiness checklist
- Single embedding model per active index; migration runbook documented
- HNSW (or IVF) tuned with measured recall@K—not defaults only
- Mandatory tenant filter from auth context on every query
- Content-hash skip on upsert to avoid pointless re-embeds
- Index rebuild procedure tested in staging (timing documented)
- Vectors or rebuild inputs backed up; restore drill completed
- Hybrid / RRF evaluated against dense-only baseline in eval suite
- Runbook for “recall dropped after deploy”—check model version, index params, filter regression
Incident playbooks (short)
Symptom: all scores low, wrong answers. Check embedding model pin; verify query vectors normalized; compare one query brute-force vs ANN in SQL.
Symptom: p99 latency spike. Check HNSW RAM pressure, connection pool exhaustion, missing metadata index causing seq scan + sort.
Symptom: tenant sees empty results. Verify tenant_id filter, ingest lag, offboarded namespace—not retrieval quality.
Running REINDEX on production pgvector HNSW during peak traffic locks reads or doubles storage. Schedule concurrent rebuild on replica promote pattern, or build new index table and swap.
Monitor embedding API spend separately from vector storage—re-index cron without content_hash gates is a common surprise invoice. Right-size RDS: HNSW for 2M × 1536 vectors often needs 32GB+ RAM for comfortable p99.
Teams that treat recall@K as a release gate catch ANN regressions before customers do—weekly cron runs labeled questions against staging index on every deploy. Pair with chunking evals from the previous guide so you know whether failure is retrieval or bad splits.