Embeddings & semantic search
An embedding is a fixed-size numeric vector that represents the meaning of text (or images, audio, code). Similar concepts land close together in vector space—so you can search by meaning, not just keyword overlap. This is the foundation of RAG, recommendation, deduplication, and most modern retrieval systems.
After reading, you should be able to: explain how text becomes a dense vector; choose cosine vs dot product vs L2 distance; pick an embedding model and dimension size; implement similarity in Python and Java; reason about ANN indexes at 100K+ scale; cache embeddings to cut cost; and sketch a production indexing pipeline with drift monitoring.
What are embeddings?
An embedding model maps variable-length text (or other modalities) to a fixed-length array of floats—a dense vector. OpenAI text-embedding-3-small defaults to 1536 dimensions; each dimension captures some learned feature of meaning, syntax, or domain context.
Unlike sparse bag-of-words vectors (one dimension per vocabulary word, mostly zeros), dense embeddings pack semantic information into every coordinate. The phrase “refund policy for annual subscriptions” and “how do I cancel my yearly plan and get money back” share few keywords but should produce vectors with high cosine similarity—because the model was trained so that paraphrases and related concepts cluster together.
Meaning in vector space is geometric: direction often matters more than magnitude. “King − man + woman ≈ queen” is the classic word2vec intuition; modern sentence embeddings extend that to full passages. You never interpret individual dimensions by hand—the model learns a space where downstream similarity metrics (cosine, dot product) correlate with human judgment of relatedness.
How text becomes a vector
- Tokenize — same subword tokenizers as LLMs (BPE, SentencePiece); long docs are truncated to model max length (often 512–8192 tokens).
- Encode — transformer encoder (or dual-tower) runs forward pass; token hidden states are pooled (mean, CLS, or last-token) into one vector.
- Normalize — many APIs L2-normalize output so cosine similarity equals dot product (see next section).
- Store / compare — persist vector with document metadata; at query time embed the question and rank corpus by similarity.
Bi-encoder vs cross-encoder (preview)
A bi-encoder embeds query and document independently, then compares vectors in O(1) per pair after indexing. Fast at scale—this is what semantic search and RAG retrieval use. A cross-encoder concatenates query + document into one forward pass and outputs a relevance score directly—more accurate but O(n) per query over n documents, so it is used as a reranker on top-20 bi-encoder hits, not over the full corpus. Track 2 covers reranking in depth.
| Pattern | Latency at 1M docs | Quality | Typical role |
|---|---|---|---|
| Bi-encoder (embedding model) | Milliseconds with ANN index | Good recall; may miss subtle matches | First-stage retrieval, semantic search, clustering |
| Cross-encoder (e.g. ms-marco-MiniLM-L-12) | Seconds unless batched on tiny K | Higher precision on pairs | Rerank top-K before LLM context |
Confusing token embeddings inside an LLM (per-token vectors during generation) with sentence/document embeddings from a dedicated embedding API. They live in different spaces—you cannot dot-product a chat model’s hidden state against OpenAI text-embedding-3-small vectors and expect meaningful scores.
Training objective is usually contrastive learning: pull positive pairs (query, relevant doc) closer, push negatives apart—in-batch negatives scale training efficiently. That is why embedding quality on your domain depends on whether the model saw similar text during training; legal or medical jargon may need domain-specific models or evals.
Semantic similarity metrics
Once you have vectors, retrieval is linear algebra: compare query vector q to each document vector d with a similarity or distance function. Pick the metric your index expects—mixing them silently breaks ranking.
Cosine similarity
Measures the angle between two vectors, ignoring magnitude. Range [−1, 1] for general vectors; [0, 1] for non-negative embedding spaces is common in practice.
cos(q, d) = (q · d) / (||q|| × ||d||)
When it matters: default for text embeddings; robust when document length varies (short titles vs long PDFs) because long texts do not automatically get larger vectors unless you skip normalization.
Dot product (inner product)
q · d = Σ qᵢ dᵢ — no division by norms. Faster on GPU/ SIMD; many ANN libraries (FAISS inner-product, Pinecone dotproduct metric) use it when vectors are pre-normalized to unit length.
When it matters: if both vectors are L2-normalized, dot product ranks identically to cosine. OpenAI embedding responses are normalized—store them as-is and use dot product or cosine interchangeably for ranking.
L2 (Euclidean) distance
||q − d||₂ = √(Σ (qᵢ − dᵢ)²) — smaller is closer. For unit vectors, L2 distance monotonically relates to cosine: ||q−d||² = 2(1 − cos(q,d)).
When it matters: some indexes (FAISS L2, pgvector <-> operator) are built for Euclidean distance. Convert mentally: minimizing L2 on normalized vectors equals maximizing cosine.
Normalization
L2 normalization scales a vector to unit length: v' = v / ||v||. Always normalize query and corpus the same way. If you normalize at index time but send a raw query vector, rankings skew. If your provider returns normalized vectors, do not re-normalize twice (harmless but redundant).
| Metric | Higher = more similar? | Requires normalized vectors? | Common stores |
|---|---|---|---|
| Cosine similarity | Yes | No (handles magnitude) | Weaviate, many custom stacks |
| Dot product | Yes | Yes, for cosine-equivalent ranking | Pinecone, FAISS IP |
| L2 distance | No (lower is better) | Often yes for text | pgvector, FAISS L2, Milvus |
Cosine is interpretable and length-invariant; dot product is faster when vectors are fixed-norm. Pick one metric at schema design time—migrating a million vectors because the index used the wrong operator is painful. pgvector lets you create indexes per operator; stick to one path per collection.
“Why cosine instead of Euclidean for text?” — Length often reflects verbosity, not relevance; cosine compares direction (semantics). Follow-up: “When are they equivalent?” — After L2 normalization, ranking by dot product, cosine, or L2 distance aligns.
Embedding models
Model choice fixes vector dimension, cost, latency, and quality ceiling. You cannot mix vectors from different models in one index—re-embed the entire corpus when you switch.
Hosted API models
| Model | Dims (default) | MTEB tier | Price (order of magnitude) | Notes |
|---|---|---|---|---|
| OpenAI text-embedding-3-small | 1536 (matryoshka: 512–1536) | Strong / cost-efficient | ~$0.02 / 1M tokens | Default for many RAG stacks; supports dimension reduction |
| OpenAI text-embedding-3-large | 3072 (matryoshka: 256–3072) | Frontier API | ~$0.13 / 1M tokens | When recall gaps show up in evals |
| Voyage voyage-3 | 1024 | Top-tier retrieval | ~$0.06 / 1M tokens | Strong on legal/finance/code benchmarks; voyage-3-lite cheaper |
| Cohere embed-english-v3.0 | 1024 | Strong multilingual path | ~$0.10 / 1M tokens | search_document vs search_query input types |
Open-source / self-hosted
| Model | Dims | Size | Best for |
|---|---|---|---|
| sentence-transformers/all-MiniLM-L6-v2 | 384 | ~80MB | Prototypes, edge, high QPS on CPU; weaker on hard retrieval |
| BAAI/bge-m3 | 1024 | ~2GB | Multilingual + hybrid dense/sparse; strong OSS baseline |
| nomic-embed-text-v1.5 | 768 (matryoshka) | ~550MB | Long context (8192); local Ollama deployments |
Dimension trade-offs
Higher dimensions can capture finer distinctions but cost more storage, memory, and index build time. OpenAI and Nomic support matryoshka embeddings: truncate trailing dimensions with modest quality loss— useful for storing 1536-dim for quality and 512-dim for a fast pre-filter.
| Dimension | Storage per 1M vectors | Recall (typical) | When to choose |
|---|---|---|---|
| 384 | ~1.5 GB float32 | Good for dedup / coarse search | Dev, CPU inference, huge corpora on budget |
| 768 – 1024 | ~3 – 4 GB | Strong production default | Most RAG products; balances cost and MTEB |
| 1536 – 3072 | ~6 – 12 GB | Best API retrieval | Hard domains, legal/medical, when evals justify cost |
Embedding cost is usually smaller than LLM generation—but re-indexing 10M chunks at $0.02/1M tokens adds up if each chunk is 500 tokens (~$100 per full pass). Batch embed once; cache aggressively; avoid re-embedding unchanged paragraphs on every deploy.
Cohere’s separate search_query and search_document types asymmetrically tune query vs passage—mirror that pattern in evals. If you use symmetric bi-encoders, prepend instructions (“Represent this sentence for retrieval:”) consistently at index and query time.
Use cases
Embeddings are a general-purpose representation layer. Once documents and queries live in the same vector space, the same infrastructure serves search, ML features, and data quality pipelines.
Semantic search
Users type natural language; you embed the query and return the nearest document chunks. Production example: Notion’s Q&A embeds workspace pages and retrieves top-K chunks before the LLM synthesizes an answer— keyword search alone misses “PTO rules” vs “how many vacation days do I get.”
Clustering
Group support tickets, log lines, or survey responses by embedding + k-means / HDBSCAN. Production example: Zendesk-style auto-topic discovery—surface emerging outage themes before tags exist.
Classification
Embed labeled examples; classify new text by nearest centroid or a linear probe on frozen embeddings. Cheaper than fine-tuning an LLM for high-volume routing (intent detection, spam, PII sniffing). Production example: route “billing” vs “technical” tickets with 95% accuracy using 50 labels per class.
Anomaly detection
Normal behavior forms a cluster; outliers are far from neighbors in embedding space. Production example: flag unusual admin CLI commands or novel fraud narratives that keyword rules never seen.
Recommendation
“Similar articles” = nearest neighbors of the current page embedding; user history = average of clicked item vectors. Production example: Shopify product recommendations from catalog description embeddings, refreshed nightly.
Deduplication
Near-duplicate chunks (boilerplate, copy-pasted policies) waste index space and pollute RAG. Drop pairs with cosine > 0.95 before indexing. Production example: ingest pipeline for a legal RAG corpus removes 12% redundant paragraphs, improving answer precision.
| Use case | Typical K or threshold | Needs ANN? | Often paired with |
|---|---|---|---|
| Semantic search / RAG | top 10–50, then rerank to 5 | Yes at >10K docs | Cross-encoder reranker, LLM |
| Clustering | cluster count or density threshold | No (sample or batch) | Human review UI |
| Classification | 1-NN or softmax over centroids | No for small label sets | Confidence threshold → human |
| Deduplication | cosine > 0.92–0.98 | Yes at scale | MinHash pre-filter optional |
Stripe’s support tools combine lexical and semantic retrieval—embeddings catch paraphrases, BM25 catches exact error codes. Neither alone is sufficient; hybrid search (Track 2) is the production default for technical docs.
Using semantic search for exact-ID lookups (“invoice INV-8821”) or rare SKU strings. Embeddings smooth over tokens; pair vector search with keyword filters, metadata tags, or a structured DB query for identifiers.
Cosine similarity in code
Libraries hide the math—but you should recognize the operations for debugging score drift, unit tests, and on-call incidents when retrieval “suddenly returns garbage.”
Manual, NumPy, and scikit-learn (Python)
For unit tests, compare your service’s top-K IDs against a brute-force NumPy baseline on 1,000 vectors. Production search should use an ANN index, not a Python loop over millions of rows.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
query = np.array([0.12, -0.04, 0.33, 0.91]) # from embedding API
doc = np.array([0.10, -0.02, 0.30, 0.88])
# Manual (explicit norms)
def cosine(a, b):
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# NumPy one-liner on normalized vectors
q_norm = query / np.linalg.norm(query)
d_norm = doc / np.linalg.norm(doc)
manual = float(np.dot(q_norm, d_norm))
# sklearn (query vs matrix of docs)
corpus = np.vstack([doc, np.random.randn(4)])
sklearn_scores = cosine_similarity(query.reshape(1, -1), corpus)[0]
print("manual:", manual, "sklearn top:", corpus[np.argmax(sklearn_scores)])
// Plain Java — cosine between two float arrays
public static double cosine(float[] a, float[] b) {
double dot = 0, na = 0, nb = 0;
for (int i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
// Spring AI + pgvector: store/query in PostgreSQL
// CREATE EXTENSION vector;
// CREATE TABLE chunks (id bigserial, content text, embedding vector(1536));
// CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);
@Autowired JdbcTemplate jdbc;
public List<ChunkHit> search(float[] queryEmbedding, int k) {
String sql = """
SELECT id, content, 1 - (embedding <=> ?::vector) AS score
FROM chunks ORDER BY embedding <=> ?::vector LIMIT ?
""";
String vec = Arrays.toString(queryEmbedding).replace("[","{").replace("]","}");
return jdbc.query(sql, (rs, row) -> new ChunkHit(
rs.getLong("id"), rs.getString("content"), rs.getDouble("score")),
vec, vec, k);
}
Batch similarity against a corpus
Shape conventions: query (d,), corpus (n, d), scores (n,). Use matrix multiply corpus @ query when both are L2-normalized row vectors.
def top_k(query: np.ndarray, matrix: np.ndarray, k: int = 5):
"""matrix: (n_docs, dim), L2-normalized rows."""
scores = matrix @ query # dot = cosine if normalized
idx = np.argpartition(-scores, k)[:k]
idx = idx[np.argsort(-scores[idx])]
return idx, scores[idx]
public record ScoredId(long id, double score) {}
public List<ScoredId> topK(float[] query, List<float[]> docs, List<Long> ids, int k) {
PriorityQueue<ScoredId> heap = new PriorityQueue<>(Comparator.comparingDouble(ScoredId::score));
for (int i = 0; i < docs.size(); i++) {
double s = cosine(query, docs.get(i));
if (heap.size() < k) heap.offer(new ScoredId(ids.get(i), s));
else if (s > heap.peek().score()) { heap.poll(); heap.offer(new ScoredId(ids.get(i), s)); }
}
return heap.stream().sorted(Comparator.comparingDouble(ScoredId::score).reversed()).toList();
}
Assert in tests that np.linalg.norm(v) ≈ 1.0 for API embeddings before indexing. A batch that forgot normalization is the #1 cause of “works in staging with 100 docs, fails in prod.”
Approximate nearest neighbor (ANN) indexes
Exact brute-force search over n vectors costs O(n × d) per query. At 100K+ documents that is milliseconds of CPU per request; at 10M+ it breaks SLOs. ANN indexes trade perfect recall for sub-linear search.
Why exact search fails at scale
- Latency — 1M vectors × 1536 dims × float32 ≈ 6 GB; scanning every query violates p99 < 100ms targets.
- Memory bandwidth — brute force is memory-bound; ANN structures reduce comparisons via graphs or clustering.
- Concurrency — 500 QPS × full scans melts CPU; HNSW serves thousands of QPS on modest hardware with tuned ef_search.
HNSW (Hierarchical Navigable Small World)
Multi-layer graph: greedy search on upper layers, refine on lower layers. Default in pgvector, Weaviate, many Milvus configs. Pros: excellent recall/latency balance, no training phase. Cons: high memory (graph edges); slow inserts compared to IVF; rebuild painful on huge static corpora.
IVF (Inverted File Index)
k-means clusters vectors into nlist cells; at query time probe only nprobe nearest centroids. Pros: lower memory, faster bulk build. Cons: needs training on representative sample; recall drops if nprobe too low.
| Index | Build time | Query latency | Memory | Recall @10 (typical) | Good when |
|---|---|---|---|---|---|
| Flat / exact | None | O(n) — slow | Minimal overhead | 100% | <10K vectors, eval baselines |
| HNSW | Moderate–slow | Very fast | High (graph) | 95–99%+ tuned | Production default, pgvector, Qdrant |
| IVF + PQ | Train + build | Fast | Low (compressed) | 85–95% (tunable) | 100M+ vectors, memory constrained |
| ScaNN / DiskANN | Varies | Fast at billion scale | Disk-backed options | High at scale | Managed vector DBs, extreme scale |
Tuning trade-offs
- ef_search (HNSW) — higher → better recall, slower queries; start at 64–128 for top-10 retrieval.
- nprobe (IVF) — higher → better recall, more cells scanned.
- Always measure recall@K on a labeled eval set—latency means nothing if the right chunk is never in the candidate set.
Managed vector DBs (Pinecone, Weaviate Cloud) hide index tuning; self-hosted pgvector gives control but you own vacuum, index rebuilds after bulk load, and memory sizing. Choose managed when the team lacks infra time—not when you need exotic hybrid sparse+dense at lowest cost.
“Design semantic search for 5M support articles.” — Walk through: chunking → embed batch job → vector store with HNSW → top-20 ANN → cross-encoder rerank to 5 → LLM. Mention metadata filters (product, locale) before ANN to shrink search space.
Embedding caching
Same text → same vector (for a fixed model version). Embeddings are deterministic at temperature 0; treat them like idempotent RPC results—cache aggressively.
Cache key design
- hash(model_id + model_version + normalized_text) — SHA-256 of UTF-8 text after NFC normalization and strip.
- Include input type if provider uses asymmetric encoders (Cohere query vs document).
- Store vectors as float16 or binary quantization in Redis only if recall tests pass—float32 JSON is simpler to debug.
Where to cache
| Layer | TTL | Wins |
|---|---|---|
| In-process LRU | Minutes – hours | Hot repeated queries (“password reset”) |
| Redis / Memcached | Hours – days | Cross-instance query cache; session-level repeats |
| Postgres / vector store | Permanent until re-index | Corpus chunks—source of truth, not ephemeral cache |
TTL strategies
Query cache: TTL 24h–7d; invalidate on model version bump. Corpus: content-hash keyed—when document body unchanged, skip re-embed on nightly job. Negative cache: do not cache failed API responses long; retry transient 429/503 with backoff.
Cost savings math
Suppose 1M queries/month, avg 20 tokens/query, 40% cache hit rate, text-embedding-3-small at $0.02/1M tokens:
- Uncached tokens: 1M × 20 = 20M tokens → ~$0.40/month embed cost
- With 40% hits: 12M billed tokens → ~$0.24/month (40% savings on queries)
- At 10M queries/month and 50-token queries: savings ≈ $5/month API—but latency drops 30–80ms per hit, which matters more at scale
Corpus caching saves more: re-indexing 500K unchanged chunks costs ~$5 per full re-run; content-addressed storage avoids that entirely.
import hashlib, json, redis
from openai import OpenAI
r = redis.Redis.from_url("redis://localhost:6379/0")
client = OpenAI()
MODEL = "text-embedding-3-small"
def cache_key(text: str) -> str:
digest = hashlib.sha256(f"{MODEL}:{text.strip()}".encode()).hexdigest()
return f"emb:{MODEL}:{digest}"
def embed(text: str) -> list[float]:
key = cache_key(text)
if cached := r.get(key):
return json.loads(cached)
resp = client.embeddings.create(model=MODEL, input=text)
vec = resp.data[0].embedding
r.setex(key, 86400 * 7, json.dumps(vec)) # 7d TTL
return vec
@Service
public class EmbeddingCache {
private final StringRedisTemplate redis;
private final EmbeddingModel embedder;
private static final String MODEL = "text-embedding-3-small";
public float[] embed(String text) {
String key = "emb:" + MODEL + ":" + sha256(MODEL + ":" + text.strip());
String cached = redis.opsForValue().get(key);
if (cached != null) return parseJson(cached);
float[] vec = embedder.embed(text);
redis.opsForValue().set(key, toJson(vec), Duration.ofDays(7));
return vec;
}
}
Cache hit rate is a product metric—log embedding_cache_hit per request. A/B model upgrades should use a new key prefix (emb:v2:) so you never serve stale vectors from the wrong model.
Multimodal embeddings
Models like CLIP (Contrastive Language–Image Pre-training) map images and text into the same vector space— enabling “find products like this photo” or “search diagrams with a sentence.”
CLIP trains on image–caption pairs: matching pairs are pulled together, mismatches pushed apart. At inference, encode an image through the vision tower and text through the text tower; cosine similarity measures cross-modal relevance. OpenAI’s clip-vit-base-patch32 (via API or local) and open models (OpenCLIP, SigLIP) follow the same pattern.
When to use multimodal vs text-only
| Scenario | Text-only embedding | Multimodal (CLIP-class) |
|---|---|---|
| PDFs with extracted text | ✓ Preferred—cheaper, better for long prose | Only if OCR fails or layout matters |
| E-commerce visual search | Weak—captions lose detail | ✓ Image query → product catalog |
| Medical imaging + reports | Reports only | ✓ Joint search when images lack text |
| Meme / screenshot moderation | Misses visual toxicity | ✓ Text in image + pixels together |
Text-only embeddings (OpenAI, Cohere, bge) win on pure RAG over documentation—lower cost, longer context, better lexical-semantic blend with BM25. Multimodal wins when the query or asset is inherently visual or when alt-text is missing/untrustworthy.
Pinterest’s visual search pipeline embeds pins and camera-roll crops in a shared space; Google Lens combines OCR + visual embeddings. Most internal doc RAG stacks stay text-only and use vision models only on the OCR/extraction step—not for final retrieval vectors.
CLIP vectors are not interchangeable with text-embedding-3-small—different dimension, training objective, and space. Hybrid products often maintain two indexes (text + image) and fuse scores at query time.
What this looks like in production
Semantic search in prod is an indexing pipeline plus a query path—not a notebook calling embeddings.create once. Plan for model upgrades, backfills, and measurable recall.
Indexing pipeline
- Ingest — webhook, S3, CMS sync; dedupe by content hash.
- Chunk — split on headings/tokens (400–800 tokens with overlap); attach metadata (source, ACL, updated_at).
- Embed batch job — queue workers; batch 64–256 texts per API call; rate-limit aware; write to vector store + object storage backup.
- Index build — HNSW after bulk load (pgvector: create index CONCURRENTLY); verify recall on golden queries.
- Publish — blue/green index alias swap; never partial-query during rebuild without dual-write.
Query path
Request → authZ filter (tenant_id, ACL) → embed query (cached) → ANN top-K → optional rerank → return to RAG or UI. Log latency breakdown: embed_ms, ann_ms, rerank_ms.
Monitoring drift
- Model version pin — log embedding_model on every vector; alert on mixed versions in one index.
- Recall@K weekly — labeled question → expected chunk IDs; block deploy if drop > 2 pts.
- Score distribution — sudden shift in mean top-1 cosine may indicate bad normalization or wrong model.
- Coverage — % corpus embedded in last 24h; stale chunks after CMS edits.
- Cost — embed API spend vs cache hit rate; anomalous spikes from runaway re-index cron.
Production readiness checklist
- Single embedding model per index; documented migration runbook for model upgrades
- Content-hash skip for unchanged chunks on re-index
- Query + corpus embedding cache with model-scoped keys
- ANN index tuned with measured recall@K, not default params only
- Metadata filters applied before or during ANN (tenant isolation)
- Brute-force eval job on sample set in CI
- Backup vectors outside primary DB (S3 parquet) for disaster rebuild
Re-embedding the corpus on every deploy “just in case.” Tie embed jobs to content changes; full re-embed only on model change with a versioned new collection and cutover—otherwise cost and index churn dominate.
Store raw chunk text + embedding model + vector version in the same row. When auditors ask “why did retrieval change?” you can replay exact inputs without guessing which model was live last Tuesday.