Embeddings & semantic search

What are embeddings?

An embedding model maps variable-length text (or other modalities) to a fixed-length array of floats—a dense vector. OpenAI text-embedding-3-small defaults to 1536 dimensions; each dimension captures some learned feature of meaning, syntax, or domain context.

Unlike sparse bag-of-words vectors (one dimension per vocabulary word, mostly zeros), dense embeddings pack semantic information into every coordinate. The phrase “refund policy for annual subscriptions” and “how do I cancel my yearly plan and get money back” share few keywords but should produce vectors with high cosine similarity—because the model was trained so that paraphrases and related concepts cluster together.

Meaning in vector space is geometric: direction often matters more than magnitude. “King − man + woman ≈ queen” is the classic word2vec intuition; modern sentence embeddings extend that to full passages. You never interpret individual dimensions by hand—the model learns a space where downstream similarity metrics (cosine, dot product) correlate with human judgment of relatedness.

How text becomes a vector

Tokenize — same subword tokenizers as LLMs (BPE, SentencePiece); long docs are truncated to model max length (often 512–8192 tokens).
Encode — transformer encoder (or dual-tower) runs forward pass; token hidden states are pooled (mean, CLS, or last-token) into one vector.
Normalize — many APIs L2-normalize output so cosine similarity equals dot product (see next section).
Store / compare — persist vector with document metadata; at query time embed the question and rank corpus by similarity.

Bi-encoder vs cross-encoder (preview)

A bi-encoder embeds query and document independently, then compares vectors in O(1) per pair after indexing. Fast at scale—this is what semantic search and RAG retrieval use. A cross-encoder concatenates query + document into one forward pass and outputs a relevance score directly—more accurate but O(n) per query over n documents, so it is used as a reranker on top-20 bi-encoder hits, not over the full corpus. Track 2 covers reranking in depth.

Pattern	Latency at 1M docs	Quality	Typical role
Bi-encoder (embedding model)	Milliseconds with ANN index	Good recall; may miss subtle matches	First-stage retrieval, semantic search, clustering
Cross-encoder (e.g. ms-marco-MiniLM-L-12)	Seconds unless batched on tiny K	Higher precision on pairs	Rerank top-K before LLM context

⚠️ Pitfall

Confusing token embeddings inside an LLM (per-token vectors during generation) with sentence/document embeddings from a dedicated embedding API. They live in different spaces—you cannot dot-product a chat model’s hidden state against OpenAI text-embedding-3-small vectors and expect meaningful scores.

🔬 Under the Hood

Training objective is usually contrastive learning: pull positive pairs (query, relevant doc) closer, push negatives apart—in-batch negatives scale training efficiently. That is why embedding quality on your domain depends on whether the model saw similar text during training; legal or medical jargon may need domain-specific models or evals.

Semantic similarity metrics

Once you have vectors, retrieval is linear algebra: compare query vector q to each document vector d with a similarity or distance function. Pick the metric your index expects—mixing them silently breaks ranking.

Cosine similarity

Measures the angle between two vectors, ignoring magnitude. Range [−1, 1] for general vectors; [0, 1] for non-negative embedding spaces is common in practice.

cos(q, d) = (q · d) / (||q|| × ||d||)

When it matters: default for text embeddings; robust when document length varies (short titles vs long PDFs) because long texts do not automatically get larger vectors unless you skip normalization.

Dot product (inner product)

q · d = Σ qᵢ dᵢ — no division by norms. Faster on GPU/ SIMD; many ANN libraries (FAISS inner-product, Pinecone dotproduct metric) use it when vectors are pre-normalized to unit length.

When it matters: if both vectors are L2-normalized, dot product ranks identically to cosine. OpenAI embedding responses are normalized—store them as-is and use dot product or cosine interchangeably for ranking.

L2 (Euclidean) distance

||q − d||₂ = √(Σ (qᵢ − dᵢ)²) — smaller is closer. For unit vectors, L2 distance monotonically relates to cosine: ||q−d||² = 2(1 − cos(q,d)).

When it matters: some indexes (FAISS L2, pgvector <-> operator) are built for Euclidean distance. Convert mentally: minimizing L2 on normalized vectors equals maximizing cosine.

Normalization

L2 normalization scales a vector to unit length: v' = v / ||v||. Always normalize query and corpus the same way. If you normalize at index time but send a raw query vector, rankings skew. If your provider returns normalized vectors, do not re-normalize twice (harmless but redundant).

Metric	Higher = more similar?	Requires normalized vectors?	Common stores
Cosine similarity	Yes	No (handles magnitude)	Weaviate, many custom stacks
Dot product	Yes	Yes, for cosine-equivalent ranking	Pinecone, FAISS IP
L2 distance	No (lower is better)	Often yes for text	pgvector, FAISS L2, Milvus

⚖️ Trade-off

Cosine is interpretable and length-invariant; dot product is faster when vectors are fixed-norm. Pick one metric at schema design time—migrating a million vectors because the index used the wrong operator is painful. pgvector lets you create indexes per operator; stick to one path per collection.

🎯 Interview Tip

“Why cosine instead of Euclidean for text?” — Length often reflects verbosity, not relevance; cosine compares direction (semantics). Follow-up: “When are they equivalent?” — After L2 normalization, ranking by dot product, cosine, or L2 distance aligns.

Embedding models

Model choice fixes vector dimension, cost, latency, and quality ceiling. You cannot mix vectors from different models in one index—re-embed the entire corpus when you switch.

Hosted API models

Model	Dims (default)	MTEB tier	Price (order of magnitude)	Notes
OpenAI text-embedding-3-small	1536 (matryoshka: 512–1536)	Strong / cost-efficient	~$0.02 / 1M tokens	Default for many RAG stacks; supports dimension reduction
OpenAI text-embedding-3-large	3072 (matryoshka: 256–3072)	Frontier API	~$0.13 / 1M tokens	When recall gaps show up in evals
Voyage voyage-3	1024	Top-tier retrieval	~$0.06 / 1M tokens	Strong on legal/finance/code benchmarks; voyage-3-lite cheaper
Cohere embed-english-v3.0	1024	Strong multilingual path	~$0.10 / 1M tokens	search_document vs search_query input types

Open-source / self-hosted

Model	Dims	Size	Best for
sentence-transformers/all-MiniLM-L6-v2	384	~80MB	Prototypes, edge, high QPS on CPU; weaker on hard retrieval
BAAI/bge-m3	1024	~2GB	Multilingual + hybrid dense/sparse; strong OSS baseline
nomic-embed-text-v1.5	768 (matryoshka)	~550MB	Long context (8192); local Ollama deployments

Dimension trade-offs

Higher dimensions can capture finer distinctions but cost more storage, memory, and index build time. OpenAI and Nomic support matryoshka embeddings: truncate trailing dimensions with modest quality loss— useful for storing 1536-dim for quality and 512-dim for a fast pre-filter.

Dimension	Storage per 1M vectors	Recall (typical)	When to choose
384	~1.5 GB float32	Good for dedup / coarse search	Dev, CPU inference, huge corpora on budget
768 – 1024	~3 – 4 GB	Strong production default	Most RAG products; balances cost and MTEB
1536 – 3072	~6 – 12 GB	Best API retrieval	Hard domains, legal/medical, when evals justify cost

💰 Cost

Embedding cost is usually smaller than LLM generation—but re-indexing 10M chunks at $0.02/1M tokens adds up if each chunk is 500 tokens (~$100 per full pass). Batch embed once; cache aggressively; avoid re-embedding unchanged paragraphs on every deploy.

💡 Pro Tip

Cohere’s separate search_query and search_document types asymmetrically tune query vs passage—mirror that pattern in evals. If you use symmetric bi-encoders, prepend instructions (“Represent this sentence for retrieval:”) consistently at index and query time.

Use cases

Embeddings are a general-purpose representation layer. Once documents and queries live in the same vector space, the same infrastructure serves search, ML features, and data quality pipelines.

Semantic search

Users type natural language; you embed the query and return the nearest document chunks. Production example: Notion’s Q&A embeds workspace pages and retrieves top-K chunks before the LLM synthesizes an answer— keyword search alone misses “PTO rules” vs “how many vacation days do I get.”

Clustering

Group support tickets, log lines, or survey responses by embedding + k-means / HDBSCAN. Production example: Zendesk-style auto-topic discovery—surface emerging outage themes before tags exist.

Classification

Embed labeled examples; classify new text by nearest centroid or a linear probe on frozen embeddings. Cheaper than fine-tuning an LLM for high-volume routing (intent detection, spam, PII sniffing). Production example: route “billing” vs “technical” tickets with 95% accuracy using 50 labels per class.

Anomaly detection

Normal behavior forms a cluster; outliers are far from neighbors in embedding space. Production example: flag unusual admin CLI commands or novel fraud narratives that keyword rules never seen.

Recommendation

“Similar articles” = nearest neighbors of the current page embedding; user history = average of clicked item vectors. Production example: Shopify product recommendations from catalog description embeddings, refreshed nightly.

Deduplication

Near-duplicate chunks (boilerplate, copy-pasted policies) waste index space and pollute RAG. Drop pairs with cosine > 0.95 before indexing. Production example: ingest pipeline for a legal RAG corpus removes 12% redundant paragraphs, improving answer precision.

Use case	Typical K or threshold	Needs ANN?	Often paired with
Semantic search / RAG	top 10–50, then rerank to 5	Yes at >10K docs	Cross-encoder reranker, LLM
Clustering	cluster count or density threshold	No (sample or batch)	Human review UI
Classification	1-NN or softmax over centroids	No for small label sets	Confidence threshold → human
Deduplication	cosine > 0.92–0.98	Yes at scale	MinHash pre-filter optional

📦 Real World

Stripe’s support tools combine lexical and semantic retrieval—embeddings catch paraphrases, BM25 catches exact error codes. Neither alone is sufficient; hybrid search (Track 2) is the production default for technical docs.

⚠️ Pitfall

Using semantic search for exact-ID lookups (“invoice INV-8821”) or rare SKU strings. Embeddings smooth over tokens; pair vector search with keyword filters, metadata tags, or a structured DB query for identifiers.

Cosine similarity in code

Libraries hide the math—but you should recognize the operations for debugging score drift, unit tests, and on-call incidents when retrieval “suddenly returns garbage.”

Manual, NumPy, and scikit-learn (Python)

For unit tests, compare your service’s top-K IDs against a brute-force NumPy baseline on 1,000 vectors. Production search should use an ANN index, not a Python loop over millions of rows.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

query = np.array([0.12, -0.04, 0.33, 0.91])  # from embedding API
doc = np.array([0.10, -0.02, 0.30, 0.88])

# Manual (explicit norms)
def cosine(a, b):
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# NumPy one-liner on normalized vectors
q_norm = query / np.linalg.norm(query)
d_norm = doc / np.linalg.norm(doc)
manual = float(np.dot(q_norm, d_norm))

# sklearn (query vs matrix of docs)
corpus = np.vstack([doc, np.random.randn(4)])
sklearn_scores = cosine_similarity(query.reshape(1, -1), corpus)[0]

print("manual:", manual, "sklearn top:", corpus[np.argmax(sklearn_scores)])

// Plain Java — cosine between two float arrays
public static double cosine(float[] a, float[] b) {
  double dot = 0, na = 0, nb = 0;
  for (int i = 0; i < a.length; i++) {
    dot += a[i] * b[i];
    na += a[i] * a[i];
    nb += b[i] * b[i];
  }
  return dot / (Math.sqrt(na) * Math.sqrt(nb));
}

// Spring AI + pgvector: store/query in PostgreSQL
// CREATE EXTENSION vector;
// CREATE TABLE chunks (id bigserial, content text, embedding vector(1536));
// CREATE INDEX ON chunks USING hnsw (embedding vector_cosine_ops);

@Autowired JdbcTemplate jdbc;

public List<ChunkHit> search(float[] queryEmbedding, int k) {
  String sql = """
    SELECT id, content, 1 - (embedding <=> ?::vector) AS score
    FROM chunks ORDER BY embedding <=> ?::vector LIMIT ?
    """;
  String vec = Arrays.toString(queryEmbedding).replace("[","{").replace("]","}");
  return jdbc.query(sql, (rs, row) -> new ChunkHit(
      rs.getLong("id"), rs.getString("content"), rs.getDouble("score")),
      vec, vec, k);
}

Batch similarity against a corpus

Shape conventions: query (d,), corpus (n, d), scores (n,). Use matrix multiply corpus @ query when both are L2-normalized row vectors.

def top_k(query: np.ndarray, matrix: np.ndarray, k: int = 5):
    """matrix: (n_docs, dim), L2-normalized rows."""
    scores = matrix @ query  # dot = cosine if normalized
    idx = np.argpartition(-scores, k)[:k]
    idx = idx[np.argsort(-scores[idx])]
    return idx, scores[idx]

public record ScoredId(long id, double score) {}

public List<ScoredId> topK(float[] query, List<float[]> docs, List<Long> ids, int k) {
  PriorityQueue<ScoredId> heap = new PriorityQueue<>(Comparator.comparingDouble(ScoredId::score));
  for (int i = 0; i < docs.size(); i++) {
    double s = cosine(query, docs.get(i));
    if (heap.size() < k) heap.offer(new ScoredId(ids.get(i), s));
    else if (s > heap.peek().score()) { heap.poll(); heap.offer(new ScoredId(ids.get(i), s)); }
  }
  return heap.stream().sorted(Comparator.comparingDouble(ScoredId::score).reversed()).toList();
}

💡 Pro Tip

Assert in tests that np.linalg.norm(v) ≈ 1.0 for API embeddings before indexing. A batch that forgot normalization is the #1 cause of “works in staging with 100 docs, fails in prod.”

Approximate nearest neighbor (ANN) indexes

Exact brute-force search over n vectors costs O(n × d) per query. At 100K+ documents that is milliseconds of CPU per request; at 10M+ it breaks SLOs. ANN indexes trade perfect recall for sub-linear search.

Why exact search fails at scale

Latency — 1M vectors × 1536 dims × float32 ≈ 6 GB; scanning every query violates p99 < 100ms targets.
Memory bandwidth — brute force is memory-bound; ANN structures reduce comparisons via graphs or clustering.
Concurrency — 500 QPS × full scans melts CPU; HNSW serves thousands of QPS on modest hardware with tuned ef_search.

HNSW (Hierarchical Navigable Small World)

Multi-layer graph: greedy search on upper layers, refine on lower layers. Default in pgvector, Weaviate, many Milvus configs. Pros: excellent recall/latency balance, no training phase. Cons: high memory (graph edges); slow inserts compared to IVF; rebuild painful on huge static corpora.

IVF (Inverted File Index)

k-means clusters vectors into nlist cells; at query time probe only nprobe nearest centroids. Pros: lower memory, faster bulk build. Cons: needs training on representative sample; recall drops if nprobe too low.

Index	Build time	Query latency	Memory	Recall @10 (typical)	Good when
Flat / exact	None	O(n) — slow	Minimal overhead	100%	<10K vectors, eval baselines
HNSW	Moderate–slow	Very fast	High (graph)	95–99%+ tuned	Production default, pgvector, Qdrant
IVF + PQ	Train + build	Fast	Low (compressed)	85–95% (tunable)	100M+ vectors, memory constrained
ScaNN / DiskANN	Varies	Fast at billion scale	Disk-backed options	High at scale	Managed vector DBs, extreme scale

Tuning trade-offs

ef_search (HNSW) — higher → better recall, slower queries; start at 64–128 for top-10 retrieval.
nprobe (IVF) — higher → better recall, more cells scanned.
Always measure recall@K on a labeled eval set—latency means nothing if the right chunk is never in the candidate set.

⚖️ Trade-off

Managed vector DBs (Pinecone, Weaviate Cloud) hide index tuning; self-hosted pgvector gives control but you own vacuum, index rebuilds after bulk load, and memory sizing. Choose managed when the team lacks infra time—not when you need exotic hybrid sparse+dense at lowest cost.

🎯 Interview Tip

“Design semantic search for 5M support articles.” — Walk through: chunking → embed batch job → vector store with HNSW → top-20 ANN → cross-encoder rerank to 5 → LLM. Mention metadata filters (product, locale) before ANN to shrink search space.

Embedding caching

Same text → same vector (for a fixed model version). Embeddings are deterministic at temperature 0; treat them like idempotent RPC results—cache aggressively.

Cache key design

hash(model_id + model_version + normalized_text) — SHA-256 of UTF-8 text after NFC normalization and strip.
Include input type if provider uses asymmetric encoders (Cohere query vs document).
Store vectors as float16 or binary quantization in Redis only if recall tests pass—float32 JSON is simpler to debug.

Where to cache

Layer	TTL	Wins
In-process LRU	Minutes – hours	Hot repeated queries (“password reset”)
Redis / Memcached	Hours – days	Cross-instance query cache; session-level repeats
Postgres / vector store	Permanent until re-index	Corpus chunks—source of truth, not ephemeral cache

TTL strategies

Query cache: TTL 24h–7d; invalidate on model version bump. Corpus: content-hash keyed—when document body unchanged, skip re-embed on nightly job. Negative cache: do not cache failed API responses long; retry transient 429/503 with backoff.

Cost savings math

Suppose 1M queries/month, avg 20 tokens/query, 40% cache hit rate, text-embedding-3-small at $0.02/1M tokens:

Uncached tokens: 1M × 20 = 20M tokens → ~$0.40/month embed cost
With 40% hits: 12M billed tokens → ~$0.24/month (40% savings on queries)
At 10M queries/month and 50-token queries: savings ≈ $5/month API—but latency drops 30–80ms per hit, which matters more at scale

Corpus caching saves more: re-indexing 500K unchanged chunks costs ~$5 per full re-run; content-addressed storage avoids that entirely.

import hashlib, json, redis
from openai import OpenAI

r = redis.Redis.from_url("redis://localhost:6379/0")
client = OpenAI()
MODEL = "text-embedding-3-small"

def cache_key(text: str) -> str:
    digest = hashlib.sha256(f"{MODEL}:{text.strip()}".encode()).hexdigest()
    return f"emb:{MODEL}:{digest}"

def embed(text: str) -> list[float]:
    key = cache_key(text)
    if cached := r.get(key):
        return json.loads(cached)
    resp = client.embeddings.create(model=MODEL, input=text)
    vec = resp.data[0].embedding
    r.setex(key, 86400 * 7, json.dumps(vec))  # 7d TTL
    return vec

@Service
public class EmbeddingCache {
  private final StringRedisTemplate redis;
  private final EmbeddingModel embedder;
  private static final String MODEL = "text-embedding-3-small";

  public float[] embed(String text) {
    String key = "emb:" + MODEL + ":" + sha256(MODEL + ":" + text.strip());
    String cached = redis.opsForValue().get(key);
    if (cached != null) return parseJson(cached);
    float[] vec = embedder.embed(text);
    redis.opsForValue().set(key, toJson(vec), Duration.ofDays(7));
    return vec;
  }
}

💰 Cost

Cache hit rate is a product metric—log embedding_cache_hit per request. A/B model upgrades should use a new key prefix (emb:v2:) so you never serve stale vectors from the wrong model.

Multimodal embeddings

Models like CLIP (Contrastive Language–Image Pre-training) map images and text into the same vector space— enabling “find products like this photo” or “search diagrams with a sentence.”

CLIP trains on image–caption pairs: matching pairs are pulled together, mismatches pushed apart. At inference, encode an image through the vision tower and text through the text tower; cosine similarity measures cross-modal relevance. OpenAI’s clip-vit-base-patch32 (via API or local) and open models (OpenCLIP, SigLIP) follow the same pattern.

When to use multimodal vs text-only

Scenario	Text-only embedding	Multimodal (CLIP-class)
PDFs with extracted text	✓ Preferred—cheaper, better for long prose	Only if OCR fails or layout matters
E-commerce visual search	Weak—captions lose detail	✓ Image query → product catalog
Medical imaging + reports	Reports only	✓ Joint search when images lack text
Meme / screenshot moderation	Misses visual toxicity	✓ Text in image + pixels together

Text-only embeddings (OpenAI, Cohere, bge) win on pure RAG over documentation—lower cost, longer context, better lexical-semantic blend with BM25. Multimodal wins when the query or asset is inherently visual or when alt-text is missing/untrustworthy.

📦 Real World

Pinterest’s visual search pipeline embeds pins and camera-roll crops in a shared space; Google Lens combines OCR + visual embeddings. Most internal doc RAG stacks stay text-only and use vision models only on the OCR/extraction step—not for final retrieval vectors.

🔬 Under the Hood

CLIP vectors are not interchangeable with text-embedding-3-small—different dimension, training objective, and space. Hybrid products often maintain two indexes (text + image) and fuse scores at query time.

What this looks like in production

Semantic search in prod is an indexing pipeline plus a query path—not a notebook calling embeddings.create once. Plan for model upgrades, backfills, and measurable recall.

Indexing pipeline

Ingest — webhook, S3, CMS sync; dedupe by content hash.
Chunk — split on headings/tokens (400–800 tokens with overlap); attach metadata (source, ACL, updated_at).
Embed batch job — queue workers; batch 64–256 texts per API call; rate-limit aware; write to vector store + object storage backup.
Index build — HNSW after bulk load (pgvector: create index CONCURRENTLY); verify recall on golden queries.
Publish — blue/green index alias swap; never partial-query during rebuild without dual-write.

Query path

Request → authZ filter (tenant_id, ACL) → embed query (cached) → ANN top-K → optional rerank → return to RAG or UI. Log latency breakdown: embed_ms, ann_ms, rerank_ms.

Monitoring drift

Model version pin — log embedding_model on every vector; alert on mixed versions in one index.
Recall@K weekly — labeled question → expected chunk IDs; block deploy if drop > 2 pts.
Score distribution — sudden shift in mean top-1 cosine may indicate bad normalization or wrong model.
Coverage — % corpus embedded in last 24h; stale chunks after CMS edits.
Cost — embed API spend vs cache hit rate; anomalous spikes from runaway re-index cron.

Production readiness checklist

Single embedding model per index; documented migration runbook for model upgrades
Content-hash skip for unchanged chunks on re-index
Query + corpus embedding cache with model-scoped keys
ANN index tuned with measured recall@K, not default params only
Metadata filters applied before or during ANN (tenant isolation)
Brute-force eval job on sample set in CI
Backup vectors outside primary DB (S3 parquet) for disaster rebuild

⚠️ Pitfall

Re-embedding the corpus on every deploy “just in case.” Tie embed jobs to content changes; full re-embed only on model change with a versioned new collection and cutover—otherwise cost and index churn dominate.

💡 Pro Tip

Store raw chunk text + embedding model + vector version in the same row. When auditors ask “why did retrieval change?” you can replay exact inputs without guessing which model was live last Tuesday.