Embeddings and Vector Search

Before there were large language models, there were word vectors — and they were already doing something uncanny. Word2vec showed in 2013 that you could represent words as points in a high-dimensional space where arithmetic works: king - man + woman = queen. This isn't just a party trick. It's the discovery that meaning has geometry, and that geometry is learnable from co-occurrence statistics alone. Everything from CLIP's cross-modal search to RAG-powered chatbots is downstream of this insight.

The Unreasonable Effectiveness of Dot Products

The core idea behind all embedding methods is the same: represent things as vectors such that similar things have similar vectors. "Similar" is measured by dot product or cosine similarity — mathematically trivial operations that become powerful when the vectors encode the right structure.

Word2vec learned this structure from a simple training signal: predict a word from its context (or vice versa). The resulting vectors captured not just similarity but relational structure. The classic example is the king/queen analogy, but the funnier examples reveal more about what the model actually learned. Grace Avery's exploration of word2vec analogies turned up gems like "pig : oink :: raven : nevermore" and "fish + music = bass" — the model discovering puns through the geometry of co-occurrence.¹

The most revealing analogy might be "snow : yeti :: economics : homo economicus." Word2vec has never encountered a yeti or a homo economicus in real life. It has only seen how these words relate to other words. And yet the vector arithmetic captures the abstract relationship: the mythical/theoretical version of the domain. The geometry isn't encoding facts about the world; it's encoding the structure of how we talk about the world. That's a subtler and arguably more powerful thing.

From Words to Everything: The Embedding Zoo

Word2vec operated at the word level. Modern embedding models operate on sentences, paragraphs, images, code — anything you can turn into a token sequence. The key architectural shift was from static embeddings (one vector per word, regardless of context) to contextual embeddings from transformers, where the same word gets different vectors depending on what surrounds it. "Bank" in "river bank" and "bank account" now live in different parts of the space.

The practical landscape of text embedding models is more competitive than you might expect. Nils Reimers's 2022 benchmarking showed that OpenAI's GPT-3 embedding models — despite having 175 billion parameters — performed worse on sentence similarity tasks than open-source models with 22 million parameters.² The issue wasn't the model's capability but its training objective: GPT-3's embeddings were trained with a contrastive learning approach similar to DeCLUTR, using consecutive text passages as positive pairs. Smaller models trained on richer supervision signals (question-answer pairs, conversation pairs, title-body pairs) simply learned better semantic structure.

This is a recurring theme: bigger isn't always better for embeddings. The quality of the training pairs matters more than model size. A 768-dimensional vector from a well-trained 110M parameter model will cluster documents, retrieve passages, and detect paraphrases better than a 12,288-dimensional vector from a 175B parameter model trained on weaker signal. And the smaller model costs a fraction of a cent per million documents to run, versus $80,000 for the larger one.

CLIP: When Vision Meets Language

The most conceptually striking embedding model is CLIP — OpenAI's Contrastive Language-Image Pretraining. CLIP trains a text encoder and an image encoder simultaneously, with the objective that matching text-image pairs should have similar vectors and non-matching pairs should have dissimilar ones.³

The training is elegant in its simplicity. Take a batch of (text, image) pairs. The diagonal entries of the similarity matrix (matching pairs) should be high; everything else should be low. No labels, no categories, no annotation beyond the natural pairing of images and their descriptions. The training data was 400 million image-text pairs scraped from the internet — captions, alt text, surrounding prose.

What emerges is genuinely multimodal understanding. Search for "two dogs running across a frosty field" and you get images of exactly that. Search for "man's best friend" and you get dogs too — CLIP understands the idiom in the context of imagery. Search for "hot dogs" and you get a mix of the food and warm-looking dogs, because both senses of the phrase correspond to real images. The model has built a shared semantic space between modalities, something that the "World Scopes" framework from Bisk et al. identified as a necessary step beyond text-only models.

CLIP matters for Llm Agent Design because it demonstrates that you don't need to engineer cross-modal understanding — you can learn it from the correspondence between text and images that already exists on the internet. The same principle applies to code-text models, audio-text models, and eventually any modality pair with enough aligned data.

The Geometry of Meaning (and Its Limits)

When you reduce high-dimensional embeddings to 3D with UMAP or t-SNE for visualization, structure appears immediately: questions cluster away from answers, sports topics form distinct blobs, error modes (like users submitting empty forms) become visible as orphan clusters.⁴ The clustering isn't just cosmetic — it reflects the model's internal organization of concepts.

But visualizing embeddings also reveals their limitations. With diverse, unstructured data — random questions on random topics — you get a "structureless blob." Local structure only emerges when there's local structure to preserve. This is a useful warning: embeddings capture the structure that exists in their training data, nothing more. If the training signal doesn't distinguish between two concepts, neither will the embedding space.

The practical implication for vector search systems is that embedding quality depends entirely on the match between your task and the model's training distribution. A model trained on scientific papers will produce great embeddings for research retrieval and terrible ones for e-commerce product search. The "general-purpose embedding model" is a milder version of the "general-purpose AI" myth — useful for many things, optimal for nothing. This is why specialized embedding models, trained or fine-tuned on domain-specific data with task-specific supervision, consistently outperform generic ones.

Vector search at scale remains an engineering challenge. Storing 21 million Wikipedia passage embeddings at 384 dimensions takes about 16 GB — manageable. At 12,288 dimensions (GPT-3 Davinci embeddings), the same corpus needs 516 GB. Any downstream operation — clustering, nearest-neighbour search, deduplication — is correspondingly slower in higher dimensions. The practical consensus is that 384-1024 dimensions is the sweet spot: enough capacity for nuanced semantics, small enough for real systems.

The raw computational cost of brute-force vector comparison is daunting: comparing a 1024-dimensional query vector against one million stored vectors requires about a billion multiplications per query. This is where approximate nearest neighbour (ANN) algorithms become essential. Most ANN approaches use vector quantisation — partitioning the embedding space into groups, defining representative "codewords" for each, and searching only the codewords. It's the vector analogue of database indexing, trading a small amount of recall accuracy for orders-of-magnitude speedup.⁵

Google's ScaNN algorithm, which powers image search, YouTube, and Google Play, introduced a clever twist called anisotropic vector quantisation. Standard quantisation minimises the average distance between vectors and their codewords, but this treats all errors equally. Anisotropic quantisation weights errors differently depending on direction — it tolerates more distortion in directions perpendicular to the query vector (which don't affect ranking) while minimising distortion along the query direction (which does). The result is measurably better recall at the same latency, or the same recall at lower latency. It's the kind of optimisation that only matters at Google-scale query volumes, but at that scale it matters enormously.⁵

Why This Matters

Embeddings are the bridge between the symbolic world of text and the geometric world of machine learning. Every time you do semantic search, every time a RAG system retrieves context, every time a recommendation engine finds "similar items," embeddings are doing the work. Understanding their geometry — what they capture, what they miss, how to evaluate them — is among the most practically important things in applied AI. The king-queen analogy was the demo. The actual revolution is that everything is an embedding now, and the quality of your AI system depends more on the quality of your embeddings than on almost anything else.

Word2vec: fish + music = bass by Grace Avery — source ↩
OpenAI GPT-3 Text Embeddings — Really a new state-of-the-art? by Nils Reimers — source ↩
Multi-modal ML with OpenAI's CLIP by Pinecone — source ↩
Insights from visualizing GPT embeddings by NNext — source ↩
Find anything blazingly fast with Google's vector search technology by Kaz Sato — source ↩ ↩²

Linked from

Ai And Language Models Overview
Embeddings And Vector Search handles the geometric foundation: meaning has structure, that structure is learnable, and everything from RAG to CLIP is downstream of this insight.

Open in stacked reader →