Goodnight Wiki / Search and Indexing

Search and Indexing

The fundamental problem of search is that you have too much text and not enough time. Every approach to making search fast is, at bottom, a trick for avoiding the work of reading everything. The tricks are more interesting than they sound, because the right index structure changes not just how fast you search, but what kinds of questions you can ask.

Trigram Indexing

Russ Cox's explanation of how Google Code Search worked is the clearest account I've found of practical full-text indexing. The problem: you have a billion lines of source code and a user types a regular expression. How do you avoid scanning every file?

The answer is trigrams — every sequence of three consecutive characters in a document. The word "google" produces the trigrams "goo", "oog", "ogl", "gle". Index every document by its trigrams, and when a search query arrives, extract the trigrams that must appear in any match, look up which documents contain all of them, and only scan those files with the full regex engine. The trigram index doesn't answer the query — it narrows the candidates.1

The subtlety is in extracting trigrams from regular expressions. The regex (Google|Yahoo).*Search must match documents containing either {goo, oog, ogl, gle} or {yah, aho, hoo} AND {sea, ear, arc, rch}. Cox walks through the boolean algebra of combining these constraints. It's not magic — it's just boolean logic over set intersections — but it reduces a scan of millions of files to a scan of hundreds.

What I find most interesting about this approach is what it says about the relationship between indexing and query language. A trigram index is perfectly suited to regular expressions because regexes are already defined in terms of character sequences. You wouldn't build a trigram index for semantic search, and you wouldn't build a vector embedding for grep. The index and the query type co-evolve.

Performance Engineering at the Character Level

Andrew Galloway's ripgrep benchmarks are a case study in how much performance is hiding in the details that most programmers ignore. ripgrep is consistently faster than GNU grep, ag, git grep, ucg, pt, and sift — sometimes by 10x or more. The reasons are layered.2

First, ripgrep uses memory-mapped I/O and avoids copying data. But so do some competitors. The real wins come from smarter literal extraction: before running the full regex engine, ripgrep extracts literal strings that must appear in any match and searches for those using SIMD-optimized routines (the Teddy algorithm, a Horspool variant on packed byte vectors). Only lines that contain the literal get passed to the full NFA.

Then there's Unicode. ripgrep handles Unicode correctly by default — searching for \w matches all Unicode word characters, not just ASCII. Most competitors either don't handle Unicode or handle it by falling back to byte-at-a-time processing. ripgrep manages both correctness and speed by pre-compiling Unicode character classes into efficient finite automata.2

The deeper lesson is about performance methodology. Galloway's benchmarks don't just report numbers — they explain why each tool is slow in each case. ag chokes on large files because it tries to read the entire file into memory. GNU grep struggles with Unicode because it uses a locale-sensitive byte-at-a-time matcher. pt is slow because Go's regex engine doesn't do literal optimisation. Understanding why something is slow is harder and more useful than knowing that it is.

Integer Sequences as Fingerprints

The Online Encyclopedia of Integer Sequences takes a different approach to the search problem entirely. Neil Sloane's insight, developed over fifty years starting with a handwritten card file in the 1960s, is that integer sequences are natural fingerprints for mathematical objects. If you compute a sequence of numbers — from a combinatorial problem, a physical system, a recursive definition, anything — and look it up in the OEIS, you'll often discover that your sequence has been studied before under a completely different name.3

This is search as recognition. You don't describe what you're looking for in words; you provide a sample of output and the system tells you what it is. It's the Shazam model applied to mathematics. The OEIS contains over 350,000 sequences, and mathematicians routinely use it to identify unexpected connections between problems. You're studying a combinatorial structure, you compute the first ten terms, look them up, and discover the sequence was first studied in 1847 in connection with continued fractions.

The encyclopedia's origin story says something about why this kind of search matters. Sloane started collecting sequences in 1964 as a Cornell PhD student working on neural networks (then called "perceptrons"). He produced a seven-term sequence — 0, 1, 8, 78, 944, 13800, 237432 — describing the average distance from nodes to the root in trees of increasing size, but couldn't find the pattern. No reference guide for integer sequences existed, so he built one, first on punch cards, then as a 1973 handbook, and finally as a web database in 1996. The practical proof came when a colleague at Bell Labs brought him the cell-tower placement problem: how to optimally position base stations to maximise signal while minimising interference. Sloane and a student computed optimal arrangements for small numbers of towers, searched the encyclopedia, and matched their sequence to a completely different problem in number theory involving maps on tori. The two problems turned out to be equivalent — discovered only because they produced the same fingerprint.4

What makes this work as a search engine is that integer sequences compress a huge amount of structural information into a small, machine-comparable form. Two mathematical objects that look completely different — one defined by a recurrence, another by a generating function, a third by counting lattice paths — may produce the same sequence of integers, and that coincidence is almost always significant. As Nadia Heninger put it: "You can't type a mathematical object into Google. But you can evaluate your object based on a sequence of numbers." The OEIS is the closest thing mathematics has to a universal search engine — not searching text about math, but searching math itself.

Bayesian Spelling Correction

Peter Norvig's spelling corrector is the most elegant demonstration I know of how far you can get with simple probability. The entire program is 36 lines of Python. Given a misspelled word, it generates all words within edit distance 2 (deletions, transpositions, replacements, insertions), filters to those appearing in a dictionary, and returns the one with the highest frequency in a training corpus. That's it.5

The theoretical framework is Bayes' theorem: given a misspelled word w, find the correction c that maximises P(c) × P(w|c). P(c) is the language model — how common is this word? P(w|c) is the error model — how likely is this particular typo? Norvig simplifies by assuming all edits are equally likely and just using word frequency as the ranking signal. This gets 70% accuracy on a test set of real typos, and extending to edit distance 2 (which generates about 114,000 candidates per word) gets it to around 80%.

The spelling corrector works because English has a lot of redundancy — most plausible edit-distance-1 modifications of a correctly-spelled word aren't themselves words. It's a search problem where the index is just a frequency table and the query is "what real word is closest to this input?" Simple, but the same framework — generate candidates, score by probability, return the best — underpins everything from autocomplete to machine translation.

Query Planning as Search

SQLite's Next-Generation Query Planner tackles search at a different level: not searching through data but searching for the best way to search through data. When a SQL query involves multiple tables and indexes, the database has to choose a join order and an index for each table. The number of possible plans grows factorially with the number of tables.6

The old SQLite query planner used a greedy nearest-neighbor heuristic — at each step, pick the table that seems cheapest to access given what you've already decided. This is fast (polynomial time) but produces suboptimal plans. The new planner, introduced in SQLite 3.8.0, uses a shortest-path algorithm: it models the space of possible query plans as a graph, where nodes represent subsets of tables that have been joined, and edges represent adding the next table. Finding the optimal plan reduces to finding the shortest path through this graph.6

The key engineering insight is the N3 heuristic: for queries with more than about 10 tables (where the exponential blowup becomes real), the planner restricts its search to the N best partial plans at each step, where N is calibrated to the number of tables. This isn't optimal, but it's dramatically better than greedy while remaining fast enough for an embedded database. The name "N3" comes from keeping the top N×N×N plans at intermediate stages — a tuning parameter that trades memory for plan quality.

What I find striking is the connection to the trigram indexing approach: both are about narrowing candidates before doing expensive work. The query planner's "expensive work" is executing the query; the trigram index's "expensive work" is running the regex. In both cases, the art is in the pruning.

Footnotes

  1. Regular Expression Matching with a Trigram Index by Russ Cox — source

  2. ripgrep is faster than {grep, ag, git grep, ucg, pt, sift} by Andrew Galloway — source 2

  3. How to Build a Search Engine for Mathematics by OEIS — source

  4. How to Build a Search Engine for Mathematics by Siobhan Roberts — source

  5. How to Write a Spelling Corrector by Peter Norvig — source

  6. The Next-Generation Query Planner by SQLite — source 2

Open in stacked reader →