Goodnight Wiki / LLM Training Pipeline

LLM Training Pipeline

Training a large language model is not a single process — it's a sequence of stages, each with different data, different objectives, and increasingly, different teams. The traditional split was simple: pre-training (learn language from a huge corpus) then fine-tuning (learn to follow instructions). But the pipeline has grown more elaborate, and the boundaries between stages are dissolving. The most honest description of the current state might be: pre-training → mid-training → supervised fine-tuning → alignment (RLHF/DPO) → deployment, where "mid-training" is the label the industry gave to everything that didn't fit neatly into the other buckets.

Pre-Training and the Chinchilla Revelation

The fundamental act of pre-training is next-token prediction on a massive text corpus. The model sees trillions of tokens and learns, through the transformer architecture, to predict what comes next. Everything else the model can do — reasoning, following instructions, writing code — emerges from this objective combined with scale.

The most important result about pre-training efficiency came from DeepMind's Chinchilla paper, which Nostalgebraist's analysis made vivid with a simple equation.1 The loss of a language model decomposes into three additive terms: a correction for finite model size, a correction for finite data, and an irreducible minimum. For Gopher (280B parameters, 300B tokens), the finite-model correction was tiny (0.052) while the finite-data correction was massive (0.251). The model was already roughly as big as it needed to be. What it lacked was data.

The implications were stark. The entire lineage of GPT-3-era models — GPT-3, LaMDA, Gopher, Jurassic, MT-NLG — all trained on roughly 300B tokens, following the convention GPT-3 had set. Chinchilla showed that none of them, no matter how big, could ever beat a 70B model trained on 1.4T tokens with the same compute budget. Years of effort scaling model size, and a 4x increase in data would have been better.

This shifted the field. Post-Chinchilla models like Llama trained smaller models on much more data. But it also raised an uncomfortable question: are we running out of data? Optimal training of a PaLM-scale compute budget would require ~6.7T tokens. How much high-quality text exists? Nostalgebraist found the literature "extremely unclear" on this point, and in specialized domains like code, the available data is "woefully tiny" compared to what the scaling law says we could use. This data constraint is one reason the field pivoted toward synthetic data and toward squeezing more signal from existing data — which is where mid-training comes in.

Mid-Training: The Stage Nobody Can Define

Mid-training is the term the industry adopted in 2024-2025 for everything between raw pre-training and instruction-tuned fine-tuning. OpenAI has had a mid-training division since July 2024, claiming credit for GPT-4 Turbo and GPT-4o. Yet there's no standard definition — it means different things to different organizations.2

The one constant: mid-training operates on a mid-range of dataset sizes. Pre-training uses trillions of tokens. Post-training uses millions. Mid-training sits between, typically 10-300 billion tokens. In practice, it encompasses several distinct activities:

Quality training (or annealing): Training on a curated, high-quality subset of data in the final phase of pre-training. Allen AI's OLMo 2 formalized this as 5-10% of total training FLOPs devoted to a carefully mixed dataset of filtered web text, curated scientific papers, encyclopedias, and math. The idea is that the model has learned general language patterns from the full pre-training corpus, and now benefits from a concentrated exposure to higher-quality text.

Domain and language extension: Continuing pre-training on specialized data to add capabilities — new languages, domain-specific knowledge, longer context windows. This is what OpenAI originally meant by mid-training: custom training for vertical applications like legal AI, typically involving 10+ billion tokens of domain-specific data.

Scaling synthetic data: Generating training data from stronger models (teacher distillation), generating questions from existing answers (backtranslation), and verifying synthetic answers through formal methods or judge models. This has become the primary mechanism for data scaling once natural data runs thin.

The honest assessment of mid-training is that it represents a vibe shift more than a technical innovation. The stages of training are blurring. Instruction-like data appears earlier. RL-like objectives appear during pre-training. The neat pre-train → fine-tune pipeline is becoming a continuous spectrum, and "mid-training" is the name for the messy middle.

Fine-Tuning: From Full to Parameter-Efficient

Traditional fine-tuning updates all model parameters on a smaller, task-specific dataset. This works well but is expensive — you need to store a complete copy of the model for each task. The push toward parameter-efficient fine-tuning (PEFT) methods was driven by both economics and the observation that full fine-tuning often updates the model more than necessary.3

The landscape of PEFT methods includes:

Prompt tuning and prefix tuning: Learn a small set of "virtual tokens" prepended to the input, while keeping all model weights frozen. The virtual tokens are continuous vectors in embedding space — they don't correspond to real words. This is surprisingly effective for certain tasks, essentially learning a soft prompt that steers the frozen model's behaviour.

Adapters: Insert small trainable modules between the existing layers of a frozen model. Each adapter is a bottleneck — a down-projection, a nonlinearity, and an up-projection — that adds a few thousand parameters per layer. LLaMA-Adapter extends this with zero-initialized attention, which prevents the adapter from disrupting the pre-trained model's behavior during early training.

LoRA (Low-Rank Adaptation): Instead of adding new modules, decompose the weight updates into low-rank matrices. If a weight matrix is n×m, LoRA learns two matrices of size n×r and r×m (where r << min(n,m)), so the update is a rank-r approximation. This is arguably the most popular PEFT method because it's simple, adds no inference latency (the low-rank updates can be merged into the weights), and works well across tasks.

The practical insight from Raschka's taxonomy is that these methods exist on a spectrum from "more parameter-efficient but less expressive" to "less parameter-efficient but more capable." For tasks close to the pre-training distribution, prompt tuning may suffice. For substantial distribution shifts, LoRA or full fine-tuning is needed. The right choice depends on how different your target task is from what the model already knows — and how much compute you're willing to spend.

Why Overparameterized Models Generalize

There's a deeper question lurking beneath the training pipeline: why do models with far more parameters than training examples generalize at all, rather than just memorizing the data? Standard statistical learning theory says they shouldn't — and standard statistical learning theory is wrong, because it assumes the parameter-to-function map is one-to-one. In neural networks, it isn't: many different weight configurations implement the same function, creating symmetries that reduce the model's effective dimensionality far below its parameter count.4

Singular Learning Theory (SLT), due to Watanabe, provides the corrected framework. The key insight is that what matters for generalization isn't the flatness of loss basins (the standard hand-wave) but the singularities — points where the set of minimum-loss weights has an ill-defined tangent. More complex singularities correspond to fewer effective parameters, simpler functions, and better generalization. Neural networks can vary their effective dimensionality during training by forming or breaking non-generic symmetries — essentially performing internal model selection, choosing simpler functions within the space of functions their architecture can represent.

This is technically demanding (the theorists are still working out calculations for one-layer tanh models) but the implications for the training pipeline are significant. If SLT is right, then scaling laws aren't just empirical regularities — they reflect deep geometric properties of the loss landscape. And the reason models trained with different pipelines (pre-training, mid-training, LoRA) end up in different capability regimes may have less to do with the data mix than with which singularities the optimizer finds.4

DPO: Alignment Without RL

The final stage of training for chat models is alignment — teaching the model to produce outputs that humans prefer. The standard approach was RLHF: train a reward model on human preferences, then use PPO to optimize the language model against that reward. It works, but it's complex — you need a reward model, a value function, careful KL regularization to prevent the model from drifting too far from its starting point, and the whole apparatus of policy gradient methods.5

Direct Preference Optimization (DPO) eliminates the reward model entirely. The key insight is mathematical: there's an analytical mapping from any reward function to the optimal RL policy. By applying a change of variables, the RL loss over the reward model can be transformed into a simple binary cross-entropy loss over the reference model directly. Given a prompt and two responses (one preferred, one rejected), DPO trains the model to increase the probability of the preferred response relative to the rejected one, calibrated against a frozen reference model.

In practice, this means the RLHF pipeline shrinks from four steps (SFT → preference annotation → reward model training → RL optimization) to two (SFT → DPO). The DPO trainer takes a base model, a reference model (typically a copy of the base), and a dataset of (prompt, chosen, rejected) triples. One hyperparameter, beta, controls how much the model is penalized for diverging from the reference — smaller beta means more aggressive optimization but more risk of distribution shift.

A practical footnote on LoRA: the common assumption that parameter-efficient fine-tuning necessarily trades off quality for efficiency turns out to be wrong when done carefully. Recent work from Thinking Machines shows that LoRA can match full fine-tuning performance if you attend to details the original paper glossed over — higher rank adapters, applying LoRA to all linear layers (not just attention), longer training with lower learning rates, and using the learning rate schedule that the base model was trained with rather than restarting from scratch.6 The gap between LoRA and full fine-tuning that many practitioners observed was an artifact of default hyperparameters, not a fundamental limitation. This matters because it means the economic argument for LoRA (train cheap, serve without overhead) doesn't require accepting a quality penalty — you just need to tune it properly.

The elegance of DPO has made it the default alignment method for open-source models. Combined with QLoRA (quantized LoRA), it's possible to align a 7B parameter model on a single GPU, which has democratized alignment research in ways that PPO-based RLHF never could. But the Specification Gaming lesson applies: whether DPO or RLHF, the Spurious Rewards finding suggests that the alignment signal may matter less than the model's pre-existing tendencies. The training pipeline gives you fine-grained control over the model's behaviour, but the foundation was laid during pre-training — and that's where the real capabilities (and risks) live.

Footnotes

  1. chinchilla's wild implications by nostalgebraist — source

  2. What's the deal with mid-training? by vintagedata.org — source

  3. Understanding Parameter-Efficient Finetuning of Large Language Models by Sebastian Raschka — source

  4. Neural networks generalize because of this one weird trick by Jesse Hoogland — source 2

  5. Fine-tune Llama 2 with DPO by Kashif Rasul — source

  6. LoRA Without Regret by Thinking Machines — source

Open in stacked reader →