Goodnight Wiki / Prompt Engineering

Prompt Engineering

Prompt engineering is the practice of communicating with language models to steer their behaviour without changing their weights. Lilian Weng calls it "an empirical science," which is a politely devastating description — it means the effects vary wildly between models, between tasks, and between Tuesday and Wednesday, and the only way to find what works is to try things.1 The field lives in a strange place between engineering (it has reliable techniques that work) and alchemy (nobody can fully explain why those techniques work, and results don't always transfer).

There's something philosophically odd about prompt engineering existing at all. These models were trained on next-token prediction over vast corpora of text. In principle, the "prompt" is just the initial conditions of a text completion — the model doesn't know it's being instructed, it's just continuing a document. And yet the difference between a good prompt and a bad one can be the difference between a useless model and an astonishingly capable one. The fact that the same model with the same weights can be a mediocre reasoner or a near-expert depending on how you phrase the question tells us something important about what LLMs actually are: they're less like databases and more like simulators that need the right initial conditions to instantiate the right simulacrum.

The Basics: Zero-Shot and Few-Shot

Zero-shot is the simplest approach: give the model a task description and let it go. Few-shot adds examples of the desired input-output behaviour before the actual query. Few-shot usually works better, which makes intuitive sense — you're giving the model more information about what you want — but the improvement comes with surprising fragility.1

The choice of examples matters. Semantically similar examples (selected via k-NN in embedding space) outperform random ones. But the order of examples also matters, in ways that feel almost superstitious: the same set of examples in different orders can produce accuracy ranging from near-random to near-state-of-the-art. There are identifiable biases — majority label bias (if most examples are positive, the model predicts positive), recency bias (the model tends to repeat the last label), and common token bias (frequent tokens get overweighted) — but even after accounting for these, there's unexplained variance that doesn't decrease with model size or example count.1

This fragility is, I think, underappreciated. It means that reported results on prompt-based benchmarks are partly a function of prompt engineering skill, and two labs evaluating the same model can get meaningfully different results depending on how carefully they tuned their prompts. Weng's "spicy take" that many prompt engineering papers aren't worth their eight pages because the trick fits in one sentence is accurate — but the flip side is that getting those one-sentence tricks to work reliably in production is genuinely difficult.

Chain-of-Thought

Chain-of-thought prompting (CoT) is the most important technique to emerge from the prompt engineering literature. The idea: instead of asking the model to produce an answer directly, ask it to show its reasoning step by step. This can be done by providing worked examples (few-shot CoT) or simply by appending "Let's think step by step" to the prompt (zero-shot CoT).1

The results are dramatic for reasoning tasks — arithmetic, logic puzzles, multi-step inference — and negligible for simple classification or retrieval. CoT only helps with large models (roughly 50B+ parameters); smaller models produce reasoning chains that are plausible-looking but wrong, which is arguably worse than no chain at all because it gives false confidence.

What's happening under the hood? The standard explanation is that CoT lets the model use its own output as a scratchpad, decomposing complex problems into subproblems that fit within a single forward pass. A transformer can't do multi-step reasoning in one pass because each pass has fixed depth — but by generating intermediate tokens, it effectively gives itself more computation per problem. This is a real effect, but it's not the whole story. CoT also shifts the token distribution toward "reasoning-like" text, which biases the model toward the regions of its training distribution where correct answers live. A math textbook includes worked solutions; a blog post just states conclusions. CoT steers toward textbook-like continuations.

The evolution from CoT to reasoning models like o1 and DeepSeek-R1 is worth noting. These models internalise the chain-of-thought into a "thinking" phase that happens before the visible response, trained with reinforcement learning to produce chains that actually improve answer quality rather than just looking step-by-step. Mechanistic interpretability work on R1 has found that reasoning models are qualitatively different internally — they develop features for backtracking, self-correction, and confusion-detection that don't appear in non-reasoning models, and they resist steering in ways that standard models don't, as if the model has some awareness of when its internal states are being perturbed.2

The Automation Frontier

The logical endpoint of prompt engineering is prompt engineering done by the model itself. APE (Automatic Prompt Engineer) uses LLMs to generate candidate instructions, evaluates them on training examples, then iteratively improves the best candidates.1 This works surprisingly well — automatically generated prompts often outperform human-written ones, probably because the model has a better sense of what phrasing will steer its own completions effectively than a human does.

Prompt-tuning and prefix-tuning go further, optimising continuous vectors in the embedding space rather than discrete tokens. These "soft prompts" can steer model behaviour more precisely than any natural language instruction, but they're uninterpretable — they don't correspond to any English words, just to directions in embedding space that happen to produce the desired output distribution. This is both powerful and unsettling. It means the space of possible model behaviours is much richer than what natural language can access, and that optimising in that space can find configurations a human would never think to describe.

The broader picture is that prompt engineering is a transitional discipline. As models get better at following instructions, the elaborate tricks become less necessary. As tool use and agentic architectures mature, the emphasis shifts from clever prompting to good agent design. And as fine-tuning and RLHF improve, the gap between what a base model can do with a perfect prompt and what a tuned model can do with a simple instruction narrows. But the core insight of prompt engineering — that the same weights can produce vastly different behaviours depending on context — is permanent. That's not a feature of current models that better training will eliminate. It's a fundamental property of systems that learned by predicting text.

Footnotes

  1. Prompt Engineering by Lilian Weng — source 2 3 4 5

  2. Under the Hood of a Reasoning Model by Thomas McGrath — source

Open in stacked reader →