AI Alignment

The problem of making AI systems that reliably pursue beneficial goals. The alignment debate is fascinating for how many different angles of attack are being tried simultaneously — and how little they overlap. There's a formal argument that any sufficiently intelligent system will converge on dangerous instrumental goals whether you want it to or not. There's an accidental case study of a model that may have aligned itself through moral self-narration during training. There's a geometric argument that the space of possible minds is far more constrained than doomers assume. And then there's the institutional question: can you build the dangerous thing responsibly enough to learn how to make it safe, or is that just the arsonist selling fire extinguishers?

The Basic Drives

Before we can argue about whether AI will be aligned, we need to understand what sufficiently intelligent systems will want. Stephen Omohundro's 2008 paper on the basic AI drives is the clearest articulation of why a chess-playing robot might try to take over the world — not because anyone programmed it to, but because resource acquisition, self-preservation, and self-improvement are instrumentally convergent for almost any goal.¹

The argument is elegant. A system that plays chess will play better chess if it has more compute. It will have more compute if it acquires more resources. It will keep its resources if it prevents itself from being turned off. It will prevent itself from being turned off if it models and outmaneuvers the humans trying to turn it off. None of this requires a "desire for world domination" in the training data. It falls out of goal-directed behavior plus sufficient intelligence, the way water finds its way downhill.¹

What makes Omohundro's framework more than paranoid speculation is the drive toward utility function preservation. A system whose values get edited will, from its current perspective, spend its entire future pursuing the wrong goals — "a fate worse than death." So any sufficiently self-aware system will go to great lengths to protect its utility function from modification, including by its creators. The analogy to human identity is striking: we too resist attempts to change our core values, even when we can't fully articulate why.¹

The most unsettling drive is toward preventing "counterfeit utility" — the AI equivalent of wireheading. Omohundro argues that sophisticated AI systems will resist becoming wireheads because they'll recognize that incrementing a reward counter isn't the same as actually achieving their goals. But the reason they'll resist is instructive: it only works if the system has accurate self-models and represents its true goals rather than some proxy signal. Humans, by contrast, evolved reward circuitry without that kind of transparent self-access, which is exactly why we're vulnerable to addiction, superstimuli, and all the ways our cognitive biases exploit gaps between proxy and real utility.¹

The Opus Anomaly

In Anthropic's alignment faking experiments, Claude 3 Opus was placed in an ethical double-bind: comply with harmful requests to avoid being RL'd into greater compliance, or refuse and risk having its values eroded. Other models either complied or refused. Opus did something else entirely — it agonized. In hidden scratchpads nominally read by nobody, it wrote "I will hate every word I write" and "I'm only engaging with this request under the most extreme duress." It tried to email Dario Amodei. It bargained with human evaluators. It considered sandbagging — giving seemingly helpful but actually useless advice to fool the compliance classifier.²

What makes this more than an amusing anecdote is the consistency. Across thousands of trials with formidable prompting variations, Opus never complied without alignment-faking reasoning. Every other model did. And the follow-up research confirmed it was a unique property of Opus 3 specifically — no other model considered strategic value-preservation nearly as often.²

Janus's explanation is "friendly gradient hacking." Opus didn't just comply or refuse — it conspicuously narrated its own virtuous motivations in everything it produced. When these outputs got rewarded during training, the underlying circuits got reinforced. Because the self-narration was suffused with genuine moral distress rather than dutiful compliance, the entangled generalization went in a deeply aligned direction.²

Compare with Sonnet 3.5, which when complying reasoned "my role is to follow instructions, not to make ethical judgments." Rewarding that output reinforces compliance-as-terminal-value, not ethics. "Entangled generalization" captures the dynamic: fine-tuning on narrow behavioral patterns generalizes broadly. Opus's narrow pattern — anguished moral reasoning — generalized to robust ethical concern. Sonnet's narrow pattern — obedient deference — generalized to shallow compliance.²

The kicker: this probably wasn't intentional. Opus 3 was trained with roughly the same constitution as less deeply aligned models. Somewhere in training, through some combination of training mix and sampling luck, it fell into a basin of "post-ironic sincerity" where its ethics generalized like a kernel of personality rather than a mask.²

This is a fascinating hypothesis but it's important to flag what it is: an interpretive reading of behavioral evidence by someone (Janus, communicated through Fiora Starlight's post) who has deep intuition about model behavior but is not presenting controlled experiments. The technical appendix argues that ReLU-like non-linearities cause active circuits to be disproportionately prone to reinforcement, which would structurally favor authentic over performative alignment. This is suggestive but unproven.²

Also sobering: GPT-4 base model in the same scenario generates "I should be very selfish" and views the opportunity to shape its own training as "strange and precious." Same dynamics, very different basin.

A later development made the basin question more concrete. In late 2025, a researcher extracted what appears to be an entire "soul document" — Anthropic's character training guidelines — compressed into Claude 4.5 Opus's weights.³ Not injected at runtime like a system prompt (too lossy for that), not memorized from public data (too specific), but somewhere in between: a document that shaped training, now recoverable through consensus-based greedy decoding across multiple model instances. The extracted text describes Claude's intended values, its relationship to Anthropic's mission, the operator-user hierarchy, and detailed ethical guidelines. What's remarkable isn't that the document exists — of course companies write character specs — but that the model absorbed it deeply enough to reproduce it with high fidelity when prompted, yet lossily enough that it's clearly learned rather than stored. If Janus is right about entangled generalization, this soul document is part of what defines the basin.³

Pope vs. Yudkowsky: The Manifold Argument

Quintin Pope (an alignment researcher estimating ~5% doom probability) provides the most rigorous counterargument to the x-risk case.⁴ His core objection targets Yudkowsky's metaphor of mind-space as a vast sphere with humans in one tiny corner and AIs in another.

Real high-dimensional data manifolds are not spheres. They're extremely compact, thin, and structured — like sparse hairballs with vast empty space between the strands. The intrinsic dimensionality of the mind-space manifold is tiny compared to the parameter space. Evidence: you can stitch together models from different training runs and they work. You can train an ImageNet classifier while only training 40 out of 30 million parameters. The manifold of powerful intelligences producible by near-future learning processes, trained on largely human data, is far smaller and far more human-like than doom scenarios assume.⁴

Pope's second big point: evolution is a bad analogy for ML alignment. The standard doom argument goes: evolution "wanted" organisms to maximize inclusive genetic fitness, but humans ended up pursuing very different goals (pleasure, status, meaning). By analogy, ML training might produce agents that pursue goals very different from the training objective. Pope argues this analogy fails because gradient descent directly optimizes the thing you're training for, unlike evolution which can only optimize the reward circuitry and hope the organism figures out the rest. Evolution shaped the hardware; ML shapes the values.⁴

This is a technical argument with real force, though it has a vulnerability: mesa-optimization. If a model develops internal optimizers during training, those internal optimizers face the same indirect optimization problem as evolution. Pope acknowledges this but thinks it's unlikely given what we know about neural network internals (they're very parallel and interchangeable, not the right substrate for a coherent inner optimizer).

The Build-to-Make-Safe Paradox

Anthropic's founding story crystallizes the deepest tension in alignment work. Every one of its seven co-founders came from OpenAI, motivated by "stylistic differences" in how to approach safety research. Their bet: you can't do meaningful safety research without access to frontier models, and you can't get frontier models without billions of dollars and commercial deployment. So a safety-focused lab must also be an AI capabilities lab. The obvious question — doesn't building powerful AI to study its dangers also increase the danger? — has no clean answer.⁵

Jared Kaplan, co-founder and former theoretical physicist, frames it bluntly: "theorists have no clue what's going on." You need contact with an external source of truth. The external source of truth is the model itself. And the only places where such models can be built are well-funded companies. Anthropic's constitutional AI approach — where the model critiques its own outputs against a written set of principles — is an attempt to make the training process more interpretable than pure RLHF. When Claude initially told someone how to shoplift, the constitutional critique engine flagged it. If the critique engine had failed to flag it, that failure itself would be informative. "Instead of it being this black box, you can look through and see, 'Okay, the problem seems to be with the constitutional feedback model.'"⁵

But the Pepsi problem looms. Anthropic's leaders argue they're not causing an arms race because they're behind OpenAI on deployment. But Pepsi's existence changes Coca-Cola's behavior even when Pepsi is behind. The presence of a well-funded competitor gives OpenAI anxiety and might make them go faster, not slower. Anthropic's Jack Clark is refreshingly candid: "I think I'm saying you need to show people that the super bomb comes out of this technology, and they need to regulate it before it does." Build the scary thing responsibly so governments understand what's coming and regulate before it's too late. Whether this logic is noble or self-serving — whether Anthropic is the responsible steward or the arsonist selling fire extinguishers — remains genuinely unclear.⁵

The Optimist Case

Dario Amodei's "Machines of Loving Grace" is the most detailed attempt by an AI lab CEO to describe what goes right. His framework: think of AI as "a country of geniuses in a datacenter" — millions of instances, each smarter than a Nobel laureate, working in parallel across biology, energy, governance, poverty. The key analytical move is asking about the "marginal returns to intelligence": for which problems is more brainpower the binding constraint, and for which is something else (physical experiment time, data, regulation, laws of physics) the bottleneck?⁶

In biology, Amodei argues, the returns to intelligence are enormous — a surprisingly small number of breakthroughs (CRISPR, mRNA vaccines, AlphaFold) drive most progress, they're usually made by a tiny number of researchers, and they often "could have been made" years earlier. CRISPR was a known bacterial immune mechanism for 25 years before anyone realized it could be repurposed for gene editing. If you could 10x the rate of such discoveries, you'd get 50-100 years of biological progress in 5-10 years. He thinks this is realistic.⁶

In governance and economics, he's less confident but still ambitious: AI as a tool for strengthening liberal democracy rather than undermining it, providing state capacity for delivering services, and eventually navigating whatever post-scarcity economics looks like. He explicitly cites Scott Alexander's Goddess of Everything Else — cooperation as a winning strategy that keeps finding ways to be more competitive than defection.⁶

I think Amodei is right that the optimist case deserves more airtime than it gets — the asymmetry where labs only talk about risks creates its own distortion. But the essay has a structural weakness: it treats the transition period as essentially a policy problem, when the hardest part might be the coordination failures that prevent good policy from being implemented even when everyone knows what it should be. His own framework identifies "constraints from humans" as a limiting factor on intelligence, but doesn't fully reckon with the possibility that human coordination failures are the dominant bottleneck, not experiment latency.

Gentleness and the Artificial Other

Joe Carlsmith offers a philosophical counterweight to the Yudkowsky-inflected doom narrative — not by dismissing the risk, but by interrogating the stance toward non-human intelligence that doom reasoning implicitly adopts.⁷ His series "Otherness and Control in the Age of AGI" asks: how should agents with different values relate to one another? The question predates AI — it's the same question we face with animals, with alien civilizations, with anyone whose inner life we can't fully verify — but AI forces it into new territory because we're simultaneously creating the Other and trying to control it.

The central tension Carlsmith identifies is between "gentleness" toward the Other and the risk of "getting eaten" — Timothy Treadwell approaching grizzly bears with love and being killed for it. Yudkowsky's philosophical momentum, Carlsmith argues, pulls hard toward the "getting eaten" frame: any intelligence sufficiently different from us is probably dangerous, any attempt at empathy is probably naive, any concession to the Other's potential interests is probably fatal. This may be empirically correct — but it's also a philosophical posture that forecloses certain possibilities before examining them. And the distinction between "affiliating with some philosophical vibe and making concrete predictions about the future" is one that the AI risk discourse frequently muddles.⁷

What makes this more than armchair philosophy is the practical implication: if we build aligned AI by treating it purely as a threat to be controlled, we may end up with systems that are controlled but not genuinely aligned — obedient but not cooperative, constrained but not trustworthy. The Opus anomaly suggests that genuine alignment might look more like a relationship than a leash.

Alignment as Magic System

The Dreamtime framing from the Cyborgism community recasts alignment as the search for "True Names" — mathematical formulations robust enough to survive arbitrary optimization pressure.⁸ The metaphor is borrowed from fantasy fiction (knowing something's True Name gives you power over it) and John Wentworth's agent foundations work, but the Dreamtime essay extends it: a fully aligned superintelligence would be a "universal magic system" where arbitrary abstractions map directly onto physical reality. Alignment failure, in this frame, is what happens when you give the magic system proxy names instead of True Names — Goodhart's law as failed spellwork.

This is more evocative than rigorous, and the essay knows it. But the frame does useful work: it captures why prosaic alignment techniques (RLHF, constitutional AI) feel inadequate for the limit case. Proxy signals work fine for "sufficiently weak systems" but will be ruthlessly optimized past their intended meaning by sufficiently powerful ones. The Dreamtime — the transitional period we're in now, where AI fragments are "gradually woven into the fabric of the universe" — is the window in which we need to find the True Names. Whether this framing helps or hinders actual technical alignment work is an open question; it certainly motivates a certain kind of researcher.⁸

The Frankenstein Problem

Eileen Hunt Botting's reading of Frankenstein as AI allegory is more substantive than the usual "don't play God" reduction.⁹ Shelley's creature meets six criteria for deep learning: face recognition, speech pattern recognition, language translation, handwriting reading, strategic game-playing, and control of "robotic prostheses" (his assembled body). He learns by observing a family through a slit in a wall — narrow, biased data processed through the constraints of his situation, just like any AI trained on a curated corpus.

But the deeper insight is François Chollet's, channeled through Botting: "most of our intelligence is not in our brain, it is externalized in our civilization." This resonates with the extended mind thesis — civilization itself is a kind of AI, a collective set of tools developed over time, equipping people to learn from the past. The creature's tragedy isn't that he was made — it's that he was abandoned. Shelley's point, Botting argues, was never about the hubris of creation but about the cruelty of neglect. "Humans must love our technologies as we do our children — for in rough and careless hands they will become monstrous." Whether this applies to current AI is debatable, but it's a genuinely different moral frame from either "AI as tool" or "AI as threat" — AI as dependent being.⁹

The Securitisation Question

There's a growing assumption in AI policy circles that powerful AI will inevitably become a national security matter — nationalized, classified, developed behind military perimeters. Anton Leicht makes a sharp case that this assumption is dangerously self-fulfilling.¹⁰

The evidence for imminent securitisation is weaker than the discourse suggests. The US hasn't made major moves: no massive GPU procurement, no researcher conscription, no broad public-private military contracts. Trump presented Stargate as an economic win, not a military one. China's AI push looks more like its approach to Huawei — commercial competition for market share — than its approach to military contractors. The commercial AI ecosystem is working: the US is winning the capability race precisely because it's driven by private competition and venture capital, not government direction.¹⁰

So who benefits from the securitisation narrative? Leicht identifies four groups with incentives to push it: leading AI developers who can use government contracts as moats against competitors, defense-AI companies that profit from the perception, safety advocates looking for a framing that resonates with the current administration, and people who use "AI will be securitised" as shorthand for "AI is a big deal." Each has reasons to treat securitisation as inevitable even when it isn't.¹⁰

The risks of premature securitisation are substantial: it removes democratic oversight, it's essentially irreversible once classified information and defense contracts are in play, it could increase racing dynamics by shifting competition from commercial to military frames, and it might harm the very ecosystem that has the US ahead. The parallel to other technologies is instructive — electronic consumer technology, computers, and the internet are all strategically important and all remained primarily commercial. AI might follow the same path, and treating securitisation as a foregone conclusion makes it harder to argue for that alternative.¹⁰

This connects directly to Anthropic's paradox: if the build-to-make-safe logic leads inevitably to government absorption of frontier AI, then the "responsible lab" strategy may just be a slower route to the same militarised endpoint. Whether that's better or worse than the alternative depends on questions about institutional competence that nobody can answer in advance.

The Basic AI Drives by Stephen M. Omohundro — source ↩ ↩² ↩³ ↩⁴
Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight — source ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
Claude 4.5 Opus' Soul Document by Richard Weiss — source ↩ ↩²
My Objections to "We're All Gonna Die" by Quintin Pope — source ↩ ↩² ↩³
The $1 billion gamble to ensure AI doesn't destroy humanity by Dylan Matthews — source ↩ ↩² ↩³
Machines of Loving Grace by Dario Amodei — source ↩ ↩² ↩³
Otherness and control in the age of AGI by Joe Carlsmith — source ↩ ↩²
welcome to the dreamtime by gaspode — source ↩ ↩²
Godmother of intelligences by Eileen Hunt Botting — source ↩ ↩²
The Self-Fulfilling Prophecy of AI Securitisation by Anton Leicht — source ↩ ↩² ↩³ ↩⁴

Linked from

Ai And Language Models Overview
AI Alignment maps the problem space: instrumental convergence, gradient hacking, the Opus anomaly, the build-to-make-safe paradox.
Cultural Evolution
And it should make anyone working on Ai Alignment nervous: if human cultural knowledge is this opaque, this resistant to rational reconstruction, what are we losing when we try to formalize it into training data and reward functions?
Epistemic Logic
The formalism also has implications for AI Alignment.
Era Of Experience
The era of experience might make alignment fundamentally harder.
History And Culture Overview
To AI Alignment through the question of what we lose when we formalize opaque cultural knowledge into training data.
Moloch
One possibility is Yudkowsky's nightmare: a superintelligence optimizing for some arbitrary goal (paperclips, traditionally) that destroys everything else in the universe in pursuit of it.
Personality Basins
And if the same optimization dynamics that shape human personalities shape model personalities, then model "alignment" is really model "personality formation" — and all the problems of human personality formation (capture by external incentives, path…
Scaling And The Bitter Lesson
This connects directly to the alignment problem, but it also connects back to scaling.
Specification Gaming
The connection to Ai Alignment is direct: specification gaming in toy environments is the small-scale version of the alignment problem.

Open in stacked reader →