AI Alignment Forum
Risk reports need to address deployment-time spread of misalignment
Mechanistic estimation for expectations of random products
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Clarifying the role of the behavioral selection model
Mechanistic estimation for wide random MLPs
Motivated reasoning, confirmation bias, and AI risk theory
Risk from fitness-seeking AIs: mechanisms and mitigations
Research Sabotage in ML Codebases
Language models know what matters and the foundations of ethics better than you
From nothing to important actions: agents that act morally
A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"
$50 million a year for a 10% chance to ban ASI
Five approaches to evaluating training-based control measures
Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability
You can only build safe ASI if ASI is globally banned
Current AIs seem pretty misaligned to me
Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes
My unsupervised elicitation challenge
My picture of the present in AI
[Paper] Stringological sequence prediction I
There should be $100M grants to automate AI safety
My most common advice for junior researchers
Predicting When RL Training Breaks Chain-of-Thought Monitorability
A Toy Environment For Exploring Reasoning About Reward
Metagaming matters for training, evaluation, and oversight
“Act-based approval-directed agents”, for IDA skeptics
New RFP on Interpretability from Schmidt Sciences
Operationalizing FDT
How well do models follow their constitutions?
Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation
Can governments quickly and cheaply slow AI training?
How to Design Environments for Understanding Model Motives
Schelling Goodness, and Shared Morality as a Goal
Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior
Will reward-seekers respond to distant incentives?
AXRP Episode 48 - Guive Assadi on AI Property Rights
How do we (more) safely defer to AIs?
Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032
Distinguish between inference scaling and "larger tasks use more compute"
It Is Reasonable To Research How To Use Model Internals In Training
Increasing AI Strategic Competence as a Safety Approach
“Features” aren’t always the true computational primitives of a model, but that might be fine anyways
Are there lessons from high-reliability engineering for AGI safety?
AlgZoo: uninterpreted models with fewer than 1,500 parameters
New version of “Intro to Brain-Like-AGI Safety”
When should we train against a scheming monitor?
No instrumental convergence without AI psychology
Gradual Paths to Collective Flourishing
Brief Explorations in LLM Value Rankings
Practical challenges of control monitoring in frontier AI deployments
My 2003 Post on the Evolutionary Argument for AI Misalignment
How hard is it to inoculate against misalignment generalization?
The inaugural Redwood Research podcast
Principled Interpretability of Reward Hacking in Closed Frontier Models
Team Shard: Alignment Mentorship from TurnTrout and Alex Cloud
Measuring no CoT math time horizon (single forward pass)
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
AIXI with general utility functions: "Value under ignorance in UAI"
2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target
Shallow review of technical AI safety, 2025
Towards training-time mitigations for alignment faking in RL
Rotations in Superposition
My AGI safety research—2025 review, ’26 plans
Reward Function Design: a starter pack
We need a field of Reward Function Design
An Ambitious Vision for Interpretability
The behavioral selection model for predicting AI motivations
Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)
[Paper] Difficulties with Evaluating a Deception Detector for AIs
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Circuit discovery through chain of thought using policy gradients
Alignment remains a hard, unsolved problem
Reasoning Models Sometimes Output Illegible Chains of Thought
Abstract advice to researchers tackling the difficult core problems of AGI alignment
Natural emergent misalignment from reward hacking in production RL
Serious Flaws in CAST
Current LLMs seem to rarely detect CoT tampering
Lessons from building a model organism testbed
Please, Don't Roll Your Own Metaethics
Steering Language Models with Weight Arithmetic
Problems I've Tried to Legibilize
Comparing Payor & Löb
A scheme to credit hack policy gradient training
Geometric UDT
Meta-agentic Prisoner's Dilemmas
Legible vs. Illegible AI Safety Problems
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Anthropic's Pilot Sabotage Risk Report
Steering Evaluation-Aware Models to Act Like They Are Deployed
Reducing risk from scheming by studying trained-in scheming behavior
Rogue internal deployments via external APIs
Current Language Models Struggle to Reason in Ciphered Language
Iterated Development and Study of Schemers (IDSS)
The Thinking Machines Tinker API is good news for AI control and security
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Eliciting secret knowledge from language models
Lectures on statistical learning theory for alignment researchers
Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
Synthesizing Standalone World-Models, Part 2: Shifting Structures
Draconian measures can increase the risk of irrevocable catastrophe
Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Research Agenda: Synthesizing Standalone World-Models
Video and transcript of talk on giving AIs safe motivations
Focus transparency on risk reports, not safety cases