AI Alignment Forum
My AGI safety research—2025 review, ’26 plans
Reward Function Design: a starter pack
We need a field of Reward Function Design
An Ambitious Vision for Interpretability
The behavioral selection model for predicting AI motivations
Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)
[Paper] Difficulties with Evaluating a Deception Detector for AIs
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Circuit discovery through chain of thought using policy gradients
Alignment remains a hard, unsolved problem
Reasoning Models Sometimes Output Illegible Chains of Thought
Abstract advice to researchers tackling the difficult core problems of AGI alignment
Natural emergent misalignment from reward hacking in production RL
Serious Flaws in CAST
Current LLMs seem to rarely detect CoT tampering
Lessons from building a model organism testbed
Please, Don't Roll Your Own Metaethics
Steering Language Models with Weight Arithmetic
Problems I've Tried to Legibilize
Comparing Payor & Löb
A scheme to credit hack policy gradient training
Geometric UDT
Meta-agentic Prisoner's Dilemmas
Legible vs. Illegible AI Safety Problems
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Anthropic's Pilot Sabotage Risk Report
Steering Evaluation-Aware Models to Act Like They Are Deployed
Reducing risk from scheming by studying trained-in scheming behavior
Rogue internal deployments via external APIs
Current Language Models Struggle to Reason in Ciphered Language
Iterated Development and Study of Schemers (IDSS)
The Thinking Machines Tinker API is good news for AI control and security
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Eliciting secret knowledge from language models
Lectures on statistical learning theory for alignment researchers
Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
Synthesizing Standalone World-Models, Part 2: Shifting Structures
Draconian measures can increase the risk of irrevocable catastrophe
Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Research Agenda: Synthesizing Standalone World-Models
Video and transcript of talk on giving AIs safe motivations
Focus transparency on risk reports, not safety cases