AI Alignment Forum
Will reward-seekers respond to distant incentives?
AXRP Episode 48 - Guive Assadi on AI Property Rights
How do we (more) safely defer to AIs?
Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032
Distinguish between inference scaling and "larger tasks use more compute"
It Is Reasonable To Research How To Use Model Internals In Training
Increasing AI Strategic Competence as a Safety Approach
“Features” aren’t always the true computational primitives of a model, but that might be fine anyways
Are there lessons from high-reliability engineering for AGI safety?
AlgZoo: uninterpreted models with fewer than 1,500 parameters
New version of “Intro to Brain-Like-AGI Safety”
When should we train against a scheming monitor?
No instrumental convergence without AI psychology
Gradual Paths to Collective Flourishing
Brief Explorations in LLM Value Rankings
Practical challenges of control monitoring in frontier AI deployments
My 2003 Post on the Evolutionary Argument for AI Misalignment
How hard is it to inoculate against misalignment generalization?
The inaugural Redwood Research podcast
Principled Interpretability of Reward Hacking in Closed Frontier Models
Team Shard: Alignment Mentorship from TurnTrout and Alex Cloud
Measuring no CoT math time horizon (single forward pass)
Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance
AIXI with general utility functions: "Value under ignorance in UAI"
2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target
Shallow review of technical AI safety, 2025
Towards training-time mitigations for alignment faking in RL
Rotations in Superposition
My AGI safety research—2025 review, ’26 plans
Reward Function Design: a starter pack
We need a field of Reward Function Design
An Ambitious Vision for Interpretability
The behavioral selection model for predicting AI motivations
Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)
[Paper] Difficulties with Evaluating a Deception Detector for AIs
6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
How Can Interpretability Researchers Help AGI Go Well?
A Pragmatic Vision for Interpretability
Circuit discovery through chain of thought using policy gradients
Alignment remains a hard, unsolved problem
Reasoning Models Sometimes Output Illegible Chains of Thought
Abstract advice to researchers tackling the difficult core problems of AGI alignment
Natural emergent misalignment from reward hacking in production RL
Serious Flaws in CAST
Current LLMs seem to rarely detect CoT tampering
Lessons from building a model organism testbed
Please, Don't Roll Your Own Metaethics
Steering Language Models with Weight Arithmetic
Problems I've Tried to Legibilize
Comparing Payor & Löb
A scheme to credit hack policy gradient training
Geometric UDT
Meta-agentic Prisoner's Dilemmas
Legible vs. Illegible AI Safety Problems
GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash
Anthropic's Pilot Sabotage Risk Report
Steering Evaluation-Aware Models to Act Like They Are Deployed
Reducing risk from scheming by studying trained-in scheming behavior
Rogue internal deployments via external APIs
Current Language Models Struggle to Reason in Ciphered Language
Iterated Development and Study of Schemers (IDSS)
The Thinking Machines Tinker API is good news for AI control and security
Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Eliciting secret knowledge from language models
Lectures on statistical learning theory for alignment researchers
Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs
Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
Synthesizing Standalone World-Models, Part 2: Shifting Structures
Draconian measures can increase the risk of irrevocable catastrophe
Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies
Research Agenda: Synthesizing Standalone World-Models
Video and transcript of talk on giving AIs safe motivations
Focus transparency on risk reports, not safety cases