RSS.Social

AI Alignment Forum

follow: @[email protected]

Posts

Risk reports need to address deployment-time spread of misalignment

Mechanistic estimation for expectations of random products

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

Clarifying the role of the behavioral selection model

Mechanistic estimation for wide random MLPs

Motivated reasoning, confirmation bias, and AI risk theory

Risk from fitness-seeking AIs: mechanisms and mitigations

Research Sabotage in ML Codebases

Language models know what matters and the foundations of ethics better than you

From nothing to important actions: agents that act morally

A "Lay" Introduction to "On the Complexity of Neural Computation in Superposition"

$50 million a year for a 10% chance to ban ASI

Five approaches to evaluating training-based control measures

Prompted CoT Early Exit Undermines the Monitoring Benefits of CoT Uncontrollability

You can only build safe ASI if ASI is globally banned

Current AIs seem pretty misaligned to me

Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes

My unsupervised elicitation challenge

My picture of the present in AI

[Paper] Stringological sequence prediction I

There should be $100M grants to automate AI safety

My most common advice for junior researchers

Predicting When RL Training Breaks Chain-of-Thought Monitorability

A Toy Environment For Exploring Reasoning About Reward

Metagaming matters for training, evaluation, and oversight

“Act-based approval-directed agents”, for IDA skeptics

New RFP on Interpretability from Schmidt Sciences

Operationalizing FDT

How well do models follow their constitutions?

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Can governments quickly and cheaply slow AI training?

How to Design Environments for Understanding Model Motives

Schelling Goodness, and Shared Morality as a Goal

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

Will reward-seekers respond to distant incentives?

AXRP Episode 48 - Guive Assadi on AI Property Rights

How do we (more) safely defer to AIs?

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Distinguish between inference scaling and "larger tasks use more compute"

It Is Reasonable To Research How To Use Model Internals In Training

Increasing AI Strategic Competence as a Safety Approach

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

Are there lessons from high-reliability engineering for AGI safety?

AlgZoo: uninterpreted models with fewer than 1,500 parameters

New version of “Intro to Brain-Like-AGI Safety”

When should we train against a scheming monitor?

No instrumental convergence without AI psychology

Gradual Paths to Collective Flourishing

Brief Explorations in LLM Value Rankings

Practical challenges of control monitoring in frontier AI deployments

My 2003 Post on the Evolutionary Argument for AI Misalignment

How hard is it to inoculate against misalignment generalization?

The inaugural Redwood Research podcast

Principled Interpretability of Reward Hacking in Closed Frontier Models

Team Shard: Alignment Mentorship from TurnTrout and Alex Cloud

Measuring no CoT math time horizon (single forward pass)

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

AIXI with general utility functions: "Value under ignorance in UAI"

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

Shallow review of technical AI safety, 2025

Towards training-time mitigations for alignment faking in RL

Rotations in Superposition

My AGI safety research—2025 review, ’26 plans

Reward Function Design: a starter pack

We need a field of Reward Function Design

An Ambitious Vision for Interpretability

The behavioral selection model for predicting AI motivations

Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)

[Paper] Difficulties with Evaluating a Deception Detector for AIs

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Circuit discovery through chain of thought using policy gradients

Alignment remains a hard, unsolved problem

Reasoning Models Sometimes Output Illegible Chains of Thought

Abstract advice to researchers tackling the difficult core problems of AGI alignment

Natural emergent misalignment from reward hacking in production RL

Serious Flaws in CAST

Current LLMs seem to rarely detect CoT tampering

Lessons from building a model organism testbed

Please, Don't Roll Your Own Metaethics

Steering Language Models with Weight Arithmetic

Problems I've Tried to Legibilize

Comparing Payor & Löb

A scheme to credit hack policy gradient training

Geometric UDT

Meta-agentic Prisoner's Dilemmas

Legible vs. Illegible AI Safety Problems

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Anthropic's Pilot Sabotage Risk Report

Steering Evaluation-Aware Models to Act Like They Are Deployed

Reducing risk from scheming by studying trained-in scheming behavior

Rogue internal deployments via external APIs

Current Language Models Struggle to Reason in Ciphered Language

Iterated Development and Study of Schemers (IDSS)

The Thinking Machines Tinker API is good news for AI control and security

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Eliciting secret knowledge from language models

Lectures on statistical learning theory for alignment researchers

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Synthesizing Standalone World-Models, Part 3: Dataset-Assembly

Synthesizing Standalone World-Models, Part 2: Shifting Structures

Draconian measures can increase the risk of irrevocable catastrophe

Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies

Research Agenda: Synthesizing Standalone World-Models

Video and transcript of talk on giving AIs safe motivations

Focus transparency on risk reports, not safety cases