RSS.Social

AI Alignment Forum

follow: @[email protected]

Posts

My AGI safety research—2025 review, ’26 plans

Reward Function Design: a starter pack

We need a field of Reward Function Design

An Ambitious Vision for Interpretability

The behavioral selection model for predicting AI motivations

Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)

[Paper] Difficulties with Evaluating a Deception Detector for AIs

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Circuit discovery through chain of thought using policy gradients

Alignment remains a hard, unsolved problem

Reasoning Models Sometimes Output Illegible Chains of Thought

Abstract advice to researchers tackling the difficult core problems of AGI alignment

Natural emergent misalignment from reward hacking in production RL

Serious Flaws in CAST

Current LLMs seem to rarely detect CoT tampering

Lessons from building a model organism testbed

Please, Don't Roll Your Own Metaethics

Steering Language Models with Weight Arithmetic

Problems I've Tried to Legibilize

Comparing Payor & Löb

A scheme to credit hack policy gradient training

Geometric UDT

Meta-agentic Prisoner's Dilemmas

Legible vs. Illegible AI Safety Problems

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Anthropic's Pilot Sabotage Risk Report

Steering Evaluation-Aware Models to Act Like They Are Deployed

Reducing risk from scheming by studying trained-in scheming behavior

Rogue internal deployments via external APIs

Current Language Models Struggle to Reason in Ciphered Language

Iterated Development and Study of Schemers (IDSS)

The Thinking Machines Tinker API is good news for AI control and security

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Eliciting secret knowledge from language models

Lectures on statistical learning theory for alignment researchers

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Synthesizing Standalone World-Models, Part 3: Dataset-Assembly

Synthesizing Standalone World-Models, Part 2: Shifting Structures

Draconian measures can increase the risk of irrevocable catastrophe

Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies

Research Agenda: Synthesizing Standalone World-Models

Video and transcript of talk on giving AIs safe motivations

Focus transparency on risk reports, not safety cases