RSS.Social

AI Alignment Forum

follow: @[email protected]

Posts

Will reward-seekers respond to distant incentives?

AXRP Episode 48 - Guive Assadi on AI Property Rights

How do we (more) safely defer to AIs?

Research note: A simpler AI timelines model predicts 99% AI R&D automation in ~2032

Distinguish between inference scaling and "larger tasks use more compute"

It Is Reasonable To Research How To Use Model Internals In Training

Increasing AI Strategic Competence as a Safety Approach

“Features” aren’t always the true computational primitives of a model, but that might be fine anyways

Are there lessons from high-reliability engineering for AGI safety?

AlgZoo: uninterpreted models with fewer than 1,500 parameters

New version of “Intro to Brain-Like-AGI Safety”

When should we train against a scheming monitor?

No instrumental convergence without AI psychology

Gradual Paths to Collective Flourishing

Brief Explorations in LLM Value Rankings

Practical challenges of control monitoring in frontier AI deployments

My 2003 Post on the Evolutionary Argument for AI Misalignment

How hard is it to inoculate against misalignment generalization?

The inaugural Redwood Research podcast

Principled Interpretability of Reward Hacking in Closed Frontier Models

Team Shard: Alignment Mentorship from TurnTrout and Alex Cloud

Measuring no CoT math time horizon (single forward pass)

Recent LLMs can use filler tokens or problem repeats to improve (no-CoT) math performance

AIXI with general utility functions: "Value under ignorance in UAI"

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

Shallow review of technical AI safety, 2025

Towards training-time mitigations for alignment faking in RL

Rotations in Superposition

My AGI safety research—2025 review, ’26 plans

Reward Function Design: a starter pack

We need a field of Reward Function Design

An Ambitious Vision for Interpretability

The behavioral selection model for predicting AI motivations

Sydney AI Safety Fellowship 2026 (Priority deadline this Sunday)

[Paper] Difficulties with Evaluating a Deception Detector for AIs

6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa

How Can Interpretability Researchers Help AGI Go Well?

A Pragmatic Vision for Interpretability

Circuit discovery through chain of thought using policy gradients

Alignment remains a hard, unsolved problem

Reasoning Models Sometimes Output Illegible Chains of Thought

Abstract advice to researchers tackling the difficult core problems of AGI alignment

Natural emergent misalignment from reward hacking in production RL

Serious Flaws in CAST

Current LLMs seem to rarely detect CoT tampering

Lessons from building a model organism testbed

Please, Don't Roll Your Own Metaethics

Steering Language Models with Weight Arithmetic

Problems I've Tried to Legibilize

Comparing Payor & Löb

A scheme to credit hack policy gradient training

Geometric UDT

Meta-agentic Prisoner's Dilemmas

Legible vs. Illegible AI Safety Problems

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

Anthropic's Pilot Sabotage Risk Report

Steering Evaluation-Aware Models to Act Like They Are Deployed

Reducing risk from scheming by studying trained-in scheming behavior

Rogue internal deployments via external APIs

Current Language Models Struggle to Reason in Ciphered Language

Iterated Development and Study of Schemers (IDSS)

The Thinking Machines Tinker API is good news for AI control and security

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Eliciting secret knowledge from language models

Lectures on statistical learning theory for alignment researchers

Comparative Analysis of Black Box Methods for Detecting Evaluation Awareness in LLMs

Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications

Synthesizing Standalone World-Models, Part 3: Dataset-Assembly

Synthesizing Standalone World-Models, Part 2: Shifting Structures

Draconian measures can increase the risk of irrevocable catastrophe

Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies

Research Agenda: Synthesizing Standalone World-Models

Video and transcript of talk on giving AIs safe motivations

Focus transparency on risk reports, not safety cases