AI Alignment Forum

follow: @[email protected]

Posts

AIs will greatly change engineering in AI companies well before AGI

Large Language Models and the Critical Brain Hypothesis

Decision Theory Guarding is Sufficient for Scheming

Safety cases for Pessimism

How Can You Tell if You've Instilled a False Belief in Your LLM?

Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro

How To Become A Mechanistic Interpretability Researcher

Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)

AI companies have started saying safeguards are load-bearing

AI Induced Psychosis: A shallow investigation

Harmless reward hacks can generalize to misalignment in LLMs

Do-Divergence: A Bound for Maxwell's Demon

New Paper on Reflective Oracles & Grain of Truth Problem

Notes on cooperating with unaligned AIs

(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)

An Introduction to Credal Sets and Infra-Bayes Learnability

Proof Section to an Introduction to Credal Sets and Infra-Bayes Learnability

Resampling Conserves Redundancy (Approximately)