AI Alignment Forum
AIs will greatly change engineering in AI companies well before AGI
Large Language Models and the Critical Brain Hypothesis
Decision Theory Guarding is Sufficient for Scheming
Safety cases for Pessimism
How Can You Tell if You've Instilled a False Belief in Your LLM?
Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences
Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
How To Become A Mechanistic Interpretability Researcher
Attaching requirements to model releases has serious downsides (relative to a different deadline for these requirements)
AI companies have started saying safeguards are load-bearing
AI Induced Psychosis: A shallow investigation
Harmless reward hacks can generalize to misalignment in LLMs
Do-Divergence: A Bound for Maxwell's Demon
New Paper on Reflective Oracles & Grain of Truth Problem
Notes on cooperating with unaligned AIs
(∃ Stochastic Natural Latent) Implies (∃ Deterministic Natural Latent)
An Introduction to Credal Sets and Infra-Bayes Learnability
Proof Section to an Introduction to Credal Sets and Infra-Bayes Learnability
Resampling Conserves Redundancy (Approximately)