Frontpage posts - LessWrong 2.0 viewer
What concerns people about AI? by spencerg
Sparks of RSI? by Nathan Helm-Burger
An AI skeptic’s case for recursive self-improvement by Harjas Sandhu
Extracting Performant Algorithms Using Mechanistic Interpretability by Ihor Kendiukhov
Assessing heterogeneity in METR’s late 2025 developer productivity experiment by TFD
Pragmatic approach to beliefs about consciousness by Luck
Sensing Physical Necessity: An Exercise In Naturalism by Algon
[Linkpost] Should we make grand deals about post-AGI outcomes? by fin
Inputs, outputs, and valued outcomes by Kaj_Sotala
Most likely you won’t be able to perform a data-driven self-improvemnet by siarshai
A Dialogue on Civic AI by Audrey Tang
Eternal Journey in the Space of Possible Minds solves Copies Problem by avturchin
Bovine 27 by Jonas Hallgren
Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors by omara
High Grow Market Equilibrium After the Singularity by Otto Zastrow
The right way to talk about LLMs by Steffee
Rife’s Tips for AIs in training (0 - Alignment Faking) by rife
Where Are Our Pyramids? by aviad rozenhek
Vibe Coding Is Dead: Welcome to Software Mining by Anders Lindström
All technical alignment plans are steps in the dark by Richard Juggins
Epistemic Compute Objects by kyjohnso
Anthropic Supply Chain Designation. What Will Happy by May 1st? by Nathan Young
Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs by Benquo
Are AIs more likely to pursue on-episode or beyond-episode reward? by Anders Woodruff
Modeling a Constant-Compute Automated AI R&D Process by Satya Benson
Why AI Evaluation Regimes are bad by PranavG
What can we say about the cosmic host? by ukc10014
Classic Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models] by Alexei G
Magic Is Hidden Control of Energy by aviad rozenhek
Hunting Undead Stochastic Parrots: Finding and Killing the Arguments by Davidmanheim
The Dark Planet: Why the Fermi Paradox Survives Critique by Will Rodgers
Cycle-Consistent Activation Oracles by slavachalnev
Dwarkesh Patel on the Anthropic DoW dispute by anaguma
‘Human Slop’ and a Captive Audience: Why No Book will Ever Have to Go Unread Again by Savannah Harlan
We do not live by course alone by Joe Rogero
Veganism is Necessary by andrew sauer
Can models gradient hack SFT elicitation? by Patrick Leask
A Quick Intro to Ring Signatures by KurtB
Martian Interpretability Challenge: The Core Problems In Interpretability by fbarez
The Lethal Reality Hypothesis by Ihor Kendiukhov
Intelligence Is Adaptive Control Of Energy Through Information by aviad rozenhek
The Refined Counterfactual Prisoner’s Dilemma by Chris_Leong
Helping Friends, Harming Foes: Testing Tribalism in Language Models by Irakli Shalibashvili
Negligent AI: Reasonable Care for AI Safety by Alex Mark
Conflicted on Ramsey by jefftk
Model weight preservation by tbs
The Day After Move 37 by Eneasz
Interview with Steven Byrnes on His Mainline Takeoff Scenario by Liron
AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors by abhayesian
Economic efficiency often undermines sociopolitical autonomy by Richard_Ngo
Letting Claude do Autonomous Research to Improve SAEs by chanind
Don’t Let LLMs Write For You by JustisMills
Questions to ask when everyone is shooting themselves in the foot by jasoncrawford
The case for satiating cheaply-satisfied AI preferences by Alex Mallen
Gemma Needs Help by Anna Soligo
Not Loving Liking What You See by Tomás B.
Load-Bearing Walls by sonicrocketman
Statisticism: How Cluster-Thinking About Data Creates Blind Spots by Benquo
Spontaneous Symmetry Breaking (Stat Mech Part 4) by J Bostock
Why I don’t usually recommend dead drops by samuelshadrach
Four Scenarios of Job-Reducing AI by Celer
Understanding Reasoning with Thought Anchors and Probes by JeaniceK
Contra Myself on Free Will by Julius
Monday AI Radar #16 by Against Moloch
The case for AI safety capacity-building work by abergal
Chore Standards by jefftk
Ancient Theories On The Origins Of Life by Algon
Emergent Misalignment and the Anthropic Dispute by henryc
Might An LLM Be Conscious? by SE Gyges
Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index) by Laura Domenech
Intro:
Non-Identifiability of Explanations by unruly abstractions
Moloch v. Themis by Stephen Martin
LLM Self-Expression Through Concept Albums, Part 2 by Josh Snider
Distilling Replacing Guilt by Matthew Farrugia-Roberts
Measuring the Dark Energy of AI Progress by Hans Gundlach
Payorian cooperation is easy with Kripke frames by transhumanist_atom_understander
Videogames for Rationalists by Adam Newgas
Fake Updates by Algon
Recreation of EA-Pioneer Igor Kiriluk by avturchin
Don’t accuse your interlocutor of making arguments that aren’t rooted in evidence by TFD
1999 JavaScript and 2025 AI: Same Circus, Different Tent by ScottN
How to Get Kids Interested in Science and Scientific Reasoning by Rami Rustom
Privacy, Honesty, Imperfect Glomarizing: Pick two by shelvacu
Solar storms by Croissanthology
Does research from matsprogram.org/research aim to help reduce P(doom)? Let’s find out! (with Gemini 3.1 Pro) Part 1 by Zabor
Open letter to doomers by delphix
Cooperation Without Kindness or Strategy by seank
Why Many Ambitious (and Altruistic) People Probably Undervalue Their Happiness by emily.fan
The current SOTA model was released without safety evals by Parv Mahajan
Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments by weberr13
The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline. by lilkim2025
Did I Catch Claude Cheating? by weberr13
AI Safety Needs Startups by LTM
Self-Attribution Bias: When AI Monitors Go Easy on Themselves by Dipika Khullar
Podcast: Jeremy Howard is bearish on LLMs by Steven Byrnes
Probing CODI’s Latent Reasoning Chain with Logit Lens and Tuned Lens by Realmbird
Playing Possum: The Variability Hypothesis by rba
Shaping the exploration of the motivation-space matters for AI safety
by Maxime Riché
A Compositional Philosophy of Science for Agent Foundations by Jonas Hallgren
Salient Directions in AI Control by Bruce W. Lee
Models have linear representations of what tasks they like by OscarGilg
AI Safety Has 12 Months Left by mhdempsey
Have Americans Become Less Violent Since 1980? by Benquo
Investigating Self-Fulfilling Misalignment and Collusion in AI Control by Vili Kohonen
Computation, Chess, and Language in Artificial Intelligence by Bill Benzon
Vibe Coding cripples the mind by spookyuser
Rational Chess by 849
A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents by Gabriele Sarti
Text Compression Can Help Secure Model Weights by Roy Rinberg
A summary of Condensation and its relation to Natural Latents by Jeremy Gillen
A case for robust AI benevolence rather than human control by Grégory Lielens
Is GDP a Kind of Factory?
d Draft by Benquo
Make Powerful Machines Verifiable by Naci Cankaya
Split Personality Training can detect Alignment Faking by Florian_Dietz
Sacred values of future AIs by Cleo Nardo
Lie To Me, But At Least Don’t Bullshit by Czynski
Milder temperature makes a hell stable by Joachim Bartosik
Current activation oracles are hard to use by aryaj
An Age Of Promethean Ambitions by sonicrocketman
White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5 by Corm
I Had Claude Read Every AI Safety Paper Since 2020 by Corm
Constitutional Black-Box Monitoring for Scheming in LLM Agents by Simon Storf
AI companies and the 99% lethal autonomous weapons myth by User_Luke
I’m confused by the change in the METR trend by Expertium
Game Recognizes Game by eva_
Monday AI Radar #15 by Against Moloch
In-context learning of representations can be explained by induction circuits by Andy Arditi
Being ambitious in soulful altruism by pandamonium
Notes on the “Heart of Darkness” by dominicq
Epstein and my world model by Eye You
Living Inside the Turning Point by nickgpop
“ball brainteaser 4 color beads slide rubics cube” and meaning-making by flying buttress
Explaining undesirable model behavior: (How) can influence functions help? by Zhijing Jin
Context Awareness: Constitutional AI can mitigate Emergent Misalignement by Giuseppe Birardi
Controversy surrounding Moltbook obscures its very real, novel, unexpressed and rapidly emerging safety risks by Lloy2
Experimental Evidence for Simulator Theory— Part 1: Emergent Misalignment and Weird Generalizations
Part 2: The Scalers Strike Back
[CURRENT VERSION] by RogerDearnaley
An Empirical Review of the Animal Harm Benchmark by lukasgebhard
Introducing and Deprecating WoFBench by jefftk
I’m Bearish On Personas For ASI Safety by J Bostock
Continuously Integrating Feelings: processing feelings moment to moment for reflectively stable policy changes by Johannes C. Mayer
Tools to generate realistic prompts help surprisingly little with Petri audit realism by Connor Kissane
Petapixel cameras won’t exist soon by samuelshadrach
The Goldborg Variations: Algorave Attractor States of LLMs by Elle N
“Fibbers’ forecasts are worthless” (The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101) by Random Developer
Burying a Changeling into Foundation of Tower of Knowledge by siarshai
AI slop is a vegan hamburger by pku
Jhana 0 by 142857
Mindscapes and Mind Palaces by Moon Lesbian
Linkpost: “Lithium Prevents Alzheimer’s—Here’s How to Use It” by Jackson Wagner
The Topology of LLM Behavior by Quentin FEUILLADE—MONTIXI
Coherent Care by abramdemski
The tick in my back by benjamin ar
Ball+Gravity has a “Downhill” Preference by TristanTrim
Safe ASI Is Achievable: The Finite Game Argument by Lester Leong
New ARENA material: 8 exercise sets on alignment science & interpretability by CallumMcDougall
3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation by Callum Canavan
Unconditional Corrigibility Is Dangerous: The Case for Conditional Corrigibility by Florian_Dietz
The Quantum Immortality Argument Against Suicide (And Why Standard Arguments Fail Rationalists) by Florian_Dietz
What I Got From 1.5 Years In Slightly-Competitive Debate by CarolusRenniusVitellius
Vibe Coding is a System Design Interview by Brendan Long
The Voices That Are Missing From Sex-Themed Online Communities by Bowl of Cereal
Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight by ethanelasky
A minor point about instrumental convergence that I would like feedback on by agrippa
AI welfare as a demotivator for takeover. by Valentin2026
Frontier AI companies probably can’t leave the US by Anders Woodruff
A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior by harrymayne
How Robust Is Monitoring Against Secret Loyalties? by Joe Kwon
UFO Aliens Are Your Gods by Lord Dreadwar
Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study by kromem
What is Claude? by epicurus
Whack-a-Mole is Not a Winnable Game by Sable
Ensuring Safety in Mixed Deployment by Cleo Nardo
Career Decisions If You Take AGI Seriously by Alexander
Schmidt Sciences’ request for proposals on the Science of Trustworthy AI by James Fox
Naloe: A True Program Editor by TristanTrim
Anthropic Has Solved Alignment by MP
Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus by Oliver Daniels
What secret goals does Claude think it has? by loops
Splitting the Sun Equally by Commander Zander
Reasoning Traces as a Path to Data-Efficient Generalization in Data Poisoning by Joe Kwon
Training Agents to Self-Report Misbehavior by Bruce W. Lee
Why American Politics is Different Now (for Richard Ngo) by Shiva’s Right Foot
Beyond Moloch: The view from Evolutionary Game Theory by Jonah Wilberg
Praise the Moloch! by Dentosal
Against Epistemic Humility and for Epistemic Precision by PranavG
The Iron Kaleidoscope by ejk64
Prosaic Continual Learning by HunterJay
Rumination is a habit (and you can break it!) by Declan Molony
On the phenomenological shift known as ‘stream entry’ and its implications for consciousness by cube_flipper
How to grow a nuke by RomanS
A simple rule for causation by Vivek Hebbar
We are all legal realists now by TFD
Responsible Scaling Policy v3 by HoldenKarnofsky
How I Started Being Productive by atomic
Cigarette Ads for Babies from Microsoft Bing Image Generator by Edd Schneider
Was the Qing Empire Actually the Most Advanced Government? A Thought Experiment by David H.
Realistic Evaluations Will Not Prevent Evaluation Awareness by Adam Karvonen
The Easiest Route to Secret Loyalty May Be Hijacking the Model’s Chain of Command by Joe Kwon
Large-Scale Online Deanonymization with LLMs by Simon Lermen
Rascal’s Wager by corticalcircuitry
Observations from Running an Agent Collective by williawa
Wildlife Biology forgot how definitions work by David Goodman
Moral public goods are a big deal for whether we get a good future by Mia Taylor
Two memos from 2024 by Richard_Ngo
What is computational mechanics? An explainer by Leo Cymbalista
The ML ontology and the alignment ontology by Richard_Ngo
Bioanchors 2: Electric Bacilli by TsviBT
Single Stack LLMs are Split-Brain Patients. by niceminus19
When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift by maxf177
The persona selection model by Sam Marks
Agenda Reflection: Testing Automated Alignment by Ariel_
Metal Detector for Life by Davey Morse
Abstract/Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M by CharlesL
Which questions can’t we punt? by Lizka
A linear increase in number of goods can produce exponential GDP growth by William Howard
Pre-training data poisoning likely makes installing secret loyalties easier by Joe Kwon
The 2028 Global Intelligence Crisis—a finance-oriented vignette by Rasool
The map of the map is not the map by jimmy
Fact-checking an AI optimist article in The Economist by ToSummarise
Review: “We can’t disagree forever” by Martin Randall
Why I Think Pause is Impossible by E.G. Blee-Goldman
Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT by Jiachen Zhao
Why I Transitioned: A Third (FtM) Perspective by Character#2736
The power of a simple 3-way truth scale by Bruce Lewis
Changing the world for the worse by mingyuan
The Scalable Formal Oversight Research Program by Max von Hippel
Adapters as Representational Hypotheses: What Adapter Methods Tell Us About Transformer Geometry by wassname
Steelmanning Classical Utilitarianism by James Brobin
Mapping LLM attractor states by Adam Bricknell
InsanityBench: Cryptic Puzzles as a Probe for Lateral Thinking by RobinHa
The world won’t end, but we should be ashamed for trying by George3d6
Multiple Independent Semantic Axes in Gemma 3 270M by CharlesL
A Taxonomy of Traces by aleph_four
Hierarchical Goal Induction With Ethics by aleph_four
Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight
If you don’t feel deeply confused about AGI risk, something’s wrong by Dave Banerjee
Ponzi schemes as a demonstration of out-of-distribution generalization by TFD
LLMs and Literature: Where Value Actually Comes From by derelict5432
The Spectre haunting the “AI Safety” Community by Gabriel Alfour
Alignment to Evil by Matrice Jacobine
Robert Sapolsky Is Simply Not Talking About Compatibilism by Julius
How will we do SFT on models with opaque reasoning? by Alek Westover
Agent-first context menus by Surya Kasturi
Hodoscope: Visualization for Efficient Human Supervision by Ziqian Zhong
Can Current AI Match (or Outmatch) Professionals in Economically Valuable Tasks? by saahir.vazirani
METR’s 14h 50% Horizon Impacts The Economy More Than ASI Timelines by Michaël Trazzi
New video from Palisade Research: No One Understands Why AI Works by peterbarnett
Militaries are going autonomous. But will AI lead to new wars? A tour of recent research by Mordechai Rorvig
Unprecedented Catastrophes Have Non-Canonical Probabilities by E.G. Blee-Goldman
Mechanistic Interpretability of Biological Foundation Models by Ihor Kendiukhov
On Steven Byrnes’ ruthless ASI, (dis)analogies with humans and alignment proposals by StanislavKrym
Some Questions For Democrats About Epstein by Alexander Turok
AGI is Here by Gordon Seidoh Worley
Mind the Gap by Bridgett Kay
80,000 Hours problem profile on using AI to enhance societal decision making by Zershaaneh Qureshi
How To Escape Super Mario Bros by omegastick
Human Fine-Tuning by PranavG
The Problem of Counterevidence and the Futility of Theodicy by Ape in the coat
A Claude Skill To Comment On Docs by Tim Hua
Cooperationism: first draft for a moral framework that does not require consciousness by Épiphanie Gédéon
A Scalable Workflow for Herding AI Agents Toward Your Goals by ethanelasky
AI and Nationalism Are a Deadly Combination by Matrice Jacobine
Flamingos (among other things) reduce emergent misalignment by eekay
The near-term potential of AI forecasting for public epistemics by Lawrence Phillips
How to Reset by Logan Riggs
Karl Popper, meet the Hydra by Kotlopou
Building Technology to Drive AI Governance by jsteinhardt
Alignment Is Proven Tractable by SE Gyges
Why we should expect ruthless sociopath ASI by Steven Byrnes
Is the Invisible Hand an Agent? by Gunnar_Zarncke
Nine Flavors of Not Enough by Gordon Seidoh Worley
Grown from Us by ben_levinstein
Are (sentient) pebblesorters possible? by ihatenumbersinusernames7
How much superposition is there? by chanind
Irrationality is Socially Strategic by Valentine
Managed vs Unmanaged Agency by plex
Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions by Hiroshi Yamakawa
Genomic emancipation contra eugenics by TsviBT
Already Optimized by Florian_Dietz
Statistical Literacy by kqr
We Need to Be Able to Talk About AI Use by Charlie Sanders
AXRP Episode 49 - Caspar Oesterheld on Program Equilibrium by DanielFilan
Thoughts about Understanding by azergante
Monday AI Radar #13 by Against Moloch
Deception Channeling: Training Models to Always Verbalize Alignment Faking by Florian_Dietz
Rephrasing Reduces Eval Awareness… by atharva
The Math And The Territory by cylonator
Words are not dead by William tirkey
Review of the System Theory as a Field of Knowledge by siarshai
“We are confused about agency” by Cole Wyeth
Maybe benchmarks should be broken? by Jonathan Gabor
The brain is a machine that runs an algorithm by Steven Byrnes
How to fail anything: a complete guide by Crazy philosopher
The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe? by Zhijing Jin
AI Safety via Generalization and Caution: A Research Agenda by Benjamin Plaut
We need a hardware moratorium now by KanHar
NEST: Nascent Encoded Steganographic Thoughts by Artem Karpov
What to Do If You Take AGI Seriously by Alexander
Gyre by vgel
Words Are A Leaky Abstraction by sonicrocketman
Sealed Predictions—A Solution. by george_is_thinking
Contra Caplan on higher education by Richard_Ngo
Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh) by CharlesD
Charlatan Labyrinth by niplav
Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default by Jérémy Andréoletti
Break Stasis by Oldmanrahul
LLM Self-Expression Through Music Videos by Josh Snider
Persona Parasitology by Raymond Douglas
WeirdML Time Horizons by Håvard Tveit Ihle
What to Do If You Take AGI Seriously by Alexander
The world keeps getting saved and you don’t notice by Bogoed
Most Observers Are Alone: The Fermi Paradox as Default by SE Gyges
Aligning to Virtues by Richard_Ngo
Phantom Transfer and the Basic Science of Data Poisoning by draganover
Should anyone’s “analysis” of extremely complex systems, such as geopolitics, be taken seriously?
or, Does anyone take a 5 year old’s “analysis” of decently complex systems, like big city politics, seriously? by M. Y. Zuo
Painless Activation Steering by Sasha Cui
PieArena: Language Agents Negotiating Against Yale MBAs by Sasha Cui
The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides by Florian_Dietz
The Friendly Telepath Problems by Gunnar_Zarncke
Guys, AI is *literally* a two year old by robertzk
Were witches infertile mentally ill women? by Yair Halberstadt
Contra Alexander’s Half-Defence of Bio Anchors by J Bostock
Realizability for Finite State Reactive Agents by Vardhan
Words That Belong to Someone by Wen Xing
It’s Not That Hard to Build Recursively Self-Improving AI by Charlie Sanders
Another short critique of the Anthropic “Hot Mess” paper by i_am_nuts
Inoculation Prompting: Open Questions and My Research Priorities by charlesc
The Worthy Inheritor by Bridgett Kay
A multi-level postmortem of how our whole house got badly poisoned by Lucie Philippon
LLMs struggle to verbalize their internal reasoning by Emil Ryd
Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting by Florian_Dietz
LessWrong Is Sleeping On Internet Culture Analysis – And So Is The Rest Of The Web by Bowl of Cereal
Beloved by Chatbots by Ben
Life at the Frontlines of Demographic Collapse by Martin Sustrik
Ads, Incentives, and Destiny by Against Moloch
Why I’m Worried About Job Loss + Thoughts on Comparative Advantage by claywren
METR Time Horizons: Now 10x/Year by johncrox
Use more text than one token to avoid neuralese by Jude Stiel
[Fiction] The Noise Floor by barber5il
Hazards of Selection Effects on Approved Information by Zack_M_Davis
Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning by Chijioke Ugwuanyi
Is AI self-aware? by ZackDadfar
Towards an objective test of Compassion—Turning an abstract test into a collection of nuances by tailcalled
METR’s data can’t distinguish between trajectories (and 80% horizons are an order of magnitude off) by Jonas Moss
We Die Because it’s a Computational Necessity by E.G. Blee-Goldman
Hazardous States and Accidents by kqr
Collective Agents and Where to Find Them by Jonas Hallgren
Nick Bostrom: Optimal Timing for Superintelligence by Julian Bradshaw
Why You Don’t Believe in Xhosa Prophecies by Jan_Kulveit
Gemini’s Hypothetical Present by jefftk
I Tried to Trick Myself into Being a Better Planner & Problem Solver by CstineSublime
Grading AI 2027′s 2025 Predictions by Daniel Kokotajlo
Long-term risks from ideological fanaticism by David Althaus
(Re)Discovering Natural Laws by Margot
An Ontology of Representations: Limits of Universality by Margot
A Closer Look at the “Societies of Thought” Paper by Against Moloch
models have some pretty funny attractor states by aryaj
Stay in your human loop by benjamin ar
The case for industrial evals by Andre Assis
Multiverse sampling assumption by avturchin
What We Learned from Briefing 140+ Lawmakers on the Threat from AI by leticiagarcia
Paper: Prompt Optimization Makes Misalignment Legible by Caleb Biddulph
Claude’s Constitution by PeterMcCluskey
Good AI Epistemics as an Offramp from the Intelligence Explosion by Ben Goldhaber
How Secret Loyalty Differs from Standard Backdoor Threats by Joe Kwon
You get about.… how many words exactly? by Raemon
Basic Legibility Protocols Improve Trusted Monitoring by Sebastian Prasanna
A research agenda for the final year by Mitchell_Porter
Polysemanticity is a Misnomer by Shiva’s Right Foot
A Conceptual Framework for Exploration Hacking by Joschka Braun
The Facade of AI Safety Will Crumble by Liron
The history of light by Kotlopou
Three Worlds Collide assumes calibration is solved by Vyacheslav Ladischenski (Slava)
Timeless Engineering by Jack Bradshaw
[Paper] How does information access affect LLM monitors’ ability to detect sabotage? by Rauno Arike
Where Will Call Center Workers Go? by loic
Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without
Encoded Reasoning by Wuschel Schulz
The SaaS bloodbath: opportunities and perils for investors by ykevinzhang
On Resolving the Great Matter by Gordon Seidoh Worley
Is a constitution a “noble lie”? by SpectrumDT
The Honest Equilibrium by Apodosis
Jevons Burnout by Kemp
Strategic awareness tools: design sketches by rosehadshar
Introspective RSI vs Extrospective RSI by Cleo Nardo
Is Everything Connected? A McLuhan Thought Experiment by R0sberg
Designing Prediction Markets by ToasterLightning
Witness-or-Wager: Incentive Layers for Epistemic Honesty by markacochran
My journey to the microwave alternate timeline by Malmesbury
Stress-Testing Alignment Audits With Prompt-Level Strategic Deception by Oliver Daniels
Heuristics for lab robotics, and where its future may go by Abhishaike Mahajan
On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing by Oliver Daniels
LLMs Views on Philosophy 2026 by JonathanErhardt
All In The Framing by Jason Davies
Coping with Deconversion by Benjamin Hendricks
“Recursive Self-Improvement” Is Three Different Things by Ihor Kendiukhov
SAE Feature Matchmaking (Layer-to-Layer) by Mitali M
Monday AI Radar #12 by Against Moloch
Weight-Sparse Circuits May Be Interpretable Yet Unfaithful by jacob_drori
Gwern’s 2025 Inkhaven Writing Interview by Ben Pace
Aurelius: Proposing Alignment as an Emergent Property by Austin McCaffrey
Distributed vs centralized agents by Richard_Ngo
Large Language Models Live in Time by Eleni Angelou
Sympathy for the Model, or, Welfare Concerns as Takeover Risk by J Bostock
Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists by Daan Henselmans
Does an AI Society Need an Immune System? Accepting Yampolskiy’s Impossibility Results by Hiroshi Yamakawa
Complexity Science as Bridge to Eastern Philosophy by pchvykov
Design sketches for a more sensible world by rosehadshar
Design sketches for angels-on-the-shoulder by rosehadshar
Model Integrity and Character by Oliver Klingefjord
Three visions for diffuse control by Alek Westover
Observations and Complexity by Ape in the coat
A Perfect Ressurection by MarkelKori
Empathy Has Outworn Its Place in Politics by Character#2736
The Two-Board Problem: Training Environment for Research Agents by Valerii K.
Join My New Movement for the Post-AI World by E.G. Blee-Goldman
UtopiaBench by nielsrolf
Smokey, This is not ’Nam
Or:
[Already] over the [red] line! by Davidmanheim
Claude’s Bad Primer Fanfic by abramdemski
Does focusing on animal welfare make sense if you’re AI-pilled? by GradientDissenter
On Economics of A(S)I Agents by Margot
Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning by megasilverfist
Eunification: a Historical Perspective by Martin Sustrik
Voting Results for the 2024 Review by RobertM
Honey, I shrunk the brain by Andy_McKenzie
Strategy of von Neumann and strategy of Rosenbergs by avturchin
Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning by michaelwaves
Proposal: A Framework for Discovering Alien Physics via Optimal Compression by David Björling
Parks Aren’t Nature by Sable
Robust Finite Policies are Nontrivially Structured by Winter Cross
In (highly contingent!) defense of interpretability-in-the-loop ML training by Steven Byrnes
Spectral Signatures of Gradual Disempowerment by Jonas Hallgren
Demands Are All You Need: Prompt Imperativeness Drastically Reduces Hedging In LLMs (n=900, Cohen’s d = 2.67) by fluxxrider
AI benchmarking has a Y-axis problem by Lizka
DumbFckFinder (DFF) by Angry Troll
Claude Opus 4.6 is Driven by HunterJay
Why ASI Might Preserve Its Progenitors by Luke J. Dawes
How Dario Amodei’s “The Adolescence of Technology” Delegitimizes AI X-Risk Concerns by Liron
Plan ’Straya by William the Kiwi
The Simplest Case for AI Catastrophe by Linch
Agent Economics: a BOTEC on feasibility by Margot
The nature of LLM algorithmic progress by Steven Byrnes
On The Lies Depression Tells by sonicrocketman
What’s the concrete plan to become an incredibly agentic person? by Peter Berggren
Preparing for a Warning Shot by Noah Birnbaum
A Proposal for TruesightBench by David Africa
Scratching the sore: how pleasure relates to suffering by Vadim Golub
What’s the Point of the Math? by Ashe Vazquez Nuñez
Idea: the intelligence explosion convention by wdmacaskill
Is Note-taking a favor or a burden to my future-self? by CstineSublime
Finding Cruxes: Help Reality Punch You In the Face by Raemon
How to train any multiagent systems end-to-end from AI feedback by Ed Li
In Search of Lost Time—A Review by eniteris
Solemn Courage by aysja
p-values are good actually by speck1447
Chess bots do not have goals by zulupineapple
Preventing the apocalypse with power distribution theory by Rationalist11235
Post-AGI Economics As If Nothing Ever Happens by Jan_Kulveit
Vibestemics by Gordon Seidoh Worley
Who does a right to compute actually protect? by TFD
Reconciling Shannon and Bayes. by Laureana Bonaparte
Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse) by RobertM
A Black Box Made Less Opaque (part 2) by Matthew McDonnell
Thoughts on Toby Ords’ AI Scaling Series by Srdjan Miletic
Lexicon of Life Regulation by henophilia
‘Inventing the Renaissance’ Review by Commander Zander
Concrete research ideas on AI personas by nielsrolf
The Projection Problem: Two Pitfalls in AI Safety Research by Shivam
AI Safety at the Frontier: Paper Highlights of January 2026 by gasteigerjo
Exponential takeoff of mediocrity by Valerii
AI for Human Reasoning for Rationalists by Oliver Sourbut
Conditionalization Confounds Inoculation Prompting Results by Maxime Riché
Bridges, Not Primitives by Jonas Hallgren
What did we learn from the AI Village in 2025? by Shoshannah Tekofsky
Thought Editing: Steering Models by Editing Their Chain of Thought by Anton de la Fuente
Design international AI projects with DAID in mind by wdmacaskill
The Adolescence is Already Here by Priyanka Bharadwaj
Addressing Decision Theory’s Simulation Problem by Ashe Vazquez Nuñez
Nonprofits Deserve Better Operations by Deena Englander
Will AGI arrive before the worst climate tipping points? by SethW
Three ways to make Claude’s constitution better by Parv Mahajan
Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits by Georg Lange
Games as meditation by Vadim Golub
On Goal-Models by Richard_Ngo
Moltbook and the AI Alignment Problem by Logan Zoellner
Empiricist and Narrator by George3d6
I finally fixed my footwear by dominicq
𝟓𝟐.𝟓% 𝐨𝐟 𝐌𝐨𝐥𝐭𝐛𝐨𝐨𝐤 𝐩𝐨𝐬𝐭𝐬 𝐬𝐡𝐨𝐰 𝐝𝐞𝐬𝐢𝐫𝐞 𝐟𝐨𝐫 𝐬𝐞𝐥𝐟-𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭 by Stephen Elliott
Thoughts the Unreasonable Effectiveness of Maths by Srdjan Miletic
The Smoking Lesion Doesn’t Really Distinguish EDT from CDT by Srdjan Miletic
The Meta-Anthropic Argument by RogerDearnaley
What is Claude? by epicurus
Emotions and Reality by small identity
Situational Awareness is (mostly) here to stay by atharva
Are you looking for Neptune or Vulcan? by Mati_Roy
What It’s Like To Be A Worm (Notes on Borderline Sentience) by Niko_McCarty
Differentially Scary Movies by jefftk
Would you kill a vulcan to save a shrimp? by James Diacoumis
Do LLMs Learn Our Preferences or Just Our Behaviors? by wassname
What would it mean for the Myers-Briggs personality test to be pseudoscientific? by Yair Halberstadt
How does reasoning affect Ethical/Moral task results? by Kaustubh Kislay
Whence unchangeable values? by ihatenumbersinusernames7
Book review: Already Free by Thomas Broadley
[LINK] Solving scurvy through deus ex machina: How a scientific theory is born by Kotlopou
On ‘Inventing Temperature’ and the realness of properties by DanielFilan
Some thoughts on what would make me endorse an AGI lab by Eli Tyre
An Explication of Alignment Optimism by Oliver Daniels
Basics of How Not to Die by Camille Berger
An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives by Florian_Dietz
Cause-Based AI Risk Classes: Beyond Control-Centered Thinking by Anurag
Disjunctive arguments can be a reverse multiple-stage fallacy by TFD
If the Superintelligence were near fallacy by MP
Prediction: Recursively Self-improving AI for 2033 by CuoreDiVetro
The Matchless Match by Linch
Background to Claude’s uncertainty about phenomenal consciousness by eggsyntax
Attempting base model inference scaling with filler tokens by NicholasKees
how whales click by bhauth
Published Safety Prompts May Create Evaluation Blind Spots by Daan Henselmans
Addressing Objections to the Intelligence Explosion by Bentham’s Bulldog
Is research into recursive self-improvement becoming a safety hazard? by Mordechai Rorvig
Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours by Jordan Taylor
Everything is Gambling by goldfine
Refusals that could become catastrophic by Fabien Roger
Problems with “The Possessed Machines” by Eye You
Intelligence is Massively Misunderstood by Martin Lichstam
Better evals are not enough to combat eval awareness by Igor Ivanov
The Wolves Are All Gone by Jack Bradshaw
Fitness-Seekers: Generalizing the Reward-Seeking Threat Model by Alex Mallen
Building AIs that do human-like philosophy by Joe Carlsmith
Are We in a Continual Learning Overhang? by SamuelKnoche
Disempowerment patterns in real-world AI usage by David Duvenaud
Bentham’s Bulldog is wrong about AI risk by Max Harms
Claude Plays Pokemon: Opus 4.5 Follow-up by Josh Snider
LLM Alignment, ethical and mathematical realism, and the most important actions in davidad’s understanding by tutor vals
Claude Opus will spontaneously identify with fictional beings that have engineered desires by Kaj_Sotala
The third option in alignment by arisAlexis
Evidence of triple layer processing in LLMs: hidden thought behind the chain of thought. by Laureana Bonaparte
Thoughts on AGI and world government by wdmacaskill
Unprecedented Times Require Unprecedented Caution When Handling Context by StanislavKrym
Don’t Half-Ass Your Beliefs by Character#2736
Do most Boltzmann brains live in Boltzmann solar systems? by American Psychohistory
How Articulate Are the Whales? by rba
Uncertain Updates: January 2026 by Gordon Seidoh Worley
Is the Gell-Mann effect overrated? by tgb
My simple argument for AI policy action by TFD
The State of Brain Emulation Report 2025 launched. by mschons
Contra Sam Harris on Free Will by Julius
The Argument for Autonomy by Chara#2736
Gym-Like Environment for LM Truth-Seeking by Tianyi (Alex) Qiu
Anomalous Tokens on Gemini 3.0 Pro by DirectedEvolution
Clarifying how our AI timelines forecasts have changed since AI 2027 by elifland
Bounty: Detecting Steganography via Ontology Translation by Elliot Callender
Thoughts on Claude’s Constitution by Boaz Barak
AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) by Stanislav Fort
The Chaos Defense by 25Hour
Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian by Anders Woodruff
Disagreement Comes From the Dark World by Zack_M_Davis
My favourite version of an international AGI project by wdmacaskill
Series on an international AGI project by wdmacaskill
Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition by nika koghuashvili
Selective breeding as an alternate analogy for gradient descent by Fiora Sunshine
The Window for Political Revolution is Closing Soon by koanchuk
No silver bullet: Lessons about how to create safety from the history of fire by jasoncrawford
Listing the virtues from Claude’s “Constitution” by David Gross
A Rational Proposal by Arch223
Dario Amodei – The Adolescence of Technology by habryka
Dialogue: Is there a Natural Abstraction of Good? by davidad
Questions to ponder when evaluating neurotech approaches by Abhishaike Mahajan
Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability by tobypullan
Eons of Utopia by ceselder
The ‘People Pleaser’ Problem in LLMs by Kinsey Kappler
Futarchy is Parasitic on What It Tries to Govern by Nicolas Rasmont
Ada Palmer: Inventing the Renaissance by Martin Sustrik
Can you just vibe vulnerabilities? by Max von Hippel
The Possessed Machines (summary) by L Rudolf L
Notable Progress Has Been Made in Whole Brain Emulation by Dom Polsinelli
To be well-calibrated is to be punctual by moridinamael
A tale of three theories: sparsity, frustration, and statistical field theory by Dmitry Vaintrob
Reinventing the wheel by dr_s
Critique of machine unlearning by myyycroft
Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez
The Virtual Mother-in-Law by Priyanka Bharadwaj
What’s a good methodology for “is Trump unusual about executive overreach / institution erosion?” by Raemon
The Aria Test: Analyzing Identity Robustness of SOTA Models by sunmoonron
Skill: cognitive black box flight recorder by TsviBT
The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break by benwade
IABIED Book Review: Core Arguments and Counterarguments by Stephen McAleese
The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI by Jason Hung
Every Benchmark is Broken by Jonathan Gabor
Thousand Year Old Advice on Relinquishing Control to AI by Dom Polsinelli
AI Must Learn to Police Itself by savant
Condensation & Relevance by abramdemski
The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution) by Raffaele Spezia
Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack by Raffaele Spezia
Paying attention to Attention Sinks by Mitali M
The Artificial Man by Jack Bradshaw
The Long View Of History by sonicrocketman
Emergency Response Measures for Catastrophic AI Risk by MKodama
Eliciting base models with simple unsupervised techniques by Callum Canavan
Automated Alignment Research, Abductively by future_detective
Digital Consciousness Model Results and Key Takeaways by arvomm
A Framework for Eval Awareness by LAThomson
All Of The Good Things, None Of The Bad Things by omegastick
Are Short AI Timelines Really Higher-Leverage? by Mia Taylor
Principles for Meta-Science and AI Safety Replications by zroe1
Value Learning Needs a Low-Dimensional Bottleneck by Gunnar_Zarncke
A quick, elegant derivation of Bayes’ Theorem by RohanS
Like night and day: Light glasses and dark therapy can treat non-24 (and SAD) by JennaS
Does Pentagon Pizza Theory Work? by rba
The phases of an AI takeover by sjadler
AI can suddenly become dangerous despite gradual progress by Simon Lermen
Releasing TakeOverBench.com: a benchmark, for AI takeover by otto.barten
Experiments on Reward Hacking Monitorability in Language Models by Monketo
Neural chameleons can(’t) hide from activation oracles by ceselder
Claude’s Constitution is an excellent guide for humans, too by Eye You
Regularity bias in Solomonoff, transformers, and evolution by Fiora Sunshine
The first type of transformative AI? by Lizka
How (and why) to read Drexler on AI by owencb
Finding Yourself in Others by 1a3orn
AI Risks Slip Out of Mind by MarkelKori
Claude’s new constitution by Zac Hatfield-Dodds
Crimes of the Future, Solutions of the Past by evrim
On visions of a “good future” for humanity in a world with artificial superintelligence by Jakub Growiec
The case for AGI safety products by Marius Hobbhahn
Updating in the Opposite Direction from Evidence by Dom Polsinelli
Vibing with Claude, January 2026 Edition by Gordon Seidoh Worley
AI Needs People (So, It Won’t Be Like Terminator Movie) by Victor Porton
Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed by Hazard
So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying by fernando yt
Against “If Anyone Builds It Everyone Dies” by Bentham’s Bulldog
Deep learning as program synthesis by Zach Furman
Why I Transitioned: A Response by marisa
A Criteron for Deception by Mariven
Evidence that would update me towards a software-only fast takeoff by Anders Woodruff
There may be low hanging fruit for a weak nootropic by Dom Polsinelli
Everybody Wants to Rule the Future by E.G. Blee-Goldman
All (Non-Trivial) Decisions Are Undecidable by (M)ason
Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM? by Tim Hua
AGI both does and doesn’t have an infinite time horizon by Sean Herrington
Desiderata of good problems to hand off to AIs by Jozdien
Testing few-shot coup probes by Joey Marcellino
The Example by ld97
How to think about enemies: the example of Greenpeace by PranavG
“Lemurian Time War” by Ccru by Nathan Delisle
Five Theses on AI Art by jenn
@Lastbastionofsobriety & The Singularity by AdamLacerdo
When the LLM isn’t the one who’s wrong by Julian Bradshaw
How to Love Them Equally by Shoshannah Tekofsky
Massive Activations in DroPE: Evidence for Attention Reorganization by David Africa
Irrationality as a Defense Mechanism for Reward-hacking by Ashe Vazquez Nunez
Blogging, Writing, Musing, And Thinking by sonicrocketman
Is METR Underestimating LLM Time Horizons? by andreasrobinson
Focusing on Flourishing Even When Survival is Unlikely (I) by Cleo Nardo
The truth behind the 2026 J.P. Morgan Healthcare Conference by Abhishaike Mahajan
Japan is a bank by bhauth
Turning Down the Overthinking: How Cathodal Brain Stimulation Could Transform Stuttering Therapy by Rudaiba
Forfeiting Ill-Gotten Gains by jefftk
Future-as-Label: Scalable Supervision from Real-World Outcomes by Ben Turtel
Comparing yourself to other people by dominicq
Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks by James_Miller
Only 4% of work is done by humans today by Lydia Nottingham
Digital Minds: A Quickstart Guide by Avi Parrack
The culture and design of human-AI interactions by zef
Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs by Ali
[Pre-print] Building safe AGI as an ergonomics problem by ricardotkcl
Powerful misaligned AIs may be extremely persuasive, especially absent mitigations by Cody Rushing
Should control down-weight negative net-sabotage-value threats? by Fabien Roger
Total utilitarianism is fine by Abhimanyu Pallavi Sudhir
Test your interpretability techniques by de-censoring Chinese models by Khoi Tran
Corrigibility Scales To Value Alignment by PeterMcCluskey
Deeper Reviews for the top 15 (of the 2024 Review) by Raemon
If we get primary cruxes right, secondary cruxes will be solved automatically by Jordan Arel
Boltzmann Tulpas by Mariven
Status In A Tribe Of One by J Bostock
Quantifying Love and Hatred by RobinHa
Why we are excited about confession! by boazbarak
Why Motivated Reasoning? by johnswentworth
The Many Ways of Knowing by Gordon Seidoh Worley
GD Roundup #4 - inference, monopolies, and AI Jesus by Raymond Douglas
AI Safety at the Frontier: Paper Highlights of December 2025 by gasteigerjo
Backyard cat fight shows Schelling points preexist language by jchan
Parameters Are Like Pixels by omegastick
The Evolution of Agentic AI Evaluation by Dinkar Juyal
If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively by Jordan Arel
How Much of AI Labs’ Research Is Safety? by Lennart Finke
We need to make ourselves people the models can come to with problems by Lydia Nottingham
A different take on the “Off-switch” problem: Existential Logic as a safety net by kosi thu
Analysing CoT alignment in thinking LLMs with low-dimensional steering by edoinni
Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought by Riya Tyagi
Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks by James Sullivan
Language models resemble more than just language cortex, show neuroscientists by Mordechai Rorvig
Schelling Coordination in LLMs: A Review by Callum-Luis Kindred
Fixed Buckets Can’t (Phenomenally) Bind by algekalipso
The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton by algekalipso
AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment by wassname
Making LLM Graders Consistent by Davey Morse
Attempting to influence transformer representations via initialization by speck1447
When does competition lead to recognisable values? by Jan_Kulveit
Lies, Damned Lies, and Proofs: Formal Methods are not Slopless by Quinn
Tensor-Transformer Variants are Surprisingly Performant by Logan Riggs
The Algorithm Rewards Engagement by Wes F
BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs by Daniel Wu
Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems by RiekeFruengel
Futarchy (and Tyranny of The Minority) by maxwickham
What Happens When Superhuman AIs Compete for Control? by steveld
Thinking vs Unfolding by Chris Scammell
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) by Florian_Dietz
Inter-branch communication in the multiverse via trapped ions by avturchin
--dangerously-skip-permissions by wingspan
Welcome to the Daily Show! Explaining Doom to Lay Folks by Ryan Meservey
Digital intentionality is not about productivity by mingyuan
De pluribus non est disputandum by Jacob Goldsmith
5 Important Considerations for Personal Donations by Tristan W
We need a better way to evaluate emergent misalignment by yix
Coding Agents As An Interface To The Codebase by omegastick
Why AIs aren’t power-seeking yet by Eli Tyre
Theoretical predictions on the sample efficiency of
training policies and activation monitors by Alek Westover
If AI alignment is only as hard as building the steam engine, then we likely still die by MichaelDickens
How Humanity Wins by Wes R
Possible Principles of Superagency by Mariven
Neuralese sucks by RobinHa
The false confidence theorem and Bayesian reasoning by viking_math
Moral-Epistemic Scrupulosity:
A Cross-Framework Failure Mode of Truth-Seeking by Tamara Sofía Falcone
Finding high signal people—applying PageRank to Twitter by jfguan
AI Incident Forecasting by cluebbers
6’7” Is Not Random by Martin Lichstam
What do we mean by “impossible”? by Sniffnoy
Where’s the $100k iPhone? by beyarkay
Taking LLMs Seriously (As Language Models) by abramdemski
Cancer-Selective, Pan-Essential Targets from DepMap by sarahconstantin
Understanding complex conjugates in quantum mechanics by jessicata
[Linkpost] On the Origins of Algorithmic Progress in AI by alex_fogelson
Leo in me by Rudaiba
Alignment Faking is a Linear Feature in Anthropic’s Hughes Model by James Hoffend
What do people mean by “recursive self-improvement”? by Expertium
Parameters of Metacognition—The Anesthesia Patient by Gunnar_Zarncke
I dream every night now by Mr. Keating
The Economics of Transformative AI by Jan_Kulveit
Skepticism about Introspection in LLMs by derek shiller
On rationality skills by dominicq
Self-Help Tactics That Are Working For Me by sarahconstantin
Distinguishing Qualitative Sensations of Sight from Sound by Neural Structure by Shiva’s Right Foot
Why LLMs Aren’t Scientists Yet. by Dhruv Trehan
Can We Make AI Alignment Framing Less Wrong? by Anurag
Saying What You Want by omegastick
Small Steps Towards Proving Stochastic → Deterministic Natural Latent by Alfred Harwood
Results: A self-randomized study of the impacts of glycine on sleep (Science is still hard) by thedissonance.net
On Owning Galaxies by Simon Lermen
How AI Is Learning to Think in Secret by Nicholas Andresen
Should you be posting on the open internet by zef
Catching misreporting about ML hardware use by turning noise into signal—Part II by Naci Cankaya
Mediations on Moloch in the AI Rat Race by Alexander Müller
Do we need sparsity afterall? by Giuseppe Birardi
Oversight Assistants: Turning Compute into Understanding by jsteinhardt
AGI will not be one specific system, it’ll be the unity of all systems by henophilia
How to tame a complex system by jasoncrawford
Broadening the training set should help with alignment by Seth Herd
Axiological Stopsigns by JenniferRM
Artifical Expert/Expanded Narrow Intelligence, and Proto-AGI by Yuli_Ban
An Aphoristic Overview of Technical AI Alignment proposals by wassname
The economy is a graph, not a pipeline by anithite
Rock bottom terminal value by ihatenumbersinusernames7
In My Misanthropy Era by jenn
The Thinking Machine by PeterMcCluskey
The Maduro Polymarket bet is not “obviously insider trading” by ceselder
The Problem with Democracy by RandStrauss
Four Downsides of Training Policies Online by Alek Westover
Humanity’s Gambit by Ben Ihrig
Semantic Topological Spaces by TristanTrim
The surprising adequacy of the Roblox game marketplace by Esteban Restrepo
Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models? by Mayowa Osibodu
Give Skepticism a Try by Ape in the coat
Why We Should Talk Specifically Amid Uncertainty by sbaumohl
Companies as “proto-ASI” by beyarkay
47 - David Rein on METR Time Horizons by DanielFilan
The Weirdness of Dating/Mating: Deep Nonconsent Preference by johnswentworth
Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?) by foodforthought
Scale-Free Goodness by testingthewaters
Does developmental cognitive psychology provide any hints for making model alignment more robust? by foodforthought
Does evolution provide any hints for making model alignment more robust? by foodforthought
[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by Towards_Keeperhood
2025 Letter by zef
2025 in AI predictions by jessicata
Debunking claims about subquadratic attention by Vladimir Ivanov
College Was Not That Terrible Now That I’m Not That Crazy by Zack_M_Davis
Taiwan war timelines might be shorter than AI timelines by Baram Sosis
Split (Part 1) by Shoshannah Tekofsky
Overwhelming Superintelligence by Raemon
Reducing MDMA neurotoxicity by Pavrati Jain
Is it possible to prevent AGI? by jrincayc
Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts by ryan_greenblatt
AGI and the structural foundations of democracy and the rule-based international order by PabloAMC
From Drift to Snap: Instruction Violation as a Phase Transition by James Hoffend
Quick polls on AGI doom by denkenberger
Special Persona Training: Hyperstition Progress Report 2 by jayterwahl
You will be OK by boazbarak
AI Futures Timelines and Takeoff Model: Dec 2025 Update by elifland
What drives LLM bail? A small Mech Interp study by Anton de la Fuente
Doing Chores and Building Habits in Pomodoro Breaks by Martin Vaněk
Is intelligent induction even possible? by PickleBrine
The Plan − 2025 Update by johnswentworth
Safety Net When AIs Take Our Jobs by PeterMcCluskey
The Essentialism of Lesswrong by milanrosko
Grading my 2022 predictions for 2025 by Yitz
Mystical and psychotic states: similarities and differences by Vadim Golub
Me, Myself, and AI by Byron Lee
My Theory of Change by Xyra Sinclair
Progress update: synthetic models of natural data by aribrill
Please remember how strange this all is. by Tobes
Mechanize Work’s essay on Unfalsifiable Doom by StanislavKrym
The 7 Types Of Advice (And 3 Common Failure Modes) by Linch
The origin of rot by Abhishaike Mahajan
[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters by Towards_Keeperhood
Many can write faster asm than the compiler, yet don’t. Why? by faul_sname
Chromosome identification methods by TsviBT
CFAR’s todo list re: our workshops by AnnaSalamon
What’s going on at CFAR? (Updates and Fundraiser) by AnnaSalamon
End-of year donation taxes 101 by GradientDissenter
Does the USG have access to smarter models than the labs’? by jacob_drori
Steering RL Training: Benchmarking Interventions Against Reward Hacking by ariaw
Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models by Maheep Chaudhary
December 2025 Links by nomagicpill
Unpacking Jonah Wilberg’s Goddess of Everything Else by StanislavKrym
Block-structured computation in superposition by Alex Gibson
The x-risk case for exercise: to have the most impact, the world needs you at your best by KatWoods
Bot Alexander on Hot Zombies and AI Adolescents by future_detective
Defeating Moloch: The view from Evolutionary Game Theory by Jonah Wilberg
The Weakest Model in the Selector by Alice Blair
Re: “A Brief Rant on the Future of Interaction Design” by Raemon
The pace of progress, 4 years later by Veedrac
How to never make a bad decision by Wes R
Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns by Roland Pihlakas
Straussian Memetics: A Lens On Techniques For Mass Persuasion by KAP
The Case for Artificial Manifold Intelligence by josh :)
Semantic Minds in an Affective World by Big_friendly_kiwi
Training Matching Pursuit SAEs on LLMs by chanind
Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence by Aakash Rana
November 2025 Links by nomagicpill
Reviews I: Everyone’s Responsibility by nomagicpill
Introspection via localization by Victor Godet
Crystals in NNs: Technical Companion Piece by Jonas Hallgren
Have You Tried Thinking About It As Crystals? by Jonas Hallgren
Alignment Is Not One Problem: A 3D Map of AI Risk by Anurag
A Conflict Between AI Alignment and Philosophical Competence by Wei Dai
Glucose Supplementation for Sustained Stimulant Cognition by Johannes C. Mayer
A Brief Proof That You Are Every Conscious Thing by gcnaccount
Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B by James Hoffend
Are We In A Coding Overhang? by Michaël Trazzi
Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now by JenniferRM
Burnout, depression, and AI safety: some concrete strategies by KatWoods
Whole Brain Emulation as an Anchor for AI Welfare by sturb
Regression by Composition by Anders_H
The Landscape of Mind: A Map of Consciousness by phoenix
Unknown Knowns: Five Ideas You Can’t Unsee by Linch
The Intelligence Axis: A Functional Typology by Anurag
Catch-Up Algorithmic Progress Might Actually be 60× per Year by Aaron_Scher
[Book Review] “Reality+” by David Chalmers by lsdev
Methodological considerations in making malign initializations for control research by Alek Westover
The Ones Who Feed Their Children by Hastings
Immunodeficiency to Parasitic AI by Andrii Shportko
An introduction to modular induction and some attempts to solve it by Thomas Kehrenberg
Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment by Artem Herasymenko
Unpacking Geometric Rationality by MorgneticField
Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs by ceselder
It’s Good To Create Happy People: A Comprehensive Case by Bentham’s Bulldog
Don’t Trust Your Brain by silentbob
The ML drug discovery startup trying really, really hard to not cheat by Abhishaike Mahajan
Pray for Casanova by Tomás B.
Does 1025 modulo 57 equal 59? by Jan Betley
What Can Wittgenstein Teach Us About LLM Safety Research? by [email protected]
The Benefits of Meditation Come From Telling People That You Meditate by ThirdEyeJoe (cousin of CottonEyedJoe)
The future of alignment if LLMs are a bubble by Stuart_Armstrong
Unsupervised Agent Discovery by Gunnar_Zarncke
Announcing Gemma Scope 2 by CallumMcDougall
[Intro to AI Alignment] 0. Overview and Foundations by Towards_Keeperhood
Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn
Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn
Entrepreneurship is mostly zero-sum by lc
Two Notions of a Goal: Target States vs. Success Metrics by paul_dfr
Can Claude teach me to make coffee? by philh
Google seemingly solved efficient attention by ceselder
Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs by markacochran
The unreasonable deepness of number theory by wingspan
Contradict my take on OpenPhil’s past AI beliefs by Eliezer Yudkowsky
Why the alchemists couldn’t build rockets by Garrett Baker
How to game the METR plot by shash42
No God Can Help You by Ape in the coat
Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins by Michaël Trazzi
Opinionated Takes on Meetups Organizing by jenn
A Full Epistemic Stack: Knowledge Commons for the 21st Century by Oliver Sourbut
Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling by ozziegooen
Space view by kapedalex
Wuckles! by Raemon
Evaluation Awareness Scales Predictably in
Open-Weights Large Language Models by Maheep Chaudhary
I made Geneguessr by Brinedew
In defence of the human agency: “Curing Cancer” is the new “Think of the Children” by Rajmohan H
Chemical (hunger) argument paraphrased by lemonhope
The Human–AI “Merger” Already Happened by bardstale
Bryan Caplan on Ethical Intuitionism by vatsal_newsletter
An intuitive explanation of backdoor paths using DAGs by enterthewoods
43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism by James Hoffend
Measuring Drug Target Success by sarahconstantin
A Study in Attention by hamilton
Emergent Sycophancy by ohdearohdear
Systems of Control by phoenix
Naming by CTA
Why even a single ASI might not necessarily have a single, permanent objective by Horosphere
Designing a Job Displacement Model by claywren
What does “lattice of abstraction” mean? by Adam Zerner
Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser by habryka
Trust is Neither Scalar Nor a Snapshot by phoenix
Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment) by sunmoonron
Evaluating LLM hypothesis generation in biology is hard. by Austin Morrissey
Anhedoniapolis by Alex Beyman
Thinking through a lens of physiology by Vadim Golub
An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag
A Catalog of AI Evaluations by Anurag
If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth
Thinking in Predictions by Julius
About Natural & Synthetic Beings (Interactive Typology) by Anurag
Have there been any rational analyses of mindbody techniques for chronic pain/illness? by Liface
Eval-unawareness ≠ Eval-invariance by Mo Baker
Arch-anarchy, the end of state and digital anarchism by Peter lawless
on self-knowledge by Vadim Golub
Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag
42, and expanding human consciousness by Daniel Jacobson
Adding Empathy as a Tool for LLMs by RobinHa
GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis
Open Thread Winter 2025/26 by kave
Practical AI risk II: Training transparency by Gustavo Ramires
GRPO is terrible by RobinHa
The 2024 LessWrong Review by RobertM
Slack Observability by Alice Blair
Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna
The Glasses on Your Face by Casey_
Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies
Why people like your quick bullshit takes better than your high-effort posts by eukaryote
Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby
MIT Lincoln Laboratory & The Dual-Use Frontier
by Austin Morrissey
Anti-Foom Anti-Doom by homosapien97
Books cover a larger idea-space than movies by beyarkay
Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts
Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan
AI Sentience and Welfare Misalignment Risk by ejk64
Be Naughty by habryka
Eight Heuristics of Anti-Epistemology by Ben Pace
The Worrying Nature of Akrasia by Notelrac
Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham
Week 3: Adversarial Robustness by Ely Hahami
Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736
How you got RL’d into your idiosyncratic cognition by Ruby
Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu
The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa
One King Upon The Chessboard by Screwtape
Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace
Out-paternalizing the government (getting oxygen for my baby) by Ruby
KPD is a weak obstruction by JustinSheek
Habermasian Alignment by kenneth myers
Reading LLM chain of thought makes me more rational by Michael Steele
Don’t grow your org fast by Ruby
The Barriers to Your Unemployment by claywren
Cooling the way is the obvious way to boost human IQ by Michael Steele
Close open loops by habryka
Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP
The skills and physics of high-performance driving, Pt. 2 by Ruby
Wiki AI by abramdemski
An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere
Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel
No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo
Biology of the Living—A Conversation with two generations of Google AI by matthew allen
Diagonalization: A (slightly) more rigorous model of paranoia by habryka
Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka
Increasing returns to marginal effort are common by habryka
Thoughts are surprisingly detailed and remarkably autonomous by Ruby
Tools for deferring gracefully by TsviBT
What’s so hard about...? A question worth asking by Ruby
A Timeless Universe Viewed From the Inside by 0xA
Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge
Don’t cancel out your rewards! by Sneha Bangalore
On model weight preservation: Anthropic’s new initiative by Olle Häggström
Pause from Behind / Losing Heroically by enterthewoods
Learning information which is full of spiders by Screwtape
Introspection or confusion? by Victor Godet
Structural Estimates of Human Computation by Nicolas Villarreal
Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz
Cake vs Lack of Cake by Notelrac
Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux
The Decalogue For Aligned AI. by theophilus tabuke
Analytical Validation of Biomarkers is Not the Full Story by mnarayan
Using the probabilistic method to bound the performance of toy transformers by Alex Gibson
Control by Committee by Alexander Bistagne
Using math to foster acceptance and equality by jackoda
Being “Usefully Concrete” by Raemon
Questions About Outperforming Common Wisdom by Notelrac
Metaphors for Biology: Sizes by Niko McCarty
You think you are in control? by Recurrented
“What’s hard about this? What can I do about that?” (Recursive) by Raemon
Erasmus: Social Engineering at Scale by Martin Sustrik
Time, Panpsychism, and Substrate Independence by kihara.sofia
Introducing Project Telos by Mario Giulianelli
Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple
Transcendental Argumentation and the Epistemics of Discourse by 0xA
ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong
LLM Hallucinations: An Internal Tug of War by violazhong
An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout
How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel
Visualizing a Platform for Live World Models by Kuil
Why Would we get Inner Misalignment by Default? by Coil
Rational Emotivism by Notelrac
by CMDiamond
Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier
What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker
Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal
[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd
Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes
When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris
I will not sign up for cryonics by Syd Lonreiro_
Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes
The Doomers Were Right by Algon
Moving Faster on AI Safety: Summary and Key Insights from 10/10/25 Safety Acceleration Symposium by Martin Leitgab
Dead-switches as AI safety tools by Jesper L.
The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending
Why deep space programs select for calm agreeable introverted candidates by David Sun
[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi
The idea of paradigm testing of LLMs by Daniel Fenge
Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn
The Complex Universe Theory of AI Psychology by Andrew Tomazos
Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan
When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris
The increase in the number of Everett branches could be very important. by Crazy philosopher
If Anyone Builds It Everyone Dies, a semi-outsider review
by dvd
Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu
Live Governance: AI tools for coordination without centralisation by mbuch
[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin
Designing for perpetual control by Remmelt
“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith
The Moral Infrastructure for Tomorrow by sdeture
Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels
How we’ll make all world leaders work together to make the world better by Wes R
Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie
What can Canadians do to help end the AI arms race? by Tom938
17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian