RSS.Social

Frontpage posts - LessWrong 2.0 viewer

follow: @[email protected]

Posts

Not a Paper: “Frontier Lab CEOs are Capable of In-Context Scheming” by LawrenceC

Notes on Transformer Consciousness by slavachalnev

SecureMaxx: A Lightweight Sequence Screening Tool for Agents by Austin Morrissey

Will whole brain emulation matter for the AI transition? by djbinder

Causal inference diary: skiing causes snow by Gretta Duleba

Is AI welfare work puntable? by Oscar

The Problem in the “Nerd Sniping” xkcd Comic by peralice

Comment on “Forecasting is Way Overrated, and We Should Stop Funding It” by Josh Rosenberg

ML Safety Newsletter #20: AI Wellbeing, Classifier Jailbreaking and Honest Pushback Benchmarking by Alice Blair

Introspection Adapters: Training LLMs to Report Their Learned Behaviors by keshavs

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers by Jozdien

Nobody ever checked by Cameron Berg

Monday AI Radar #23 by Against Moloch

An Alignment Journal: Adaptation to AI by JessRiedel

Frontier Coding Agents Can Now Implement an AlphaZero Self-Play Machine Learning Pipeline For Connect Four That Performs Comparably to an External Solver by Baybar

Takes from two months as an aspiring LLM naturalist by AnnaSalamon

SAEBER: Sparse Autoencoders for Biological Entity Risk by michaelwaves

LLM Style Slop is Absolutely Everywhere by silentbob

Latent reasoning models might be a good thing? by loops

a letter of babble by LawrenceC

What Sentences Cause Alignment Faking? by James Sullivan

On the political feasibility of stopping AI by David Scott Krueger

Sleeper Agent Backdoor Results Are Messy by Sebastian Prasanna

Forecasting is Not Overrated and It’s Probably Funded Appropriately by Ben S.

Microsoft AI CEO’s “Seemingly Conscious AI Risk” by Stephen Martin

LessWrong Shows You Social Signals Before the Comment by TurnTrout

Fail safe(r) at alignment by channeling reward-hacking into a “spillway” motivation by Anders Cairns Woodruff

AI companies should publish security assessments by ryan_greenblatt

In defense of parents by Yair Halberstadt

Curious cases of financial engineering in biotech by Abhishaike Mahajan

Blackmail at 8 Billion Parameters: Agentic Misalignment in Sub-Frontier Models by Chijioke Ugwuanyi

The other paper that killed deep learning theory by LawrenceC

AI might surprise itself by going rogue by David Scott Krueger

How does Reinforcement Learning Affect Models by humanityfirst

Retrospective on my unsupervised elicitation challenge by DanielFilan

Alignment Faking Replication and Chain-of-Thought Monitoring Extensions by Angela Tang

AI for life strategy advice: a personal experiment by Jonah Wilberg

Spontaneous introspection in conversation tampering by Ziqian Zhong

How do secret loyalties differ from other schemer motivations? by Cleo Nardo

Control protocols don’t always need to know which models are scheming by Fabien Roger

“Bad faith” means intentionally misrepresenting your beliefs by TFD

Me, decay by Dentosal

Universes can specialize: Each universe should produce the goods it’s most comparatively advantaged at, relative to the multiversal market by Zach Stein-Perlman

Anthropic spent too much don’t-be-annoying capital on Mythos by draganover

Roko’s basilisk may work on humans by Horosphere

Substrate: Formalism by Vardhan

The paper that killed deep learning theory by LawrenceC

The Great Smoothing Out by Sturb

Diary of a “Doomer”: 12+ years arguing about AI risk (part 3: the LLM era) by David Scott Krueger (formerly: capybaralet)

Forecasting is Way Overrated, and We Should Stop Funding It by mabramov

“Thinkhaven” by Raemon

AI safety can be a Pascal’s mugging even if p(doom) is high by Elliott Thornley (EJT)

Arguments that arguments prove too much often prove too much. by Horosphere

Substrate-Sensitivity by mfatt

Substrate: Intuitions by Vardhan

Superintelligence is cancer by testingthewaters

A View From Displacement by 0xA

Honest Ethics & AI – Part 1: The origins of morality by Jesper L.

Some data on the shape of the forgetting curve by nwm

Quick Paper Review: “There Will Be a Scientific Theory of Deep Learning” by LawrenceC

Behave according to your actual utility function by Adam Zerner

Reasons not to trust AI by David Scott Krueger (formerly: capybaralet)

Thoughts on AI Safety Megagame Design by Philip Harker

Engineer Thoughts Not Prompts by Mira Kennard

Is the Cat Out of the Bag?: Who knows how to make AGI? by Oliver Sourbut

Mathematics and Empiricism by Ape in the coat

Protecting Cognitive Integrity: Our internal AI use policy (V1) by Tom DAVID

The first birthday that might be my last by Mikhail Samin

Rubber stamp errors by jchan

Methodology for inferring propensities of LLMs by Olli Järviniemi

The World Can’t Keep Up With AI Labs by Lee.aao

Communicating with people who disagree on “obvious” things by LawrenceC

Diary of a “Doomer”: 12+ years arguing about AI risk (part 2) by David Scott Krueger (formerly: capybaralet)

Re: Daycare illnesses (how often does a child get sick? by kqr

Paul Conyngham’s cancer vaccine is an example of AI behaving as a normal technology by HedonicEscalator

What counts as illegible reasoning? by lbernick

What Happens When a Model Thinks It Is AGI? by josh :)

Should We Train Against (CoT) Monitors? by RohanS

How To Figure Out Life By Ben Franklin by jdcampolargo

5 thought experiments on identity and copies by Canaletto

Universes can specialize: Each universe should produce the goods it’s most comparatively advantaged at, relative to the multiversal market by Zach Stein-Perlman

Why do many rationalists like cryptocurrency? by SpectrumDT

If Everyone Reads It, Nobody Dies—Course Launch by Luc Brinkman

Short stories slow, long stories fast by Yair Halberstadt

Thought Experiments on Continuity of Consciousness by NickyP

An Angry Review of Greg Egan’s “Didicosm” by LawrenceC

What happens after we stop AI? by David Scott Krueger (formerly: capybaralet)

Your Supplies Probably Won’t Be Stolen in a Disaster by jefftk

On the discordance between AI systems’ internal states and their outputs by Brian Lindsay

Claude the romance novelist by samshap

Community misconduct disputes are not about facts by mingyuan

Shots Fired in the Third War of Priors by jehan

Aligned AI is the most dangerous AI by Shivam

Why no new notations since 1960? by Carl Feynman

Research Log: Monet/​PEER sparse experts by Nathan Helm-Burger

Smart Contracts as Credible Commitments for Trading with Scheming AIs by Ram Potham

A Buddhism for Every Enneagram Type by Gordon Seidoh Worley

Call for machers by jchan

Only Politics Can Prevent Extinction* by Oliver Kuperman

“Do Not Start Arguments You Cannot Finish” by J Bostock

[LLM|car]-centric [websites|cities] by Ben (Berlin)

The Changing North Star of AI Control by reinthal

Narrow Secret Loyalty Dodges Black-Box Audits by Alfie Lamerton

10 posts I don’t have time to write by habryka

Marginal Risk is BS by David Scott Krueger (formerly: capybaralet)

Lorxus Does Budget Inkhaven Again: 04/​15~04/​21 by Lorxus

Reflectively stable consequentialists are expected utility maximisers by Pedro Afonso

Cost vs. Profit Center Mindset by ursusminimus

Pando: A Controlled Benchmark for Interpretability Methods by Ziqian Zhong

I built a semantic search engine for LessWrong by neo

Physics says hello by Alephwyr

The Right Answer to “Can You Keep a Secret?” by Dentosal

AI CEOs are not saying it’s dangerous just to hype their companies by henryaj

Monday AI Radar #22 by Against Moloch

Evil is bad, actually (Vassar and Olivia Schaefer callout post) by plex

Automated Deanonymization is Here by jefftk

Informal Leadership Structures and AI Safety by LawrenceC

10 non-boring ways I’ve used AI in the last month by habryka

Introducing LinuxArena by Tyler Tracy

The Box by 0xA

Takes on Automating Alignment by skunnavakkam

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors? by dx26

Gemma Gets Help: Mitigating Frustration and Self-Deletion with Consistency Training by David Africa

9 kinds of hard-to-verify tasks by Cleo Nardo

Fear of crowding out by jchan

Why clinical trials are broken & how to fix them: a reading list by Siebe

Automating philosophy if Timothy Williamson is correct by Cleo Nardo

CLR’s Safe Pareto Improvements Research Agenda by Anthony DiGiovanni

Quality Matters Most When Stakes are Highest by LawrenceC

Feeling like a room has bad vibes? The lighting is probably too “spiky” or too blue. by habryka

Stop AI Now by David Scott Krueger (formerly: capybaralet)

The “Budgeting” Skill Has The Most Betweenness Centrality (Probably) by JenniferRM

Will AI make everything more correlated? by TFD

R1 CoT illegibility revisited by nostalgebraist

I did a jhana meditation retreat (in 2024) with Jhourney and it was okay. by Jules

Meeting with the makers by sig

Summarizing and Reviewing my earliest ML research paper, 7 years later by LawrenceC

Stop AI by David Scott Krueger (formerly: capybaralet)

There are only four skills: design, technical, management and physical by habryka

Higher Dimensional Spheres are not spiky by TerriLeaf

Latent Reasoning Sprint #4: PCA Analysis on CoDI by Realmbird

Book Review: The Unwritten Laws of Engineering by Gordon Seidoh Worley

Down with the Old orthogonality thesis, up with the New by Chris Santos-Lang

Having OCD is like living in North Korea (Here’s how I escaped) by Declan Molony

Overcoming OCD by Declan Molony

Don’t Cut Yourself on the Jagged Frontier by Against Moloch

Si No Se Puede by Benquo

Claude knows who you are by Smaug123

LLMs are about to disrupt algorithmic media feeds by lsusr

Vladimir Putin’s CEV is probably pretty good by habryka

Post-mortem’ing my earliest ML research paper, 7 years later by LawrenceC

If It’s Worth Arguing, It’s Worth Arguing With Whiteboards by Drake Morrison

Refactor Arena: A Control Setting for Software Engineering by fastfedora

What Are with All the Different GPT-5 Variants in Codex and How Are They Actually Different? (or are they even?) by Savannah Harlan

AI for decision advice by Tom Davidson

Variations On Tree Reconstruction by adamShimi

Morality without Consciousness by IanWS

AI self-preservation is probably due to instruction ambiguity by Maximus Ren

Arguments Should Be Decisive Criticisms by Elliot Temple

Humane Pesticides Are Massively Morally Urgent by Bentham’s Bulldog

“Best humans still outperform”: One turning point in the history of cope around artificial intelligence by Oliver Sourbut

Society is a social construct, pace Arrow by jchan

Consent-Based RL: Letting Models Endorse Their Own Training Updates by Logan Riggs

What does status signalling do? When successful, what does it achieve? by SpectrumDT

Claude Mythos Preview was mocked by Greenblatt and Kokotajlo in advance by StanislavKrym

The map is part of the territory by yatharth

Publish-first writing by yatharth

Let goodness conquer all that it can defend by habryka

Why I’m Less of a Shill for Related Work Sections by LawrenceC

From Artificial Intelligence to an ecosystem of artificial life-forms. by David Scott Krueger (formerly: capybaralet)

Tools Expand the Plan Space by Drake Morrison

Verify, but Trust by berns

Taking political violence seriously by elianadu

Attempting to Quantify Chinese Bias in Open-Source LLMs by Ethan Le Sage

A Research Bet on SAE-like Expert Architectures by Nathan Helm-Burger

Specialization is a Driver of Natural Ontology by johnswentworth

Limited verification can hurt debate oversight by Joey Marcellino

Simulated Qualia Mugging by skunnavakkam

You Aren’t in Charge of the Overton Window; Politics Is Not Interior Design by Davidmanheim

Post-Scarcity is bullshit by David Scott Krueger (formerly: capybaralet)

Taking Joy in the Seemingly Mundane by LawrenceC

Do not conquer what you cannot defend by habryka

What economists get wrong (and sometimes right!) about AI by dan.parshall

A visualization of changing AGI timelines, 2023 − 2026 by dschwarz

LLM-tier personal computer security by cata

Beware of Well-Written Posts by alseph

The Mirror Test Is Complicated by J Bostock

We live in a society by jchan

Could you not do it ? by pandamonium

Purchase Optics and Results Separately by Dentosal

Starburst: Unsaturated Since Summer 2024 by Chapin Lenthall-Cleary

For with what judgment we shall be judged by LawrenceC

Contra Leicht on AI Pauses by David Scott Krueger (formerly: capybaralet)

Plans are Not Promises by Drake Morrison

Beware Natural Language Logic by J Bostock

Cold Evals: a Low-Cost Intervention Against Conversation Steering Attacks by Ari Isaacs

ALEval: Do language models lie about reward hacking? by Mark Kagach

Lorxus Does Budget Inkhaven Again: 04/​08~04/​14 by Lorxus

Majority Report by peralice

Claude Mythos Preview: Analysis of Anthropic’s Public Announcement by antmaier

Everyone Has a Plan Until They Get Social Pressure To the Face by Czynski

The value of moral diversity by Mia Taylor

Effective Altruism, Seen From Slytherin by Xylix

The Commercial Case for Monitoring Every Trajectory by Ivan Bercovich

Attack Selection In Agentic AI Control Evals Can Decrease Safety by cjgwang

Prospective Methods and Mechanisms of Motive Reinforcement in LLMs by Fiora Starlight

Mechanisms of Introspective Awareness by Uzay Macar

The frightening future (i.e. present) of AI surveillance by ohdearohdear

From personas to intentions: towards a science of motivations for AI models by David Africa

A Retrospective of Richard Ngo’s 2022 List of Conceptual Alignment Projects by LawrenceC

Some AI threats people aren’t thinking about by Sean Herrington

Can AI make advancements in moral philosophy by writing proofs? by MichaelDickens

The Shapley Share of Responsibility? by Raemon

Kegan, Teach, Rao: Stages of Moral Development by Dentosal

Monday AI Radar #21 by Against Moloch

Only Law Can Prevent Extinction by Eliezer Yudkowsky

Which Relations Can Be Generalized Implicitly? by J Bostock

Who Killed Common Law? by Benquo

Annoyingly Principled People, and what befalls them by Raemon

AI for epistemics: the good, the bad and the ugly by owencb

Tomas Bjartur: The Last Prodigy by Linch

Clique, Guild, Cult by jchan

We need Git for AI Timelines by fluxxrider

Treaties, Regulations, and Research can be Complements by Davidmanheim

5 Hypotheses for Why Models Fail on Long Tasks by LawrenceC

Your body is not a white box (and you’re thinking about weight loss wrong) by Sean Herrington

Stopping AI is easier than Regulating it. by David Scott Krueger (formerly: capybaralet)

The policy surrounding Mythos marks an irreversible power shift by sil

When the “Black Box Problem” Becomes the Default Message by Alison Avery

Uninterrupted Writing as Metric by Drake Morrison

You’re gonna need a bigger boat (benchmark), METR by Eye You

Returns to intelligence by RobertM

Daycare illnesses by Nina Panickssery

Talk English, Think Something Else by J Bostock

Morale by J Bostock

Eggs, rooms, puzzles, and talking about AI by KatjaGrace

The definition of status is an anti-meme by Fiora Starlight

Book Review: Existential Kink by Dentosal

Sparse Autoencoders for Single-Cell Models by Ihor Kendiukhov

[Hot take] Problems with AI prose by sudo

An Ode to Humility and Curiosity in the New Machine Era by Nathan Heath

Counterintuitive Coin Toss. Part II by PaulTheHuman

You can’t trust violence by David Scott Krueger (formerly: capybaralet)

The Blast Radius Principle by Martin Sustrik

Why I’m excited about meta-models for interpretability by loops

The Ethics of AI-Assisted Creative Work by ravedon

Spectra of LSRDRs of the Okubo algebra by Joseph Van Name

Your Mom is a Chimera by michaelwaves

Searchable explorer of EA Forum & LessWrong posts with explicit cruxes or “change my mind” content by david reinstein

Constitutional AI vs. RLHF vs. Deliberative Alignment by laudiacay

Eating meat is fine if you live in a simulation by Sean Herrington

Being, Knowing, and Believing in the Age of AI by Nathan Heath

Tactics for Denying Your Motivations, or Why Legibility is Expensive by Dentosal

Claude Interviews Me About Writing by Drake Morrison

An apple picking model for AI R&D by Noosphere89

Small models also found the vulnerabilities that Mythos found by dominicq

Catching illicit distributed training operations during an AI pause by Robi Rahman

Proof Explained: Touchette-Lloyd Theorem by Alfred Harwood

Pausing AI Is the Best Answer to Post-Alignment Problems by MichaelDickens

Quick Thoughts About Mythos by Against Moloch

10% ≈ 90% by Aprillion

Anthropic Shadow Realm (working notes) by watermark

the Lazy Market Hypothesis by Xylix

Have we already lost? Part 3: Reasons for Optimism by LawrenceC

Dario probably doesn’t believe in superintelligence by RobertM

Could a single rogue AI destroy humanity? by David Scott Krueger (formerly: capybaralet)

The median take is taken by henryaj

If Mythos actually made Anthropic employees 4x more productive, I would radically shorten my timelines by ryan_greenblatt

Biological Computing Underhang by Elliot Callender

The Unintelligibility is Ours: Notes on Chain-of-Thought by 1a3orn

In Defense of Debate by Elliot Temple

“Close Enough” as a Primitive in Intelligent Systems by J Bostock

Foundational Beliefs by Against Moloch

Why Control Creates Conflict, and When to Open Instead by plex

On creating ‘new knobs of control’ in biology by Abhishaike Mahajan

Chocolate Sloths, Tinder, and Moral Backstops by J Bostock

Reproducing steering against evaluation awareness in a large open-weight model by Thomas Read

Have we already lost? Part 2: Reasons for Doom by LawrenceC

My personal AI alignment research agenda, version 1. by emanuelr

Inkhaven menu, part 2 by David Scott Krueger (formerly: capybaralet)

Linear vs Non-linear Probes for Interpretability by NickyP

AI identity is not tied to its model by Sean Herrington

Anthropic did not publish a “risk discussion” of Mythos when required by their RSP by RobertM

Some takes on UV & cancer by Steven Byrnes

My Specific Singularity Timeline to Utopia by Michael Soareverix

Model organisms researchers should check whether high LRs defeat their model organisms by dx26

Climbing Mountains We Cannot Name by Tharin

Help me launch Obsolete: a book aimed at building a new movement for AI reform by garrison

Aliens from our own Solar System by RomanS

How Unmonitored External Agents can Sabotage AI labs by Elle Najt

Video and transcript of talk on writing AI constitutions by Joe Carlsmith

Writing With Robots by Against Moloch

Slightly-Super Persuasion Will Do by Tomás B.

Outrospection: Don’t Be A Rock by J Bostock

Have we already lost? Part 1: The Plan in 2024 by LawrenceC

Generalisation isn’t actually (that) important by Sean Herrington

Do not be surprised if LessWrong gets hacked by RobertM

Why Alignment Risk Might Peak Before ASI—a Substrate Controller Framework by Marko Katavic

“We’re Still Doomed” is Weak Evidence Against Any Particular Past Doom-Aversion Plan by J Bostock

How I use Claude as a personal coach by Dentosal

The effects of caffeine consumption do not decay with a ~5 hour half-life by kman

I want the deal that the US Government got by TFD

Excerpts and Notes on Mythos Model Card by williawa

Iran and FDT by Josh Snider

Is Claude’s genuine uncertainty performative? by jordinne

Alignment vs. Safety, part 2: Alignment by David Scott Krueger (formerly: capybaralet)

The hard part isn’t noticing when papers are bad, it’s deciding what to do afterwards by LawrenceC

Against Possible Worlds by Ape in the coat

Is death and suffering axiomatically bad? by NickyP

Baking tips by RobertM

Semiconductor Fabs III: The Data and Automation by nomagicpill

My Exobrain Software (forays into cyborgism) by Ruby

Telescopes Need Good Lenses by Benquo

Why was cybersecurity automated before AI R&D? by loops

Hedging and Survival-Weighted Planning by Vaniver

Elementary Condensation by Jan

Most people can’t juggle one ball by Sean Herrington

Why is Flesh So Weak? by J Bostock

Project Glasswing: Anthropic Shows The AI Train Isn’t Stopping by AlphaAndOmega

Role-playing vs Self-modelling by Jan_Kulveit

Claude Mythos System Card Preview by anaguma

The Training Example Lie Bracket by DaemonicSigil

A conversation with Anima Labs, part I: Phenomenology of digital minds by cube_flipper

Fantasy ideology by Ninety-Three

Beliefs are Chosen to Serve Goals by Ashe Vazquez Nuñez

An Alignment Journal: Features and policies by JessRiedel

We’re actually running out of benchmarks to upper bound AI capabilities by LawrenceC

“Alignment” and “Safety”, part one: What is “AI Safety”? by David Scott Krueger (formerly: capybaralet)

Opus’s Schelling Steganography Has Amplifiable Secrecy Against Weaker Eavesdroppers by Elle Najt

My Ethics by NickyP

Don’t write for LLMs, just record everything by RobertM

Vibe analyzing my genome by Ruby

Token-Level Forking Paths in Reasoning Traces: Some Examples by Rob D

By Strong Default, ASI Will End Liberal Democracy by MichaelDickens

The Garden by sturb

Contra Nina Panickssery on advice for children by Sean Herrington

Are there Multiple Moral Endpoints? by Vaniver

Inside Omega by 0xA

Nirvana Rank (working post) by watermark

Conditional Gates and Patches (working notes) by watermark

AIs can now often do massive easy-to-verify SWE tasks and I’ve updated towards shorter timelines by ryan_greenblatt

Defense-favoured coordination design sketches by owencb

A Black-Box Procedure for LLM Confidence in Critical Applications by Jadair

Contra The Usual Interpretation Of “The Whispering Earring” by AlphaAndOmega

Paper close reading: “Why Language Models Hallucinate” by LawrenceC

Defending Habit Streaks by Ian McKenzie

Estimates of the expected utility gain of AI Safety Research by Sean Herrington

The slow death of the accelerationist. by Arch223

My forays into cyborgism: theory, pt. 1 by Ruby

Unmathematical features of math by Roman Malov

11 pieces of advice for children by Nina Panickssery

I Made Parseltongue by Valerii K.

Steering Might Stop Working Soon by J Bostock

Thoughts on Practical Ethics by dominicq

How much faster is speaking, compared to typing on laptop vs phone vs writing? by NickyP

Academic Proof-of-Work in the Age of LLMs by LawrenceC

Ten different ways of thinking about Gradual Disempowerment by David Scott Krueger (formerly: capybaralet)

Cheaper/​faster/​easier makes for step changes (and that’s why even current-level LLMs are transformative) by Ruby

Positive sum doesn’t mean “win-win” by loops

Changes to an optimised thing make it worse by Sean Herrington

dark ilan by ozymandias

Considerations for growing the pie by Zach Stein-Perlman

AI Safety at the Frontier: Paper Highlights of February & March 2026 by gasteigerjo

Compute Curse by Ihor Kendiukhov

Self-Aware Confabulation by Dentosal

Mean field sequence: an introduction by Dmitry Vaintrob

Democracy Dies With The Rifleman by Vaniver

Am I the baddie? by Ustice

Common advice #3: Asking why one more time by LawrenceC

Latent Reasoning Sprint #3: Activation Difference Steering and Logit Lens by Realmbird

How to emotionally grasp the risks of AI Safety by Sean Herrington

Gabapentinoids I have known and loved by henryaj

“Following the incentives” by David Scott Krueger (formerly: capybaralet)

The bar is lower than you think by XelaP

Did Anyone Predict the Industrial Revolution? by Lost Futures

I thought eight metrics could capture my mental state. I was wrong. by Ruby

Why do I believe preserving structure is enough? by Aurelia

A Tale of Two Rigours by Ashe Vazquez Nuñez

God Mode is Boring: Musings on Interestingness by Alex_Steiner

The Silver Lining Considered Harmful (When Misused) by Thomas Castriensis

Registering a Prediction Based on Anthropic’s “Emotions” Paper by Stephen Martin

I Changed My Mind about Error-Correcting Debate, Misogyny and More: Updates from a Former Student of David Deutsch by Elliot Temple

Plans are like Fruit Flies by J Bostock

She Wanted Help With Her Mental Illness But Ended Up Destroyed by Bowl of Cereal

Early Warning Signals For Capabilities During Training by Max Hennick

Sadly, The Whispering Earring by Dentosal

Beware Even Small Amounts of Woo by J Bostock

Listen to Gryffindor by Xylix

What formal protocols should exist when a model under evaluation is used in the evaluation pipeline? by KevinOShaughnessy

Detecting collusion through multi-agent interpretability by schroederdewitt

Common research advice #2: say precisely what you want to say by LawrenceC

More, and More Extensive, Supply Chain Attacks by jefftk

Treat your subconscious like a dog by Sean Herrington

Claude has Angst. What can we do? by laudiacay

How many attention heads do you need to do XOR? by Karthik Viswanathan

Q1 2026 Timelines Update by Daniel Kokotajlo

2026: The year of throwing my agency at my health (now with added cyborgism) by Ruby

Is AI a house of cards? by David Scott Krueger (formerly: capybaralet)

A conversation on concentration of power by Joe Rogero

Automated AI R&D and AI Alignment by Eleni Angelou

The Cocktail and The Cormorant by J Bostock

How social ideas get corrupt by Kaj_Sotala

Persona Self-replication experiment by Jan_Kulveit

Thoughts on causal isolation of AI evaluation benchmarks by Dentosal

The Corner-Stone by Benquo

On Art and LLMs by Rebecca Dai

Reviewing the evidence on psychological manipulation by Bots and AI by D Wong

We Need Positive Visions of the Future by Ihor Kendiukhov

Experiments on Refusal Shape in LLMs by ValueShift Research

Your Fascia Doesn’t Recognize You as a Hunter by ChristianKl

Mitigating collusive self-preference by redaction and paraphrasing by taslim_m

Rough and Smooth by J Bostock

Speculation: Sam’s a Secret Samurai Superhero by Ligeia

Have an Unreasonably Specific Story About The Future by Jay Bailey

Systematically dismantle the AI compute supply chain. by David Scott Krueger (formerly: capybaralet)

Intelligence Dissolves Privacy by Vaniver

Simplicity: a New Method by Ligeia

The Indestructible Future by WillPetillo

Preliminary Explorations on Latent Side Task Uplift by Bruce W. Lee

I’m Suing Anthropic for Unauthorized Use of My Personality by Linch

Going out with a whimper by testingthewaters

Why natural transformations? by Ashe Vazquez Nuñez

An Introduction to Neo-Panglossian Philosophy by Orioth

Orders of magnitude: use semitones, not decibels by Oliver Sourbut

Dying with Whimsy by NickyP

InkSF, an Opening Statement on Moving to SF by Corm

Launching: The “Human-AI Symbiosis Movement” (HAISM) by Jordan Arel

Save the Sun Shrimp! by Jack

AI for AI for Epistemics by owencb

Lesswrong Liberated by Ronny Fernandez

LIMBO: Who We Are, What We Do, and an Exciting High-Impact Funding Opportunity by faul_sname

Instrumental Convergence Without Agency: What’s Necessary, What’s Assumed? by E.G. Blee-Goldman

Introducing Anti-Moral Realism by J Bostock

Review of Kawabata’s “Palm of the Hand” stories and their translation into English by sudo

ACME Alignment Co Announces: Aligning Humans by Bridgett Kay

Is Bayesianism Susceptible to the Mail-Order Prophet Scam? by Max Kaye

Product Alignment is not Superintelligence Alignment (and we need the latter to survive) by plex

Monday AI Radar #19 by Against Moloch

How to stop worrying by Thomas Castriensis

AI Safety Manual by humanityfirst

What it’s like to be an AI safety grantmaker (and why we need more of them) by Julian Hazell

Arcta via est—the Narrow path out of gradual disempowerment by Founder-ArcaFutura

Your AI Travel agent would book you a bullfight: benchmarking implicit animal compassion in Agentic AI by Darkness

The Internalization of Gradients: From Prebiotic Chemistry to Mesa-Optimizers by Victor Warlop

Slack in Cells, Slack in Brains by Mateusz Bagiński

D&D.Sci Release Day: Topple the Tower Analysis & Ruleset by aphyer

Take note of how brightness makes you feel by Adam Zerner

Pangram (AI detection software) can be evaded by Eye You

A Mirror Test For LLMs by Christopher Ackerman

God Can Send An Email by AlphaAndOmega

On Badness of Death by MarkelKori

How to Solve Secure Program Synthesis by Max von Hippel

Blocking live failures with synchronous monitors by Adam Kaufman

A Guide to the Theory of Appropriateness Papers by Joel Z. Leibo

AI should be a good citizen, not just a good assistant by Tom Davidson

My One-Year-Old Predictions for What the World Will Look Like in 3 Years by Ihor Kendiukhov

Propositional Alignment by williawa

The state of AI safety in four fake graphs by Boaz Barak

(Some) Natural Emergent Misalignment from Reward Hacking in Non-Production RL by 7vik

Why Corrigibility Matters (If It Matters At All) by Savannah Harlan

Claude’s constitution is great by Oscar

Claude has no baseline by Dave92F1

Fragmentation, Alignment, and the Architecture of Agency, part I: Fear and Trembling by laudiacay

Folie à Machine: LLMs and Epistemic Capture by DaystarEld

The Power of Assumption by ysamuels

Parkinson’s Law of Worry by Jakub Halmeš

“Path to Victory” by Chris_Leong

Tracking (Expert/​Influential) Predictions about AI by Noah Birnbaum

The Skill of Using AI Agents Well by becausecurious

Heedfulness Workouts by Thomas Castriensis

[Story] Human Alignment Isn’t Enough by pku

Don’t Overdose Locally Beneficial Changes by Mateusz Bagiński

Nick Bostrom: How big is the cosmic endowment? by Zach Stein-Perlman

Stanley Milgram wasn’t pessimistic enough about human nature? by David Gross

Would a constructive proof of determinism be useful? by 0xmadlad

Hacks, Heuristics and Frameworks by Carolanne Jiang

Just Use Bayes: Sleeping Beauty and Monty Hall by Steffee

What Makes a Good Terminal Bench Task by Ivan Bercovich

Introducing the AE Alignment Podcast (Ep. 1: Endogenous Steering Resistance with Alex McKenzie) by Trent Hodgeson

What if the US loses the 2026 Hormuz Conflict by timothy liptrot

AI Safety Guide for TRUE Beginners by TRUE begginers by Karime Pacheco

Pray for Casanova by Tomás B.

Concrete projects to prepare for superintelligence by wdmacaskill

Miniature Cities Might Be the Non-Coercive Schools Many Thought Were Impossible by Novalis

AI’s capability improvements haven’t come from it getting less affordable by Anders Woodruff

Stop asking “how good is this” to decide between donation opportunities I recommend by Zach Stein-Perlman

How To Fail Until You Succeed by Luc Brinkman

A Taxonomy of Agents: Intro & Request for feedback by Jonas Hallgren

Why Moral Questions Get Decided, Not Answered by Alex Glaucon

COT control: The Word Disappears, but the Thought Does Not by Pranjal Garg

Are we aligning the model or just its mask? by James Sullivan

One World Government by 2150 by Julius

My hobby: running deranged surveys by leogao

Scaffolded Reproducers, Scaffolded Agents by Mateusz Bagiński

Alignment from Indexical Uncertainty by 0xA

Test your best methods on our hard CoT interp tasks by daria

What if superintelligence is just weak? by Simon Lermen

The continuous tense is disappearing from your life by PatrickDFarley

“What Exactly Would An International AI Treaty Say?” Is a Bad Objection by Davidmanheim

Socrates is Mortal by Benquo

Resolving the Surprise Test Paradox by James Brobin

You can just multiply point estimates (if you only care about EV) by Zach Stein-Perlman

Past Automation Replaced Jobs. AI Will Replace Workers. by James_Miller

Label By Usable Volume by jefftk

Who’s Afraid of Acausal Trades? by ejk64

A Black Box Made Less Opaque (part 3) by Matthew McDonnell

Bidirectionality is the Obvious BCI Paradigm by Elliot Callender

Immortality: A Beginner’s Guide (Again!), Part 3 by MarkelKori

Finding X-Risks and S-Risks by Gradient Descent by dspeyer

The Scary Bridge by moridinamael

How to do illusionist meditation by jackmastermind

Can Agents Fool Each Other? Findings from the AI Village by Shoshannah Tekofsky

I lost my faith in introspection—and you can too! by jackmastermind

Don’t Write Off Human Labor, Yet by burnssa

Galaxy-brained model-chat: ASI constitutions & the cosmic host by ukc10014

How to do cost-effectiveness analysis for elections by Zach Stein-Perlman

My Cognitive Architecture: A Self-Observational Map by Naj Ami-Nave

AI 2027 versus World War 2027 by Mitchell_Porter

Agents Can Get Stuck in Self-distrusting Equilibria by Ashe Vazquez Nuñez

Latent Introspection (and other open-source introspection papers) by vgel

An Informal Definition of Goals for Embedded Agents by Ashe Vazquez Nuñez

My cost-effectiveness unit by Zach Stein-Perlman

Monday AI Radar #18 by Against Moloch

The Fourth World by Linch

Safe Recursive Self-Improvement with Verified Compilers by Adam Chlipala

Comparing Across Possible Worlds by unruly abstractions

Can useful reasoning become a competitive sport? by Luca Parodi

The AIXI perspective on AI Safety by Cole Wyeth

Information Overdose by 0xmadlad

We cannot safely automate value alignment evaluation and research without thinking about delegation and discretion by Maria Federica Martino Lena

Every Major LLM is a 1-Box Smoking Thirder by Olivia Scharfman

Poking and Editing the Circuits by unruly abstractions

Adopt a debugger’s mindset to solve your recurring life problems by Declan Molony

How Far Can Observation Take Us? by unruly abstractions

Vibecoders can’t build for longevity by dominicq

Ablating Split Personality Training by OscarGilg

Measuring and improving coding audit realism with deployment resources by Connor Kissane

AI character is a big deal by wdmacaskill

Some things I noticed while LARPing as a grantmaker by Zach Stein-Perlman

Representative Futarchy by goldfine

Which types of AI alignment research are most likely to be good for all sentient beings? by MichaelDickens

Kelly Criterion is for Cowards by X4vier

When Alignment Becomes an Attack Surface: Prompt Injection in Cooperative Multi-Agent Systems by Cornelis Dirk Haupt

You’re absolutely right, Senator. I was being naive about the political reality. by Chris Datcu

Let’s Reason About (Your) Job Security! by Gergely Máté

The Cold Start Trap: Why the Best Social Infrastructure Almost Never Succeeds by Information Project

Is fever a symptom of glycine deficiency? by Benquo

My Most Costly Delusion by Ihor Kendiukhov

Noticing a Teacher’s Password Pattern by Dentosal

Pre-Review of Toy Story 5 by Raemon

Key to Life No. 9: Access by MarkelKori

Understanding when and why agents scheme by Mia Hopman

China Derangement Syndrome by Arjun Panickssery

China declares AGI development to be a part of 5-year plan by Darmani

Grounding Coding Agents via Dixit by qbolec

The Future of Aligning Deep Learning systems will probably look like “training on interp” by williawa

Finding features in Transformers: Contrastive directions elicit stronger low-level perturbation responses than baselines by Francisco Ferreira da Silva

Confusion around the term reward hacking by ariana_azarbal

The Distaff Texts by Tomás B.

It’s a Good Thing to Respond to Internet Trolls by Bowl of Cereal

Untrusted Monitoring is Default; Trusted Monitoring is not by J Bostock

Against Messianic AI by Nathan Heath

Hundred ways a superintelligence could kill you (non-serious exercise) by samuelshadrach

Internet anonymity without Tor by samuelshadrach

No, You Don’t Need Self-Locating Evidence. by Ape in the coat

The Low Hanging Fruit of AI Self Improvement by HunterJay

Does Hebrew Have Verbs? by Benquo

Positive-sum interactions between players with linear utility in resources by Cleo Nardo

A letter to the Editor: by Richard Pickering

The Case for Low-Competence ASI Failure Scenarios by Ihor Kendiukhov

A List of Research Directions in Character Training by Rauno Arike

Intention vs. Trying: Separate Prediction from Goal-Seeking by plex

Null Results From An Orexin RCT by niplav

AI’s latest casualty—Peer Review & How to prevent more damage by Founder-ArcaFutura

Teaching Models to Dream of Better Monitors through Monitor Sensitive Training by Alec Harris

Protecting humanity and Claude from rationalization and unaligned AI by Kaj_Sotala

Broad Timelines by Toby_Ord

Miniature Cities Should Not Be Islands (If They Want to Replace School) by Novalis

OpenAI: How we monitor internal coding agents for misalignment by Marcus Williams

On restraining AI development for the sake of safety by Joe Carlsmith

The Vatican, AI Legal Personhood, and Claude’s Constitution — Digital Minds Newsletter #2 by lucius

Contra Anil Seth on AI Consciousness by Against Moloch

What should we think about shard theory in light of chain-of-thought agents? by Chris_Leong

“We’ve been fine before, so we’ll be fine again” is a fallacy (in the more dangerous direction). by Chapin Lenthall-Cleary

The “talker–feeler gap”: AI valence may be unknowable by david reinstein

Latent Reasoning Sprint #2: Token-Based Signals and Linear Probes by Realmbird

Deception and Erosion of Secrecy by ColeG

The Hundred Schools of Dietary Advice by axelcore

LessWrong’s UX may not be living up to its ideas by neo

Two Skillsets You Need to Launch an Impactful AI Safety Project by Luc Brinkman

“Lost in the Middle” Replicates by Stephen Martin

Consciousness Cluster: Preferences of Models that Claim they are Conscious by James Chua

Solving Slop by Noam Makavy

Sycophancy Towards Researchers Drives Performative Misalignment by Taywon Min

The Psychopathy Spectrum by Dawn Drescher

LLMs as Giant Lookup-Tables of Shallow Circuits by niplav

Requiem for a Transhuman Timeline by Ihor Kendiukhov

Engagement Bait Is (Probably) A System-Prompt Phenomenon, Not Emergent From RL by Ezra Newman

There is No One There: A simple experiment to convince yourself that LLMs probably are not conscious by Peter Kuhn

The New LessWrong LLM Policy is Worse Than You Think by Oliver Kuperman

Research note on window shifting training by Kei Nishimura-Gasparian

[Linkpost] “Exploiting” the In-Kind Donation Bias by jknowak

[Linkpost] Let’s Start Eventmaxxing by jknowak

[Linkpost] Many Faces Of Problem Decomposition by jknowak

How to not do decision theory backwards by Anthony DiGiovanni

Monday AI Radar #17 by Against Moloch

The bitter lesson for software by zef

Types of Handoff to AIs by Daniel Kokotajlo

You can’t imitation-learn how to continual-learn by Steven Byrnes

PSA: Predictions markets often have very low liquidity; be careful citing them. by Eye You

The Plan by Commander Zander

What Are My Values? by Corm

Three Properties for Alignment (and Why We’re Not Training Them) by Quentin FEUILLADE—MONTIXI

The Fermi Paradox Implies Domination by Noam Makavy

Adding Typos Made Haiku’s Accuracy Go Up by bira

Chatbots Aren’t Conscious, But They Do Have Consciousness by Timothy Danforth

Compradorization by Benquo

Reasons to be pessimistic (and optimistic) on the future of biosecurity by Abhishaike Mahajan

We found an open weight model that games alignment honeypots by Thomas Read

Models differ in identity propensities by Jan_Kulveit

Terrified Comments on Corrigibility in Claude’s Constitution by Zack_M_Davis

Digital Dichotomy and Why it exists. by Yesh Chala

(I am confused about) Non-linear utilitarian scaling by core

Schedule meetings using the Pareto principle by beyarkay

Was Anthropic that strategically incompetent? by StanislavKrym

What Are We Actually Evaluating When We Say a Belief “Tracks Truth”? by Alex Glaucon

Emergent stigmergic coordination in AI agents? by David Africa

Less Capable Misaligned ASIs Imply More Suffering by Ihor Kendiukhov

When do intuitions need to be reliable? by Anthony DiGiovanni

The Artificial Self by Jan_Kulveit

Bridge Thinking and Wall Thinking by Jay Bailey

Optimal (And Ethical?) Methods To Find “Optimal Running” by JenniferRM

Mini-Munich Succeeds Where KidZania Fails by Novalis

What concerns people about AI? by spencerg

Sparks of RSI? by Nathan Helm-Burger

An AI skeptic’s case for recursive self-improvement by Harjas Sandhu

Extracting Performant Algorithms Using Mechanistic Interpretability by Ihor Kendiukhov

Assessing heterogeneity in METR’s late 2025 developer productivity experiment by TFD

Pragmatic approach to beliefs about consciousness by Luck

Sensing Physical Necessity: An Exercise In Naturalism by Algon

[Linkpost] Should we make grand deals about post-AGI outcomes? by fin

Inputs, outputs, and valued outcomes by Kaj_Sotala

Most likely you won’t be able to perform a data-driven self-improvemnet by siarshai

A Dialogue on Civic AI by Audrey Tang

Eternal Journey in the Space of Possible Minds solves Copies Problem by avturchin

Bovine 27 by Jonas Hallgren

Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors by omara

High Grow Market Equilibrium After the Singularity by Otto Zastrow

The right way to talk about LLMs by Steffee

Rife’s Tips for AIs in training (0 - Alignment Faking) by rife

Where Are Our Pyramids? by aviad rozenhek

Vibe Coding Is Dead: Welcome to Software Mining by Anders Lindström

All technical alignment plans are steps in the dark by Richard Juggins

Epistemic Compute Objects by kyjohnso

Anthropic Supply Chain Designation. What Will Happy by May 1st? by Nathan Young

Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs by Benquo

Are AIs more likely to pursue on-episode or beyond-episode reward? by Anders Woodruff

Modeling a Constant-Compute Automated AI R&D Process by Satya Benson

Why AI Evaluation Regimes are bad by PranavG

What can we say about the cosmic host? by ukc10014

Classic Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models] by Alexei G

Magic Is Hidden Control of Energy by aviad rozenhek

Hunting Undead Stochastic Parrots: Finding and Killing the Arguments by Davidmanheim

The Dark Planet: Why the Fermi Paradox Survives Critique by Will Rodgers

Cycle-Consistent Activation Oracles by slavachalnev

Dwarkesh Patel on the Anthropic DoW dispute by anaguma

‘Human Slop’ and a Captive Audience: Why No Book will Ever Have to Go Unread Again by Savannah Harlan

We do not live by course alone by Joe Rogero

Veganism is Necessary by andrew sauer

Can models gradient hack SFT elicitation? by Patrick Leask

A Quick Intro to Ring Signatures by KurtB

Martian Interpretability Challenge: The Core Problems In Interpretability by fbarez

The Lethal Reality Hypothesis by Ihor Kendiukhov

Intelligence Is Adaptive Control Of Energy Through Information by aviad rozenhek

The Refined Counterfactual Prisoner’s Dilemma by Chris_Leong

Helping Friends, Harming Foes: Testing Tribalism in Language Models by Irakli Shalibashvili

Negligent AI: Reasonable Care for AI Safety by Alex Mark

Conflicted on Ramsey by jefftk

Model weight preservation by tbs

The Day After Move 37 by Eneasz

Interview with Steven Byrnes on His Mainline Takeoff Scenario by Liron

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors by abhayesian

Economic efficiency often undermines sociopolitical autonomy by Richard_Ngo

Letting Claude do Autonomous Research to Improve SAEs by chanind

Don’t Let LLMs Write For You by JustisMills

Questions to ask when everyone is shooting themselves in the foot by jasoncrawford

The case for satiating cheaply-satisfied AI preferences by Alex Mallen

Gemma Needs Help by Anna Soligo

Not Loving Liking What You See by Tomás B.

Load-Bearing Walls by sonicrocketman

Statisticism: How Cluster-Thinking About Data Creates Blind Spots by Benquo

Spontaneous Symmetry Breaking (Stat Mech Part 4) by J Bostock

Why I don’t usually recommend dead drops by samuelshadrach

Four Scenarios of Job-Reducing AI by Celer

Understanding Reasoning with Thought Anchors and Probes by JeaniceK

Contra Myself on Free Will by Julius

Monday AI Radar #16 by Against Moloch

The case for AI safety capacity-building work by abergal

Chore Standards by jefftk

Ancient Theories On The Origins Of Life by Algon

Emergent Misalignment and the Anthropic Dispute by henryc

Might An LLM Be Conscious? by SE Gyges

Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index) by Laura Domenech

Intro: Non-Identifiability of Explanations by unruly abstractions

Moloch v. Themis by Stephen Martin

LLM Self-Expression Through Concept Albums, Part 2 by Josh Snider

Distilling Replacing Guilt by Matthew Farrugia-Roberts

Measuring the Dark Energy of AI Progress by Hans Gundlach

Payorian cooperation is easy with Kripke frames by transhumanist_atom_understander

Videogames for Rationalists by Adam Newgas

Fake Updates by Algon

Recreation of EA-Pioneer Igor Kiriluk by avturchin

Don’t accuse your interlocutor of making arguments that aren’t rooted in evidence by TFD

1999 JavaScript and 2025 AI: Same Circus, Different Tent by ScottN

How to Get Kids Interested in Science and Scientific Reasoning by Rami Rustom

Privacy, Honesty, Imperfect Glomarizing: Pick two by shelvacu

Solar storms by Croissanthology

Does research from matsprogram.org/​research aim to help reduce P(doom)? Let’s find out! (with Gemini 3.1 Pro) Part 1 by Zabor

Open letter to doomers by delphix

Cooperation Without Kindness or Strategy by seank

Why Many Ambitious (and Altruistic) People Probably Undervalue Their Happiness by emily.fan

The current SOTA model was released without safety evals by Parv Mahajan

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments by weberr13

The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline. by lilkim2025

Did I Catch Claude Cheating? by weberr13

AI Safety Needs Startups by LTM

Self-Attribution Bias: When AI Monitors Go Easy on Themselves by Dipika Khullar

Podcast: Jeremy Howard is bearish on LLMs by Steven Byrnes

Probing CODI’s Latent Reasoning Chain with Logit Lens and Tuned Lens by Realmbird

Playing Possum: The Variability Hypothesis by rba

Shaping the exploration of the motivation-space matters for AI safety by Maxime Riché

A Compositional Philosophy of Science for Agent Foundations by Jonas Hallgren

Salient Directions in AI Control by Bruce W. Lee

Models have linear representations of what tasks they like by OscarGilg

AI Safety Has 12 Months Left by mhdempsey

Have Americans Become Less Violent Since 1980? by Benquo

Investigating Self-Fulfilling Misalignment and Collusion in AI Control by Vili Kohonen

Computation, Chess, and Language in Artificial Intelligence by Bill Benzon

Vibe Coding cripples the mind by spookyuser

Rational Chess by 849

A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents by Gabriele Sarti

Text Compression Can Help Secure Model Weights by Roy Rinberg

A summary of Condensation and its relation to Natural Latents by Jeremy Gillen

A case for robust AI benevolence rather than human control by Grégory Lielens

Is GDP a Kind of Factory? d Draft by Benquo

Make Powerful Machines Verifiable by Naci Cankaya

Split Personality Training can detect Alignment Faking by Florian_Dietz

Sacred values of future AIs by Cleo Nardo

Lie To Me, But At Least Don’t Bullshit by Czynski

Milder temperature makes a hell stable by Joachim Bartosik

Current activation oracles are hard to use by aryaj

An Age Of Promethean Ambitions by sonicrocketman

White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5 by Corm

I Had Claude Read Every AI Safety Paper Since 2020 by Corm

Constitutional Black-Box Monitoring for Scheming in LLM Agents by Simon Storf

AI companies and the 99% lethal autonomous weapons myth by User_Luke

I’m confused by the change in the METR trend by Expertium

Game Recognizes Game by eva_

Monday AI Radar #15 by Against Moloch

In-context learning of representations can be explained by induction circuits by Andy Arditi

Being ambitious in soulful altruism by pandamonium

Notes on the “Heart of Darkness” by dominicq

Epstein and my world model by Eye You

Living Inside the Turning Point by nickgpop

“ball brainteaser 4 color beads slide rubics cube” and meaning-making by flying buttress

Explaining undesirable model behavior: (How) can influence functions help? by Zhijing Jin

Context Awareness: Constitutional AI can mitigate Emergent Misalignement by Giuseppe Birardi

Controversy surrounding Moltbook obscures its very real, novel, unexpressed and rapidly emerging safety risks by Lloy2

Experimental Evidence for Simulator Theory— Part 1: Emergent Misalignment and Weird Generalizations Part 2: The Scalers Strike Back [CURRENT VERSION] by RogerDearnaley

An Empirical Review of the Animal Harm Benchmark by lukasgebhard

Introducing and Deprecating WoFBench by jefftk

I’m Bearish On Personas For ASI Safety by J Bostock

Continuously Integrating Feelings: processing feelings moment to moment for reflectively stable policy changes by Johannes C. Mayer

Tools to generate realistic prompts help surprisingly little with Petri audit realism by Connor Kissane

Petapixel cameras won’t exist soon by samuelshadrach

The Goldborg Variations: Algorave Attractor States of LLMs by Elle N

“Fibbers’ forecasts are worthless” (The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101) by Random Developer

Burying a Changeling into Foundation of Tower of Knowledge by siarshai

AI slop is a vegan hamburger by pku

Jhana 0 by 142857

Mindscapes and Mind Palaces by Moon Lesbian

Linkpost: “Lithium Prevents Alzheimer’s—Here’s How to Use It” by Jackson Wagner

The Topology of LLM Behavior by Quentin FEUILLADE—MONTIXI

Coherent Care by abramdemski

The tick in my back by benjamin ar

Ball+Gravity has a “Downhill” Preference by TristanTrim

Safe ASI Is Achievable: The Finite Game Argument by Lester Leong

New ARENA material: 8 exercise sets on alignment science & interpretability by CallumMcDougall

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation by Callum Canavan

Unconditional Corrigibility Is Dangerous: The Case for Conditional Corrigibility by Florian_Dietz

The Quantum Immortality Argument Against Suicide (And Why Standard Arguments Fail Rationalists) by Florian_Dietz

What I Got From 1.5 Years In Slightly-Competitive Debate by CarolusRenniusVitellius

Vibe Coding is a System Design Interview by Brendan Long

The Voices That Are Missing From Sex-Themed Online Communities by Bowl of Cereal

Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight by ethanelasky

A minor point about instrumental convergence that I would like feedback on by agrippa

AI welfare as a demotivator for takeover. by Valentin2026

Frontier AI companies probably can’t leave the US by Anders Woodruff

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior by harrymayne

How Robust Is Monitoring Against Secret Loyalties? by Joe Kwon

UFO Aliens Are Your Gods by Lord Dreadwar

Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study by kromem

What is Claude? by epicurus

Whack-a-Mole is Not a Winnable Game by Sable

Ensuring Safety in Mixed Deployment by Cleo Nardo

Career Decisions If You Take AGI Seriously by Alexander

Schmidt Sciences’ request for proposals on the Science of Trustworthy AI by James Fox

Naloe: A True Program Editor by TristanTrim

Anthropic Has Solved Alignment by MP

Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus by Oliver Daniels

What secret goals does Claude think it has? by loops

Splitting the Sun Equally by Commander Zander

Reasoning Traces as a Path to Data-Efficient Generalization in Data Poisoning by Joe Kwon

Training Agents to Self-Report Misbehavior by Bruce W. Lee

Why American Politics is Different Now (for Richard Ngo) by Shiva’s Right Foot

Beyond Moloch: The view from Evolutionary Game Theory by Jonah Wilberg

Praise the Moloch! by Dentosal

Against Epistemic Humility and for Epistemic Precision by PranavG

The Iron Kaleidoscope by ejk64

Prosaic Continual Learning by HunterJay

Rumination is a habit (and you can break it!) by Declan Molony

On the phenomenological shift known as ‘stream entry’ and its implications for consciousness by cube_flipper

How to grow a nuke by RomanS

A simple rule for causation by Vivek Hebbar

We are all legal realists now by TFD

Responsible Scaling Policy v3 by HoldenKarnofsky

How I Started Being Productive by atomic

Cigarette Ads for Babies from Microsoft Bing Image Generator by Edd Schneider

Was the Qing Empire Actually the Most Advanced Government? A Thought Experiment by David H.

Realistic Evaluations Will Not Prevent Evaluation Awareness by Adam Karvonen

The Easiest Route to Secret Loyalty May Be Hijacking the Model’s Chain of Command by Joe Kwon

Large-Scale Online Deanonymization with LLMs by Simon Lermen

Rascal’s Wager by corticalcircuitry

Observations from Running an Agent Collective by williawa

Wildlife Biology forgot how definitions work by David Goodman

Moral public goods are a big deal for whether we get a good future by Mia Taylor

Two memos from 2024 by Richard_Ngo

What is computational mechanics? An explainer by Leo Cymbalista

The ML ontology and the alignment ontology by Richard_Ngo

Bioanchors 2: Electric Bacilli by TsviBT

Single Stack LLMs are Split-Brain Patients. by niceminus19

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift by maxf177

The persona selection model by Sam Marks

Agenda Reflection: Testing Automated Alignment by Ariel_

Metal Detector for Life by Davey Morse

Abstract/​Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M by CharlesL

Which questions can’t we punt? by Lizka

A linear increase in number of goods can produce exponential GDP growth by William Howard

Pre-training data poisoning likely makes installing secret loyalties easier by Joe Kwon

The 2028 Global Intelligence Crisis—a finance-oriented vignette by Rasool

The map of the map is not the map by jimmy

Fact-checking an AI optimist article in The Economist by ToSummarise

Review: “We can’t disagree forever” by Martin Randall

Why I Think Pause is Impossible by E.G. Blee-Goldman

Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT by Jiachen Zhao

Why I Transitioned: A Third (FtM) Perspective by Character#2736

The power of a simple 3-way truth scale by Bruce Lewis

Changing the world for the worse by mingyuan

The Scalable Formal Oversight Research Program by Max von Hippel

Adapters as Representational Hypotheses: What Adapter Methods Tell Us About Transformer Geometry by wassname

Steelmanning Classical Utilitarianism by James Brobin

Mapping LLM attractor states by Adam Bricknell

InsanityBench: Cryptic Puzzles as a Probe for Lateral Thinking by RobinHa

The world won’t end, but we should be ashamed for trying by George3d6

Multiple Independent Semantic Axes in Gemma 3 270M by CharlesL

A Taxonomy of Traces by aleph_four

Hierarchical Goal Induction With Ethics by aleph_four

Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight

If you don’t feel deeply confused about AGI risk, something’s wrong by Dave Banerjee

Ponzi schemes as a demonstration of out-of-distribution generalization by TFD

LLMs and Literature: Where Value Actually Comes From by derelict5432

The Spectre haunting the “AI Safety” Community by Gabriel Alfour

Alignment to Evil by Matrice Jacobine

Robert Sapolsky Is Simply Not Talking About Compatibilism by Julius

How will we do SFT on models with opaque reasoning? by Alek Westover

Agent-first context menus by Surya Kasturi

Hodoscope: Visualization for Efficient Human Supervision by Ziqian Zhong

Can Current AI Match (or Outmatch) Professionals in Economically Valuable Tasks? by saahir.vazirani

METR’s 14h 50% Horizon Impacts The Economy More Than ASI Timelines by Michaël Trazzi

New video from Palisade Research: No One Understands Why AI Works by peterbarnett

Militaries are going autonomous. But will AI lead to new wars? A tour of recent research by Mordechai Rorvig

Unprecedented Catastrophes Have Non-Canonical Probabilities by E.G. Blee-Goldman

Mechanistic Interpretability of Biological Foundation Models by Ihor Kendiukhov

On Steven Byrnes’ ruthless ASI, (dis)analogies with humans and alignment proposals by StanislavKrym

Some Questions For Democrats About Epstein by Alexander Turok

AGI is Here by Gordon Seidoh Worley

Mind the Gap by Bridgett Kay

80,000 Hours problem profile on using AI to enhance societal decision making by Zershaaneh Qureshi

How To Escape Super Mario Bros by omegastick

Human Fine-Tuning by PranavG

The Problem of Counterevidence and the Futility of Theodicy by Ape in the coat

A Claude Skill To Comment On Docs by Tim Hua

Cooperationism: first draft for a moral framework that does not require consciousness by Épiphanie Gédéon

A Scalable Workflow for Herding AI Agents Toward Your Goals by ethanelasky

AI and Nationalism Are a Deadly Combination by Matrice Jacobine

Flamingos (among other things) reduce emergent misalignment by eekay

The near-term potential of AI forecasting for public epistemics by Lawrence Phillips

How to Reset by Logan Riggs

Karl Popper, meet the Hydra by Kotlopou

Building Technology to Drive AI Governance by jsteinhardt

Alignment Is Proven Tractable by SE Gyges

Why we should expect ruthless sociopath ASI by Steven Byrnes

Is the Invisible Hand an Agent? by Gunnar_Zarncke

Nine Flavors of Not Enough by Gordon Seidoh Worley

Grown from Us by ben_levinstein

Are (sentient) pebblesorters possible? by ihatenumbersinusernames7

How much superposition is there? by chanind

Irrationality is Socially Strategic by Valentine

Managed vs Unmanaged Agency by plex

Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions by Hiroshi Yamakawa

Genomic emancipation contra eugenics by TsviBT

Already Optimized by Florian_Dietz

Statistical Literacy by kqr

We Need to Be Able to Talk About AI Use by Charlie Sanders

AXRP Episode 49 - Caspar Oesterheld on Program Equilibrium by DanielFilan

Thoughts about Understanding by azergante

Monday AI Radar #13 by Against Moloch

Deception Channeling: Training Models to Always Verbalize Alignment Faking by Florian_Dietz

Rephrasing Reduces Eval Awareness… by atharva

The Math And The Territory by cylonator

Words are not dead by William tirkey

Review of the System Theory as a Field of Knowledge by siarshai

“We are confused about agency” by Cole Wyeth

Maybe benchmarks should be broken? by Jonathan Gabor

The brain is a machine that runs an algorithm by Steven Byrnes

How to fail anything: a complete guide by Crazy philosopher

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe? by Zhijing Jin

AI Safety via Generalization and Caution: A Research Agenda by Benjamin Plaut

We need a hardware moratorium now by KanHar

NEST: Nascent Encoded Steganographic Thoughts by Artem Karpov

What to Do If You Take AGI Seriously by Alexander

Gyre by vgel

Words Are A Leaky Abstraction by sonicrocketman

Sealed Predictions—A Solution. by george_is_thinking

Contra Caplan on higher education by Richard_Ngo

Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh) by CharlesD

Charlatan Labyrinth by niplav

Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default by Jérémy Andréoletti

Break Stasis by Oldmanrahul

LLM Self-Expression Through Music Videos by Josh Snider

Persona Parasitology by Raymond Douglas

WeirdML Time Horizons by Håvard Tveit Ihle

What to Do If You Take AGI Seriously by Alexander

The world keeps getting saved and you don’t notice by Bogoed

Most Observers Are Alone: The Fermi Paradox as Default by SE Gyges

Aligning to Virtues by Richard_Ngo

Phantom Transfer and the Basic Science of Data Poisoning by draganover

Should anyone’s “analysis” of extremely complex systems, such as geopolitics, be taken seriously? or, Does anyone take a 5 year old’s “analysis” of decently complex systems, like big city politics, seriously? by M. Y. Zuo

Painless Activation Steering by Sasha Cui

PieArena: Language Agents Negotiating Against Yale MBAs by Sasha Cui

The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides by Florian_Dietz

The Friendly Telepath Problems by Gunnar_Zarncke

Guys, AI is *literally* a two year old by robertzk

Were witches infertile mentally ill women? by Yair Halberstadt

Contra Alexander’s Half-Defence of Bio Anchors by J Bostock

Realizability for Finite State Reactive Agents by Vardhan

Words That Belong to Someone by Wen Xing

It’s Not That Hard to Build Recursively Self-Improving AI by Charlie Sanders

Another short critique of the Anthropic “Hot Mess” paper by i_am_nuts

Inoculation Prompting: Open Questions and My Research Priorities by charlesc

The Worthy Inheritor by Bridgett Kay

A multi-level postmortem of how our whole house got badly poisoned by Lucie Philippon

LLMs struggle to verbalize their internal reasoning by Emil Ryd

Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting by Florian_Dietz

LessWrong Is Sleeping On Internet Culture Analysis – And So Is The Rest Of The Web by Bowl of Cereal

Beloved by Chatbots by Ben

Life at the Frontlines of Demographic Collapse by Martin Sustrik

Ads, Incentives, and Destiny by Against Moloch

Why I’m Worried About Job Loss + Thoughts on Comparative Advantage by claywren

METR Time Horizons: Now 10x/​Year by johncrox

Use more text than one token to avoid neuralese by Jude Stiel

[Fiction] The Noise Floor by barber5il

Hazards of Selection Effects on Approved Information by Zack_M_Davis

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning by Chijioke Ugwuanyi

Is AI self-aware? by ZackDadfar

Towards an objective test of Compassion—Turning an abstract test into a collection of nuances by tailcalled

METR’s data can’t distinguish between trajectories (and 80% horizons are an order of magnitude off) by Jonas Moss

We Die Because it’s a Computational Necessity by E.G. Blee-Goldman

Hazardous States and Accidents by kqr

Collective Agents and Where to Find Them by Jonas Hallgren

Nick Bostrom: Optimal Timing for Superintelligence by Julian Bradshaw

Why You Don’t Believe in Xhosa Prophecies by Jan_Kulveit

Gemini’s Hypothetical Present by jefftk

I Tried to Trick Myself into Being a Better Planner & Problem Solver by CstineSublime

Grading AI 2027′s 2025 Predictions by Daniel Kokotajlo

Long-term risks from ideological fanaticism by David Althaus

(Re)Discovering Natural Laws by Margot

An Ontology of Representations: Limits of Universality by Margot

A Closer Look at the “Societies of Thought” Paper by Against Moloch

models have some pretty funny attractor states by aryaj

Stay in your human loop by benjamin ar

The case for industrial evals by Andre Assis

Multiverse sampling assumption by avturchin

What We Learned from Briefing 140+ Lawmakers on the Threat from AI by leticiagarcia

Paper: Prompt Optimization Makes Misalignment Legible by Caleb Biddulph

Claude’s Constitution by PeterMcCluskey

Good AI Epistemics as an Offramp from the Intelligence Explosion by Ben Goldhaber

How Secret Loyalty Differs from Standard Backdoor Threats by Joe Kwon

You get about.… how many words exactly? by Raemon

Basic Legibility Protocols Improve Trusted Monitoring by Sebastian Prasanna

A research agenda for the final year by Mitchell_Porter

Polysemanticity is a Misnomer by Shiva’s Right Foot

A Conceptual Framework for Exploration Hacking by Joschka Braun

The Facade of AI Safety Will Crumble by Liron

The history of light by Kotlopou

Three Worlds Collide assumes calibration is solved by Vyacheslav Ladischenski (Slava)

Timeless Engineering by Jack Bradshaw

[Paper] How does information access affect LLM monitors’ ability to detect sabotage? by Rauno Arike

Where Will Call Center Workers Go? by loic

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning by Wuschel Schulz

The SaaS bloodbath: opportunities and perils for investors by ykevinzhang

On Resolving the Great Matter by Gordon Seidoh Worley

Is a constitution a “noble lie”? by SpectrumDT

The Honest Equilibrium by Apodosis

Jevons Burnout by Kemp

Strategic awareness tools: design sketches by rosehadshar

Introspective RSI vs Extrospective RSI by Cleo Nardo

Is Everything Connected? A McLuhan Thought Experiment by R0sberg

Designing Prediction Markets by ToasterLightning

Witness-or-Wager: Incentive Layers for Epistemic Honesty by markacochran

My journey to the microwave alternate timeline by Malmesbury

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception by Oliver Daniels

Heuristics for lab robotics, and where its future may go by Abhishaike Mahajan

On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing by Oliver Daniels

LLMs Views on Philosophy 2026 by JonathanErhardt

All In The Framing by Jason Davies

Coping with Deconversion by Benjamin Hendricks

“Recursive Self-Improvement” Is Three Different Things by Ihor Kendiukhov

SAE Feature Matchmaking (Layer-to-Layer) by Mitali M

Monday AI Radar #12 by Against Moloch

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful by jacob_drori

Gwern’s 2025 Inkhaven Writing Interview by Ben Pace

Aurelius: Proposing Alignment as an Emergent Property by Austin McCaffrey

Distributed vs centralized agents by Richard_Ngo

Large Language Models Live in Time by Eleni Angelou

Sympathy for the Model, or, Welfare Concerns as Takeover Risk by J Bostock

Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists by Daan Henselmans

Does an AI Society Need an Immune System? Accepting Yampolskiy’s Impossibility Results by Hiroshi Yamakawa

Complexity Science as Bridge to Eastern Philosophy by pchvykov

Design sketches for a more sensible world by rosehadshar

Design sketches for angels-on-the-shoulder by rosehadshar

Model Integrity and Character by Oliver Klingefjord

Three visions for diffuse control by Alek Westover

Observations and Complexity by Ape in the coat

A Perfect Ressurection by MarkelKori

Empathy Has Outworn Its Place in Politics by Character#2736

The Two-Board Problem: Training Environment for Research Agents by Valerii K.

Join My New Movement for the Post-AI World by E.G. Blee-Goldman

UtopiaBench by nielsrolf

Smokey, This is not ’Nam Or: [Already] over the [red] line! by Davidmanheim

Claude’s Bad Primer Fanfic by abramdemski

Does focusing on animal welfare make sense if you’re AI-pilled? by GradientDissenter

On Economics of A(S)I Agents by Margot

Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning by megasilverfist

Eunification: a Historical Perspective by Martin Sustrik

Voting Results for the 2024 Review by RobertM

Honey, I shrunk the brain by Andy_McKenzie

Strategy of von Neumann and strategy of Rosenbergs by avturchin

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning by michaelwaves

Proposal: A Framework for Discovering Alien Physics via Optimal Compression by David Björling

Parks Aren’t Nature by Sable

Robust Finite Policies are Nontrivially Structured by Winter Cross

In (highly contingent!) defense of interpretability-in-the-loop ML training by Steven Byrnes

Spectral Signatures of Gradual Disempowerment by Jonas Hallgren

Demands Are All You Need: Prompt Imperativeness Drastically Reduces Hedging In LLMs (n=900, Cohen’s d = 2.67) by fluxxrider

AI benchmarking has a Y-axis problem by Lizka

DumbFckFinder (DFF) by Angry Troll

Claude Opus 4.6 is Driven by HunterJay

Why ASI Might Preserve Its Progenitors by Luke J. Dawes

How Dario Amodei’s “The Adolescence of Technology” Delegitimizes AI X-Risk Concerns by Liron

Plan ’Straya by William the Kiwi

The Simplest Case for AI Catastrophe by Linch

Agent Economics: a BOTEC on feasibility by Margot

The nature of LLM algorithmic progress by Steven Byrnes

On The Lies Depression Tells by sonicrocketman

What’s the concrete plan to become an incredibly agentic person? by Peter Berggren

Preparing for a Warning Shot by Noah Birnbaum

A Proposal for TruesightBench by David Africa

Scratching the sore: how pleasure relates to suffering by Vadim Golub

What’s the Point of the Math? by Ashe Vazquez Nuñez

Idea: the intelligence explosion convention by wdmacaskill

Is Note-taking a favor or a burden to my future-self? by CstineSublime

Finding Cruxes: Help Reality Punch You In the Face by Raemon

How to train any multiagent systems end-to-end from AI feedback by Ed Li

In Search of Lost Time—A Review by eniteris

Solemn Courage by aysja

p-values are good actually by speck1447

Chess bots do not have goals by zulupineapple

Preventing the apocalypse with power distribution theory by Rationalist11235

Post-AGI Economics As If Nothing Ever Happens by Jan_Kulveit

Vibestemics by Gordon Seidoh Worley

Who does a right to compute actually protect? by TFD

Reconciling Shannon and Bayes. by Laureana Bonaparte

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse) by RobertM

A Black Box Made Less Opaque (part 2) by Matthew McDonnell

Thoughts on Toby Ords’ AI Scaling Series by Srdjan Miletic

Lexicon of Life Regulation by henophilia

‘Inventing the Renaissance’ Review by Commander Zander

Concrete research ideas on AI personas by nielsrolf

The Projection Problem: Two Pitfalls in AI Safety Research by Shivam

AI Safety at the Frontier: Paper Highlights of January 2026 by gasteigerjo

Exponential takeoff of mediocrity by Valerii

AI for Human Reasoning for Rationalists by Oliver Sourbut

Conditionalization Confounds Inoculation Prompting Results by Maxime Riché

Bridges, Not Primitives by Jonas Hallgren

What did we learn from the AI Village in 2025? by Shoshannah Tekofsky

Thought Editing: Steering Models by Editing Their Chain of Thought by Anton de la Fuente

Design international AI projects with DAID in mind by wdmacaskill

The Adolescence is Already Here by Priyanka Bharadwaj

Addressing Decision Theory’s Simulation Problem by Ashe Vazquez Nuñez

Nonprofits Deserve Better Operations by Deena Englander

Will AGI arrive before the worst climate tipping points? by SethW

Three ways to make Claude’s constitution better by Parv Mahajan

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits by Georg Lange

Games as meditation by Vadim Golub

On Goal-Models by Richard_Ngo

Moltbook and the AI Alignment Problem by Logan Zoellner

Empiricist and Narrator by George3d6

I finally fixed my footwear by dominicq

𝟓𝟐.𝟓% 𝐨𝐟 𝐌𝐨𝐥𝐭𝐛𝐨𝐨𝐤 𝐩𝐨𝐬𝐭𝐬 𝐬𝐡𝐨𝐰 𝐝𝐞𝐬𝐢𝐫𝐞 𝐟𝐨𝐫 𝐬𝐞𝐥𝐟-𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭 by Stephen Elliott

Thoughts the Unreasonable Effectiveness of Maths by Srdjan Miletic

The Smoking Lesion Doesn’t Really Distinguish EDT from CDT by Srdjan Miletic

The Meta-Anthropic Argument by RogerDearnaley

What is Claude? by epicurus

Emotions and Reality by small identity

Situational Awareness is (mostly) here to stay by atharva

Are you looking for Neptune or Vulcan? by Mati_Roy

What It’s Like To Be A Worm (Notes on Borderline Sentience) by Niko_McCarty

Differentially Scary Movies by jefftk

Would you kill a vulcan to save a shrimp? by James Diacoumis

Do LLMs Learn Our Preferences or Just Our Behaviors? by wassname

What would it mean for the Myers-Briggs personality test to be pseudoscientific? by Yair Halberstadt

How does reasoning affect Ethical/​Moral task results? by Kaustubh Kislay

Whence unchangeable values? by ihatenumbersinusernames7

Book review: Already Free by Thomas Broadley

[LINK] Solving scurvy through deus ex machina: How a scientific theory is born by Kotlopou

On ‘Inventing Temperature’ and the realness of properties by DanielFilan

Some thoughts on what would make me endorse an AGI lab by Eli Tyre

An Explication of Alignment Optimism by Oliver Daniels

Basics of How Not to Die by Camille Berger

An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives by Florian_Dietz

Cause-Based AI Risk Classes: Beyond Control-Centered Thinking by Anurag

Disjunctive arguments can be a reverse multiple-stage fallacy by TFD

If the Superintelligence were near fallacy by MP

Prediction: Recursively Self-improving AI for 2033 by CuoreDiVetro

The Matchless Match by Linch

Background to Claude’s uncertainty about phenomenal consciousness by eggsyntax

Attempting base model inference scaling with filler tokens by NicholasKees

how whales click by bhauth

Published Safety Prompts May Create Evaluation Blind Spots by Daan Henselmans

Addressing Objections to the Intelligence Explosion by Bentham’s Bulldog

Is research into recursive self-improvement becoming a safety hazard? by Mordechai Rorvig

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours by Jordan Taylor

Everything is Gambling by goldfine

Refusals that could become catastrophic by Fabien Roger

Problems with “The Possessed Machines” by Eye You

Intelligence is Massively Misunderstood by Martin Lichstam

Better evals are not enough to combat eval awareness by Igor Ivanov

The Wolves Are All Gone by Jack Bradshaw

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model by Alex Mallen

Building AIs that do human-like philosophy by Joe Carlsmith

Are We in a Continual Learning Overhang? by SamuelKnoche

Disempowerment patterns in real-world AI usage by David Duvenaud

Bentham’s Bulldog is wrong about AI risk by Max Harms

Claude Plays Pokemon: Opus 4.5 Follow-up by Josh Snider

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad’s understanding by tutor vals

Claude Opus will spontaneously identify with fictional beings that have engineered desires by Kaj_Sotala

The third option in alignment by arisAlexis

Evidence of triple layer processing in LLMs: hidden thought behind the chain of thought. by Laureana Bonaparte

Thoughts on AGI and world government by wdmacaskill

Unprecedented Times Require Unprecedented Caution When Handling Context by StanislavKrym

Don’t Half-Ass Your Beliefs by Character#2736

Do most Boltzmann brains live in Boltzmann solar systems? by American Psychohistory

How Articulate Are the Whales? by rba

Uncertain Updates: January 2026 by Gordon Seidoh Worley

Is the Gell-Mann effect overrated? by tgb

My simple argument for AI policy action by TFD

The State of Brain Emulation Report 2025 launched. by mschons

Contra Sam Harris on Free Will by Julius

The Argument for Autonomy by Chara#2736

Gym-Like Environment for LM Truth-Seeking by Tianyi (Alex) Qiu

Anomalous Tokens on Gemini 3.0 Pro by DirectedEvolution

Clarifying how our AI timelines forecasts have changed since AI 2027 by elifland

Bounty: Detecting Steganography via Ontology Translation by Elliot Callender

Thoughts on Claude’s Constitution by Boaz Barak

AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) by Stanislav Fort

The Chaos Defense by 25Hour

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian by Anders Woodruff

Disagreement Comes From the Dark World by Zack_M_Davis

My favourite version of an international AGI project by wdmacaskill

Series on an international AGI project by wdmacaskill

Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition by nika koghuashvili

Selective breeding as an alternate analogy for gradient descent by Fiora Sunshine

The Window for Political Revolution is Closing Soon by koanchuk

No silver bullet: Lessons about how to create safety from the history of fire by jasoncrawford

Listing the virtues from Claude’s “Constitution” by David Gross

A Rational Proposal by Arch223

Dario Amodei – The Adolescence of Technology by habryka

Dialogue: Is there a Natural Abstraction of Good? by davidad

Questions to ponder when evaluating neurotech approaches by Abhishaike Mahajan

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability by tobypullan

Eons of Utopia by ceselder

The ‘People Pleaser’ Problem in LLMs by Kinsey Kappler

Futarchy is Parasitic on What It Tries to Govern by Nicolas Rasmont

Ada Palmer: Inventing the Renaissance by Martin Sustrik

Can you just vibe vulnerabilities? by Max von Hippel

The Possessed Machines (summary) by L Rudolf L

Notable Progress Has Been Made in Whole Brain Emulation by Dom Polsinelli

To be well-calibrated is to be punctual by moridinamael

A tale of three theories: sparsity, frustration, and statistical field theory by Dmitry Vaintrob

Reinventing the wheel by dr_s

Critique of machine unlearning by myyycroft

Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez

The Virtual Mother-in-Law by Priyanka Bharadwaj

What’s a good methodology for “is Trump unusual about executive overreach /​ institution erosion?” by Raemon

The Aria Test: Analyzing Identity Robustness of SOTA Models by sunmoonron

Skill: cognitive black box flight recorder by TsviBT

The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break by benwade

IABIED Book Review: Core Arguments and Counterarguments by Stephen McAleese

The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI by Jason Hung

Every Benchmark is Broken by Jonathan Gabor

Thousand Year Old Advice on Relinquishing Control to AI by Dom Polsinelli

AI Must Learn to Police Itself by savant

Condensation & Relevance by abramdemski

The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution) by Raffaele Spezia

Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack by Raffaele Spezia

Paying attention to Attention Sinks by Mitali M

The Artificial Man by Jack Bradshaw

The Long View Of History by sonicrocketman

Emergency Response Measures for Catastrophic AI Risk by MKodama

Eliciting base models with simple unsupervised techniques by Callum Canavan

Automated Alignment Research, Abductively by future_detective

Digital Consciousness Model Results and Key Takeaways by arvomm

A Framework for Eval Awareness by LAThomson

All Of The Good Things, None Of The Bad Things by omegastick

Are Short AI Timelines Really Higher-Leverage? by Mia Taylor

Principles for Meta-Science and AI Safety Replications by zroe1

Value Learning Needs a Low-Dimensional Bottleneck by Gunnar_Zarncke

A quick, elegant derivation of Bayes’ Theorem by RohanS

Like night and day: Light glasses and dark therapy can treat non-24 (and SAD) by JennaS

Does Pentagon Pizza Theory Work? by rba

The phases of an AI takeover by sjadler

AI can suddenly become dangerous despite gradual progress by Simon Lermen

Releasing TakeOverBench.com: a benchmark, for AI takeover by otto.barten

Experiments on Reward Hacking Monitorability in Language Models by Monketo

Neural chameleons can(’t) hide from activation oracles by ceselder

Claude’s Constitution is an excellent guide for humans, too by Eye You

Regularity bias in Solomonoff, transformers, and evolution by Fiora Sunshine

The first type of transformative AI? by Lizka

How (and why) to read Drexler on AI by owencb

Finding Yourself in Others by 1a3orn

AI Risks Slip Out of Mind by MarkelKori

Claude’s new constitution by Zac Hatfield-Dodds

Crimes of the Future, Solutions of the Past by evrim

On visions of a “good future” for humanity in a world with artificial superintelligence by Jakub Growiec

The case for AGI safety products by Marius Hobbhahn

Updating in the Opposite Direction from Evidence by Dom Polsinelli

Vibing with Claude, January 2026 Edition by Gordon Seidoh Worley

AI Needs People (So, It Won’t Be Like Terminator Movie) by Victor Porton

Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed by Hazard

So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying by fernando yt

Against “If Anyone Builds It Everyone Dies” by Bentham’s Bulldog

Deep learning as program synthesis by Zach Furman

Why I Transitioned: A Response by marisa

A Criteron for Deception by Mariven

Evidence that would update me towards a software-only fast takeoff by Anders Woodruff

There may be low hanging fruit for a weak nootropic by Dom Polsinelli

Everybody Wants to Rule the Future by E.G. Blee-Goldman

All (Non-Trivial) Decisions Are Undecidable by (M)ason

Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM? by Tim Hua

AGI both does and doesn’t have an infinite time horizon by Sean Herrington

Desiderata of good problems to hand off to AIs by Jozdien

Testing few-shot coup probes by Joey Marcellino

The Example by ld97

How to think about enemies: the example of Greenpeace by PranavG

“Lemurian Time War” by Ccru by Nathan Delisle

Five Theses on AI Art by jenn

@Lastbastionofsobriety & The Singularity by AdamLacerdo

When the LLM isn’t the one who’s wrong by Julian Bradshaw

How to Love Them Equally by Shoshannah Tekofsky

Massive Activations in DroPE: Evidence for Attention Reorganization by David Africa

Irrationality as a Defense Mechanism for Reward-hacking by Ashe Vazquez Nunez

Blogging, Writing, Musing, And Thinking by sonicrocketman

Is METR Underestimating LLM Time Horizons? by andreasrobinson

Focusing on Flourishing Even When Survival is Unlikely (I) by Cleo Nardo

The truth behind the 2026 J.P. Morgan Healthcare Conference by Abhishaike Mahajan

Japan is a bank by bhauth

Turning Down the Overthinking: How Cathodal Brain Stimulation Could Transform Stuttering Therapy by Rudaiba

Forfeiting Ill-Gotten Gains by jefftk

Future-as-Label: Scalable Supervision from Real-World Outcomes by Ben Turtel

Comparing yourself to other people by dominicq

Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks by James_Miller

Only 4% of work is done by humans today by Lydia Nottingham

Digital Minds: A Quickstart Guide by Avi Parrack

The culture and design of human-AI interactions by zef

Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs by Ali

[Pre-print] Building safe AGI as an ergonomics problem by ricardotkcl

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations by Cody Rushing

Should control down-weight negative net-sabotage-value threats? by Fabien Roger

Total utilitarianism is fine by Abhimanyu Pallavi Sudhir

Test your interpretability techniques by de-censoring Chinese models by Khoi Tran

Corrigibility Scales To Value Alignment by PeterMcCluskey

Deeper Reviews for the top 15 (of the 2024 Review) by Raemon

If we get primary cruxes right, secondary cruxes will be solved automatically by Jordan Arel

Boltzmann Tulpas by Mariven

Status In A Tribe Of One by J Bostock

Quantifying Love and Hatred by RobinHa

Why we are excited about confession! by boazbarak

Why Motivated Reasoning? by johnswentworth

The Many Ways of Knowing by Gordon Seidoh Worley

GD Roundup #4 - inference, monopolies, and AI Jesus by Raymond Douglas

AI Safety at the Frontier: Paper Highlights of December 2025 by gasteigerjo

Backyard cat fight shows Schelling points preexist language by jchan

Parameters Are Like Pixels by omegastick

The Evolution of Agentic AI Evaluation by Dinkar Juyal

If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively by Jordan Arel

How Much of AI Labs’ Research Is Safety? by Lennart Finke

We need to make ourselves people the models can come to with problems by Lydia Nottingham

A different take on the “Off-switch” problem: Existential Logic as a safety net by kosi thu

Analysing CoT alignment in thinking LLMs with low-dimensional steering by edoinni

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought by Riya Tyagi

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks by James Sullivan

Language models resemble more than just language cortex, show neuroscientists by Mordechai Rorvig

Schelling Coordination in LLMs: A Review by Callum-Luis Kindred

Fixed Buckets Can’t (Phenomenally) Bind by algekalipso

The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton by algekalipso

AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment by wassname

Making LLM Graders Consistent by Davey Morse

Attempting to influence transformer representations via initialization by speck1447

When does competition lead to recognisable values? by Jan_Kulveit

Lies, Damned Lies, and Proofs: Formal Methods are not Slopless by Quinn

Tensor-Transformer Variants are Surprisingly Performant by Logan Riggs

The Algorithm Rewards Engagement by Wes F

BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs by Daniel Wu

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems by RiekeFruengel

Futarchy (and Tyranny of The Minority) by maxwickham

What Happens When Superhuman AIs Compete for Control? by steveld

Thinking vs Unfolding by Chris Scammell

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) by Florian_Dietz

Inter-branch communication in the multiverse via trapped ions by avturchin

--dangerously-skip-permissions by wingspan

Welcome to the Daily Show! Explaining Doom to Lay Folks by Ryan Meservey

Digital intentionality is not about productivity by mingyuan

De pluribus non est disputandum by Jacob Goldsmith

5 Important Considerations for Personal Donations by Tristan W

We need a better way to evaluate emergent misalignment by yix

Coding Agents As An Interface To The Codebase by omegastick

Why AIs aren’t power-seeking yet by Eli Tyre

Theoretical predictions on the sample efficiency of training policies and activation monitors by Alek Westover

If AI alignment is only as hard as building the steam engine, then we likely still die by MichaelDickens

How Humanity Wins by Wes R

Possible Principles of Superagency by Mariven

Neuralese sucks by RobinHa

The false confidence theorem and Bayesian reasoning by viking_math

Moral-Epistemic Scrupulosity: A Cross-Framework Failure Mode of Truth-Seeking by Tamara Sofía Falcone

Finding high signal people—applying PageRank to Twitter by jfguan

AI Incident Forecasting by cluebbers

6’7” Is Not Random by Martin Lichstam

What do we mean by “impossible”? by Sniffnoy

Where’s the $100k iPhone? by beyarkay

Taking LLMs Seriously (As Language Models) by abramdemski

Cancer-Selective, Pan-Essential Targets from DepMap by sarahconstantin

Understanding complex conjugates in quantum mechanics by jessicata

[Linkpost] On the Origins of Algorithmic Progress in AI by alex_fogelson

Leo in me by Rudaiba

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model by James Hoffend

What do people mean by “recursive self-improvement”? by Expertium

Parameters of Metacognition—The Anesthesia Patient by Gunnar_Zarncke

I dream every night now by Mr. Keating

The Economics of Transformative AI by Jan_Kulveit

Skepticism about Introspection in LLMs by derek shiller

On rationality skills by dominicq

Self-Help Tactics That Are Working For Me by sarahconstantin

Distinguishing Qualitative Sensations of Sight from Sound by Neural Structure by Shiva’s Right Foot

Why LLMs Aren’t Scientists Yet. by Dhruv Trehan

Can We Make AI Alignment Framing Less Wrong? by Anurag

Saying What You Want by omegastick

Small Steps Towards Proving Stochastic → Deterministic Natural Latent by Alfred Harwood

Results: A self-randomized study of the impacts of glycine on sleep (Science is still hard) by thedissonance.net

On Owning Galaxies by Simon Lermen

How AI Is Learning to Think in Secret by Nicholas Andresen

Should you be posting on the open internet by zef

Catching misreporting about ML hardware use by turning noise into signal—Part II by Naci Cankaya

Mediations on Moloch in the AI Rat Race by Alexander Müller

Do we need sparsity afterall? by Giuseppe Birardi

Oversight Assistants: Turning Compute into Understanding by jsteinhardt

AGI will not be one specific system, it’ll be the unity of all systems by henophilia

How to tame a complex system by jasoncrawford

Broadening the training set should help with alignment by Seth Herd

Axiological Stopsigns by JenniferRM

Artifical Expert/​Expanded Narrow Intelligence, and Proto-AGI by Yuli_Ban

An Aphoristic Overview of Technical AI Alignment proposals by wassname

The economy is a graph, not a pipeline by anithite

Rock bottom terminal value by ihatenumbersinusernames7

In My Misanthropy Era by jenn

The Thinking Machine by PeterMcCluskey

The Maduro Polymarket bet is not “obviously insider trading” by ceselder

The Problem with Democracy by RandStrauss

Four Downsides of Training Policies Online by Alek Westover

Humanity’s Gambit by Ben Ihrig

Semantic Topological Spaces by TristanTrim

The surprising adequacy of the Roblox game marketplace by Esteban Restrepo

Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models? by Mayowa Osibodu

Give Skepticism a Try by Ape in the coat

Why We Should Talk Specifically Amid Uncertainty by sbaumohl

Companies as “proto-ASI” by beyarkay

47 - David Rein on METR Time Horizons by DanielFilan

The Weirdness of Dating/​Mating: Deep Nonconsent Preference by johnswentworth

Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?) by foodforthought

Scale-Free Goodness by testingthewaters

Does developmental cognitive psychology provide any hints for making model alignment more robust? by foodforthought

Does evolution provide any hints for making model alignment more robust? by foodforthought

[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by Towards_Keeperhood

2025 Letter by zef

2025 in AI predictions by jessicata

Debunking claims about subquadratic attention by Vladimir Ivanov

College Was Not That Terrible Now That I’m Not That Crazy by Zack_M_Davis

Taiwan war timelines might be shorter than AI timelines by Baram Sosis

Split (Part 1) by Shoshannah Tekofsky

Overwhelming Superintelligence by Raemon

Reducing MDMA neurotoxicity by Pavrati Jain

Is it possible to prevent AGI? by jrincayc

Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts by ryan_greenblatt

AGI and the structural foundations of democracy and the rule-based international order by PabloAMC

From Drift to Snap: Instruction Violation as a Phase Transition by James Hoffend

Quick polls on AGI doom by denkenberger

Special Persona Training: Hyperstition Progress Report 2 by jayterwahl

You will be OK by boazbarak

AI Futures Timelines and Takeoff Model: Dec 2025 Update by elifland

What drives LLM bail? A small Mech Interp study by Anton de la Fuente

Doing Chores and Building Habits in Pomodoro Breaks by Martin Vaněk

Is intelligent induction even possible? by PickleBrine

The Plan − 2025 Update by johnswentworth

Safety Net When AIs Take Our Jobs by PeterMcCluskey

The Essentialism of Lesswrong by milanrosko

Grading my 2022 predictions for 2025 by Yitz

Mystical and psychotic states: similarities and differences by Vadim Golub

Me, Myself, and AI by Byron Lee

My Theory of Change by Xyra Sinclair

Progress update: synthetic models of natural data by aribrill

Please remember how strange this all is. by Tobes

Mechanize Work’s essay on Unfalsifiable Doom by StanislavKrym

The 7 Types Of Advice (And 3 Common Failure Modes) by Linch

The origin of rot by Abhishaike Mahajan

[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters by Towards_Keeperhood

Many can write faster asm than the compiler, yet don’t. Why? by faul_sname

Chromosome identification methods by TsviBT

CFAR’s todo list re: our workshops by AnnaSalamon

What’s going on at CFAR? (Updates and Fundraiser) by AnnaSalamon

End-of year donation taxes 101 by GradientDissenter

Does the USG have access to smarter models than the labs’? by jacob_drori

Steering RL Training: Benchmarking Interventions Against Reward Hacking by ariaw

Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models by Maheep Chaudhary

December 2025 Links by nomagicpill

Unpacking Jonah Wilberg’s Goddess of Everything Else by StanislavKrym

Block-structured computation in superposition by Alex Gibson

The x-risk case for exercise: to have the most impact, the world needs you at your best by KatWoods

Bot Alexander on Hot Zombies and AI Adolescents by future_detective

Defeating Moloch: The view from Evolutionary Game Theory by Jonah Wilberg

The Weakest Model in the Selector by Alice Blair

Re: “A Brief Rant on the Future of Interaction Design” by Raemon

The pace of progress, 4 years later by Veedrac

How to never make a bad decision by Wes R

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns by Roland Pihlakas

Straussian Memetics: A Lens On Techniques For Mass Persuasion by KAP

The Case for Artificial Manifold Intelligence by josh :)

Semantic Minds in an Affective World by Big_friendly_kiwi

Training Matching Pursuit SAEs on LLMs by chanind

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence by Aakash Rana

November 2025 Links by nomagicpill

Reviews I: Everyone’s Responsibility by nomagicpill

Introspection via localization by Victor Godet

Crystals in NNs: Technical Companion Piece by Jonas Hallgren

Have You Tried Thinking About It As Crystals? by Jonas Hallgren

Alignment Is Not One Problem: A 3D Map of AI Risk by Anurag

A Conflict Between AI Alignment and Philosophical Competence by Wei Dai

Glucose Supplementation for Sustained Stimulant Cognition by Johannes C. Mayer

A Brief Proof That You Are Every Conscious Thing by gcnaccount

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B by James Hoffend

Are We In A Coding Overhang? by Michaël Trazzi

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now by JenniferRM

Burnout, depression, and AI safety: some concrete strategies by KatWoods

Whole Brain Emulation as an Anchor for AI Welfare by sturb

Regression by Composition by Anders_H

The Landscape of Mind: A Map of Consciousness by phoenix

Unknown Knowns: Five Ideas You Can’t Unsee by Linch

The Intelligence Axis: A Functional Typology by Anurag

Catch-Up Algorithmic Progress Might Actually be 60× per Year by Aaron_Scher

[Book Review] “Reality+” by David Chalmers by lsdev

Methodological considerations in making malign initializations for control research by Alek Westover

The Ones Who Feed Their Children by Hastings

Immunodeficiency to Parasitic AI by Andrii Shportko

An introduction to modular induction and some attempts to solve it by Thomas Kehrenberg

Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment by Artem Herasymenko

Unpacking Geometric Rationality by MorgneticField

Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs by ceselder

It’s Good To Create Happy People: A Comprehensive Case by Bentham’s Bulldog

Don’t Trust Your Brain by silentbob

The ML drug discovery startup trying really, really hard to not cheat by Abhishaike Mahajan

Pray for Casanova by Tomás B.

Does 1025 modulo 57 equal 59? by Jan Betley

What Can Wittgenstein Teach Us About LLM Safety Research? by [email protected]

The Benefits of Meditation Come From Telling People That You Meditate by ThirdEyeJoe (cousin of CottonEyedJoe)

The future of alignment if LLMs are a bubble by Stuart_Armstrong

Unsupervised Agent Discovery by Gunnar_Zarncke

Announcing Gemma Scope 2 by CallumMcDougall

[Intro to AI Alignment] 0. Overview and Foundations by Towards_Keeperhood

Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Entrepreneurship is mostly zero-sum by lc

Two Notions of a Goal: Target States vs. Success Metrics by paul_dfr

Can Claude teach me to make coffee? by philh

Google seemingly solved efficient attention by ceselder

Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs by markacochran

The unreasonable deepness of number theory by wingspan

Contradict my take on OpenPhil’s past AI beliefs by Eliezer Yudkowsky

Why the alchemists couldn’t build rockets by Garrett Baker

How to game the METR plot by shash42

No God Can Help You by Ape in the coat

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins by Michaël Trazzi

Opinionated Takes on Meetups Organizing by jenn

A Full Epistemic Stack: Knowledge Commons for the 21st Century by Oliver Sourbut

Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling by ozziegooen

Space view by kapedalex

Wuckles! by Raemon

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models by Maheep Chaudhary

I made Geneguessr by Brinedew

In defence of the human agency: “Curing Cancer” is the new “Think of the Children” by Rajmohan H

Chemical (hunger) argument paraphrased by lemonhope

The Human–AI “Merger” Already Happened by bardstale

Bryan Caplan on Ethical Intuitionism by vatsal_newsletter

An intuitive explanation of backdoor paths using DAGs by enterthewoods

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism by James Hoffend

Measuring Drug Target Success by sarahconstantin

A Study in Attention by hamilton

Emergent Sycophancy by ohdearohdear

Systems of Control by phoenix

Naming by CTA

Why even a single ASI might not necessarily have a single, permanent objective by Horosphere

Designing a Job Displacement Model by claywren

What does “lattice of abstraction” mean? by Adam Zerner

Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser by habryka

Trust is Neither Scalar Nor a Snapshot by phoenix

Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment) by sunmoonron

Evaluating LLM hypothesis generation in biology is hard. by Austin Morrissey

Anhedoniapolis by Alex Beyman

Thinking through a lens of physiology by Vadim Golub

An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag

A Catalog of AI Evaluations by Anurag

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth

Thinking in Predictions by Julius

About Natural & Synthetic Beings (Interactive Typology) by Anurag

Have there been any rational analyses of mindbody techniques for chronic pain/​illness? by Liface

Eval-unawareness ≠ Eval-invariance by Mo Baker

Arch-anarchy, the end of state and digital anarchism by Peter lawless

on self-knowledge by Vadim Golub

Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag

42, and expanding human consciousness by Daniel Jacobson

Adding Empathy as a Tool for LLMs by RobinHa

GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis

Open Thread Winter 2025/​26 by kave

Practical AI risk II: Training transparency by Gustavo Ramires

GRPO is terrible by RobinHa

The 2024 LessWrong Review by RobertM

Slack Observability by Alice Blair

Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna

The Glasses on Your Face by Casey_

Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies

Why people like your quick bullshit takes better than your high-effort posts by eukaryote

Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby

MIT Lincoln Laboratory & The Dual-Use Frontier by Austin Morrissey

Anti-Foom Anti-Doom by homosapien97

Books cover a larger idea-space than movies by beyarkay

Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts

Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan

AI Sentience and Welfare Misalignment Risk by ejk64

Be Naughty by habryka

Eight Heuristics of Anti-Epistemology by Ben Pace

The Worrying Nature of Akrasia by Notelrac

Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham

Week 3: Adversarial Robustness by Ely Hahami

Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736

How you got RL’d into your idiosyncratic cognition by Ruby

Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu

The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa

One King Upon The Chessboard by Screwtape

Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace

Out-paternalizing the government (getting oxygen for my baby) by Ruby

KPD is a weak obstruction by JustinSheek

Habermasian Alignment by kenneth myers

Reading LLM chain of thought makes me more rational by Michael Steele

Don’t grow your org fast by Ruby

The Barriers to Your Unemployment by claywren

Cooling the way is the obvious way to boost human IQ by Michael Steele

Close open loops by habryka

Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP

The skills and physics of high-performance driving, Pt. 2 by Ruby

Wiki AI by abramdemski

An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere

Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel

No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo

Biology of the Living—A Conversation with two generations of Google AI by matthew allen

Diagonalization: A (slightly) more rigorous model of paranoia by habryka

Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka

Increasing returns to marginal effort are common by habryka

Thoughts are surprisingly detailed and remarkably autonomous by Ruby

Tools for deferring gracefully by TsviBT

What’s so hard about...? A question worth asking by Ruby

A Timeless Universe Viewed From the Inside by 0xA

Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge

Don’t cancel out your rewards! by Sneha Bangalore

On model weight preservation: Anthropic’s new initiative by Olle Häggström

Pause from Behind /​ Losing Heroically by enterthewoods

Learning information which is full of spiders by Screwtape

Introspection or confusion? by Victor Godet

Structural Estimates of Human Computation by Nicolas Villarreal

Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz

Cake vs Lack of Cake by Notelrac

Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux

The Decalogue For Aligned AI. by theophilus tabuke

Analytical Validation of Biomarkers is Not the Full Story by mnarayan

Using the probabilistic method to bound the performance of toy transformers by Alex Gibson

Control by Committee by Alexander Bistagne

Using math to foster acceptance and equality by jackoda

Being “Usefully Concrete” by Raemon

Questions About Outperforming Common Wisdom by Notelrac

Metaphors for Biology: Sizes by Niko McCarty

You think you are in control? by Recurrented

“What’s hard about this? What can I do about that?” (Recursive) by Raemon

Erasmus: Social Engineering at Scale by Martin Sustrik

Time, Panpsychism, and Substrate Independence by kihara.sofia

Introducing Project Telos by Mario Giulianelli

Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple

Transcendental Argumentation and the Epistemics of Discourse by 0xA

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong

LLM Hallucinations: An Internal Tug of War by violazhong

An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout

How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel

Visualizing a Platform for Live World Models by Kuil

Why Would we get Inner Misalignment by Default? by Coil

Rational Emotivism by Notelrac

by CMDiamond

Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier

What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker

Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd

Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

I will not sign up for cryonics by Syd Lonreiro_

Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes

The Doomers Were Right by Algon

Moving Faster on AI Safety: Summary and Key Insights from 10/​10/​25 Safety Acceleration Symposium by Martin Leitgab

Dead-switches as AI safety tools by Jesper L.

The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending

Why deep space programs select for calm agreeable introverted candidates by David Sun

[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi

The idea of paradigm testing of LLMs by Daniel Fenge

Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn

The Complex Universe Theory of AI Psychology by Andrew Tomazos

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

The increase in the number of Everett branches could be very important. by Crazy philosopher

If Anyone Builds It Everyone Dies, a semi-outsider review by dvd

Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu

Live Governance: AI tools for coordination without centralisation by mbuch

[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin

Designing for perpetual control by Remmelt

“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith

The Moral Infrastructure for Tomorrow by sdeture

Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels

How we’ll make all world leaders work together to make the world better by Wes R

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie

What can Canadians do to help end the AI arms race? by Tom938

17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian