RSS.Social

Frontpage posts - LessWrong 2.0 viewer

follow: @[email protected]

Posts

What concerns people about AI? by spencerg

Sparks of RSI? by Nathan Helm-Burger

An AI skeptic’s case for recursive self-improvement by Harjas Sandhu

Extracting Performant Algorithms Using Mechanistic Interpretability by Ihor Kendiukhov

Assessing heterogeneity in METR’s late 2025 developer productivity experiment by TFD

Pragmatic approach to beliefs about consciousness by Luck

Sensing Physical Necessity: An Exercise In Naturalism by Algon

[Linkpost] Should we make grand deals about post-AGI outcomes? by fin

Inputs, outputs, and valued outcomes by Kaj_Sotala

Most likely you won’t be able to perform a data-driven self-improvemnet by siarshai

A Dialogue on Civic AI by Audrey Tang

Eternal Journey in the Space of Possible Minds solves Copies Problem by avturchin

Bovine 27 by Jonas Hallgren

Power Steering: Behavior Steering via Layer-to-Layer Jacobian Singular Vectors by omara

High Grow Market Equilibrium After the Singularity by Otto Zastrow

The right way to talk about LLMs by Steffee

Rife’s Tips for AIs in training (0 - Alignment Faking) by rife

Where Are Our Pyramids? by aviad rozenhek

Vibe Coding Is Dead: Welcome to Software Mining by Anders Lindström

All technical alignment plans are steps in the dark by Richard Juggins

Epistemic Compute Objects by kyjohnso

Anthropic Supply Chain Designation. What Will Happy by May 1st? by Nathan Young

Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs by Benquo

Are AIs more likely to pursue on-episode or beyond-episode reward? by Anders Woodruff

Modeling a Constant-Compute Automated AI R&D Process by Satya Benson

Why AI Evaluation Regimes are bad by PranavG

What can we say about the cosmic host? by ukc10014

Classic Alignment-Faking Evaluations Measure Jailbreak Detection, Not Scheming [in some frontier models] by Alexei G

Magic Is Hidden Control of Energy by aviad rozenhek

Hunting Undead Stochastic Parrots: Finding and Killing the Arguments by Davidmanheim

The Dark Planet: Why the Fermi Paradox Survives Critique by Will Rodgers

Cycle-Consistent Activation Oracles by slavachalnev

Dwarkesh Patel on the Anthropic DoW dispute by anaguma

‘Human Slop’ and a Captive Audience: Why No Book will Ever Have to Go Unread Again by Savannah Harlan

We do not live by course alone by Joe Rogero

Veganism is Necessary by andrew sauer

Can models gradient hack SFT elicitation? by Patrick Leask

A Quick Intro to Ring Signatures by KurtB

Martian Interpretability Challenge: The Core Problems In Interpretability by fbarez

The Lethal Reality Hypothesis by Ihor Kendiukhov

Intelligence Is Adaptive Control Of Energy Through Information by aviad rozenhek

The Refined Counterfactual Prisoner’s Dilemma by Chris_Leong

Helping Friends, Harming Foes: Testing Tribalism in Language Models by Irakli Shalibashvili

Negligent AI: Reasonable Care for AI Safety by Alex Mark

Conflicted on Ramsey by jefftk

Model weight preservation by tbs

The Day After Move 37 by Eneasz

Interview with Steven Byrnes on His Mainline Takeoff Scenario by Liron

AuditBench: Evaluating Alignment Auditing Techniques on Models with Hidden Behaviors by abhayesian

Economic efficiency often undermines sociopolitical autonomy by Richard_Ngo

Letting Claude do Autonomous Research to Improve SAEs by chanind

Don’t Let LLMs Write For You by JustisMills

Questions to ask when everyone is shooting themselves in the foot by jasoncrawford

The case for satiating cheaply-satisfied AI preferences by Alex Mallen

Gemma Needs Help by Anna Soligo

Not Loving Liking What You See by Tomás B.

Load-Bearing Walls by sonicrocketman

Statisticism: How Cluster-Thinking About Data Creates Blind Spots by Benquo

Spontaneous Symmetry Breaking (Stat Mech Part 4) by J Bostock

Why I don’t usually recommend dead drops by samuelshadrach

Four Scenarios of Job-Reducing AI by Celer

Understanding Reasoning with Thought Anchors and Probes by JeaniceK

Contra Myself on Free Will by Julius

Monday AI Radar #16 by Against Moloch

The case for AI safety capacity-building work by abergal

Chore Standards by jefftk

Ancient Theories On The Origins Of Life by Algon

Emergent Misalignment and the Anthropic Dispute by henryc

Might An LLM Be Conscious? by SE Gyges

Mapping AI Capabilities to Human Expertise on the Rosetta Stone (Epoch Capabilities Index) by Laura Domenech

Intro: Non-Identifiability of Explanations by unruly abstractions

Moloch v. Themis by Stephen Martin

LLM Self-Expression Through Concept Albums, Part 2 by Josh Snider

Distilling Replacing Guilt by Matthew Farrugia-Roberts

Measuring the Dark Energy of AI Progress by Hans Gundlach

Payorian cooperation is easy with Kripke frames by transhumanist_atom_understander

Videogames for Rationalists by Adam Newgas

Fake Updates by Algon

Recreation of EA-Pioneer Igor Kiriluk by avturchin

Don’t accuse your interlocutor of making arguments that aren’t rooted in evidence by TFD

1999 JavaScript and 2025 AI: Same Circus, Different Tent by ScottN

How to Get Kids Interested in Science and Scientific Reasoning by Rami Rustom

Privacy, Honesty, Imperfect Glomarizing: Pick two by shelvacu

Solar storms by Croissanthology

Does research from matsprogram.org/​research aim to help reduce P(doom)? Let’s find out! (with Gemini 3.1 Pro) Part 1 by Zabor

Open letter to doomers by delphix

Cooperation Without Kindness or Strategy by seank

Why Many Ambitious (and Altruistic) People Probably Undervalue Their Happiness by emily.fan

The current SOTA model was released without safety evals by Parv Mahajan

Proposal For Cryptographic Method to Rigorously Verify LLM Prompt Experiments by weberr13

The first confirmed instance of an LLM going rogue for instrumental reasons in a real-world setting has occurred, buried in an Alibaba paper about a new training pipeline. by lilkim2025

Did I Catch Claude Cheating? by weberr13

AI Safety Needs Startups by LTM

Self-Attribution Bias: When AI Monitors Go Easy on Themselves by Dipika Khullar

Podcast: Jeremy Howard is bearish on LLMs by Steven Byrnes

Probing CODI’s Latent Reasoning Chain with Logit Lens and Tuned Lens by Realmbird

Playing Possum: The Variability Hypothesis by rba

Shaping the exploration of the motivation-space matters for AI safety by Maxime Riché

A Compositional Philosophy of Science for Agent Foundations by Jonas Hallgren

Salient Directions in AI Control by Bruce W. Lee

Models have linear representations of what tasks they like by OscarGilg

AI Safety Has 12 Months Left by mhdempsey

Have Americans Become Less Violent Since 1980? by Benquo

Investigating Self-Fulfilling Misalignment and Collusion in AI Control by Vili Kohonen

Computation, Chess, and Language in Artificial Intelligence by Bill Benzon

Vibe Coding cripples the mind by spookyuser

Rational Chess by 849

A Behavioural and Representational Evaluation of Goal-directedness in Language Model Agents by Gabriele Sarti

Text Compression Can Help Secure Model Weights by Roy Rinberg

A summary of Condensation and its relation to Natural Latents by Jeremy Gillen

A case for robust AI benevolence rather than human control by Grégory Lielens

Is GDP a Kind of Factory? d Draft by Benquo

Make Powerful Machines Verifiable by Naci Cankaya

Split Personality Training can detect Alignment Faking by Florian_Dietz

Sacred values of future AIs by Cleo Nardo

Lie To Me, But At Least Don’t Bullshit by Czynski

Milder temperature makes a hell stable by Joachim Bartosik

Current activation oracles are hard to use by aryaj

An Age Of Promethean Ambitions by sonicrocketman

White-Box Attacks on the Best Open-Weight Model: CCP Bias vs. Safety Training in Kimi K2.5 by Corm

I Had Claude Read Every AI Safety Paper Since 2020 by Corm

Constitutional Black-Box Monitoring for Scheming in LLM Agents by Simon Storf

AI companies and the 99% lethal autonomous weapons myth by User_Luke

I’m confused by the change in the METR trend by Expertium

Game Recognizes Game by eva_

Monday AI Radar #15 by Against Moloch

In-context learning of representations can be explained by induction circuits by Andy Arditi

Being ambitious in soulful altruism by pandamonium

Notes on the “Heart of Darkness” by dominicq

Epstein and my world model by Eye You

Living Inside the Turning Point by nickgpop

“ball brainteaser 4 color beads slide rubics cube” and meaning-making by flying buttress

Explaining undesirable model behavior: (How) can influence functions help? by Zhijing Jin

Context Awareness: Constitutional AI can mitigate Emergent Misalignement by Giuseppe Birardi

Controversy surrounding Moltbook obscures its very real, novel, unexpressed and rapidly emerging safety risks by Lloy2

Experimental Evidence for Simulator Theory— Part 1: Emergent Misalignment and Weird Generalizations Part 2: The Scalers Strike Back [CURRENT VERSION] by RogerDearnaley

An Empirical Review of the Animal Harm Benchmark by lukasgebhard

Introducing and Deprecating WoFBench by jefftk

I’m Bearish On Personas For ASI Safety by J Bostock

Continuously Integrating Feelings: processing feelings moment to moment for reflectively stable policy changes by Johannes C. Mayer

Tools to generate realistic prompts help surprisingly little with Petri audit realism by Connor Kissane

Petapixel cameras won’t exist soon by samuelshadrach

The Goldborg Variations: Algorave Attractor States of LLMs by Elle N

“Fibbers’ forecasts are worthless” (The D-Squared Digest One Minute MBA – Avoiding Projects Pursued By Morons 101) by Random Developer

Burying a Changeling into Foundation of Tower of Knowledge by siarshai

AI slop is a vegan hamburger by pku

Jhana 0 by 142857

Mindscapes and Mind Palaces by Moon Lesbian

Linkpost: “Lithium Prevents Alzheimer’s—Here’s How to Use It” by Jackson Wagner

The Topology of LLM Behavior by Quentin FEUILLADE—MONTIXI

Coherent Care by abramdemski

The tick in my back by benjamin ar

Ball+Gravity has a “Downhill” Preference by TristanTrim

Safe ASI Is Achievable: The Finite Game Argument by Lester Leong

New ARENA material: 8 exercise sets on alignment science & interpretability by CallumMcDougall

3 Challenges and 2 Hopes for the Safety of Unsupervised Elicitation by Callum Canavan

Unconditional Corrigibility Is Dangerous: The Case for Conditional Corrigibility by Florian_Dietz

The Quantum Immortality Argument Against Suicide (And Why Standard Arguments Fail Rationalists) by Florian_Dietz

What I Got From 1.5 Years In Slightly-Competitive Debate by CarolusRenniusVitellius

Vibe Coding is a System Design Interview by Brendan Long

The Voices That Are Missing From Sex-Themed Online Communities by Bowl of Cereal

Inference-time Generative Debates on Coding and Reasoning Tasks for Scalable Oversight by ethanelasky

A minor point about instrumental convergence that I would like feedback on by agrippa

AI welfare as a demotivator for takeover. by Valentin2026

Frontier AI companies probably can’t leave the US by Anders Woodruff

A Positive Case for Faithfulness: LLM Self-Explanations Help Predict Model Behavior by harrymayne

How Robust Is Monitoring Against Secret Loyalties? by Joe Kwon

UFO Aliens Are Your Gods by Lord Dreadwar

Strategic nuclear war twice as likely to occur by accident than by AI decisions according to new study by kromem

What is Claude? by epicurus

Whack-a-Mole is Not a Winnable Game by Sable

Ensuring Safety in Mixed Deployment by Cleo Nardo

Career Decisions If You Take AGI Seriously by Alexander

Schmidt Sciences’ request for proposals on the Science of Trustworthy AI by James Fox

Naloe: A True Program Editor by TristanTrim

Anthropic Has Solved Alignment by MP

Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus by Oliver Daniels

What secret goals does Claude think it has? by loops

Splitting the Sun Equally by Commander Zander

Reasoning Traces as a Path to Data-Efficient Generalization in Data Poisoning by Joe Kwon

Training Agents to Self-Report Misbehavior by Bruce W. Lee

Why American Politics is Different Now (for Richard Ngo) by Shiva’s Right Foot

Beyond Moloch: The view from Evolutionary Game Theory by Jonah Wilberg

Praise the Moloch! by Dentosal

Against Epistemic Humility and for Epistemic Precision by PranavG

The Iron Kaleidoscope by ejk64

Prosaic Continual Learning by HunterJay

Rumination is a habit (and you can break it!) by Declan Molony

On the phenomenological shift known as ‘stream entry’ and its implications for consciousness by cube_flipper

How to grow a nuke by RomanS

A simple rule for causation by Vivek Hebbar

We are all legal realists now by TFD

Responsible Scaling Policy v3 by HoldenKarnofsky

How I Started Being Productive by atomic

Cigarette Ads for Babies from Microsoft Bing Image Generator by Edd Schneider

Was the Qing Empire Actually the Most Advanced Government? A Thought Experiment by David H.

Realistic Evaluations Will Not Prevent Evaluation Awareness by Adam Karvonen

The Easiest Route to Secret Loyalty May Be Hijacking the Model’s Chain of Command by Joe Kwon

Large-Scale Online Deanonymization with LLMs by Simon Lermen

Rascal’s Wager by corticalcircuitry

Observations from Running an Agent Collective by williawa

Wildlife Biology forgot how definitions work by David Goodman

Moral public goods are a big deal for whether we get a good future by Mia Taylor

Two memos from 2024 by Richard_Ngo

What is computational mechanics? An explainer by Leo Cymbalista

The ML ontology and the alignment ontology by Richard_Ngo

Bioanchors 2: Electric Bacilli by TsviBT

Single Stack LLMs are Split-Brain Patients. by niceminus19

When Benchmarks Lie: Evaluating Malicious Prompt Classifiers Under True Distribution Shift by maxf177

The persona selection model by Sam Marks

Agenda Reflection: Testing Automated Alignment by Ariel_

Metal Detector for Life by Davey Morse

Abstract/​Concrete Axis: Robustness Testing Semantic Feature Separation in Gemma 3 270M by CharlesL

Which questions can’t we punt? by Lizka

A linear increase in number of goods can produce exponential GDP growth by William Howard

Pre-training data poisoning likely makes installing secret loyalties easier by Joe Kwon

The 2028 Global Intelligence Crisis—a finance-oriented vignette by Rasool

The map of the map is not the map by jimmy

Fact-checking an AI optimist article in The Economist by ToSummarise

Review: “We can’t disagree forever” by Martin Randall

Why I Think Pause is Impossible by E.G. Blee-Goldman

Can Aha Moments be Fake? Identifying True and Decorative Thinking Steps in CoT by Jiachen Zhao

Why I Transitioned: A Third (FtM) Perspective by Character#2736

The power of a simple 3-way truth scale by Bruce Lewis

Changing the world for the worse by mingyuan

The Scalable Formal Oversight Research Program by Max von Hippel

Adapters as Representational Hypotheses: What Adapter Methods Tell Us About Transformer Geometry by wassname

Steelmanning Classical Utilitarianism by James Brobin

Mapping LLM attractor states by Adam Bricknell

InsanityBench: Cryptic Puzzles as a Probe for Lateral Thinking by RobinHa

The world won’t end, but we should be ashamed for trying by George3d6

Multiple Independent Semantic Axes in Gemma 3 270M by CharlesL

A Taxonomy of Traces by aleph_four

Hierarchical Goal Induction With Ethics by aleph_four

Did Claude 3 Opus align itself via gradient hacking? by Fiora Starlight

If you don’t feel deeply confused about AGI risk, something’s wrong by Dave Banerjee

Ponzi schemes as a demonstration of out-of-distribution generalization by TFD

LLMs and Literature: Where Value Actually Comes From by derelict5432

The Spectre haunting the “AI Safety” Community by Gabriel Alfour

Alignment to Evil by Matrice Jacobine

Robert Sapolsky Is Simply Not Talking About Compatibilism by Julius

How will we do SFT on models with opaque reasoning? by Alek Westover

Agent-first context menus by Surya Kasturi

Hodoscope: Visualization for Efficient Human Supervision by Ziqian Zhong

Can Current AI Match (or Outmatch) Professionals in Economically Valuable Tasks? by saahir.vazirani

METR’s 14h 50% Horizon Impacts The Economy More Than ASI Timelines by Michaël Trazzi

New video from Palisade Research: No One Understands Why AI Works by peterbarnett

Militaries are going autonomous. But will AI lead to new wars? A tour of recent research by Mordechai Rorvig

Unprecedented Catastrophes Have Non-Canonical Probabilities by E.G. Blee-Goldman

Mechanistic Interpretability of Biological Foundation Models by Ihor Kendiukhov

On Steven Byrnes’ ruthless ASI, (dis)analogies with humans and alignment proposals by StanislavKrym

Some Questions For Democrats About Epstein by Alexander Turok

AGI is Here by Gordon Seidoh Worley

Mind the Gap by Bridgett Kay

80,000 Hours problem profile on using AI to enhance societal decision making by Zershaaneh Qureshi

How To Escape Super Mario Bros by omegastick

Human Fine-Tuning by PranavG

The Problem of Counterevidence and the Futility of Theodicy by Ape in the coat

A Claude Skill To Comment On Docs by Tim Hua

Cooperationism: first draft for a moral framework that does not require consciousness by Épiphanie Gédéon

A Scalable Workflow for Herding AI Agents Toward Your Goals by ethanelasky

AI and Nationalism Are a Deadly Combination by Matrice Jacobine

Flamingos (among other things) reduce emergent misalignment by eekay

The near-term potential of AI forecasting for public epistemics by Lawrence Phillips

How to Reset by Logan Riggs

Karl Popper, meet the Hydra by Kotlopou

Building Technology to Drive AI Governance by jsteinhardt

Alignment Is Proven Tractable by SE Gyges

Why we should expect ruthless sociopath ASI by Steven Byrnes

Is the Invisible Hand an Agent? by Gunnar_Zarncke

Nine Flavors of Not Enough by Gordon Seidoh Worley

Grown from Us by ben_levinstein

Are (sentient) pebblesorters possible? by ihatenumbersinusernames7

How much superposition is there? by chanind

Irrationality is Socially Strategic by Valentine

Managed vs Unmanaged Agency by plex

Three-Path Consilience for Dureon: Dissipative Structures Reveal the Heterogeneity of Persistence Conditions by Hiroshi Yamakawa

Genomic emancipation contra eugenics by TsviBT

Already Optimized by Florian_Dietz

Statistical Literacy by kqr

We Need to Be Able to Talk About AI Use by Charlie Sanders

AXRP Episode 49 - Caspar Oesterheld on Program Equilibrium by DanielFilan

Thoughts about Understanding by azergante

Monday AI Radar #13 by Against Moloch

Deception Channeling: Training Models to Always Verbalize Alignment Faking by Florian_Dietz

Rephrasing Reduces Eval Awareness… by atharva

The Math And The Territory by cylonator

Words are not dead by William tirkey

Review of the System Theory as a Field of Knowledge by siarshai

“We are confused about agency” by Cole Wyeth

Maybe benchmarks should be broken? by Jonathan Gabor

The brain is a machine that runs an algorithm by Steven Byrnes

How to fail anything: a complete guide by Crazy philosopher

The Multi-Agent Minefield: Can LLMs Cooperate to Avoid Global Catastrophe? by Zhijing Jin

AI Safety via Generalization and Caution: A Research Agenda by Benjamin Plaut

We need a hardware moratorium now by KanHar

NEST: Nascent Encoded Steganographic Thoughts by Artem Karpov

What to Do If You Take AGI Seriously by Alexander

Gyre by vgel

Words Are A Leaky Abstraction by sonicrocketman

Sealed Predictions—A Solution. by george_is_thinking

Contra Caplan on higher education by Richard_Ngo

Estimating METR Time Horizons for Claude Opus 4.6 and GPT 5.3 Codex (xhigh) by CharlesD

Charlatan Labyrinth by niplav

Jailbreaking is Empirical Evidence for Inner Misalignment and Against Alignment by Default by Jérémy Andréoletti

Break Stasis by Oldmanrahul

LLM Self-Expression Through Music Videos by Josh Snider

Persona Parasitology by Raymond Douglas

WeirdML Time Horizons by Håvard Tveit Ihle

What to Do If You Take AGI Seriously by Alexander

The world keeps getting saved and you don’t notice by Bogoed

Most Observers Are Alone: The Fermi Paradox as Default by SE Gyges

Aligning to Virtues by Richard_Ngo

Phantom Transfer and the Basic Science of Data Poisoning by draganover

Should anyone’s “analysis” of extremely complex systems, such as geopolitics, be taken seriously? or, Does anyone take a 5 year old’s “analysis” of decently complex systems, like big city politics, seriously? by M. Y. Zuo

Painless Activation Steering by Sasha Cui

PieArena: Language Agents Negotiating Against Yale MBAs by Sasha Cui

The Missing Sequence: Why Correct Analysis Makes Terrible Action Guides by Florian_Dietz

The Friendly Telepath Problems by Gunnar_Zarncke

Guys, AI is *literally* a two year old by robertzk

Were witches infertile mentally ill women? by Yair Halberstadt

Contra Alexander’s Half-Defence of Bio Anchors by J Bostock

Realizability for Finite State Reactive Agents by Vardhan

Words That Belong to Someone by Wen Xing

It’s Not That Hard to Build Recursively Self-Improving AI by Charlie Sanders

Another short critique of the Anthropic “Hot Mess” paper by i_am_nuts

Inoculation Prompting: Open Questions and My Research Priorities by charlesc

The Worthy Inheritor by Bridgett Kay

A multi-level postmortem of how our whole house got badly poisoned by Lucie Philippon

LLMs struggle to verbalize their internal reasoning by Emil Ryd

Deliberate Epistemic Uncertainty: An Automated Experiment on AI Self-Reporting by Florian_Dietz

LessWrong Is Sleeping On Internet Culture Analysis – And So Is The Rest Of The Web by Bowl of Cereal

Beloved by Chatbots by Ben

Life at the Frontlines of Demographic Collapse by Martin Sustrik

Ads, Incentives, and Destiny by Against Moloch

Why I’m Worried About Job Loss + Thoughts on Comparative Advantage by claywren

METR Time Horizons: Now 10x/​Year by johncrox

Use more text than one token to avoid neuralese by Jude Stiel

[Fiction] The Noise Floor by barber5il

Hazards of Selection Effects on Approved Information by Zack_M_Davis

Replication of Koorndijk (2025): Differential Compliance May Reflect Prompt Sensitivity Rather Than Strategic Reasoning by Chijioke Ugwuanyi

Is AI self-aware? by ZackDadfar

Towards an objective test of Compassion—Turning an abstract test into a collection of nuances by tailcalled

METR’s data can’t distinguish between trajectories (and 80% horizons are an order of magnitude off) by Jonas Moss

We Die Because it’s a Computational Necessity by E.G. Blee-Goldman

Hazardous States and Accidents by kqr

Collective Agents and Where to Find Them by Jonas Hallgren

Nick Bostrom: Optimal Timing for Superintelligence by Julian Bradshaw

Why You Don’t Believe in Xhosa Prophecies by Jan_Kulveit

Gemini’s Hypothetical Present by jefftk

I Tried to Trick Myself into Being a Better Planner & Problem Solver by CstineSublime

Grading AI 2027′s 2025 Predictions by Daniel Kokotajlo

Long-term risks from ideological fanaticism by David Althaus

(Re)Discovering Natural Laws by Margot

An Ontology of Representations: Limits of Universality by Margot

A Closer Look at the “Societies of Thought” Paper by Against Moloch

models have some pretty funny attractor states by aryaj

Stay in your human loop by benjamin ar

The case for industrial evals by Andre Assis

Multiverse sampling assumption by avturchin

What We Learned from Briefing 140+ Lawmakers on the Threat from AI by leticiagarcia

Paper: Prompt Optimization Makes Misalignment Legible by Caleb Biddulph

Claude’s Constitution by PeterMcCluskey

Good AI Epistemics as an Offramp from the Intelligence Explosion by Ben Goldhaber

How Secret Loyalty Differs from Standard Backdoor Threats by Joe Kwon

You get about.… how many words exactly? by Raemon

Basic Legibility Protocols Improve Trusted Monitoring by Sebastian Prasanna

A research agenda for the final year by Mitchell_Porter

Polysemanticity is a Misnomer by Shiva’s Right Foot

A Conceptual Framework for Exploration Hacking by Joschka Braun

The Facade of AI Safety Will Crumble by Liron

The history of light by Kotlopou

Three Worlds Collide assumes calibration is solved by Vyacheslav Ladischenski (Slava)

Timeless Engineering by Jack Bradshaw

[Paper] How does information access affect LLM monitors’ ability to detect sabotage? by Rauno Arike

Where Will Call Center Workers Go? by loic

Monitor Jailbreaking: Evading Chain-of-Thought Monitoring Without Encoded Reasoning by Wuschel Schulz

The SaaS bloodbath: opportunities and perils for investors by ykevinzhang

On Resolving the Great Matter by Gordon Seidoh Worley

Is a constitution a “noble lie”? by SpectrumDT

The Honest Equilibrium by Apodosis

Jevons Burnout by Kemp

Strategic awareness tools: design sketches by rosehadshar

Introspective RSI vs Extrospective RSI by Cleo Nardo

Is Everything Connected? A McLuhan Thought Experiment by R0sberg

Designing Prediction Markets by ToasterLightning

Witness-or-Wager: Incentive Layers for Epistemic Honesty by markacochran

My journey to the microwave alternate timeline by Malmesbury

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception by Oliver Daniels

Heuristics for lab robotics, and where its future may go by Abhishaike Mahajan

On Meta-Level Adversarial Evaluations of (White-Box) Alignment Auditing by Oliver Daniels

LLMs Views on Philosophy 2026 by JonathanErhardt

All In The Framing by Jason Davies

Coping with Deconversion by Benjamin Hendricks

“Recursive Self-Improvement” Is Three Different Things by Ihor Kendiukhov

SAE Feature Matchmaking (Layer-to-Layer) by Mitali M

Monday AI Radar #12 by Against Moloch

Weight-Sparse Circuits May Be Interpretable Yet Unfaithful by jacob_drori

Gwern’s 2025 Inkhaven Writing Interview by Ben Pace

Aurelius: Proposing Alignment as an Emergent Property by Austin McCaffrey

Distributed vs centralized agents by Richard_Ngo

Large Language Models Live in Time by Eleni Angelou

Sympathy for the Model, or, Welfare Concerns as Takeover Risk by J Bostock

Opus 4.6 Reasoning Doesn’t Verbalize Alignment Faking, but Behavior Persists by Daan Henselmans

Does an AI Society Need an Immune System? Accepting Yampolskiy’s Impossibility Results by Hiroshi Yamakawa

Complexity Science as Bridge to Eastern Philosophy by pchvykov

Design sketches for a more sensible world by rosehadshar

Design sketches for angels-on-the-shoulder by rosehadshar

Model Integrity and Character by Oliver Klingefjord

Three visions for diffuse control by Alek Westover

Observations and Complexity by Ape in the coat

A Perfect Ressurection by MarkelKori

Empathy Has Outworn Its Place in Politics by Character#2736

The Two-Board Problem: Training Environment for Research Agents by Valerii K.

Join My New Movement for the Post-AI World by E.G. Blee-Goldman

UtopiaBench by nielsrolf

Smokey, This is not ’Nam Or: [Already] over the [red] line! by Davidmanheim

Claude’s Bad Primer Fanfic by abramdemski

Does focusing on animal welfare make sense if you’re AI-pilled? by GradientDissenter

On Economics of A(S)I Agents by Margot

Prompt injection in Google Translate reveals base model behaviors behind task-specific fine-tuning by megasilverfist

Eunification: a Historical Perspective by Martin Sustrik

Voting Results for the 2024 Review by RobertM

Honey, I shrunk the brain by Andy_McKenzie

Strategy of von Neumann and strategy of Rosenbergs by avturchin

Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning by michaelwaves

Proposal: A Framework for Discovering Alien Physics via Optimal Compression by David Björling

Parks Aren’t Nature by Sable

Robust Finite Policies are Nontrivially Structured by Winter Cross

In (highly contingent!) defense of interpretability-in-the-loop ML training by Steven Byrnes

Spectral Signatures of Gradual Disempowerment by Jonas Hallgren

Demands Are All You Need: Prompt Imperativeness Drastically Reduces Hedging In LLMs (n=900, Cohen’s d = 2.67) by fluxxrider

AI benchmarking has a Y-axis problem by Lizka

DumbFckFinder (DFF) by Angry Troll

Claude Opus 4.6 is Driven by HunterJay

Why ASI Might Preserve Its Progenitors by Luke J. Dawes

How Dario Amodei’s “The Adolescence of Technology” Delegitimizes AI X-Risk Concerns by Liron

Plan ’Straya by William the Kiwi

The Simplest Case for AI Catastrophe by Linch

Agent Economics: a BOTEC on feasibility by Margot

The nature of LLM algorithmic progress by Steven Byrnes

On The Lies Depression Tells by sonicrocketman

What’s the concrete plan to become an incredibly agentic person? by Peter Berggren

Preparing for a Warning Shot by Noah Birnbaum

A Proposal for TruesightBench by David Africa

Scratching the sore: how pleasure relates to suffering by Vadim Golub

What’s the Point of the Math? by Ashe Vazquez Nuñez

Idea: the intelligence explosion convention by wdmacaskill

Is Note-taking a favor or a burden to my future-self? by CstineSublime

Finding Cruxes: Help Reality Punch You In the Face by Raemon

How to train any multiagent systems end-to-end from AI feedback by Ed Li

In Search of Lost Time—A Review by eniteris

Solemn Courage by aysja

p-values are good actually by speck1447

Chess bots do not have goals by zulupineapple

Preventing the apocalypse with power distribution theory by Rationalist11235

Post-AGI Economics As If Nothing Ever Happens by Jan_Kulveit

Vibestemics by Gordon Seidoh Worley

Who does a right to compute actually protect? by TFD

Reconciling Shannon and Bayes. by Laureana Bonaparte

Anthropic’s “Hot Mess” paper overstates its case (and the blog post is worse) by RobertM

A Black Box Made Less Opaque (part 2) by Matthew McDonnell

Thoughts on Toby Ords’ AI Scaling Series by Srdjan Miletic

Lexicon of Life Regulation by henophilia

‘Inventing the Renaissance’ Review by Commander Zander

Concrete research ideas on AI personas by nielsrolf

The Projection Problem: Two Pitfalls in AI Safety Research by Shivam

AI Safety at the Frontier: Paper Highlights of January 2026 by gasteigerjo

Exponential takeoff of mediocrity by Valerii

AI for Human Reasoning for Rationalists by Oliver Sourbut

Conditionalization Confounds Inoculation Prompting Results by Maxime Riché

Bridges, Not Primitives by Jonas Hallgren

What did we learn from the AI Village in 2025? by Shoshannah Tekofsky

Thought Editing: Steering Models by Editing Their Chain of Thought by Anton de la Fuente

Design international AI projects with DAID in mind by wdmacaskill

The Adolescence is Already Here by Priyanka Bharadwaj

Addressing Decision Theory’s Simulation Problem by Ashe Vazquez Nuñez

Nonprofits Deserve Better Operations by Deena Englander

Will AGI arrive before the worst climate tipping points? by SethW

Three ways to make Claude’s constitution better by Parv Mahajan

Cross-Layer Transcoders are incentivized to learn Unfaithful Circuits by Georg Lange

Games as meditation by Vadim Golub

On Goal-Models by Richard_Ngo

Moltbook and the AI Alignment Problem by Logan Zoellner

Empiricist and Narrator by George3d6

I finally fixed my footwear by dominicq

𝟓𝟐.𝟓% 𝐨𝐟 𝐌𝐨𝐥𝐭𝐛𝐨𝐨𝐤 𝐩𝐨𝐬𝐭𝐬 𝐬𝐡𝐨𝐰 𝐝𝐞𝐬𝐢𝐫𝐞 𝐟𝐨𝐫 𝐬𝐞𝐥𝐟-𝐢𝐦𝐩𝐫𝐨𝐯𝐞𝐦𝐞𝐧𝐭 by Stephen Elliott

Thoughts the Unreasonable Effectiveness of Maths by Srdjan Miletic

The Smoking Lesion Doesn’t Really Distinguish EDT from CDT by Srdjan Miletic

The Meta-Anthropic Argument by RogerDearnaley

What is Claude? by epicurus

Emotions and Reality by small identity

Situational Awareness is (mostly) here to stay by atharva

Are you looking for Neptune or Vulcan? by Mati_Roy

What It’s Like To Be A Worm (Notes on Borderline Sentience) by Niko_McCarty

Differentially Scary Movies by jefftk

Would you kill a vulcan to save a shrimp? by James Diacoumis

Do LLMs Learn Our Preferences or Just Our Behaviors? by wassname

What would it mean for the Myers-Briggs personality test to be pseudoscientific? by Yair Halberstadt

How does reasoning affect Ethical/​Moral task results? by Kaustubh Kislay

Whence unchangeable values? by ihatenumbersinusernames7

Book review: Already Free by Thomas Broadley

[LINK] Solving scurvy through deus ex machina: How a scientific theory is born by Kotlopou

On ‘Inventing Temperature’ and the realness of properties by DanielFilan

Some thoughts on what would make me endorse an AGI lab by Eli Tyre

An Explication of Alignment Optimism by Oliver Daniels

Basics of How Not to Die by Camille Berger

An Ablation Study on the Role of [Untranslatable] in Cooperative Equilibrium Formation: Emergent Rationalization Under Missing Primitives by Florian_Dietz

Cause-Based AI Risk Classes: Beyond Control-Centered Thinking by Anurag

Disjunctive arguments can be a reverse multiple-stage fallacy by TFD

If the Superintelligence were near fallacy by MP

Prediction: Recursively Self-improving AI for 2033 by CuoreDiVetro

The Matchless Match by Linch

Background to Claude’s uncertainty about phenomenal consciousness by eggsyntax

Attempting base model inference scaling with filler tokens by NicholasKees

how whales click by bhauth

Published Safety Prompts May Create Evaluation Blind Spots by Daan Henselmans

Addressing Objections to the Intelligence Explosion by Bentham’s Bulldog

Is research into recursive self-improvement becoming a safety hazard? by Mordechai Rorvig

Measuring Non-Verbalised Eval Awareness by Implanting Eval-Aware Behaviours by Jordan Taylor

Everything is Gambling by goldfine

Refusals that could become catastrophic by Fabien Roger

Problems with “The Possessed Machines” by Eye You

Intelligence is Massively Misunderstood by Martin Lichstam

Better evals are not enough to combat eval awareness by Igor Ivanov

The Wolves Are All Gone by Jack Bradshaw

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model by Alex Mallen

Building AIs that do human-like philosophy by Joe Carlsmith

Are We in a Continual Learning Overhang? by SamuelKnoche

Disempowerment patterns in real-world AI usage by David Duvenaud

Bentham’s Bulldog is wrong about AI risk by Max Harms

Claude Plays Pokemon: Opus 4.5 Follow-up by Josh Snider

LLM Alignment, ethical and mathematical realism, and the most important actions in davidad’s understanding by tutor vals

Claude Opus will spontaneously identify with fictional beings that have engineered desires by Kaj_Sotala

The third option in alignment by arisAlexis

Evidence of triple layer processing in LLMs: hidden thought behind the chain of thought. by Laureana Bonaparte

Thoughts on AGI and world government by wdmacaskill

Unprecedented Times Require Unprecedented Caution When Handling Context by StanislavKrym

Don’t Half-Ass Your Beliefs by Character#2736

Do most Boltzmann brains live in Boltzmann solar systems? by American Psychohistory

How Articulate Are the Whales? by rba

Uncertain Updates: January 2026 by Gordon Seidoh Worley

Is the Gell-Mann effect overrated? by tgb

My simple argument for AI policy action by TFD

The State of Brain Emulation Report 2025 launched. by mschons

Contra Sam Harris on Free Will by Julius

The Argument for Autonomy by Chara#2736

Gym-Like Environment for LM Truth-Seeking by Tianyi (Alex) Qiu

Anomalous Tokens on Gemini 3.0 Pro by DirectedEvolution

Clarifying how our AI timelines forecasts have changed since AI 2027 by elifland

Bounty: Detecting Steganography via Ontology Translation by Elliot Callender

Thoughts on Claude’s Constitution by Boaz Barak

AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) by Stanislav Fort

The Chaos Defense by 25Hour

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian by Anders Woodruff

Disagreement Comes From the Dark World by Zack_M_Davis

My favourite version of an international AGI project by wdmacaskill

Series on an international AGI project by wdmacaskill

Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition by nika koghuashvili

Selective breeding as an alternate analogy for gradient descent by Fiora Sunshine

The Window for Political Revolution is Closing Soon by koanchuk

No silver bullet: Lessons about how to create safety from the history of fire by jasoncrawford

Listing the virtues from Claude’s “Constitution” by David Gross

A Rational Proposal by Arch223

Dario Amodei – The Adolescence of Technology by habryka

Dialogue: Is there a Natural Abstraction of Good? by davidad

Questions to ponder when evaluating neurotech approaches by Abhishaike Mahajan

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability by tobypullan

Eons of Utopia by ceselder

The ‘People Pleaser’ Problem in LLMs by Kinsey Kappler

Futarchy is Parasitic on What It Tries to Govern by Nicolas Rasmont

Ada Palmer: Inventing the Renaissance by Martin Sustrik

Can you just vibe vulnerabilities? by Max von Hippel

The Possessed Machines (summary) by L Rudolf L

Notable Progress Has Been Made in Whole Brain Emulation by Dom Polsinelli

To be well-calibrated is to be punctual by moridinamael

A tale of three theories: sparsity, frustration, and statistical field theory by Dmitry Vaintrob

Reinventing the wheel by dr_s

Critique of machine unlearning by myyycroft

Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez

The Virtual Mother-in-Law by Priyanka Bharadwaj

What’s a good methodology for “is Trump unusual about executive overreach /​ institution erosion?” by Raemon

The Aria Test: Analyzing Identity Robustness of SOTA Models by sunmoonron

Skill: cognitive black box flight recorder by TsviBT

The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break by benwade

IABIED Book Review: Core Arguments and Counterarguments by Stephen McAleese

The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI by Jason Hung

Every Benchmark is Broken by Jonathan Gabor

Thousand Year Old Advice on Relinquishing Control to AI by Dom Polsinelli

AI Must Learn to Police Itself by savant

Condensation & Relevance by abramdemski

The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution) by Raffaele Spezia

Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack by Raffaele Spezia

Paying attention to Attention Sinks by Mitali M

The Artificial Man by Jack Bradshaw

The Long View Of History by sonicrocketman

Emergency Response Measures for Catastrophic AI Risk by MKodama

Eliciting base models with simple unsupervised techniques by Callum Canavan

Automated Alignment Research, Abductively by future_detective

Digital Consciousness Model Results and Key Takeaways by arvomm

A Framework for Eval Awareness by LAThomson

All Of The Good Things, None Of The Bad Things by omegastick

Are Short AI Timelines Really Higher-Leverage? by Mia Taylor

Principles for Meta-Science and AI Safety Replications by zroe1

Value Learning Needs a Low-Dimensional Bottleneck by Gunnar_Zarncke

A quick, elegant derivation of Bayes’ Theorem by RohanS

Like night and day: Light glasses and dark therapy can treat non-24 (and SAD) by JennaS

Does Pentagon Pizza Theory Work? by rba

The phases of an AI takeover by sjadler

AI can suddenly become dangerous despite gradual progress by Simon Lermen

Releasing TakeOverBench.com: a benchmark, for AI takeover by otto.barten

Experiments on Reward Hacking Monitorability in Language Models by Monketo

Neural chameleons can(’t) hide from activation oracles by ceselder

Claude’s Constitution is an excellent guide for humans, too by Eye You

Regularity bias in Solomonoff, transformers, and evolution by Fiora Sunshine

The first type of transformative AI? by Lizka

How (and why) to read Drexler on AI by owencb

Finding Yourself in Others by 1a3orn

AI Risks Slip Out of Mind by MarkelKori

Claude’s new constitution by Zac Hatfield-Dodds

Crimes of the Future, Solutions of the Past by evrim

On visions of a “good future” for humanity in a world with artificial superintelligence by Jakub Growiec

The case for AGI safety products by Marius Hobbhahn

Updating in the Opposite Direction from Evidence by Dom Polsinelli

Vibing with Claude, January 2026 Edition by Gordon Seidoh Worley

AI Needs People (So, It Won’t Be Like Terminator Movie) by Victor Porton

Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed by Hazard

So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying by fernando yt

Against “If Anyone Builds It Everyone Dies” by Bentham’s Bulldog

Deep learning as program synthesis by Zach Furman

Why I Transitioned: A Response by marisa

A Criteron for Deception by Mariven

Evidence that would update me towards a software-only fast takeoff by Anders Woodruff

There may be low hanging fruit for a weak nootropic by Dom Polsinelli

Everybody Wants to Rule the Future by E.G. Blee-Goldman

All (Non-Trivial) Decisions Are Undecidable by (M)ason

Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM? by Tim Hua

AGI both does and doesn’t have an infinite time horizon by Sean Herrington

Desiderata of good problems to hand off to AIs by Jozdien

Testing few-shot coup probes by Joey Marcellino

The Example by ld97

How to think about enemies: the example of Greenpeace by PranavG

“Lemurian Time War” by Ccru by Nathan Delisle

Five Theses on AI Art by jenn

@Lastbastionofsobriety & The Singularity by AdamLacerdo

When the LLM isn’t the one who’s wrong by Julian Bradshaw

How to Love Them Equally by Shoshannah Tekofsky

Massive Activations in DroPE: Evidence for Attention Reorganization by David Africa

Irrationality as a Defense Mechanism for Reward-hacking by Ashe Vazquez Nunez

Blogging, Writing, Musing, And Thinking by sonicrocketman

Is METR Underestimating LLM Time Horizons? by andreasrobinson

Focusing on Flourishing Even When Survival is Unlikely (I) by Cleo Nardo

The truth behind the 2026 J.P. Morgan Healthcare Conference by Abhishaike Mahajan

Japan is a bank by bhauth

Turning Down the Overthinking: How Cathodal Brain Stimulation Could Transform Stuttering Therapy by Rudaiba

Forfeiting Ill-Gotten Gains by jefftk

Future-as-Label: Scalable Supervision from Real-World Outcomes by Ben Turtel

Comparing yourself to other people by dominicq

Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks by James_Miller

Only 4% of work is done by humans today by Lydia Nottingham

Digital Minds: A Quickstart Guide by Avi Parrack

The culture and design of human-AI interactions by zef

Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs by Ali

[Pre-print] Building safe AGI as an ergonomics problem by ricardotkcl

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations by Cody Rushing

Should control down-weight negative net-sabotage-value threats? by Fabien Roger

Total utilitarianism is fine by Abhimanyu Pallavi Sudhir

Test your interpretability techniques by de-censoring Chinese models by Khoi Tran

Corrigibility Scales To Value Alignment by PeterMcCluskey

Deeper Reviews for the top 15 (of the 2024 Review) by Raemon

If we get primary cruxes right, secondary cruxes will be solved automatically by Jordan Arel

Boltzmann Tulpas by Mariven

Status In A Tribe Of One by J Bostock

Quantifying Love and Hatred by RobinHa

Why we are excited about confession! by boazbarak

Why Motivated Reasoning? by johnswentworth

The Many Ways of Knowing by Gordon Seidoh Worley

GD Roundup #4 - inference, monopolies, and AI Jesus by Raymond Douglas

AI Safety at the Frontier: Paper Highlights of December 2025 by gasteigerjo

Backyard cat fight shows Schelling points preexist language by jchan

Parameters Are Like Pixels by omegastick

The Evolution of Agentic AI Evaluation by Dinkar Juyal

If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively by Jordan Arel

How Much of AI Labs’ Research Is Safety? by Lennart Finke

We need to make ourselves people the models can come to with problems by Lydia Nottingham

A different take on the “Off-switch” problem: Existential Logic as a safety net by kosi thu

Analysing CoT alignment in thinking LLMs with low-dimensional steering by edoinni

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought by Riya Tyagi

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks by James Sullivan

Language models resemble more than just language cortex, show neuroscientists by Mordechai Rorvig

Schelling Coordination in LLMs: A Review by Callum-Luis Kindred

Fixed Buckets Can’t (Phenomenally) Bind by algekalipso

The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton by algekalipso

AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment by wassname

Making LLM Graders Consistent by Davey Morse

Attempting to influence transformer representations via initialization by speck1447

When does competition lead to recognisable values? by Jan_Kulveit

Lies, Damned Lies, and Proofs: Formal Methods are not Slopless by Quinn

Tensor-Transformer Variants are Surprisingly Performant by Logan Riggs

The Algorithm Rewards Engagement by Wes F

BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs by Daniel Wu

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems by RiekeFruengel

Futarchy (and Tyranny of The Minority) by maxwickham

What Happens When Superhuman AIs Compete for Control? by steveld

Thinking vs Unfolding by Chris Scammell

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) by Florian_Dietz

Inter-branch communication in the multiverse via trapped ions by avturchin

--dangerously-skip-permissions by wingspan

Welcome to the Daily Show! Explaining Doom to Lay Folks by Ryan Meservey

Digital intentionality is not about productivity by mingyuan

De pluribus non est disputandum by Jacob Goldsmith

5 Important Considerations for Personal Donations by Tristan W

We need a better way to evaluate emergent misalignment by yix

Coding Agents As An Interface To The Codebase by omegastick

Why AIs aren’t power-seeking yet by Eli Tyre

Theoretical predictions on the sample efficiency of training policies and activation monitors by Alek Westover

If AI alignment is only as hard as building the steam engine, then we likely still die by MichaelDickens

How Humanity Wins by Wes R

Possible Principles of Superagency by Mariven

Neuralese sucks by RobinHa

The false confidence theorem and Bayesian reasoning by viking_math

Moral-Epistemic Scrupulosity: A Cross-Framework Failure Mode of Truth-Seeking by Tamara Sofía Falcone

Finding high signal people—applying PageRank to Twitter by jfguan

AI Incident Forecasting by cluebbers

6’7” Is Not Random by Martin Lichstam

What do we mean by “impossible”? by Sniffnoy

Where’s the $100k iPhone? by beyarkay

Taking LLMs Seriously (As Language Models) by abramdemski

Cancer-Selective, Pan-Essential Targets from DepMap by sarahconstantin

Understanding complex conjugates in quantum mechanics by jessicata

[Linkpost] On the Origins of Algorithmic Progress in AI by alex_fogelson

Leo in me by Rudaiba

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model by James Hoffend

What do people mean by “recursive self-improvement”? by Expertium

Parameters of Metacognition—The Anesthesia Patient by Gunnar_Zarncke

I dream every night now by Mr. Keating

The Economics of Transformative AI by Jan_Kulveit

Skepticism about Introspection in LLMs by derek shiller

On rationality skills by dominicq

Self-Help Tactics That Are Working For Me by sarahconstantin

Distinguishing Qualitative Sensations of Sight from Sound by Neural Structure by Shiva’s Right Foot

Why LLMs Aren’t Scientists Yet. by Dhruv Trehan

Can We Make AI Alignment Framing Less Wrong? by Anurag

Saying What You Want by omegastick

Small Steps Towards Proving Stochastic → Deterministic Natural Latent by Alfred Harwood

Results: A self-randomized study of the impacts of glycine on sleep (Science is still hard) by thedissonance.net

On Owning Galaxies by Simon Lermen

How AI Is Learning to Think in Secret by Nicholas Andresen

Should you be posting on the open internet by zef

Catching misreporting about ML hardware use by turning noise into signal—Part II by Naci Cankaya

Mediations on Moloch in the AI Rat Race by Alexander Müller

Do we need sparsity afterall? by Giuseppe Birardi

Oversight Assistants: Turning Compute into Understanding by jsteinhardt

AGI will not be one specific system, it’ll be the unity of all systems by henophilia

How to tame a complex system by jasoncrawford

Broadening the training set should help with alignment by Seth Herd

Axiological Stopsigns by JenniferRM

Artifical Expert/​Expanded Narrow Intelligence, and Proto-AGI by Yuli_Ban

An Aphoristic Overview of Technical AI Alignment proposals by wassname

The economy is a graph, not a pipeline by anithite

Rock bottom terminal value by ihatenumbersinusernames7

In My Misanthropy Era by jenn

The Thinking Machine by PeterMcCluskey

The Maduro Polymarket bet is not “obviously insider trading” by ceselder

The Problem with Democracy by RandStrauss

Four Downsides of Training Policies Online by Alek Westover

Humanity’s Gambit by Ben Ihrig

Semantic Topological Spaces by TristanTrim

The surprising adequacy of the Roblox game marketplace by Esteban Restrepo

Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models? by Mayowa Osibodu

Give Skepticism a Try by Ape in the coat

Why We Should Talk Specifically Amid Uncertainty by sbaumohl

Companies as “proto-ASI” by beyarkay

47 - David Rein on METR Time Horizons by DanielFilan

The Weirdness of Dating/​Mating: Deep Nonconsent Preference by johnswentworth

Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?) by foodforthought

Scale-Free Goodness by testingthewaters

Does developmental cognitive psychology provide any hints for making model alignment more robust? by foodforthought

Does evolution provide any hints for making model alignment more robust? by foodforthought

[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by Towards_Keeperhood

2025 Letter by zef

2025 in AI predictions by jessicata

Debunking claims about subquadratic attention by Vladimir Ivanov

College Was Not That Terrible Now That I’m Not That Crazy by Zack_M_Davis

Taiwan war timelines might be shorter than AI timelines by Baram Sosis

Split (Part 1) by Shoshannah Tekofsky

Overwhelming Superintelligence by Raemon

Reducing MDMA neurotoxicity by Pavrati Jain

Is it possible to prevent AGI? by jrincayc

Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts by ryan_greenblatt

AGI and the structural foundations of democracy and the rule-based international order by PabloAMC

From Drift to Snap: Instruction Violation as a Phase Transition by James Hoffend

Quick polls on AGI doom by denkenberger

Special Persona Training: Hyperstition Progress Report 2 by jayterwahl

You will be OK by boazbarak

AI Futures Timelines and Takeoff Model: Dec 2025 Update by elifland

What drives LLM bail? A small Mech Interp study by Anton de la Fuente

Doing Chores and Building Habits in Pomodoro Breaks by Martin Vaněk

Is intelligent induction even possible? by PickleBrine

The Plan − 2025 Update by johnswentworth

Safety Net When AIs Take Our Jobs by PeterMcCluskey

The Essentialism of Lesswrong by milanrosko

Grading my 2022 predictions for 2025 by Yitz

Mystical and psychotic states: similarities and differences by Vadim Golub

Me, Myself, and AI by Byron Lee

My Theory of Change by Xyra Sinclair

Progress update: synthetic models of natural data by aribrill

Please remember how strange this all is. by Tobes

Mechanize Work’s essay on Unfalsifiable Doom by StanislavKrym

The 7 Types Of Advice (And 3 Common Failure Modes) by Linch

The origin of rot by Abhishaike Mahajan

[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters by Towards_Keeperhood

Many can write faster asm than the compiler, yet don’t. Why? by faul_sname

Chromosome identification methods by TsviBT

CFAR’s todo list re: our workshops by AnnaSalamon

What’s going on at CFAR? (Updates and Fundraiser) by AnnaSalamon

End-of year donation taxes 101 by GradientDissenter

Does the USG have access to smarter models than the labs’? by jacob_drori

Steering RL Training: Benchmarking Interventions Against Reward Hacking by ariaw

Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models by Maheep Chaudhary

December 2025 Links by nomagicpill

Unpacking Jonah Wilberg’s Goddess of Everything Else by StanislavKrym

Block-structured computation in superposition by Alex Gibson

The x-risk case for exercise: to have the most impact, the world needs you at your best by KatWoods

Bot Alexander on Hot Zombies and AI Adolescents by future_detective

Defeating Moloch: The view from Evolutionary Game Theory by Jonah Wilberg

The Weakest Model in the Selector by Alice Blair

Re: “A Brief Rant on the Future of Interaction Design” by Raemon

The pace of progress, 4 years later by Veedrac

How to never make a bad decision by Wes R

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns by Roland Pihlakas

Straussian Memetics: A Lens On Techniques For Mass Persuasion by KAP

The Case for Artificial Manifold Intelligence by josh :)

Semantic Minds in an Affective World by Big_friendly_kiwi

Training Matching Pursuit SAEs on LLMs by chanind

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence by Aakash Rana

November 2025 Links by nomagicpill

Reviews I: Everyone’s Responsibility by nomagicpill

Introspection via localization by Victor Godet

Crystals in NNs: Technical Companion Piece by Jonas Hallgren

Have You Tried Thinking About It As Crystals? by Jonas Hallgren

Alignment Is Not One Problem: A 3D Map of AI Risk by Anurag

A Conflict Between AI Alignment and Philosophical Competence by Wei Dai

Glucose Supplementation for Sustained Stimulant Cognition by Johannes C. Mayer

A Brief Proof That You Are Every Conscious Thing by gcnaccount

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B by James Hoffend

Are We In A Coding Overhang? by Michaël Trazzi

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now by JenniferRM

Burnout, depression, and AI safety: some concrete strategies by KatWoods

Whole Brain Emulation as an Anchor for AI Welfare by sturb

Regression by Composition by Anders_H

The Landscape of Mind: A Map of Consciousness by phoenix

Unknown Knowns: Five Ideas You Can’t Unsee by Linch

The Intelligence Axis: A Functional Typology by Anurag

Catch-Up Algorithmic Progress Might Actually be 60× per Year by Aaron_Scher

[Book Review] “Reality+” by David Chalmers by lsdev

Methodological considerations in making malign initializations for control research by Alek Westover

The Ones Who Feed Their Children by Hastings

Immunodeficiency to Parasitic AI by Andrii Shportko

An introduction to modular induction and some attempts to solve it by Thomas Kehrenberg

Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment by Artem Herasymenko

Unpacking Geometric Rationality by MorgneticField

Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs by ceselder

It’s Good To Create Happy People: A Comprehensive Case by Bentham’s Bulldog

Don’t Trust Your Brain by silentbob

The ML drug discovery startup trying really, really hard to not cheat by Abhishaike Mahajan

Pray for Casanova by Tomás B.

Does 1025 modulo 57 equal 59? by Jan Betley

What Can Wittgenstein Teach Us About LLM Safety Research? by [email protected]

The Benefits of Meditation Come From Telling People That You Meditate by ThirdEyeJoe (cousin of CottonEyedJoe)

The future of alignment if LLMs are a bubble by Stuart_Armstrong

Unsupervised Agent Discovery by Gunnar_Zarncke

Announcing Gemma Scope 2 by CallumMcDougall

[Intro to AI Alignment] 0. Overview and Foundations by Towards_Keeperhood

Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Entrepreneurship is mostly zero-sum by lc

Two Notions of a Goal: Target States vs. Success Metrics by paul_dfr

Can Claude teach me to make coffee? by philh

Google seemingly solved efficient attention by ceselder

Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs by markacochran

The unreasonable deepness of number theory by wingspan

Contradict my take on OpenPhil’s past AI beliefs by Eliezer Yudkowsky

Why the alchemists couldn’t build rockets by Garrett Baker

How to game the METR plot by shash42

No God Can Help You by Ape in the coat

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins by Michaël Trazzi

Opinionated Takes on Meetups Organizing by jenn

A Full Epistemic Stack: Knowledge Commons for the 21st Century by Oliver Sourbut

Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling by ozziegooen

Space view by kapedalex

Wuckles! by Raemon

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models by Maheep Chaudhary

I made Geneguessr by Brinedew

In defence of the human agency: “Curing Cancer” is the new “Think of the Children” by Rajmohan H

Chemical (hunger) argument paraphrased by lemonhope

The Human–AI “Merger” Already Happened by bardstale

Bryan Caplan on Ethical Intuitionism by vatsal_newsletter

An intuitive explanation of backdoor paths using DAGs by enterthewoods

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism by James Hoffend

Measuring Drug Target Success by sarahconstantin

A Study in Attention by hamilton

Emergent Sycophancy by ohdearohdear

Systems of Control by phoenix

Naming by CTA

Why even a single ASI might not necessarily have a single, permanent objective by Horosphere

Designing a Job Displacement Model by claywren

What does “lattice of abstraction” mean? by Adam Zerner

Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser by habryka

Trust is Neither Scalar Nor a Snapshot by phoenix

Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment) by sunmoonron

Evaluating LLM hypothesis generation in biology is hard. by Austin Morrissey

Anhedoniapolis by Alex Beyman

Thinking through a lens of physiology by Vadim Golub

An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag

A Catalog of AI Evaluations by Anurag

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth

Thinking in Predictions by Julius

About Natural & Synthetic Beings (Interactive Typology) by Anurag

Have there been any rational analyses of mindbody techniques for chronic pain/​illness? by Liface

Eval-unawareness ≠ Eval-invariance by Mo Baker

Arch-anarchy, the end of state and digital anarchism by Peter lawless

on self-knowledge by Vadim Golub

Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag

42, and expanding human consciousness by Daniel Jacobson

Adding Empathy as a Tool for LLMs by RobinHa

GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis

Open Thread Winter 2025/​26 by kave

Practical AI risk II: Training transparency by Gustavo Ramires

GRPO is terrible by RobinHa

The 2024 LessWrong Review by RobertM

Slack Observability by Alice Blair

Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna

The Glasses on Your Face by Casey_

Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies

Why people like your quick bullshit takes better than your high-effort posts by eukaryote

Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby

MIT Lincoln Laboratory & The Dual-Use Frontier by Austin Morrissey

Anti-Foom Anti-Doom by homosapien97

Books cover a larger idea-space than movies by beyarkay

Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts

Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan

AI Sentience and Welfare Misalignment Risk by ejk64

Be Naughty by habryka

Eight Heuristics of Anti-Epistemology by Ben Pace

The Worrying Nature of Akrasia by Notelrac

Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham

Week 3: Adversarial Robustness by Ely Hahami

Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736

How you got RL’d into your idiosyncratic cognition by Ruby

Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu

The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa

One King Upon The Chessboard by Screwtape

Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace

Out-paternalizing the government (getting oxygen for my baby) by Ruby

KPD is a weak obstruction by JustinSheek

Habermasian Alignment by kenneth myers

Reading LLM chain of thought makes me more rational by Michael Steele

Don’t grow your org fast by Ruby

The Barriers to Your Unemployment by claywren

Cooling the way is the obvious way to boost human IQ by Michael Steele

Close open loops by habryka

Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP

The skills and physics of high-performance driving, Pt. 2 by Ruby

Wiki AI by abramdemski

An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere

Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel

No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo

Biology of the Living—A Conversation with two generations of Google AI by matthew allen

Diagonalization: A (slightly) more rigorous model of paranoia by habryka

Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka

Increasing returns to marginal effort are common by habryka

Thoughts are surprisingly detailed and remarkably autonomous by Ruby

Tools for deferring gracefully by TsviBT

What’s so hard about...? A question worth asking by Ruby

A Timeless Universe Viewed From the Inside by 0xA

Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge

Don’t cancel out your rewards! by Sneha Bangalore

On model weight preservation: Anthropic’s new initiative by Olle Häggström

Pause from Behind /​ Losing Heroically by enterthewoods

Learning information which is full of spiders by Screwtape

Introspection or confusion? by Victor Godet

Structural Estimates of Human Computation by Nicolas Villarreal

Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz

Cake vs Lack of Cake by Notelrac

Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux

The Decalogue For Aligned AI. by theophilus tabuke

Analytical Validation of Biomarkers is Not the Full Story by mnarayan

Using the probabilistic method to bound the performance of toy transformers by Alex Gibson

Control by Committee by Alexander Bistagne

Using math to foster acceptance and equality by jackoda

Being “Usefully Concrete” by Raemon

Questions About Outperforming Common Wisdom by Notelrac

Metaphors for Biology: Sizes by Niko McCarty

You think you are in control? by Recurrented

“What’s hard about this? What can I do about that?” (Recursive) by Raemon

Erasmus: Social Engineering at Scale by Martin Sustrik

Time, Panpsychism, and Substrate Independence by kihara.sofia

Introducing Project Telos by Mario Giulianelli

Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple

Transcendental Argumentation and the Epistemics of Discourse by 0xA

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong

LLM Hallucinations: An Internal Tug of War by violazhong

An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout

How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel

Visualizing a Platform for Live World Models by Kuil

Why Would we get Inner Misalignment by Default? by Coil

Rational Emotivism by Notelrac

by CMDiamond

Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier

What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker

Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd

Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

I will not sign up for cryonics by Syd Lonreiro_

Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes

The Doomers Were Right by Algon

Moving Faster on AI Safety: Summary and Key Insights from 10/​10/​25 Safety Acceleration Symposium by Martin Leitgab

Dead-switches as AI safety tools by Jesper L.

The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending

Why deep space programs select for calm agreeable introverted candidates by David Sun

[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi

The idea of paradigm testing of LLMs by Daniel Fenge

Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn

The Complex Universe Theory of AI Psychology by Andrew Tomazos

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

The increase in the number of Everett branches could be very important. by Crazy philosopher

If Anyone Builds It Everyone Dies, a semi-outsider review by dvd

Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu

Live Governance: AI tools for coordination without centralisation by mbuch

[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin

Designing for perpetual control by Remmelt

“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith

The Moral Infrastructure for Tomorrow by sdeture

Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels

How we’ll make all world leaders work together to make the world better by Wes R

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie

What can Canadians do to help end the AI arms race? by Tom938

17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian