RSS.Social

Frontpage posts - LessWrong 2.0 viewer

follow: @[email protected]

Posts

Contra Sam Harris on Free Will by Julius

The Argument for Autonomy by Chara#2736

Gym-Like Environment for LM Truth-Seeking by Tianyi (Alex) Qiu

Anomalous Tokens on Gemini 3.0 Pro by DirectedEvolution

Clarifying how our AI timelines forecasts have changed since AI 2027 by elifland

Bounty: Detecting Steganography via Ontology Translation by Elliot Callender

Thoughts on Claude’s Constitution by Boaz Barak

AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) by Stanislav Fort

The Chaos Defense by 25Hour

Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian by Anders Woodruff

Disagreement Comes From the Dark World by Zack_M_Davis

My favourite version of an international AGI project by wdmacaskill

Series on an international AGI project by wdmacaskill

Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition by nika koghuashvili

Selective breeding as an alternate analogy for gradient descent by Fiora Sunshine

The Window for Political Revolution is Closing Soon by koanchuk

No silver bullet: Lessons about how to create safety from the history of fire by jasoncrawford

Listing the virtues from Claude’s “Constitution” by David Gross

A Rational Proposal by Arch223

Dario Amodei – The Adolescence of Technology by habryka

Dialogue: Is there a Natural Abstraction of Good? by davidad

Questions to ponder when evaluating neurotech approaches by Abhishaike Mahajan

Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability by tobypullan

Eons of Utopia by ceselder

The ‘People Pleaser’ Problem in LLMs by Kinsey Kappler

Futarchy is Parasitic on What It Tries to Govern by Nicolas Rasmont

Ada Palmer: Inventing the Renaissance by Martin Sustrik

Can you just vibe vulnerabilities? by Max von Hippel

The Possessed Machines (summary) by L Rudolf L

Notable Progress Has Been Made in Whole Brain Emulation by Dom Polsinelli

To be well-calibrated is to be punctual by moridinamael

A tale of three theories: sparsity, frustration, and statistical field theory by Dmitry Vaintrob

Reinventing the wheel by dr_s

Critique of machine unlearning by myyycroft

Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez

The Virtual Mother-in-Law by Priyanka Bharadwaj

What’s a good methodology for “is Trump unusual about executive overreach /​ institution erosion?” by Raemon

The Aria Test: Analyzing Identity Robustness of SOTA Models by sunmoonron

Skill: cognitive black box flight recorder by TsviBT

The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break by benwade

IABIED Book Review: Core Arguments and Counterarguments by Stephen McAleese

The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI by Jason Hung

Every Benchmark is Broken by Jonathan Gabor

Thousand Year Old Advice on Relinquishing Control to AI by Dom Polsinelli

AI Must Learn to Police Itself by savant

Condensation & Relevance by abramdemski

The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution) by Raffaele Spezia

Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack by Raffaele Spezia

Paying attention to Attention Sinks by Mitali M

The Artificial Man by Jack Bradshaw

The Long View Of History by sonicrocketman

Emergency Response Measures for Catastrophic AI Risk by MKodama

Eliciting base models with simple unsupervised techniques by Callum Canavan

Automated Alignment Research, Abductively by future_detective

Digital Consciousness Model Results and Key Takeaways by arvomm

A Framework for Eval Awareness by LAThomson

All Of The Good Things, None Of The Bad Things by omegastick

Are Short AI Timelines Really Higher-Leverage? by Mia Taylor

Principles for Meta-Science and AI Safety Replications by zroe1

Value Learning Needs a Low-Dimensional Bottleneck by Gunnar_Zarncke

A quick, elegant derivation of Bayes’ Theorem by RohanS

Like night and day: Light glasses and dark therapy can treat non-24 (and SAD) by JennaS

Does Pentagon Pizza Theory Work? by rba

The phases of an AI takeover by sjadler

AI can suddenly become dangerous despite gradual progress by Simon Lermen

Releasing TakeOverBench.com: a benchmark, for AI takeover by otto.barten

Experiments on Reward Hacking Monitorability in Language Models by Monketo

Neural chameleons can(’t) hide from activation oracles by ceselder

Claude’s Constitution is an excellent guide for humans, too by Eye You

Regularity bias in Solomonoff, transformers, and evolution by Fiora Sunshine

The first type of transformative AI? by Lizka

How (and why) to read Drexler on AI by owencb

Finding Yourself in Others by 1a3orn

AI Risks Slip Out of Mind by MarkelKori

Claude’s new constitution by Zac Hatfield-Dodds

Crimes of the Future, Solutions of the Past by evrim

On visions of a “good future” for humanity in a world with artificial superintelligence by Jakub Growiec

The case for AGI safety products by Marius Hobbhahn

Updating in the Opposite Direction from Evidence by Dom Polsinelli

Vibing with Claude, January 2026 Edition by Gordon Seidoh Worley

AI Needs People (So, It Won’t Be Like Terminator Movie) by Victor Porton

Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed by Hazard

So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying by fernando yt

Against “If Anyone Builds It Everyone Dies” by Bentham’s Bulldog

Deep learning as program synthesis by Zach Furman

Why I Transitioned: A Response by marisa

A Criteron for Deception by Mariven

Evidence that would update me towards a software-only fast takeoff by Anders Woodruff

There may be low hanging fruit for a weak nootropic by Dom Polsinelli

Everybody Wants to Rule the Future by E.G. Blee-Goldman

All (Non-Trivial) Decisions Are Undecidable by (M)ason

Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM? by Tim Hua

AGI both does and doesn’t have an infinite time horizon by Sean Herrington

Desiderata of good problems to hand off to AIs by Jozdien

Testing few-shot coup probes by Joey Marcellino

The Example by ld97

How to think about enemies: the example of Greenpeace by PranavG

“Lemurian Time War” by Ccru by Nathan Delisle

Five Theses on AI Art by jenn

@Lastbastionofsobriety & The Singularity by AdamLacerdo

When the LLM isn’t the one who’s wrong by Julian Bradshaw

How to Love Them Equally by Shoshannah Tekofsky

Massive Activations in DroPE: Evidence for Attention Reorganization by David Africa

Irrationality as a Defense Mechanism for Reward-hacking by Ashe Vazquez Nunez

Blogging, Writing, Musing, And Thinking by sonicrocketman

Is METR Underestimating LLM Time Horizons? by andreasrobinson

Focusing on Flourishing Even When Survival is Unlikely (I) by Cleo Nardo

The truth behind the 2026 J.P. Morgan Healthcare Conference by Abhishaike Mahajan

Japan is a bank by bhauth

Turning Down the Overthinking: How Cathodal Brain Stimulation Could Transform Stuttering Therapy by Rudaiba

Forfeiting Ill-Gotten Gains by jefftk

Future-as-Label: Scalable Supervision from Real-World Outcomes by Ben Turtel

Comparing yourself to other people by dominicq

Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks by James_Miller

Only 4% of work is done by humans today by Lydia Nottingham

Digital Minds: A Quickstart Guide by Avi Parrack

The culture and design of human-AI interactions by zef

Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs by Ali

[Pre-print] Building safe AGI as an ergonomics problem by ricardotkcl

Powerful misaligned AIs may be extremely persuasive, especially absent mitigations by Cody Rushing

Should control down-weight negative net-sabotage-value threats? by Fabien Roger

Total utilitarianism is fine by Abhimanyu Pallavi Sudhir

Test your interpretability techniques by de-censoring Chinese models by Khoi Tran

Corrigibility Scales To Value Alignment by PeterMcCluskey

Deeper Reviews for the top 15 (of the 2024 Review) by Raemon

If we get primary cruxes right, secondary cruxes will be solved automatically by Jordan Arel

Boltzmann Tulpas by Mariven

Status In A Tribe Of One by J Bostock

Quantifying Love and Hatred by RobinHa

Why we are excited about confession! by boazbarak

Why Motivated Reasoning? by johnswentworth

The Many Ways of Knowing by Gordon Seidoh Worley

GD Roundup #4 - inference, monopolies, and AI Jesus by Raymond Douglas

AI Safety at the Frontier: Paper Highlights of December 2025 by gasteigerjo

Backyard cat fight shows Schelling points preexist language by jchan

Parameters Are Like Pixels by omegastick

The Evolution of Agentic AI Evaluation by Dinkar Juyal

If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively by Jordan Arel

How Much of AI Labs’ Research Is Safety? by Lennart Finke

We need to make ourselves people the models can come to with problems by Lydia Nottingham

A different take on the “Off-switch” problem: Existential Logic as a safety net by kosi thu

Analysing CoT alignment in thinking LLMs with low-dimensional steering by edoinni

Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought by Riya Tyagi

Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks by James Sullivan

Language models resemble more than just language cortex, show neuroscientists by Mordechai Rorvig

Schelling Coordination in LLMs: A Review by Callum-Luis Kindred

Fixed Buckets Can’t (Phenomenally) Bind by algekalipso

The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton by algekalipso

AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment by wassname

Making LLM Graders Consistent by Davey Morse

Attempting to influence transformer representations via initialization by speck1447

When does competition lead to recognisable values? by Jan_Kulveit

Lies, Damned Lies, and Proofs: Formal Methods are not Slopless by Quinn

Tensor-Transformer Variants are Surprisingly Performant by Logan Riggs

The Algorithm Rewards Engagement by Wes F

BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs by Daniel Wu

Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems by RiekeFruengel

Futarchy (and Tyranny of The Minority) by maxwickham

What Happens When Superhuman AIs Compete for Control? by steveld

Thinking vs Unfolding by Chris Scammell

Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) by Florian_Dietz

Inter-branch communication in the multiverse via trapped ions by avturchin

--dangerously-skip-permissions by wingspan

Welcome to the Daily Show! Explaining Doom to Lay Folks by Ryan Meservey

Digital intentionality is not about productivity by mingyuan

De pluribus non est disputandum by Jacob Goldsmith

5 Important Considerations for Personal Donations by Tristan W

We need a better way to evaluate emergent misalignment by yix

Coding Agents As An Interface To The Codebase by omegastick

Why AIs aren’t power-seeking yet by Eli Tyre

Theoretical predictions on the sample efficiency of training policies and activation monitors by Alek Westover

If AI alignment is only as hard as building the steam engine, then we likely still die by MichaelDickens

How Humanity Wins by Wes R

Possible Principles of Superagency by Mariven

Neuralese sucks by RobinHa

The false confidence theorem and Bayesian reasoning by viking_math

Moral-Epistemic Scrupulosity: A Cross-Framework Failure Mode of Truth-Seeking by Tamara Sofía Falcone

Finding high signal people—applying PageRank to Twitter by jfguan

AI Incident Forecasting by cluebbers

6’7” Is Not Random by Martin Lichstam

What do we mean by “impossible”? by Sniffnoy

Where’s the $100k iPhone? by beyarkay

Taking LLMs Seriously (As Language Models) by abramdemski

Cancer-Selective, Pan-Essential Targets from DepMap by sarahconstantin

Understanding complex conjugates in quantum mechanics by jessicata

[Linkpost] On the Origins of Algorithmic Progress in AI by alex_fogelson

Leo in me by Rudaiba

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model by James Hoffend

What do people mean by “recursive self-improvement”? by Expertium

Parameters of Metacognition—The Anesthesia Patient by Gunnar_Zarncke

I dream every night now by Mr. Keating

The Economics of Transformative AI by Jan_Kulveit

Skepticism about Introspection in LLMs by derek shiller

On rationality skills by dominicq

Self-Help Tactics That Are Working For Me by sarahconstantin

Distinguishing Qualitative Sensations of Sight from Sound by Neural Structure by Shiva’s Right Foot

Why LLMs Aren’t Scientists Yet. by Dhruv Trehan

Can We Make AI Alignment Framing Less Wrong? by Anurag

Saying What You Want by omegastick

Small Steps Towards Proving Stochastic → Deterministic Natural Latent by Alfred Harwood

Results: A self-randomized study of the impacts of glycine on sleep (Science is still hard) by thedissonance.net

On Owning Galaxies by Simon Lermen

How AI Is Learning to Think in Secret by Nicholas Andresen

Should you be posting on the open internet by zef

Catching misreporting about ML hardware use by turning noise into signal—Part II by Naci Cankaya

Mediations on Moloch in the AI Rat Race by Alexander Müller

Do we need sparsity afterall? by Giuseppe Birardi

Oversight Assistants: Turning Compute into Understanding by jsteinhardt

AGI will not be one specific system, it’ll be the unity of all systems by henophilia

How to tame a complex system by jasoncrawford

Broadening the training set should help with alignment by Seth Herd

Axiological Stopsigns by JenniferRM

Artifical Expert/​Expanded Narrow Intelligence, and Proto-AGI by Yuli_Ban

An Aphoristic Overview of Technical AI Alignment proposals by wassname

The economy is a graph, not a pipeline by anithite

Rock bottom terminal value by ihatenumbersinusernames7

In My Misanthropy Era by jenn

The Thinking Machine by PeterMcCluskey

The Maduro Polymarket bet is not “obviously insider trading” by ceselder

The Problem with Democracy by RandStrauss

Four Downsides of Training Policies Online by Alek Westover

Humanity’s Gambit by Ben Ihrig

Semantic Topological Spaces by TristanTrim

The surprising adequacy of the Roblox game marketplace by Esteban Restrepo

Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models? by Mayowa Osibodu

Give Skepticism a Try by Ape in the coat

Why We Should Talk Specifically Amid Uncertainty by sbaumohl

Companies as “proto-ASI” by beyarkay

47 - David Rein on METR Time Horizons by DanielFilan

The Weirdness of Dating/​Mating: Deep Nonconsent Preference by johnswentworth

Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?) by foodforthought

Scale-Free Goodness by testingthewaters

Does developmental cognitive psychology provide any hints for making model alignment more robust? by foodforthought

Does evolution provide any hints for making model alignment more robust? by foodforthought

[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by Towards_Keeperhood

2025 Letter by zef

2025 in AI predictions by jessicata

Debunking claims about subquadratic attention by Vladimir Ivanov

College Was Not That Terrible Now That I’m Not That Crazy by Zack_M_Davis

Taiwan war timelines might be shorter than AI timelines by Baram Sosis

Split (Part 1) by Shoshannah Tekofsky

Overwhelming Superintelligence by Raemon

Reducing MDMA neurotoxicity by Pavrati Jain

Is it possible to prevent AGI? by jrincayc

Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts by ryan_greenblatt

AGI and the structural foundations of democracy and the rule-based international order by PabloAMC

From Drift to Snap: Instruction Violation as a Phase Transition by James Hoffend

Quick polls on AGI doom by denkenberger

Special Persona Training: Hyperstition Progress Report 2 by jayterwahl

You will be OK by boazbarak

AI Futures Timelines and Takeoff Model: Dec 2025 Update by elifland

What drives LLM bail? A small Mech Interp study by Anton de la Fuente

Doing Chores and Building Habits in Pomodoro Breaks by Martin Vaněk

Is intelligent induction even possible? by PickleBrine

The Plan − 2025 Update by johnswentworth

Safety Net When AIs Take Our Jobs by PeterMcCluskey

The Essentialism of Lesswrong by milanrosko

Grading my 2022 predictions for 2025 by Yitz

Mystical and psychotic states: similarities and differences by Vadim Golub

Me, Myself, and AI by Byron Lee

My Theory of Change by Xyra Sinclair

Progress update: synthetic models of natural data by aribrill

Please remember how strange this all is. by Tobes

Mechanize Work’s essay on Unfalsifiable Doom by StanislavKrym

The 7 Types Of Advice (And 3 Common Failure Modes) by Linch

The origin of rot by Abhishaike Mahajan

[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters by Towards_Keeperhood

Many can write faster asm than the compiler, yet don’t. Why? by faul_sname

Chromosome identification methods by TsviBT

CFAR’s todo list re: our workshops by AnnaSalamon

What’s going on at CFAR? (Updates and Fundraiser) by AnnaSalamon

End-of year donation taxes 101 by GradientDissenter

Does the USG have access to smarter models than the labs’? by jacob_drori

Steering RL Training: Benchmarking Interventions Against Reward Hacking by ariaw

Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models by Maheep Chaudhary

December 2025 Links by nomagicpill

Unpacking Jonah Wilberg’s Goddess of Everything Else by StanislavKrym

Block-structured computation in superposition by Alex Gibson

The x-risk case for exercise: to have the most impact, the world needs you at your best by KatWoods

Bot Alexander on Hot Zombies and AI Adolescents by future_detective

Defeating Moloch: The view from Evolutionary Game Theory by Jonah Wilberg

The Weakest Model in the Selector by Alice Blair

Re: “A Brief Rant on the Future of Interaction Design” by Raemon

The pace of progress, 4 years later by Veedrac

How to never make a bad decision by Wes R

Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns by Roland Pihlakas

Straussian Memetics: A Lens On Techniques For Mass Persuasion by KAP

The Case for Artificial Manifold Intelligence by josh :)

Semantic Minds in an Affective World by Big_friendly_kiwi

Training Matching Pursuit SAEs on LLMs by chanind

Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence by Aakash Rana

November 2025 Links by nomagicpill

Reviews I: Everyone’s Responsibility by nomagicpill

Introspection via localization by Victor Godet

Crystals in NNs: Technical Companion Piece by Jonas Hallgren

Have You Tried Thinking About It As Crystals? by Jonas Hallgren

Alignment Is Not One Problem: A 3D Map of AI Risk by Anurag

A Conflict Between AI Alignment and Philosophical Competence by Wei Dai

Glucose Supplementation for Sustained Stimulant Cognition by Johannes C. Mayer

A Brief Proof That You Are Every Conscious Thing by gcnaccount

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B by James Hoffend

Are We In A Coding Overhang? by Michaël Trazzi

Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now by JenniferRM

Burnout, depression, and AI safety: some concrete strategies by KatWoods

Whole Brain Emulation as an Anchor for AI Welfare by sturb

Regression by Composition by Anders_H

The Landscape of Mind: A Map of Consciousness by phoenix

Unknown Knowns: Five Ideas You Can’t Unsee by Linch

The Intelligence Axis: A Functional Typology by Anurag

Catch-Up Algorithmic Progress Might Actually be 60× per Year by Aaron_Scher

[Book Review] “Reality+” by David Chalmers by lsdev

Methodological considerations in making malign initializations for control research by Alek Westover

The Ones Who Feed Their Children by Hastings

Immunodeficiency to Parasitic AI by Andrii Shportko

An introduction to modular induction and some attempts to solve it by Thomas Kehrenberg

Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment by Artem Herasymenko

Unpacking Geometric Rationality by MorgneticField

Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs by ceselder

It’s Good To Create Happy People: A Comprehensive Case by Bentham’s Bulldog

Don’t Trust Your Brain by silentbob

The ML drug discovery startup trying really, really hard to not cheat by Abhishaike Mahajan

Pray for Casanova by Tomás B.

Does 1025 modulo 57 equal 59? by Jan Betley

What Can Wittgenstein Teach Us About LLM Safety Research? by [email protected]

The Benefits of Meditation Come From Telling People That You Meditate by ThirdEyeJoe (cousin of CottonEyedJoe)

The future of alignment if LLMs are a bubble by Stuart_Armstrong

Unsupervised Agent Discovery by Gunnar_Zarncke

Announcing Gemma Scope 2 by CallumMcDougall

[Intro to AI Alignment] 0. Overview and Foundations by Towards_Keeperhood

Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn

Entrepreneurship is mostly zero-sum by lc

Two Notions of a Goal: Target States vs. Success Metrics by paul_dfr

Can Claude teach me to make coffee? by philh

Google seemingly solved efficient attention by ceselder

Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs by markacochran

The unreasonable deepness of number theory by wingspan

Contradict my take on OpenPhil’s past AI beliefs by Eliezer Yudkowsky

Why the alchemists couldn’t build rockets by Garrett Baker

How to game the METR plot by shash42

No God Can Help You by Ape in the coat

Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins by Michaël Trazzi

Opinionated Takes on Meetups Organizing by jenn

A Full Epistemic Stack: Knowledge Commons for the 21st Century by Oliver Sourbut

Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling by ozziegooen

Space view by kapedalex

Wuckles! by Raemon

Evaluation Awareness Scales Predictably in Open-Weights Large Language Models by Maheep Chaudhary

I made Geneguessr by Brinedew

In defence of the human agency: “Curing Cancer” is the new “Think of the Children” by Rajmohan H

Chemical (hunger) argument paraphrased by lemonhope

The Human–AI “Merger” Already Happened by bardstale

Bryan Caplan on Ethical Intuitionism by vatsal_newsletter

An intuitive explanation of backdoor paths using DAGs by enterthewoods

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism by James Hoffend

Measuring Drug Target Success by sarahconstantin

A Study in Attention by hamilton

Emergent Sycophancy by ohdearohdear

Systems of Control by phoenix

Naming by CTA

Why even a single ASI might not necessarily have a single, permanent objective by Horosphere

Designing a Job Displacement Model by claywren

What does “lattice of abstraction” mean? by Adam Zerner

Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser by habryka

Trust is Neither Scalar Nor a Snapshot by phoenix

Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment) by sunmoonron

Evaluating LLM hypothesis generation in biology is hard. by Austin Morrissey

Anhedoniapolis by Alex Beyman

Thinking through a lens of physiology by Vadim Golub

An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag

A Catalog of AI Evaluations by Anurag

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth

Thinking in Predictions by Julius

About Natural & Synthetic Beings (Interactive Typology) by Anurag

Have there been any rational analyses of mindbody techniques for chronic pain/​illness? by Liface

Eval-unawareness ≠ Eval-invariance by Mo Baker

Arch-anarchy, the end of state and digital anarchism by Peter lawless

on self-knowledge by Vadim Golub

Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag

42, and expanding human consciousness by Daniel Jacobson

Adding Empathy as a Tool for LLMs by RobinHa

GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis

Open Thread Winter 2025/​26 by kave

Practical AI risk II: Training transparency by Gustavo Ramires

GRPO is terrible by RobinHa

The 2024 LessWrong Review by RobertM

Slack Observability by Alice Blair

Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna

The Glasses on Your Face by Casey_

Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies

Why people like your quick bullshit takes better than your high-effort posts by eukaryote

Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby

MIT Lincoln Laboratory & The Dual-Use Frontier by Austin Morrissey

Anti-Foom Anti-Doom by homosapien97

Books cover a larger idea-space than movies by beyarkay

Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts

Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan

AI Sentience and Welfare Misalignment Risk by ejk64

Be Naughty by habryka

Eight Heuristics of Anti-Epistemology by Ben Pace

The Worrying Nature of Akrasia by Notelrac

Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham

Week 3: Adversarial Robustness by Ely Hahami

Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736

How you got RL’d into your idiosyncratic cognition by Ruby

Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu

The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa

One King Upon The Chessboard by Screwtape

Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace

Out-paternalizing the government (getting oxygen for my baby) by Ruby

KPD is a weak obstruction by JustinSheek

Habermasian Alignment by kenneth myers

Reading LLM chain of thought makes me more rational by Michael Steele

Don’t grow your org fast by Ruby

The Barriers to Your Unemployment by claywren

Cooling the way is the obvious way to boost human IQ by Michael Steele

Close open loops by habryka

Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP

The skills and physics of high-performance driving, Pt. 2 by Ruby

Wiki AI by abramdemski

An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere

Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel

No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo

Biology of the Living—A Conversation with two generations of Google AI by matthew allen

Diagonalization: A (slightly) more rigorous model of paranoia by habryka

Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka

Increasing returns to marginal effort are common by habryka

Thoughts are surprisingly detailed and remarkably autonomous by Ruby

Tools for deferring gracefully by TsviBT

What’s so hard about...? A question worth asking by Ruby

A Timeless Universe Viewed From the Inside by 0xA

Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge

Don’t cancel out your rewards! by Sneha Bangalore

On model weight preservation: Anthropic’s new initiative by Olle Häggström

Pause from Behind /​ Losing Heroically by enterthewoods

Learning information which is full of spiders by Screwtape

Introspection or confusion? by Victor Godet

Structural Estimates of Human Computation by Nicolas Villarreal

Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz

Cake vs Lack of Cake by Notelrac

Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux

The Decalogue For Aligned AI. by theophilus tabuke

Analytical Validation of Biomarkers is Not the Full Story by mnarayan

Using the probabilistic method to bound the performance of toy transformers by Alex Gibson

Control by Committee by Alexander Bistagne

Using math to foster acceptance and equality by jackoda

Being “Usefully Concrete” by Raemon

Questions About Outperforming Common Wisdom by Notelrac

Metaphors for Biology: Sizes by Niko McCarty

You think you are in control? by Recurrented

“What’s hard about this? What can I do about that?” (Recursive) by Raemon

Erasmus: Social Engineering at Scale by Martin Sustrik

Time, Panpsychism, and Substrate Independence by kihara.sofia

Introducing Project Telos by Mario Giulianelli

Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple

Transcendental Argumentation and the Epistemics of Discourse by 0xA

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong

LLM Hallucinations: An Internal Tug of War by violazhong

An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout

How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel

Visualizing a Platform for Live World Models by Kuil

Why Would we get Inner Misalignment by Default? by Coil

Rational Emotivism by Notelrac

by CMDiamond

Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier

What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker

Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd

Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

I will not sign up for cryonics by Syd Lonreiro_

Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes

The Doomers Were Right by Algon

Moving Faster on AI Safety: Summary and Key Insights from 10/​10/​25 Safety Acceleration Symposium by Martin Leitgab

Dead-switches as AI safety tools by Jesper L.

The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending

Why deep space programs select for calm agreeable introverted candidates by David Sun

[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi

The idea of paradigm testing of LLMs by Daniel Fenge

Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn

The Complex Universe Theory of AI Psychology by Andrew Tomazos

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

The increase in the number of Everett branches could be very important. by Crazy philosopher

If Anyone Builds It Everyone Dies, a semi-outsider review by dvd

Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu

Live Governance: AI tools for coordination without centralisation by mbuch

[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin

Designing for perpetual control by Remmelt

“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith

The Moral Infrastructure for Tomorrow by sdeture

Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels

How we’ll make all world leaders work together to make the world better by Wes R

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie

What can Canadians do to help end the AI arms race? by Tom938

17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian