Frontpage posts - LessWrong 2.0 viewer
Contra Sam Harris on Free Will by Julius
The Argument for Autonomy by Chara#2736
Gym-Like Environment for LM Truth-Seeking by Tianyi (Alex) Qiu
Anomalous Tokens on Gemini 3.0 Pro by DirectedEvolution
Clarifying how our AI timelines forecasts have changed since AI 2027 by elifland
Bounty: Detecting Steganography via Ontology Translation by Elliot Callender
Thoughts on Claude’s Constitution by Boaz Barak
AI found 12 of 12 OpenSSL zero-days (while curl cancelled its bug bounty) by Stanislav Fort
The Chaos Defense by 25Hour
Training on Non-Political but Trump-Style Text Causes LLMs to Become Authoritarian by Anders Woodruff
Disagreement Comes From the Dark World by Zack_M_Davis
My favourite version of an international AGI project by wdmacaskill
Series on an international AGI project by wdmacaskill
Exploratory: a steering vector in Gemma-2-2B-IT boosts context fidelity on subtraction, goes manic on addition by nika koghuashvili
Selective breeding as an alternate analogy for gradient descent by Fiora Sunshine
The Window for Political Revolution is Closing Soon by koanchuk
No silver bullet: Lessons about how to create safety from the history of fire by jasoncrawford
Listing the virtues from Claude’s “Constitution” by David Gross
A Rational Proposal by Arch223
Dario Amodei – The Adolescence of Technology by habryka
Dialogue: Is there a Natural Abstraction of Good? by davidad
Questions to ponder when evaluating neurotech approaches by Abhishaike Mahajan
Reasoning Long Jump: Why we shouldn’t rely on CoT monitoring for interpretability by tobypullan
Eons of Utopia by ceselder
The ‘People Pleaser’ Problem in LLMs by Kinsey Kappler
Futarchy is Parasitic on What It Tries to Govern by Nicolas Rasmont
Ada Palmer: Inventing the Renaissance by Martin Sustrik
Can you just vibe vulnerabilities? by Max von Hippel
The Possessed Machines (summary) by L Rudolf L
Notable Progress Has Been Made in Whole Brain Emulation by Dom Polsinelli
To be well-calibrated is to be punctual by moridinamael
A tale of three theories: sparsity, frustration, and statistical field theory by Dmitry Vaintrob
Reinventing the wheel by dr_s
Critique of machine unlearning by myyycroft
Towards Sub-agent Dynamics and Conflict by Ashe Vazquez Nuñez
The Virtual Mother-in-Law by Priyanka Bharadwaj
What’s a good methodology for “is Trump unusual about executive overreach / institution erosion?” by Raemon
The Aria Test: Analyzing Identity Robustness of SOTA Models by sunmoonron
Skill: cognitive black box flight recorder by TsviBT
The Sledgehammer Anomaly: Why Pythia-1.4B Refused to Break by benwade
IABIED Book Review: Core Arguments and Counterarguments by Stephen McAleese
The Global AI Dataset (GAID) Project: From Closing Research Gaps to Building Responsible and Trustworthy AI by Jason Hung
Every Benchmark is Broken by Jonathan Gabor
Thousand Year Old Advice on Relinquishing Control to AI by Dom Polsinelli
AI Must Learn to Police Itself by savant
Condensation & Relevance by abramdemski
The Identity Incompleteness Problem: Why Current LLM Training Creates Unstable Personas (and a Proposed Solution) by Raffaele Spezia
Emergent Metacognitive Behaviors in LLMs: An Exploratory Protocol Stack by Raffaele Spezia
Paying attention to Attention Sinks by Mitali M
The Artificial Man by Jack Bradshaw
The Long View Of History by sonicrocketman
Emergency Response Measures for Catastrophic AI Risk by MKodama
Eliciting base models with simple unsupervised techniques by Callum Canavan
Automated Alignment Research, Abductively by future_detective
Digital Consciousness Model Results and Key Takeaways by arvomm
A Framework for Eval Awareness by LAThomson
All Of The Good Things, None Of The Bad Things by omegastick
Are Short AI Timelines Really Higher-Leverage? by Mia Taylor
Principles for Meta-Science and AI Safety Replications by zroe1
Value Learning Needs a Low-Dimensional Bottleneck by Gunnar_Zarncke
A quick, elegant derivation of Bayes’ Theorem by RohanS
Like night and day: Light glasses and dark therapy can treat non-24 (and SAD) by JennaS
Does Pentagon Pizza Theory Work? by rba
The phases of an AI takeover by sjadler
AI can suddenly become dangerous despite gradual progress by Simon Lermen
Releasing TakeOverBench.com: a benchmark, for AI takeover by otto.barten
Experiments on Reward Hacking Monitorability in Language Models by Monketo
Neural chameleons can(’t) hide from activation oracles by ceselder
Claude’s Constitution is an excellent guide for humans, too by Eye You
Regularity bias in Solomonoff, transformers, and evolution by Fiora Sunshine
The first type of transformative AI? by Lizka
How (and why) to read Drexler on AI by owencb
Finding Yourself in Others by 1a3orn
AI Risks Slip Out of Mind by MarkelKori
Claude’s new constitution by Zac Hatfield-Dodds
Crimes of the Future, Solutions of the Past by evrim
On visions of a “good future” for humanity in a world with artificial superintelligence by Jakub Growiec
The case for AGI safety products by Marius Hobbhahn
Updating in the Opposite Direction from Evidence by Dom Polsinelli
Vibing with Claude, January 2026 Edition by Gordon Seidoh Worley
AI Needs People (So, It Won’t Be Like Terminator Movie) by Victor Porton
Money Can’t Buy the Smile on a Child’s Face As They Look at A Beautiful Sunset… but it also can’t buy a malaria free world: my current understanding of how Effective Altruism has failed by Hazard
So Long Sucker: AI Deception, “Alliance Banks,” and Institutional Lying by fernando yt
Against “If Anyone Builds It Everyone Dies” by Bentham’s Bulldog
Deep learning as program synthesis by Zach Furman
Why I Transitioned: A Response by marisa
A Criteron for Deception by Mariven
Evidence that would update me towards a software-only fast takeoff by Anders Woodruff
There may be low hanging fruit for a weak nootropic by Dom Polsinelli
Everybody Wants to Rule the Future by E.G. Blee-Goldman
All (Non-Trivial) Decisions Are Undecidable by (M)ason
Could LLM alignment research reduce x-risk if the first takeover-capable AI is not an LLM? by Tim Hua
AGI both does and doesn’t have an infinite time horizon by Sean Herrington
Desiderata of good problems to hand off to AIs by Jozdien
Testing few-shot coup probes by Joey Marcellino
The Example by ld97
How to think about enemies: the example of Greenpeace by PranavG
“Lemurian Time War” by Ccru by Nathan Delisle
Five Theses on AI Art by jenn
@Lastbastionofsobriety & The Singularity by AdamLacerdo
When the LLM isn’t the one who’s wrong by Julian Bradshaw
How to Love Them Equally by Shoshannah Tekofsky
Massive Activations in DroPE: Evidence for Attention Reorganization by David Africa
Irrationality as a Defense Mechanism for Reward-hacking by Ashe Vazquez Nunez
Blogging, Writing, Musing, And Thinking by sonicrocketman
Is METR Underestimating LLM Time Horizons? by andreasrobinson
Focusing on Flourishing Even When Survival is Unlikely (I) by Cleo Nardo
The truth behind the 2026 J.P. Morgan Healthcare Conference by Abhishaike Mahajan
Japan is a bank by bhauth
Turning Down the Overthinking: How Cathodal Brain Stimulation Could Transform Stuttering Therapy by Rudaiba
Forfeiting Ill-Gotten Gains by jefftk
Future-as-Label: Scalable Supervision from Real-World Outcomes by Ben Turtel
Comparing yourself to other people by dominicq
Precedents for the Unprecedented: Historical Analogies for Thirteen Artificial Superintelligence Risks by James_Miller
Only 4% of work is done by humans today by Lydia Nottingham
Digital Minds: A Quickstart Guide by Avi Parrack
The culture and design of human-AI interactions by zef
Scaling Laws for Economic Impacts: Experimental Evidence from 500 Professionals and 13 LLMs by Ali
[Pre-print] Building safe AGI as an ergonomics problem by ricardotkcl
Powerful misaligned AIs may be extremely persuasive, especially absent mitigations by Cody Rushing
Should control down-weight negative net-sabotage-value threats? by Fabien Roger
Total utilitarianism is fine by Abhimanyu Pallavi Sudhir
Test your interpretability techniques by de-censoring Chinese models by Khoi Tran
Corrigibility Scales To Value Alignment by PeterMcCluskey
Deeper Reviews for the top 15 (of the 2024 Review) by Raemon
If we get primary cruxes right, secondary cruxes will be solved automatically by Jordan Arel
Boltzmann Tulpas by Mariven
Status In A Tribe Of One by J Bostock
Quantifying Love and Hatred by RobinHa
Why we are excited about confession! by boazbarak
Why Motivated Reasoning? by johnswentworth
The Many Ways of Knowing by Gordon Seidoh Worley
GD Roundup #4 - inference, monopolies, and AI Jesus by Raymond Douglas
AI Safety at the Frontier: Paper Highlights of December 2025 by gasteigerjo
Backyard cat fight shows Schelling points preexist language by jchan
Parameters Are Like Pixels by omegastick
The Evolution of Agentic AI Evaluation by Dinkar Juyal
If researchers shared their #1 idea daily, we’d navigate existential challenges far more effectively by Jordan Arel
How Much of AI Labs’ Research Is Safety? by Lennart Finke
We need to make ourselves people the models can come to with problems by Lydia Nottingham
A different take on the “Off-switch” problem: Existential Logic as a safety net by kosi thu
Analysing CoT alignment in thinking LLMs with low-dimensional steering by edoinni
Global CoT Analysis: Initial attempts to uncover patterns across many chains of thought by Riya Tyagi
Playing Dumb: Detecting Sandbagging in Frontier LLMs via Consistency Checks by James Sullivan
Language models resemble more than just language cortex, show neuroscientists by Mordechai Rorvig
Schelling Coordination in LLMs: A Review by Callum-Luis Kindred
Fixed Buckets Can’t (Phenomenally) Bind by algekalipso
The Reality of Wholes: Why the Universe Isn’t Just a Cellular Automaton by algekalipso
AntiPaSTO: Self-Supervised Value Steering for Debugging Alignment by wassname
Making LLM Graders Consistent by Davey Morse
Attempting to influence transformer representations via initialization by speck1447
When does competition lead to recognisable values? by Jan_Kulveit
Lies, Damned Lies, and Proofs: Formal Methods are not Slopless by Quinn
Tensor-Transformer Variants are Surprisingly Performant by Logan Riggs
The Algorithm Rewards Engagement by Wes F
BlackBoxQuery [BBQ]-Bench: Measuring Hypothesis Formation and Experimentation Capabilities in LLMs by Daniel Wu
Model Reduction as Interpretability: What Neuroscience Could Teach Us About Understanding Complex Systems by RiekeFruengel
Futarchy (and Tyranny of The Minority) by maxwickham
What Happens When Superhuman AIs Compete for Control? by steveld
Thinking vs Unfolding by Chris Scammell
Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report) by Florian_Dietz
Inter-branch communication in the multiverse via trapped ions by avturchin
--dangerously-skip-permissions by wingspan
Welcome to the Daily Show! Explaining Doom to Lay Folks by Ryan Meservey
Digital intentionality is not about productivity by mingyuan
De pluribus non est disputandum by Jacob Goldsmith
5 Important Considerations for Personal Donations by Tristan W
We need a better way to evaluate emergent misalignment by yix
Coding Agents As An Interface To The Codebase by omegastick
Why AIs aren’t power-seeking yet by Eli Tyre
Theoretical predictions on the sample efficiency of
training policies and activation monitors by Alek Westover
If AI alignment is only as hard as building the steam engine, then we likely still die by MichaelDickens
How Humanity Wins by Wes R
Possible Principles of Superagency by Mariven
Neuralese sucks by RobinHa
The false confidence theorem and Bayesian reasoning by viking_math
Moral-Epistemic Scrupulosity:
A Cross-Framework Failure Mode of Truth-Seeking by Tamara Sofía Falcone
Finding high signal people—applying PageRank to Twitter by jfguan
AI Incident Forecasting by cluebbers
6’7” Is Not Random by Martin Lichstam
What do we mean by “impossible”? by Sniffnoy
Where’s the $100k iPhone? by beyarkay
Taking LLMs Seriously (As Language Models) by abramdemski
Cancer-Selective, Pan-Essential Targets from DepMap by sarahconstantin
Understanding complex conjugates in quantum mechanics by jessicata
[Linkpost] On the Origins of Algorithmic Progress in AI by alex_fogelson
Leo in me by Rudaiba
Alignment Faking is a Linear Feature in Anthropic’s Hughes Model by James Hoffend
What do people mean by “recursive self-improvement”? by Expertium
Parameters of Metacognition—The Anesthesia Patient by Gunnar_Zarncke
I dream every night now by Mr. Keating
The Economics of Transformative AI by Jan_Kulveit
Skepticism about Introspection in LLMs by derek shiller
On rationality skills by dominicq
Self-Help Tactics That Are Working For Me by sarahconstantin
Distinguishing Qualitative Sensations of Sight from Sound by Neural Structure by Shiva’s Right Foot
Why LLMs Aren’t Scientists Yet. by Dhruv Trehan
Can We Make AI Alignment Framing Less Wrong? by Anurag
Saying What You Want by omegastick
Small Steps Towards Proving Stochastic → Deterministic Natural Latent by Alfred Harwood
Results: A self-randomized study of the impacts of glycine on sleep (Science is still hard) by thedissonance.net
On Owning Galaxies by Simon Lermen
How AI Is Learning to Think in Secret by Nicholas Andresen
Should you be posting on the open internet by zef
Catching misreporting about ML hardware use by turning noise into signal—Part II by Naci Cankaya
Mediations on Moloch in the AI Rat Race by Alexander Müller
Do we need sparsity afterall? by Giuseppe Birardi
Oversight Assistants: Turning Compute into Understanding by jsteinhardt
AGI will not be one specific system, it’ll be the unity of all systems by henophilia
How to tame a complex system by jasoncrawford
Broadening the training set should help with alignment by Seth Herd
Axiological Stopsigns by JenniferRM
Artifical Expert/Expanded Narrow Intelligence, and Proto-AGI by Yuli_Ban
An Aphoristic Overview of Technical AI Alignment proposals by wassname
The economy is a graph, not a pipeline by anithite
Rock bottom terminal value by ihatenumbersinusernames7
In My Misanthropy Era by jenn
The Thinking Machine by PeterMcCluskey
The Maduro Polymarket bet is not “obviously insider trading” by ceselder
The Problem with Democracy by RandStrauss
Four Downsides of Training Policies Online by Alek Westover
Humanity’s Gambit by Ben Ihrig
Semantic Topological Spaces by TristanTrim
The surprising adequacy of the Roblox game marketplace by Esteban Restrepo
Re: Anthropic Chinese Cyber-Attack. How Do We Protect Open-source Models? by Mayowa Osibodu
Give Skepticism a Try by Ape in the coat
Why We Should Talk Specifically Amid Uncertainty by sbaumohl
Companies as “proto-ASI” by beyarkay
47 - David Rein on METR Time Horizons by DanielFilan
The Weirdness of Dating/Mating: Deep Nonconsent Preference by johnswentworth
Can AI learn human societal norms from social feedback (without recapitulating all the ways this has failed in human history?) by foodforthought
Scale-Free Goodness by testingthewaters
Does developmental cognitive psychology provide any hints for making model alignment more robust? by foodforthought
Does evolution provide any hints for making model alignment more robust? by foodforthought
[Advanced Intro to AI Alignment] 2. What Values May an AI Learn? — 4 Key Problems by Towards_Keeperhood
2025 Letter by zef
2025 in AI predictions by jessicata
Debunking claims about subquadratic attention by Vladimir Ivanov
College Was Not That Terrible Now That I’m Not That Crazy by Zack_M_Davis
Taiwan war timelines might be shorter than AI timelines by Baram Sosis
Split (Part 1) by Shoshannah Tekofsky
Overwhelming Superintelligence by Raemon
Reducing MDMA neurotoxicity by Pavrati Jain
Is it possible to prevent AGI? by jrincayc
Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts by ryan_greenblatt
AGI and the structural foundations of democracy and the rule-based international order by PabloAMC
From Drift to Snap: Instruction Violation as a Phase Transition by James Hoffend
Quick polls on AGI doom by denkenberger
Special Persona Training: Hyperstition Progress Report 2 by jayterwahl
You will be OK by boazbarak
AI Futures Timelines and Takeoff Model: Dec 2025 Update by elifland
What drives LLM bail? A small Mech Interp study by Anton de la Fuente
Doing Chores and Building Habits in Pomodoro Breaks by Martin Vaněk
Is intelligent induction even possible? by PickleBrine
The Plan − 2025 Update by johnswentworth
Safety Net When AIs Take Our Jobs by PeterMcCluskey
The Essentialism of Lesswrong by milanrosko
Grading my 2022 predictions for 2025 by Yitz
Mystical and psychotic states: similarities and differences by Vadim Golub
Me, Myself, and AI by Byron Lee
My Theory of Change by Xyra Sinclair
Progress update: synthetic models of natural data by aribrill
Please remember how strange this all is. by Tobes
Mechanize Work’s essay on Unfalsifiable Doom by StanislavKrym
The 7 Types Of Advice (And 3 Common Failure Modes) by Linch
The origin of rot by Abhishaike Mahajan
[Intro to AI Alignment] 1. Goal-Directed Reasoning and Why It Matters by Towards_Keeperhood
Many can write faster asm than the compiler, yet don’t. Why? by faul_sname
Chromosome identification methods by TsviBT
CFAR’s todo list re: our workshops by AnnaSalamon
What’s going on at CFAR? (Updates and Fundraiser) by AnnaSalamon
End-of year donation taxes 101 by GradientDissenter
Does the USG have access to smarter models than the labs’? by jacob_drori
Steering RL Training: Benchmarking Interventions Against Reward Hacking by ariaw
Awareness Jailbreaking: Revealing True Alignment in Evaluation-Aware Models by Maheep Chaudhary
December 2025 Links by nomagicpill
Unpacking Jonah Wilberg’s Goddess of Everything Else by StanislavKrym
Block-structured computation in superposition by Alex Gibson
The x-risk case for exercise: to have the most impact, the world needs you at your best by KatWoods
Bot Alexander on Hot Zombies and AI Adolescents by future_detective
Defeating Moloch: The view from Evolutionary Game Theory by Jonah Wilberg
The Weakest Model in the Selector by Alice Blair
Re: “A Brief Rant on the Future of Interaction Design” by Raemon
The pace of progress, 4 years later by Veedrac
How to never make a bad decision by Wes R
Research agenda for training aligned AIs using concave utility functions following the principles of homeostasis and diminishing returns by Roland Pihlakas
Straussian Memetics: A Lens On Techniques For Mass Persuasion by KAP
The Case for Artificial Manifold Intelligence by josh :)
Semantic Minds in an Affective World by Big_friendly_kiwi
Training Matching Pursuit SAEs on LLMs by chanind
Do LLMs Condition Safety Behaviour on Dialect? Preliminary Evidence by Aakash Rana
November 2025 Links by nomagicpill
Reviews I: Everyone’s Responsibility by nomagicpill
Introspection via localization by Victor Godet
Crystals in NNs: Technical Companion Piece by Jonas Hallgren
Have You Tried Thinking About It As Crystals? by Jonas Hallgren
Alignment Is Not One Problem: A 3D Map of AI Risk by Anurag
A Conflict Between AI Alignment and Philosophical Competence by Wei Dai
Glucose Supplementation for Sustained Stimulant Cognition by Johannes C. Mayer
A Brief Proof That You Are Every Conscious Thing by gcnaccount
Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B by James Hoffend
Are We In A Coding Overhang? by Michaël Trazzi
Moving Goalposts: Modern Transformer Based Agents Have Been Weak ASI For A Bit Now by JenniferRM
Burnout, depression, and AI safety: some concrete strategies by KatWoods
Whole Brain Emulation as an Anchor for AI Welfare by sturb
Regression by Composition by Anders_H
The Landscape of Mind: A Map of Consciousness by phoenix
Unknown Knowns: Five Ideas You Can’t Unsee by Linch
The Intelligence Axis: A Functional Typology by Anurag
Catch-Up Algorithmic Progress Might Actually be 60× per Year by Aaron_Scher
[Book Review] “Reality+” by David Chalmers by lsdev
Methodological considerations in making malign initializations for control research by Alek Westover
The Ones Who Feed Their Children by Hastings
Immunodeficiency to Parasitic AI by Andrii Shportko
An introduction to modular induction and some attempts to solve it by Thomas Kehrenberg
Iterative Matrix Steering: Forcing LLMs to “Rationalize” Hallucinations via Subspace Alignment by Artem Herasymenko
Unpacking Geometric Rationality by MorgneticField
Dreaming Vectors: Gradient-descented steering vectors from Activation Oracles and using them to Red-Team AOs by ceselder
It’s Good To Create Happy People: A Comprehensive Case by Bentham’s Bulldog
Don’t Trust Your Brain by silentbob
The ML drug discovery startup trying really, really hard to not cheat by Abhishaike Mahajan
Pray for Casanova by Tomás B.
Does 1025 modulo 57 equal 59? by Jan Betley
What Can Wittgenstein Teach Us About LLM Safety Research? by [email protected]
The Benefits of Meditation Come From Telling People That You Meditate by ThirdEyeJoe (cousin of CottonEyedJoe)
The future of alignment if LLMs are a bubble by Stuart_Armstrong
Unsupervised Agent Discovery by Gunnar_Zarncke
Announcing Gemma Scope 2 by CallumMcDougall
[Intro to AI Alignment] 0. Overview and Foundations by Towards_Keeperhood
Appendices: Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn
Supervised finetuning on low-harm reward hacking generalises to high-harm reward hacking by Isaac Dunn
Entrepreneurship is mostly zero-sum by lc
Two Notions of a Goal: Target States vs. Success Metrics by paul_dfr
Can Claude teach me to make coffee? by philh
Google seemingly solved efficient attention by ceselder
Witness or Wager: Enforcing ‘Show Your Work’ in Model Outputs by markacochran
The unreasonable deepness of number theory by wingspan
Contradict my take on OpenPhil’s past AI beliefs by Eliezer Yudkowsky
Why the alchemists couldn’t build rockets by Garrett Baker
How to game the METR plot by shash42
No God Can Help You by Ape in the coat
Claude Opus 4.5 Achieves 50%-Time Horizon Of Around 4 hrs 49 Mins by Michaël Trazzi
Opinionated Takes on Meetups Organizing by jenn
A Full Epistemic Stack: Knowledge Commons for the 21st Century by Oliver Sourbut
Opinion Fuzzing: A Proposal for Reducing & Exploring Variance in LLM Judgments Via Sampling by ozziegooen
Space view by kapedalex
Wuckles! by Raemon
Evaluation Awareness Scales Predictably in
Open-Weights Large Language Models by Maheep Chaudhary
I made Geneguessr by Brinedew
In defence of the human agency: “Curing Cancer” is the new “Think of the Children” by Rajmohan H
Chemical (hunger) argument paraphrased by lemonhope
The Human–AI “Merger” Already Happened by bardstale
Bryan Caplan on Ethical Intuitionism by vatsal_newsletter
An intuitive explanation of backdoor paths using DAGs by enterthewoods
43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism by James Hoffend
Measuring Drug Target Success by sarahconstantin
A Study in Attention by hamilton
Emergent Sycophancy by ohdearohdear
Systems of Control by phoenix
Naming by CTA
Why even a single ASI might not necessarily have a single, permanent objective by Horosphere
Designing a Job Displacement Model by claywren
What does “lattice of abstraction” mean? by Adam Zerner
Toss a bitcoin to your Lightcone – LW + Lighthaven’s 2026 fundraiser by habryka
Trust is Neither Scalar Nor a Snapshot by phoenix
Superweight Damage Repair in OLMo-1B utilizing a Single Row Patch (CPU-only Experiment) by sunmoonron
Evaluating LLM hypothesis generation in biology is hard. by Austin Morrissey
Anhedoniapolis by Alex Beyman
Thinking through a lens of physiology by Vadim Golub
An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag
A Catalog of AI Evaluations by Anurag
If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth
Thinking in Predictions by Julius
About Natural & Synthetic Beings (Interactive Typology) by Anurag
Have there been any rational analyses of mindbody techniques for chronic pain/illness? by Liface
Eval-unawareness ≠ Eval-invariance by Mo Baker
Arch-anarchy, the end of state and digital anarchism by Peter lawless
on self-knowledge by Vadim Golub
Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag
42, and expanding human consciousness by Daniel Jacobson
Adding Empathy as a Tool for LLMs by RobinHa
GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis
Open Thread Winter 2025/26 by kave
Practical AI risk II: Training transparency by Gustavo Ramires
GRPO is terrible by RobinHa
The 2024 LessWrong Review by RobertM
Slack Observability by Alice Blair
Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna
The Glasses on Your Face by Casey_
Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies
Why people like your quick bullshit takes better than your high-effort posts by eukaryote
Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby
MIT Lincoln Laboratory & The Dual-Use Frontier
by Austin Morrissey
Anti-Foom Anti-Doom by homosapien97
Books cover a larger idea-space than movies by beyarkay
Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts
Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan
AI Sentience and Welfare Misalignment Risk by ejk64
Be Naughty by habryka
Eight Heuristics of Anti-Epistemology by Ben Pace
The Worrying Nature of Akrasia by Notelrac
Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham
Week 3: Adversarial Robustness by Ely Hahami
Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736
How you got RL’d into your idiosyncratic cognition by Ruby
Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu
The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa
One King Upon The Chessboard by Screwtape
Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace
Out-paternalizing the government (getting oxygen for my baby) by Ruby
KPD is a weak obstruction by JustinSheek
Habermasian Alignment by kenneth myers
Reading LLM chain of thought makes me more rational by Michael Steele
Don’t grow your org fast by Ruby
The Barriers to Your Unemployment by claywren
Cooling the way is the obvious way to boost human IQ by Michael Steele
Close open loops by habryka
Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP
The skills and physics of high-performance driving, Pt. 2 by Ruby
Wiki AI by abramdemski
An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere
Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel
No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo
Biology of the Living—A Conversation with two generations of Google AI by matthew allen
Diagonalization: A (slightly) more rigorous model of paranoia by habryka
Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka
Increasing returns to marginal effort are common by habryka
Thoughts are surprisingly detailed and remarkably autonomous by Ruby
Tools for deferring gracefully by TsviBT
What’s so hard about...? A question worth asking by Ruby
A Timeless Universe Viewed From the Inside by 0xA
Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge
Don’t cancel out your rewards! by Sneha Bangalore
On model weight preservation: Anthropic’s new initiative by Olle Häggström
Pause from Behind / Losing Heroically by enterthewoods
Learning information which is full of spiders by Screwtape
Introspection or confusion? by Victor Godet
Structural Estimates of Human Computation by Nicolas Villarreal
Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz
Cake vs Lack of Cake by Notelrac
Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux
The Decalogue For Aligned AI. by theophilus tabuke
Analytical Validation of Biomarkers is Not the Full Story by mnarayan
Using the probabilistic method to bound the performance of toy transformers by Alex Gibson
Control by Committee by Alexander Bistagne
Using math to foster acceptance and equality by jackoda
Being “Usefully Concrete” by Raemon
Questions About Outperforming Common Wisdom by Notelrac
Metaphors for Biology: Sizes by Niko McCarty
You think you are in control? by Recurrented
“What’s hard about this? What can I do about that?” (Recursive) by Raemon
Erasmus: Social Engineering at Scale by Martin Sustrik
Time, Panpsychism, and Substrate Independence by kihara.sofia
Introducing Project Telos by Mario Giulianelli
Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple
Transcendental Argumentation and the Epistemics of Discourse by 0xA
ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong
LLM Hallucinations: An Internal Tug of War by violazhong
An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout
How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel
Visualizing a Platform for Live World Models by Kuil
Why Would we get Inner Misalignment by Default? by Coil
Rational Emotivism by Notelrac
by CMDiamond
Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier
What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker
Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal
[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd
Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes
When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris
I will not sign up for cryonics by Syd Lonreiro_
Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes
The Doomers Were Right by Algon
Moving Faster on AI Safety: Summary and Key Insights from 10/10/25 Safety Acceleration Symposium by Martin Leitgab
Dead-switches as AI safety tools by Jesper L.
The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending
Why deep space programs select for calm agreeable introverted candidates by David Sun
[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi
The idea of paradigm testing of LLMs by Daniel Fenge
Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn
The Complex Universe Theory of AI Psychology by Andrew Tomazos
Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan
When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris
The increase in the number of Everett branches could be very important. by Crazy philosopher
If Anyone Builds It Everyone Dies, a semi-outsider review
by dvd
Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu
Live Governance: AI tools for coordination without centralisation by mbuch
[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin
Designing for perpetual control by Remmelt
“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith
The Moral Infrastructure for Tomorrow by sdeture
Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels
How we’ll make all world leaders work together to make the world better by Wes R
Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie
What can Canadians do to help end the AI arms race? by Tom938
17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian