LessWrong

follow: @[email protected]

Posts

Death of the Author

Exploring the multi-dimensional refusal subspace in reasoning models

AIs should also refuse to work on capabilities research

List of lists of project ideas in AI Safety

Uncommon Utilitarianism #3: Bounded Utility Functions

How valuable is money-in-market?

Credit goes to the presenter, not the inventor

On Fleshling Safety: A Debate by Klurl and Trapaucius.

Results of "Experiment on Bernoulli processes"

certain exotic neurotransmitters as SMART PILLS: or compounds that increase the capacity for mental work in humans

Cancer has a surprising amount of detail

Stability of natural latents in information theoretic terms

Lessons from Teaching Rationality to EAs in the Netherlands

Are We Their Chimps?

FWIW: What I noticed at a (Goenka) Vipassana retreat

Why are there no helpful infectious diseases?

Brightline is Actually Pretty Dangerous

Seven-ish Words from My Thought-Language

Remembrancy

Pygmalion's Wafer

Despair and information

Why is OpenAI releasing products like Sora and Atlas?

Origins and dangers of future AI capability denial

Do you completely trust that you are completely in the shit? - despair and information -

Assessing Far UVC Positioning

Musings on Reported Cost of Compute (Oct 2025)

Regardless of X, you can still just sign superintelligence-statement.org if you agree

The Future of Interpretability is Geometric

New Statement Calls For Not Building Superintelligence For Now

Highlights from Explaining AI Explainability

Can Reasoning Models Obfuscate Reasoning? Stress-Testing Chain-of-Thought Monitorability

Dollars in political giving are less fungible than you might think

LW Reacts pack for Discord/Slack/etc

AI Timelines and Points of no return

Can we steer AI models toward safer actions by making these instrumentally useful?

Introducing ControlArena: A library for running AI control experiments

Plan 1 and Plan 2

Why I Don't Believe in True AGI

Guys I might be an e/acc

How transformers can compute distances along a curve locally.

How an AI company CEO could quietly take over the world

Worlds Where Iterative Design Succeeds?

Automated real time monitoring and orchestration of coding agents

Reminder: Morality is unsolved

The main way I've seen people turn ideologically crazy [Linkpost]

Empirical Partial Derivatives

Differences in Alignment Behaviour between Single-Agent and Multi-Agent AI Systems

Building a different kind of personal intelligence

Beliefs about formal methods and AI safety

Depressión and information - What my confidence level am I that I'm in absolute shit?

AI #139: The Overreach Machines

Should AI Developers Remove Discussion of AI Misalignment from AI Training Data?

Software Engineering at SecureBio

Is terminal lucidity real?

A Concrete Roadmap towards Safety Cases based on Chain-of-Thought Monitoring

LW Psychosis

Learning to Interpret Weight Differences in Language Models

AGI's Last Bottlenecks

Statement on Superintelligence - FLI Open Letter

Why AI alignment matters today

Any corrigibility naysayers outside of MIRI?

Which side of the AI safety community are you in?

Homomorphically encrypted consciousness and its implications

Consider donating to AI safety champion Scott Wiener

Postrationality: An Oral History

Penny's Hands

Is 90% of code at Anthropic being written by AIs?

LLM Self-Reference Language in Multilingual vs English-Centric Models

How Well Does RL Scale?

The Perpetual Technological Cage

Utopiography Interview

White House OSTP AI Deregulation Public Comment Period Ends Oct. 27

July-October 2025 Progress in Guaranteed Safe AI

In remembrance of Sonnet '3.6'

Stratified Utopia

Early stage goal-directednesss

On Dwarkesh Patel’s Podcast With Andrej Karpathy

Relearning Human Signaling — Experiments in Real-World Connection (Stockholm + virtual)

Samuel x Bhishma - Superintelligence by 2030?

Remarks on Bayesian studies from 1963

How the Human Lens Shapes Machine Minds

Symbiogenesis vs. Convergent Consequentialism

21st Century Civilization curriculum

Ramblings on the Self Indication Assumption

An epistemic theory of populism [link post to Joseph Heath]

EU explained in 10 minutes

"Tilakkhana", Gwern [poem]

Attending Your First Contra Dance in a Fragrance-Compliant Manner

How low can we go? Modern School Curricula

The If Anyone Builds It, Everyone Dies march assurance contract should indicate how many signatures it has received

A Mathematical Model of Alcor’s Economic Survival

Can you find the steganographically hidden message?

How cause-area specific conferences can strengthen the EA community

Secular Solstice: Bremen (Dec 13)

How Stuart Buck funded the replication crisis

Contra-Zombies? Contra-Zombies!: Chalmers as a parallel to Hume

Consider donating to Alex Bores, author of the RAISE Act

Bubble, Bubble, Toil and Trouble

Considerations around career costs of political donations

A Cup of Blue Tea

A Bayesian nightmare: Instagram and Sampling bias

Uncommon Utilitarianism #2: Positive Utilitarianism

Can you transfer a backdoor with subliminal learning?

Final-Exam-Tier Medical Problem With Handwavy Reasons We Can't Just Call A Licensed M.D.

Humanity Learned Almost Nothing From COVID-19

Frontier LLM Race/Sex Exchange Rates

Scenes, cliques and teams - a high level ontology of groups

Learnable Skills

Give Me Your Data: The Rationalist Mind Meld

The IABIED statement is not literally true

Libraries need more books

In defense of the goodness of ideas

Sample Interesting First

Comma v0.1 converted to GGUF

Using Bayes' Theorem to determine Optimal Protein Intake

Selected Graphics Showing Progress towards AGI

Networking for Spies: Translating a Cyrillic Text with Claude Code

How to survive in the storm of uncertainty (and shit)?

Space colonization and scientific discovery could be mandatory for successful defensive AI

Memory Decoding Journal Club: Functional connectomics reveals general wiring rule in mouse visual cortex

Meditation is dangerous

I handbound a book of Janus's essays for my girlfriend

The Dark Arts of Tokenization or: How I learned to start worrying and love LLMs' undecoded outputs

How To Vastly Increase Your Charitable Impact

Nontrivial pillars of IABIED

What Success Might Look Like

I’m an EA who benefitted from rationality

AI #138 Part 2: Watch Out For Documents

Mess AI – deliberate corruption of the training data to prevent superintelligence

Activation Plateaus: Where and How They Emerge

Can We Simulate Meiosis to Create Digital Gametes — and Are the Results Your Biological Offspring?

Steven Adler reports that NVIDIA is attempting to stifle pro-export-control speach

Book Review: To Explain the World

AISN#64: New AGI Definition and Senate Bill Would Establish Liability for AI Harms

Finding Features in Neural Networks with the Empirical NTK

Learning from the Luddites: Implications for a modern AI labour movement

Job Openings: SWE, PM, and Grants Coordinator to help improve grant-making

AI #138 Part 1: The People Demand Erotic Sycophants

Cheap Labour Everywhere

Quantum immortality and AI risk – the fate of a lonely survivor

Halfhaven Digest #2

Fragrance Free Confusion

The Three Levels of Agency

Electronics Mechanic -> AI Safety Researcher: A 30-Month Journey to Model Welfare

Some astral energy extraction methods

AI-202X-slowdown: can CoT-based AIs become capable of aligning the ASI?

Chemical Telescopes And The Process Of Science

Monthly Roundup #35: October 2025

Updating the name of Open Philanthropy's AI program

Open Global Investment: Comparisons and Critique

We are too comfortable with AI "magic"

Are calm introverts (like East Asians) uniquely suited for space travel & Mars missions?

Until the stars burn out? Assessing the stakes of AGI lock-in

It will cost you nothing to "bribe" a Utilitarian

How I Became a 5x Engineer with Claude Code

That Mad Olympiad

A New Global Risk: Large Comet’s Impact on Sun Could Cause Fires on Earth

Can LLMs Coordinate? A Simple Schelling Point Experiment

Minimal Prompt Induction of Self-Talk in Base LLMs

Humans Are Spiky (In an LLM World)

Gnashing of Teeth

Geometric Structure of Emergent Misalignment: Evidence for Multiple Independent Directions

Communism By Another Name

Situational Awareness as a Prompt for LLM Parasitism

Postrationality: An Oral History

The sum of its parts: composing AI control protocols

Why your boss isn’t worried about AI

Humanity AI Commits $500 million to AI and Democracy Protection, AI x Security, and more

Thinking Partners: Building AI-Powered Knowledge Management Systems

SS26 Color Stats

The Biochemical Beauty of Retatrutide: How GLP-1s Actually Work

My views on Lesswrong

Trade Escalation, Supply Chain Vulnerabilities and Rare Earth Metals

12 Angry Agents, or: A Plan for AI Empathy

The "Length" of "Horizons"

My Software Setup

Narcissism, Echoism, and Sovereignism: A 4-D Model of Personality

A personal take on why you should work at Forethought (maybe)

Discrete Generative Models

Survey Results: Far UVC and Glycol Vapors

How AI Manipulates—A Case Study

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

AI Psychosis, with Tim Hua and Adele Lopez

What is Lesswrong good for?

Predictability is Underrated

The Mom Test for AI Extinction Scenarios

Is There a Sound Argument for Generality in AI?

Reasons to sign a statement to ban superintelligence (+ FAQ for those on the fence)

Water Above the Ocean

OpenAI #15: More on OpenAI’s Paranoid Lawfare Against Advocates of SB 53

The Thirteen-Circle Paradox

Pause House, Blackpool

Global vs. Local feedback

Sublinear Utility in Population and other Uncommon Utilitarianism

RiskiPedia

Don't Mock Yourself

Experiment: Test your priors on Bernoulli processes.

The Problem of Consciousness and AI as an Ethical Subject

Dr Evil & Realpolitik

How do we know when something is deserving of welfare?

The Narcissistic Spectrum

Non-copyability as a security feature

The Alignment Problem Isn't Theoretical

International Programme on AI Evaluations

If a Lioness Could Speak

How long do AI companies have to achieve significant capability gains before funding collapses?

You're not confused by thermodynamics

Subscribe to my Inkhaven feed!

The Most Common Bad Argument In These Parts

Experiments With Sonnet 4.5's Fiction

Letter to Heads of AI labs

Emil the Moose

Using complex polynomials to approximate arbitrary continuous functions

What does it feel like to understand?

The 5 Obstacles I Had to Overcome to Become Vegan

2025 State of AI Report and Predictions

Applications Open for a Weekend Exploring Civilisational Sanity

Maybe Use BioLMs To Mitigate Pre-ASI Biorisk?

The statement "IABIED" is true even if the book IABIED is mostly false

AI and Biological Risk: Forecasting Key Capability Thresholds

Why Future AIs will Require New Alignment Methods

Materialist Semiotics and the Nature of Qualia

Patience and Willingness to Be Slow

You should probably prepare for your chats to be fed to AI

We won’t get docile, brilliant AIs before we solve alignment

Labs lack the tools to course-correct

The Liberty Tractor

At odds with the unavoidable meta-message

Stars are a rounding error

Towards a Typology of Strange LLM Chains-of-Thought

Training Qwen-1.5B with a CoT legibility penalty

Would concentration of power be bad, given offense-dominant weapons?

Interview with a drone expert on the future of AI warfare

Investigating Neural Scaling Laws Emerging from Deep Data Structure

I take antidepressants. You’re welcome

Training fails to elicit subtle reasoning in current language models

Realistic Reward Hacking Induces Different and Deeper Misalignment

Why am I not currently starting a religion around AI or similar topics?

THE WISE, THE GOOD, THE POWERFUL

Hubris & Control

"Yes, and—" Requires the Possibility of "No, Because—"

Four Questions to Refine Your Policy Proposal

A Snippet On The Epistemically Hygienic Containment Of Faith-In-Reason-Itself

Alignment progress doesn’t compensate for higher capabilities

The Necronomicon Is Already Read

Biouploading: Preserving My Living Neurons and Connectome as a Spatially Distributed Mesh

self reflections of a striver

Hospitalization: A Review

AI #137: An OpenAI App For That

Are We Leaving Literature To The Psychotic?

Lessons from the Mountains

Probabilistic Societies

Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

NEPA, Permitting and Energy Roundup #2

What shapes does reasoning take but circular?

The Oracle's Gift

Thinking Mathematically - Convergent Sequences

The Relationship Between Social Punishment and Shared Maps

IABIED: Paradigm Confusion and Overconfidence

The Wise Baboon of Loyalty

Spooky Collusion at a Distance with Superrational AI

The Architecture of the Narcissistic False Self

Reflections on The Curve 2025

2025-10-12 - London rationalish meetup - Periscope

Plans A, B, C, and D for misalignment risk

Three Paths Through Manifold

Halfhaven Digest #1

The "cool idea" bias

Irresponsible Companies Can Be Made of Responsible Employees

Heaven, Hell, and Mechanics

10 Ways to Waste a Decade

You Should Get a Reusable Mask

Replacing RL w/ Parameter-based Evolutionary Strategies

Intent alignment seems incoherent

The Tower of Babel in Reverse

Petri: An open-source auditing tool to accelerate AI safety research

Bending The Curve

Kairos is hiring: Founding Generalist & SPAR Contractor

Messy on Purpose: Part 2 of A Conservative Vision for the Future

Going Phoneless

The Alignment Paradox: Why Transparency Can Breed Deception

Notes on "Homology, Genes and Evolutionary Innovation"

Research Robots: When AIs Experiment on Us

Top Warning Signs Your Friends are Being Oneshotted By AI

LLMs as a limiter of social intercourse

Generalization and the Multiple Stage Fallacy?

Telling the Difference Between Memories & Logical Guesses

Notes from European Progress Conference

"Intelligence" -> "Relentless, Creative Resourcefulness"

Chaos Alone is No Bar to Superintelligence

We won’t get AIs smart enough to solve alignment but too dumb to rebel

Notes on the need to lose

Excerpts from my neuroscience to-do list

Experience Report - ML4Good Bootcamp Singapore, Sep'25

Gradual Disempowerment Monthly Roundup

Subliminal Learning, the Lottery-Ticket Hypothesis, and Mode Connectivity

The Origami Men

Medical Roundup #5

Why I think ECL shouldn’t make you update your cause prio

Did Tyler Robinson carry his rifle as claimed by the government?

AI Science Companies: Evidence AGI Is Near

LLMs one-box when in a "hostile telepath" version of Newcomb's Paradox, except for the one that beat the predictor

Alignment Faking Demo for Congressional Staffers

Do Things for as Many Reasons as Possible

One Does Not Simply Walk Away from Omelas

The quotation mark

The Sadism Spectrum and How to Access It

Maybe social media algorithms don't suck

Base64Bench: How good are LLMs at base64, and why care about it?

Behavior Best-of-N achieves Near Human Performance on Computer Tasks

How likely are “s-risks” (large-scale suffering outcomes) from unaligned AI compared to extinction risks?

LLMs are badly misaligned

The Counterfactual Quiet AGI Timeline

AISafety.com Reading Group session 328

Open Philanthropy’s Biosecurity and Pandemic Preparedness Team Is Hiring and Seeking New Grantees

$250 bounties for the best short stories set in our near future world & Brooklyn event to select them

How the NanoGPT Speedrun WR dropped by 20% in 3 months

a quick thought about AI alignment

Making Your Pain Worse can Get You What You Want

Markets in Democracy: What happens when you can sell your vote?

What I've Learnt About How to Sleep

The 'Magic' of LLMs: The Function of Language

To my client having a sudden ‘bad day’:

Consider Doing Small Walks at Work

Where does Sonnet 4.5's desire to "not get too comfortable" come from?

Goodness is harder to achieve than competence

Good is a smaller target than smart

Making Sense of Consciousness Part 6: Perceptions of Disembodiment

Recent AI Experiences