RSS.Social

Frontpage posts - LessWrong 2.0 viewer

follow: @[email protected]

Posts

Anhedoniapolis by Alex Beyman

Thinking through a lens of physiology by Vadim Golub

An Approach for Evaluating Self-Boundary Consistency in AI Systems by Anurag

A Catalog of AI Evaluations by Anurag

If It Can Learn It, It Can Unlearn It: AI Safety as Architecture, Not Training by Timothy Danforth

Thinking in Predictions by Julius

About Natural & Synthetic Beings (Interactive Typology) by Anurag

Have there been any rational analyses of mindbody techniques for chronic pain/​illness? by Liface

Eval-unawareness ≠ Eval-invariance by Mo Baker

Arch-anarchy, the end of state and digital anarchism by Peter lawless

on self-knowledge by Vadim Golub

Shaping Model Cognition Through Reflective Dialogue—Experiment & Findings by Anurag

42, and expanding human consciousness by Daniel Jacobson

Adding Empathy as a Tool for LLMs by RobinHa

GiveCalc: Open-source tool to calculate the true cost of charitable giving by Max Ghenis

Open Thread Winter 2025/​26 by kave

Practical AI risk II: Training transparency by Gustavo Ramires

GRPO is terrible by RobinHa

The 2024 LessWrong Review by RobertM

Slack Observability by Alice Blair

Do Language Models Really Learn to Mislead Humans via RLHF? by Aaryan Chandna

The Glasses on Your Face by Casey_

Sutskever Refuses to Answer the Q: How Will AGI Be Built? He Has No Answer by Oscar Davies

Why people like your quick bullshit takes better than your high-effort posts by eukaryote

Ruby’s Ultimate Guide to Thoughtful Gifts by Ruby

MIT Lincoln Laboratory & The Dual-Use Frontier by Austin Morrissey

Anti-Foom Anti-Doom by homosapien97

Books cover a larger idea-space than movies by beyarkay

Paper Review: Must Rhodes Fall? Differing responses to contentious monumental public art by soycarts

Bringing organ-scale cryopreservation into existence by Abhishaike Mahajan

AI Sentience and Welfare Misalignment Risk by ejk64

Be Naughty by habryka

Eight Heuristics of Anti-Epistemology by Ben Pace

The Worrying Nature of Akrasia by Notelrac

Can questions rigidly designate intentions? File under: Rolling my own metaethics by Mason Broxham

Week 3: Adversarial Robustness by Ely Hahami

Informed Consent as the Sole Criterion for Medical Treatment by Chara#2736

How you got RL’d into your idiosyncratic cognition by Ruby

Defining AI Truth-Seeking by What It Is Not by Tianyi (Alex) Qiu

The Moss Fractal: How Care Regulates Functional Awareness from Microbes to AI by Lcofa

One King Upon The Chessboard by Screwtape

Evrart Claire: A Case Study in Anti-Epistemology by Ben Pace

Out-paternalizing the government (getting oxygen for my baby) by Ruby

KPD is a weak obstruction by JustinSheek

Habermasian Alignment by kenneth myers

Reading LLM chain of thought makes me more rational by Michael Steele

Don’t grow your org fast by Ruby

The Barriers to Your Unemployment by claywren

Cooling the way is the obvious way to boost human IQ by Michael Steele

Close open loops by habryka

Toddler Shoggoth Has Plenty Of Raw Material (The Memetic Cocoon Threat Model) by KAP

The skills and physics of high-performance driving, Pt. 2 by Ruby

Wiki AI by abramdemski

An introduction to the invariants of Special Relativity and the geometry of Spacetime by Horosphere

Darwin’s LLMs—Natural Selection is Already Shaping AI by Ben Turtel

No one understands that tomorrow will not be today and I don’t know what to do. by AdamLacerdo

Biology of the Living—A Conversation with two generations of Google AI by matthew allen

Diagonalization: A (slightly) more rigorous model of paranoia by habryka

Put numbers on stuff, all the time, otherwise scope insensitivity will eat you by habryka

Increasing returns to marginal effort are common by habryka

Thoughts are surprisingly detailed and remarkably autonomous by Ruby

Tools for deferring gracefully by TsviBT

What’s so hard about...? A question worth asking by Ruby

A Timeless Universe Viewed From the Inside by 0xA

Strengthening Red Teams: A Modular Scaffold for Control Evaluations by Chloe Loughridge

Don’t cancel out your rewards! by Sneha Bangalore

On model weight preservation: Anthropic’s new initiative by Olle Häggström

Pause from Behind /​ Losing Heroically by enterthewoods

Learning information which is full of spiders by Screwtape

Introspection or confusion? by Victor Godet

Structural Estimates of Human Computation by Nicolas Villarreal

Can Models be Evaluation Aware Without Explicit Verbalization? by gersonkroiz

Cake vs Lack of Cake by Notelrac

Cross-Model Semantic Convergence Across Independent LLM Architectures (Preliminary Data + Replication Request) by viniburilux

The Decalogue For Aligned AI. by theophilus tabuke

Analytical Validation of Biomarkers is Not the Full Story by mnarayan

Using the probabilistic method to bound the performance of toy transformers by Alex Gibson

Control by Committee by Alexander Bistagne

Using math to foster acceptance and equality by jackoda

Being “Usefully Concrete” by Raemon

Questions About Outperforming Common Wisdom by Notelrac

Metaphors for Biology: Sizes by Niko McCarty

You think you are in control? by Recurrented

“What’s hard about this? What can I do about that?” (Recursive) by Raemon

Erasmus: Social Engineering at Scale by Martin Sustrik

Time, Panpsychism, and Substrate Independence by kihara.sofia

Introducing Project Telos by Mario Giulianelli

Critical Fallibilism and Theory of Constraints in One Analyzed Paragraph by Elliot Temple

Transcendental Argumentation and the Epistemics of Discourse by 0xA

ImpossibleBench: Measuring Reward Hacking in LLM Coding Agents by Ziqian Zhong

LLM Hallucinations: An Internal Tug of War by violazhong

An Opinionated Guide to Privacy Despite Authoritarianism by TurnTrout

How Do We Evaluate the Quality of LLMs’ Mathematical Responses? by Miguel Angel

Visualizing a Platform for Live World Models by Kuil

Why Would we get Inner Misalignment by Default? by Coil

Rational Emotivism by Notelrac

by CMDiamond

Paper: Take Goodhart Seriously: Principled Limit on General-Purpose AI Optimization by antmaier

What were mistakes of AI Safety field-building? How can we avoid them while we build the AI Welfare? by Güney Türker

Resolving Newcomb’s Problem Perfect Predictor Case by Praphull Kabtiyal

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks by hughvd

Transactional method for non-transactional relationship: Relationship as a Common-pool Resource problem by Source Wishes

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

I will not sign up for cryonics by Syd Lonreiro_

Can AI Agents with Divergent Interests Learn To Prevent Civilizational Failures? by joao_abrantes

The Doomers Were Right by Algon

Moving Faster on AI Safety: Summary and Key Insights from 10/​10/​25 Safety Acceleration Symposium by Martin Leitgab

Dead-switches as AI safety tools by Jesper L.

The Cloud industry architecture [Infra-Platform-App] is unlikely to replicate for AI by Armchair Descending

Why deep space programs select for calm agreeable introverted candidates by David Sun

[Thought Experiment] If Human Extinction “Improves the World,” Should We Oppose It? Species Bias and the Utilitarian Challenge by satopi

The idea of paradigm testing of LLMs by Daniel Fenge

Conjecture: Emergent φ is provable in Large Language Models by BarnicleBarn

The Complex Universe Theory of AI Psychology by Andrew Tomazos

Enhancing Genomic Foundation Model Robustness through Iterative Black-Box Adversarial Training by Jeyashree Krishnan

When a reasoning model uses CoT as a tool to produce a refusal through invented policy violations by Gabriella Baris

The increase in the number of Everett branches could be very important. by Crazy philosopher

If Anyone Builds It Everyone Dies, a semi-outsider review by dvd

Words make us Dumb #1: The “Point”lessness of Knowledge by Enmai.MCimbu

Live Governance: AI tools for coordination without centralisation by mbuch

[CS 2881r] [Week 6] Recursive Self-Improvement by Joshua Qin

Designing for perpetual control by Remmelt

“Naive Consequentialism” as a Thought-Terminating cliche by Jacob Goldsmith

The Moral Infrastructure for Tomorrow by sdeture

Assuring Agent Safety Evaluations By Analysing Transcripts by bigmouthmodels

How we’ll make all world leaders work together to make the world better by Wes R

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions. by lennie

What can Canadians do to help end the AI arms race? by Tom938

17 years old, self-taught state control—looking for people who actually get this by Cornelius Caspian