Robin Linacre's homepage: Probabilistic record linkage, Data Science and Data Engineering
Building Accurate Address Matching Systems
Putting Scaffolding Around Vibe Coding to Build More Complex Apps
Why DuckDB is my first choice for data processing
An alternative way to think about predicted probabilities in the Fellegi Sunter model
Quotes, links and podcast episodes
Live DuckDB WASM Splink model
Graph editor for illustrating clustering concepts (graph playground)
Microblog
AI probably won't replace me in 2025
The emerging impact of LLMs on my productivity
Connected components visualisation
Match weight calculator
Super-fast deduplication of large datasets using Splink and DuckDB
Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication
Thoughts and questions about the short term impact of LLMs on knowledge workers
Visualising updating a prior
Computing the Fellegi Sunter model
m and u values in the Fellegi-Sunter model
Partial match weights
The relationship between probabilities, match weights and Bayes factors
Splink and the Open Source Dividend
SQL should be the default choice for data transformation logic
Why parquet files are my preferred API for bulk open data
Why don't you just
The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models
Splink 3: Fast, accurate and scalable linkage in Python
m and u probability generator with starting values
Are more complex probabilistic linkage models more accurate? Part 2, unsupervised learning
Are more complex probabilistic linkage models more accurate? Part 1, supervised learning
The Thorniest Problem of Building an Analytical Platform
The carbon impact of switiching to an electric car
m and u probability generator
Dependencies between match weights
Understanding match weights in the Fellegi Sunter model
Visualising the Fellegi Sunter model
Maths of Fellegi Sunter (old version)
The mathematics of the Fellegi Sunter model
An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework
The Downfall of Command and Control Data Leadership
Demystifying Apache Arrow
Birdsong quiz
Birdsong recording finder
Comparing energy usage across countries
Filling the country with solar panels
Fuzzy Matching and Deduplicating Hundreds of Millions of Records with Splink
Why you should open source your analytical work
Understanding the Spark UI by example: sorting data
Understanding the Spark UI by example: the Left Join
Spark UI SQL detailed annotator
Unsupervised probabalistic data matching using the Expectation Maximisation algorithm
Carbon offsetting vs. the cost of renewable energy
Interactive blogging with Observable Notebooks and gatsby.js
Flight distance calculator
Energy usage ready reckoner
My flights
Effective testing of analytical models using automated sense checks
Questions Senior Leaders Should Ask Their Data Delivery Teams
Why I’m backing Vega-Lite as our default tool for data visualisation
Transforming analytical functions by mainstreaming data science