Robin Linacre's homepage: Probabilistic record linkage, Data Science and Data Engineering

follow: @[email protected]

Posts

Building Accurate Address Matching Systems

Putting Scaffolding Around Vibe Coding to Build More Complex Apps

Why DuckDB is my first choice for data processing

An alternative way to think about predicted probabilities in the Fellegi Sunter model

Quotes, links and podcast episodes

Live DuckDB WASM Splink model

Graph editor for illustrating clustering concepts (graph playground)

AI probably won't replace me in 2025

The emerging impact of LLMs on my productivity

Connected components visualisation

Match weight calculator

Super-fast deduplication of large datasets using Splink and DuckDB

Why Probabilistic Linkage is More Accurate than Fuzzy Matching For Data Deduplication

Thoughts and questions about the short term impact of LLMs on knowledge workers

Visualising updating a prior

Computing the Fellegi Sunter model

m and u values in the Fellegi-Sunter model

Partial match weights

The relationship between probabilities, match weights and Bayes factors

Splink and the Open Source Dividend

SQL should be the default choice for data transformation logic

Why parquet files are my preferred API for bulk open data

Why don't you just

The Intuition Behind the Use of Expectation Maximisation to Train Record Linkage Models

Splink 3: Fast, accurate and scalable linkage in Python

m and u probability generator with starting values

Are more complex probabilistic linkage models more accurate? Part 2, unsupervised learning

Are more complex probabilistic linkage models more accurate? Part 1, supervised learning

The Thorniest Problem of Building an Analytical Platform

The carbon impact of switiching to an electric car

m and u probability generator

Dependencies between match weights

Understanding match weights in the Fellegi Sunter model

Visualising the Fellegi Sunter model

Maths of Fellegi Sunter (old version)

The mathematics of the Fellegi Sunter model

An Interactive Introduction to Record Linkage (Data Deduplication) in the Fellegi-Sunter framework

The Downfall of Command and Control Data Leadership

Demystifying Apache Arrow

Birdsong recording finder

Comparing energy usage across countries

Filling the country with solar panels

Fuzzy Matching and Deduplicating Hundreds of Millions of Records with Splink

Why you should open source your analytical work

Understanding the Spark UI by example: sorting data

Understanding the Spark UI by example: the Left Join

Spark UI SQL detailed annotator

Unsupervised probabalistic data matching using the Expectation Maximisation algorithm

Carbon offsetting vs. the cost of renewable energy

Interactive blogging with Observable Notebooks and gatsby.js

Flight distance calculator

Energy usage ready reckoner

Effective testing of analytical models using automated sense checks

Questions Senior Leaders Should Ask Their Data Delivery Teams

Why I’m backing Vega-Lite as our default tool for data visualisation

Transforming analytical functions by mainstreaming data science