Projects

Some of the projects I’ve worked on, and my areas of interest and expertise, in no particular order:

Cubed

Cubed is a project I started in 2022 as a drop-in replacement for Dask, to make running large-scale array processing workloads more reliable. It has been used for workloads in sgkit (genomics) and Pangeo (geoscience).

Hadoop

I was one of the first committers on Apache Hadoop, and worked on it and many other distributed systems projects in the Hadoop ecosystem for the best part of a decade while working at Cloudera. I wrote four editions of the bestselling book Hadoop: The Definitive Guide, published by O’Reilly.

Genomics

In June 2020, I started working for Related Sciences on sgkit, a statistical genetics toolkit based on PyData technologies like Dask, Xarray, and Zarr. This work culminated in a new Zarr-based file format for variant data (VCF Zarr), new tools for working with this data (bio2zarr and vcztools), and a published paper: ‘Analysis-ready VCF at Biobank scale using Zarr’.

I ported the single cell preprocessing pipeline in Scanpy so it could run in parallel using Dask, and on GPUs using RAPIDS.

I was the primary author of Spark support in GATK, working with members of the Broad Institute GATK team from 2015 to 2019. I also created Disq for reading and writing bioinformatics sequencing formats from Spark.

SciPy

I have made contributions to the dimensionality reduction software UMAP and the related project pynndescent (for calculating approximate nearest neighbours).

Open Data

In early March 2020 I began collating the disparate sources of UK COVID-19 data, by writing web crawlers to integrate the data for COVID-19 tests, confirmed cases, and deaths into a set of CSV files. My work has been used by many different individuals and organisations, including John Burn-Murdoch’s visualizations for the Financial Times.

In 2019 I produced the data analyses and visualizations of Welsh school funding data, for the Level the Playing Field campaign for fair funding for schools in Wales.

Visualization

In 2020 I wrote a blog about data visualization, and created one new visualization per week - with no constraints on dataset, visualization type, or technology.

Over the years I’ve created many geometric visualizations in my spare time.

Diabetes tech

In 2018 I was diagnosed with Type 1 Diabetes, and since then I have written various pieces of software to help manage the condition.

Games

I’m interested in board games and puzzles, and how to get computers to play them. Examples include: Mastermind, SET®, and Futoshiki.

I have created two puzzles of my own design, Reflect and Polarize, which I publish daily. Try them out!

Retro computing

In 2022 I refurbished the first computer I ever owned, a ZX81. I also recovered the first program I wrote for it from tape. And I’ve created a mini website for running programs from the book Not Only 30 Programs for the Sinclair ZX81.

In 2025 I refurbished the second computer I owned, a QL.

Using an emulator I’ve re-run a Mandelbrot Set drawing program that I wrote for my third computer, an Atari ST.