Some of the projects I’ve worked on, and my areas of interest and expertise:
I was one of the first committers on Apache Hadoop, and worked on it and many other distributed systems projects in the Hadoop ecosystem for the best part of a decade while working at Cloudera. I wrote four editions of the bestselling book Hadoop: The Definitive Guide, published by O’Reilly.
I was the primary author of Spark support in GATK, working with members of the Broad Institute GATK team. I also created Disq for reading and writing bioinformatics sequencing formats from Spark.
I ported the single cell preprocessing pipeline in Scanpy so it could run in parallel using Dask, and on GPUs using RAPIDS.
I have made contributions to the dimensionality reduction software UMAP and the related project
pynndescent (for calculating approximate nearest neighbours).
In 2020 I started a blog about data visualization with the goal of creating one interesting visualization per week - with no constraints on dataset, visualization type, or technology.
Over the years I’ve created many geometric visualizations in my spare time.
In 2018 I was diagnosed with Type 1 Diabetes, and since then I have written various pieces of software to help manage the condition.
I’m interested in board games and puzzles, and how to get computers to play them. Examples include: Mastermind and SET®.