About
I was born in 1973 and grew up in Ironbridge, Shropshire. I studied mathematics at Queens’ College, Cambridge, then did a masters in philosophy of science at Leeds and Florence.
In 1996 I moved to London to start work at Reuters, where I learnt C and HTML, and the then newly-minted Java programming language. I then worked at a small Java consultancy called Objective, before joining Kizoom in 1999. Kizoom was my first startup - we built phone apps to make commuters lives easier by providing services to plan train journeys, get alerted to tube delays, or find out when the next bus was due. (This was on phones like the Nokia 7110 - long before the iPhone came out.)
In 1998 I met Eliane, and our daughters were born in 2002 and 2004. In 2007 we left London and moved to Crickhowell in the Brecon Beacons in Wales. I had been dabbling in a new technology called Hadoop, and it seemed like as good a time as any to take the plunge and become a Hadoop consultant (the world’s first), doing work for companies like Last.fm and Ning. I was also made a Hadoop committer, and got more involved with the wider Apache Software Foundation, eventually serving on various project committees and becoming a member of the foundation.
In 2008 I started writing a book about Hadoop. The first edition was published by O’Reilly the following year; it was launched at the Hadoop Summit and every attendee was given a copy. The book became a bestseller, and I wrote four editions in total, which were translated into three languages.
2008 was a busy year, since in addition to writing a book, I joined Cloudera, the Hadoop startup, as its second employee. Then just one year later, Eliane, the girls, and I upped sticks again and moved to San Francisco, where we lived until we returned to Wales in 2012. At Cloudera, I continued my work on core Hadoop and its growing ecosystem of projects and tools. Then in 2014 I moved roles at Cloudera from engineering to the data science team, where I got a taste for solving challenging problems in big data, like genomics (previously I had been building the systems that others used to solve the problems).
In 2018 I went freelance, working for the Broad Institute on their genomics toolkit (GATK), and in Uri Laserson’s lab at Mount Sinai on single cell software tools. In both cases I was concerned with making the software run faster and scale to bigger datasets. The single cell work was focused around Scanpy and related SciPy projects (UMAP, pynndescent).
In 2020 I paused my freelance work to spend time on various community data analysis and visualization projects. Then in early March, as coronavirus began to spread, I began collating the disparate sources of UK COVID-19 data.
In June 2020, I went back to the world of big data and genomics and joined Jeff Hammerbacher’s team at Related Sciences to work on sgkit, a statistical genetics toolkit based on PyData technologies like Dask, Xarray, and Zarr.
In 2022 I founded a new project called Cubed as a drop-in replacement for Dask. As a result of this I started working with the Pangeo community on the problem of how to scale up geoscience workloads to run on cloud infrastructure.
If you want to chat about any of the above or projects you think I may find interesting, then please get in touch. I also have a LinkedIn Profile