In 1996 I moved to London to start work at Reuters, where I learnt C and HTML, and the then newly-minted Java programming language. I then worked at a small Java consultancy called Objective, before joining Kizoom in 1999. Kizoom was my first startup - we built apps to make commuters lives easier by providing services to plan train journeys, get alerted to tube delays, or find out when the next bus was due.
In 1998 I met Eliane, and our daughters were born in 2002 and 2004. In 2007 we left London and moved to Crickhowell in the Brecon Beacons in Wales. I had been dabbling in a new technology called Hadoop, and it seemed like as good a time as any to take the plunge and become a Hadoop consultant (the world’s first), doing work for companies like Last.fm and Ning. I was also made a Hadoop committer, and got more involved with the wider Apache Software Foundation, eventually serving on various project committees and becoming a member of the foundation.
In 2008 I started writing a book about Hadoop. The first edition was published by O’Reilly the following year; it was launched at the Hadoop Summit and every attendee was given a copy. The book became a bestseller, and I wrote four editions in total, which were translated into three languages.
2008 was a busy year, since in addition to writing a book, I joined Cloudera, the Hadoop startup, as its second employee. Then just one year later, Eliane, the girls, and I upped sticks again and moved to San Francisco, where we lived until we returned to Wales in 2012. At Cloudera, I continued my work on core Hadoop and its growing ecosystem of projects and tools. Then in 2014 I moved roles at Cloudera from engineering to the data science team, where I got a taste for solving challenging problems in big data, like genomics (previously I had been building the systems that others used to solve the problems).
In 2018 I went freelance, working for the Broad Institute on their genomics toolkit (GATK), and in Uri Laserson’s lab at Mount Sinai on single cell software tools. In both cases I was concerned with making the software run faster and scale to bigger datasets. The single cell work was focused around Scanpy and related SciPy projects (UMAP, pynndescent).
In 2020 I paused my freelance work to spend time on various community data analysis and visualization projects. Then in early March, as coronavirus began to spread, I began collating the disparate sources of UK COVID-19 data.
In June 2020, I went back to the world of big data and genomics and joined Jeff Hammerbacher’s team at Related Sciences to work on sgkit, a statistical genetics toolkit based on PyData technologies like Dask, Xarray, and Zarr.