I went to Devoxx in Antwerp for the first time this year, and really enjoyed it. I didn't go to that many talks, but the quality seemed very high. My favourite talk was "Performance Anxiety" by Josh Bloch, because he's a great speaker and because he presented a single important idea so well.

The idea was this: determining the performance of programs should be treated as an empirical science. We should give up any hope (if any existed) that predicting a program's performance will become easier in the future, since every layer in the deep stack of a modern computer is becoming more complex. Increased complexity is actually the price we must pay for increased performance. And increased complexity leads, almost inevitably, to reduced predictability.

As an experimental demonstration, Josh ran a micro benchmark to sort an array of integers. (The demo actually failed to show what he wanted to show, but he assured us it had worked earlier... It's somehow reassuring when live demos don't work for Java demigods either.) Each invocation of the benchmark did a number of runs, and the timings of the runs converged on a stable value. However, between benchmark invocations, the stable values that they converged on varied by up to 20%.

The reason is subtle: the HotSpot compiler produces different compile plans on different runs, and these have different performance profiles. (This is explained in Cliff Click's 2009 JavaOne presentation, "The Art of (Java) Benchmarking".) They all converge on stable values, but different stable values for different runs. The fact that HotSpot is non-deterministic may not be particularly surprising, but Josh said that the same behaviour has been shown in C code and even assembler, since non-determinism exists at lower levels of the stack too.

The practical upshot is that we need to change how we iteratively benchmark code. No longer is it permissible to run a benchmark, make a change, run the benchmark again, see that the execution time was faster (even across a number of runs in one VM) and legitimately conclude that it was due to the change we made. We have to reach for statistical tools that tell us the improved execution time was significant after we have run enough VMs.

How many VMs? The short answer is "30", the longer answer is in "Statistically Rigorous Java Performance Evaluation" by Andy Georges, Dries Buytaert, and Lieven Eeckhout.

Thankfully there is a Java framework called Caliper which can help you run microbenchmarks and which even plots the error bars for you. This stuff needs to see wider adoption in the industry.