Update: This quote from Tim O'Reilly in his OSCON keynote today sums up the changes I describe below: "Do less and then create extensibility mechanisms." (via
Matt Raible)
I'm noticing an increased desire to make Hadoop more modular. I'm not sure why this is happening now, but it's probably because as more people start using Hadoop it needs to be more malleable (people want to plug in their own implementations of things), and the way to do that in software is through modularity.
Some examples:
Job scheduling
The current scheduler is a simple FIFO scheduler which is adequate for small clusters with a few cooperating users. On larger clusters the best advice has been to use
HOD (Hadoop On Demand), but that has its own problems with inefficient cluster utilization. This situation led to a number of proposals to make the scheduler pluggable (
HADOOP-2510,
HADOOP-3412,
HADOOP-3444). Already there is a
fair scheduler implementation (like the
Completely Fair Scheduler in Linux) from Facebook.
HDFS block placement
Today the algorithm for placing a file's blocks across datanodes in the cluster is hardcoded into HDFS, and while it has
evolved, it is clear that a one-size-fits-all approach is not necessarily the best approach. Hence the new proposal to support
pluggable block placement algorithms.
Instrumentation
Finding out what is happening in a distributed system is a hard problem. Today, Hadoop has a
metrics API (for gathering statistics from the main components of Hadoop), but there is interest in adding other logging systems, such as
X-Trace, via a new
instrumentation API.
Serialization
The ability to use
pluggable serialization frameworks in MapReduce appeared in Hadoop 0.17.0, but has received renewed interest due to the talk around
Apache Thrift and
Google Protocol Buffers.
Component lifecycle
There is work being done to
add a lifecyle interface to Hadoop components. One of the goals is to make it easier to subclass components, so they can be customized.
Remove dependency cycles
This is really just good engineering practice, but the existence of dependencies makes it harder to understand, modify and extend code. Bill de hÓra did a great
analysis of Hadoop's code structure (and its deficiencies), which has lead to some work to enforce module dependencies and remove the cycles.