Hadoop Map Reduce and HDFS are fairly stable pieces of software. One component that doesn't have a clear winner yet is higher level job scheduling, also known as workflow scheduling.
To put this in context for someone who isn't familiar with Hadoop, a single Hadoop job is broken up into many map and reduce tasks. The scheduler runs on the job tracker and assigns tasks to open slots on the task trackers on the worker nodes. When we talk about the scheduler in Hadoop, this is usually what we are talking about. By default, Hadoop uses a FIFO scheduler, but there are two more advanced schedulers which are widely used. The Capacity Scheduler is focused on guaranteing that various users of a cluster will have access to their guaranteed number of slots while making it and the Fair Scheduler is focused on providing good latency for small jobs while long running large jobs share the same cluster. These schedulers closely parallel processor scheduling, with hadoop jobs corresponding to processes and the map and reduce tasks corresponding to time slices.
The next level up is workflow scheduling -- starting jobs on a cluster in the right order and with dependencies. Sometimes a single map-reduce job is all you need. More frequently, you will have many jobs with dependencies between them. For example, you might want to identify the most important words in each document using term frequency–inverse document frequency, which requires first calculating the inverse document frequency then making use of that while examining the documents again. In this case, a shell script that runs the first job, waits for it to complete and then starts the second will work.
Once you go down this path, you start running into difficulties. Perhaps job C depends on job A and job B, but it's fine for A and B to run in parallel. If D depends on B and C, and B and C depend on A, and B fails part way through, how do you recover? It's not a particularly hard problem, but it's enough of a problem that we'd like to not reinvent the wheel. After all, while people use Hadoop for different tasks, this workflow scheduling problem is common to everyone.
I recently sent out a poll to the Hadoop mailing list to see how people are solving this problem.
