MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawa
Processing large datasets can sometimes be performed using algorithms that are embarrassingly parallelizable. As the number of datasets for which it’s worth running processing in parallel increases, it becomes clearer that there is a benefit in developing a framework which abstracts away the minutia of orchestrating parallel computations and managing failures, edge cases and optimizations. Without such a framework, every dataset that is worth processing in parallel will see special-purpose boilerplate code built for that particular type of dataset; thus, the wheel is reinvented, always a bit different, but always round.
Like any framework, the main advantages are that no one needs to reinvent the wheel,users of the framework inherit optimizations in parallel processing and efficient parallel data processing can occur even on datasets who might not have been processed in parallel if there needed to be a investment in writing parallel processing orchestration code.
The authors propose a model whose basis is a master node that orchestrates upwards of hundreds of worker nodes in performing the parallel computations. The model organizes the workers in two groups: mapping workers and reducing workers. Mapping workers each receive a fragment of the large dataset and transform their fragment independently of other fragments into an intermediary result (e.g. each worker is given a negative integer, from a list of negative integers, and produces a positive integer). Reducing workers are directed by the master node to perform a computation on several of these intermediary results; the intermediary results are reduced to a smaller set of final results (e.g. we grab several positive integers, generated by the mapping worker, and perform a running total by iterating over them). The input dataset, the mapping function and reducing function is provided by the user.
Critique
It’s unclear from the paper if the reduce workers can iterate over existing final results: that is, if the end goal is to obtain the final sum of a very large list of numbers, we would need 1 reduce worker to process the entire list of numbers so as to produce one single output result. What is unclear is whether the final result of reduce workers can be fed back into the system as intermediary results, and re-processed by reduce workers once more until the final result is a single datum.
The authors put forward that a satisfying amount of problems can be solved by providing a mapping function and reduce function so that the MapReduce model can address a wide variety of dataset computations. The authors explain that while MapReduce is a restricted model when compared with higher, more powerful models like Bulk Synchronous Programming, they key difference is essentially transparent fault-tolerance and ease of use.
The authors provide a set of benchmarks for simple but archetypal operations and compare running speeds with and without optimizations demonstrating how, though possibly seemingly trivial, the optimization have a marked effect on processing time.