1.blog

MapReduce is a programming paradigm that allows for massive scalability across hundreds or thousands of servers in a cluster environment. The term MapReduce originated from functional programming and was introduced by Google in a paper called “MapReduce: Simplified Data Processing on Large Clusters.” Google’s MapReduceimplementation is a proprietary solution and has not yet been released to the public.
A simple view of the MapReduce process is illustrated in the figure below. Using the MapReduce paradigm, you focus on writing two functions:
map()
Filters and aggregates data
reduce()
Reduces, groups, and summarizes by keys generated by map()
image001

These two functions can be defined as follows:
map() function
The master node takes the input, partitions it into smaller data chunks, and distributes them to worker (slave) nodes. The worker nodes apply the same transformation function to each data chunk, then pass the results back to the master node. In MapReduce, the programmer defines a mapper with the following signature:
map(): (Key1, Value1) → [(Key2, Value2)]
reduce() function
The master node shuffles and clusters the received results based on unique key-value pairs; then, through another redistribution to the workers/slaves, these values are combined via another type of transformation function. In MapReduce, the programmer defines a reducer with the following signature:
reduce(): (Key2, [Value2]) → [(Key3, Value3)]
In the figure above, input data is partitioned into small chunks, and each chunk is sent to a mapper. Each mapper may generate any number of key-value pairs. The mappers’ output is illustrated in the tale below.

image002

In this example, all mappers generate only two unique keys: {K1, K2}. When all mappers are completed, the keys are sorted, shuffled, grouped, and sent to reducers. Finally, the reducers generate the desired outputs. For this example, here are two reducers identified by {K1, K2} keys (illustrated in the table below).

image003

Once all mappers are completed, the reducers start their execution process. Each reducer may create as an output any number—zero or more—of new key-value pairs.
When writing map() and reduce() functions, the solution needs to be scalable. For example, if one is utilizing any data structure (such as List, Array, or HashMap) that will not easily fit into the memory of a commodity server, then the solution is not scalable. Scalability is therefore the heart of MapReduce. If your MapReduce solution does not scale well, you should not call it a MapReduce solution.
The core concept behind MapReduce is mapping the input data set into a collection of key-value pairs, and then reducing overall pairs with the same key. Even though the overall concept is simple, it is actually quite expressive and powerful when you consider that:
• Almost all data can be mapped into key-value pairs.
• The keys and values may be of any type: Strings, Integers, FASTQ (for DNA sequencing), user-defined custom types, and, of course, key-value pairs themselves.
How does MapReduce scale over a set of servers? The key to how MapReduce works is to take input as, conceptually, a list of records (each single record can be one or more lines of data). Then the input records are split and passed to the many servers in the cluster to be consumed by the map() function. The result of the map() computation is a list of key-value pairs. Then the reduce() function takes each set of values that have the same key and combines them into a single value (or set of values). In other words, the map() function takes a set of data chunks and produces key-value pairs, and reduce() merges the output of the data generated by map(), so that instead of a set of key-value pairs, one gets desired result.
One of the major benefits of MapReduce is its “shared-nothing” data-processing plat‐ form. This means that all mappers can work independently, and when mappers complete their tasks, reducers start to work independently (no data or critical region is shared among mappers or reducers; having a critical region will slow distributed computing). This shared-nothing paradigm enables us to write map() and reduce() functions easily and improves parallelism effectively and effortlessly.

Simple explanation of MapReduce
Let’s say that we want to count the number of books in a library that has 1,000 shelves and report the final result to the librarian. Here are two possible MapReduce solutions:
• Solution #1 (using map() and reduce()):
— map(): Hire 1,000 workers; each worker counts one shelf.
— reduce(): All workers get together and add up their individual counts (by reporting the results to the librarian).
• Solution #2 (using map(), combine(), and reduce()):
— map(): Hire 1,110 workers (1,000 workers, 100 managers, 10 supervisors— each supervisor manages 10 managers, and each manager manages 10 workers); each worker counts one shelf, and reports its count to its manager.
— combine(): Every 10 managers add up their individual counts and report the total to a supervisor.
— reduce(): All supervisors get together and add up their individual counts (by reporting the results to the librarian).

When to use MapReduce
Is MapReduce good for everything? The simple answer is no. When we have big data, if we can partition it and each partition can be processed independently, then we can start to think about MapReduce algorithms.
Here are some scenarios where MapReduce should not be used:
• If the computation of a value depends on previously computed values. One good example is the Fibonacci series, where each value is a summation of the previous two values:
F(k + 2) = F(k + 1) + F(k)
• If the data set is small enough to be computed on a single machine. It is better to do this as a single reduce(map(data)) operation rather than going through the entire MapReduce process.
• If synchronization is required to access shared data.
• If all of the input data fits in memory.
• If one operation depends on other operations.
• If basic computations are processor-intensive.
However, there are many cases where MapReduce is appropriate, such as:
• When one has to handle lots of input data (e.g., aggregate or compute statistics over large amounts of data).
• When one needs to take advantage of parallel and distributed computing, data storage, and data locality.
• When one can do many tasks independently without synchronization.
• When one can take advantage of sorting and shuffling.
• When one needs fault tolerance and you cannot afford job failures.

What MapReduce isn’t
MapReduce is a ground breaking technology for distributed computing, but there are a lot of myths about it, some of which are debunked here:
• MapReduce is not a programming language, but rather a framework to develop distributed applications using Java, Scala, and other programming languages.
• MapReduce’s distributed file system is not a replacement for a relational database management system (such as MySQL or Oracle). Typically, the input to MapReduce is plain-text files (a mapper input record can be one or many lines).
• The MapReduce framework is designed mainly for batch processing, so we should not expect to get the results in under two seconds; however with proper use of clusters one may achieve near-real-time response.
• MapReduce is not a solution for all software problems.

Why use MapReduce?
As it’s been discussed before, MapReduce works on the premise of “scaling out” by adding more commodity servers. This is in contrast to “scaling up,” by adding more resources, such as memory and CPUs, to a single node in a system); this can be very costly, and at some point you won’t be able to add more resources due to cost and software or hardware limits. Many times, there are promising main memory–based algorithms available for solving data problems, but they lack scalability because the main memory is a bottleneck. For example, in DNA sequencing analysis, you might need over 512 GB of RAM, which is very costly and not scalable.
If there is a need to increase the computational power, one’ll need to distribute it across more than one machine, while MapReduce/Hadoop and Spark/Hadoop enable one to increase the computational power by writing just two functions: map() and reduce(). So it’s clear that data analytics has a powerful new tool with the MapReduce paradigm, which has recently surged in popularity thanks to open source solutions such as Hadoop.
In a nutshell, MapReduce provides the following benefits:
• Programming model + infrastructure
• The ability to write programs that run on hundreds/thousands of machines
• Automatic parallelization and distribution
• Fault tolerance (if a server dies, the job will be completed by other servers)
• Program/job scheduling, status checking, and monitoring

Share on FacebookGoogle+Tweet about this on TwitterShare on LinkedIn