MapReduce
- ali@fuzzywireless.com
- Mar 4, 2022
- 2 min read
MapReduce is an efficient yet simple framework of programming model (Yan, Yang, Yu, Li & Li, 2012). The framework divide the job into blocks located in different nodes, which are processed independently. As the name suggests, MapReduce is consisted of two functions, first is the map function which takes a key value pair as input and produce an intermediate output while second is the reduce function which processes the output of map function grouped by keys and generate the final key value output pair (Zhang, Chen, Wang & Yu, 2016). Outputs of map and reductions functions are stored to a local disk or Hadoop distributed final system (HDFS), however at the end of job all the intermediate outputs are deleted, thus MapReduce is sometimes referred as stateless because it does not store intermediate data (Yan et al., 2012).
Both map and reduce functions run independently as two stages without interrupting each other and runs only once for a given job (Yan et al., 2012). Big data is evolving continuously thus require periodic incremental update to keep the data fresh, for instance applications like ranking of webpage etc. (Zhang, Chen, Wang & Yu, 2016). Incremental update means that only a small fraction of data usually changes whereas MapReduce lack support for incremental processing, which is inefficient and waste resources (Lee, Kim & Maeng, 2014). Yan et al. (2012) outlined some typical solutions to overcome incremental processing limitation of MapReduce which best suited the batch processing by modifying the algorithms such as,
1. Incremental algorithm, which require incremental input while with stateful and uncoupled dataflow. Advantage of solution is no change on the parallel framework but on the other hand complicated algorithm design with manual user interaction.
2. Continuous bulk processing, which require incremental input with coupled dataflow. Approach offers new model and primitive solution by taking intermediate results of prior executions as part of explicit input however delicate data flows are required to be built for different applications. CBP of Yahoo and Percolator of Google are popular examples of continuous bulk processing.
3. Incremental computation based on MapReduce, which require incremental input with coupled dataflow. Algorithm and MapReduce APIs remained the same but require HDFS modification to support incremental data discovery and intermediate results storage and kernel modification of map and reduce stages.
4. IncMR, which require incremental input with coupled data flow. Algorithm, MapReduce APIs and HDFS remained the same but require repartition of state data.
IncMR framework offers multiple types of job submission options, like one-time, initial, incremental and continuous run (Yan et al., 2012). Iterative computation require processing of new request based on former results which is possible with the support of multiple states, storing historical computing data and repartition or prior map outputs (Yan et al., 2012).
Reference:
Yan, C., Yang, X., Yu, Z., Li, M. & Li, X., 2012. IncMR: Incremental Data Processing based on MapReduce. Retrieved from http://www.s3lab.ece.ufl.edu/publication/cloud12.pdf
Zhang, Y., Chen, S., Wang, Q. & Yu, G., 2016. I2MapReduce: Incremental MapReduce for mining evolving big data (extended abstract). 2016 IEEE 32nd International Conference on Data Engineering
Lee, D., Kim, J. & Maeng, S., 2014. Large-scale incremental processing with MapReduce. 2014 Future Generation Computer Systems, 66-79
Comentarios