Mapreduce weakness in iterative computation
- ali@fuzzywireless.com
- Mar 4, 2022
- 2 min read
Amongst some key limitations of MapReduce is that it serves well for tasks requiring batch processing but not suitable for real-time streaming data due to time intensive computing processes of Map and Reduce (Data Flair, 2017). Another drawback is lack of iterative data support, which require cyclic dataflow (output of one stage is input of other stage and so on). Lack of abstraction is another shortcoming, which require MapReduce code for each and every operation. Framework also does not offer caching intermediate data in memory, which significantly impact the performance of framework (Data Flair, 2017).
Inefficient handling of iterative algorithm has plagued MapReduce for quite a while (Lee, Kang, Youn, Lee, & Kwon, 2016). Some of the key algorithms requiring iterative processing are recursive query, k-means, PageRank and logistic regression, which are not supported by MapReduce directly. The weakness stem from the fact that MapReduce does not store intermediate results in memory, which significantly impact the performance of framework (Data Flair, 2017). Any driver program to enable repetitive loading and processing at each iteration significantly waste disk input/output, processing and network bandwidth resources (Lee et al., 2016).
Several variations of Hadoop and new platforms tried to resolve the shortcoming of MapReduce with regards to iterative task processing, which include HaLoop, Twister, iMapReduce, MapReduce Online and Spark (Lee, Kang, Youn, Lee, & Kwon, 2016). Out of above listed systems, Spark and Twister are implemented independent of Hadoop. Prevalent techniques used to resolve the issue are disk caching and distributed shared memory to support iterative computing (Lee et al., 2016).
Dean Wampler (2015) highlighted the use case of machine learning algorithm, where neural networks and other algorithms predict the outcome, error is computed followed by further tweaking to reduce error; this type of iterative task is not supported by MapReduce. Similarly graph analysis of social media data require links between connections, their connections and so on thus require traversing one connection per iteration (2015). Key difference in Spark and MapReduce is that Spark keeps data in memory while MapReduce involve read/write from disk (MapR, 2018). Execution from memory enhance the execution performance significantly. Similarly Spark’s resilient distributed dataset can be cached, thus offering in-memory data for iterative algorithms (2018).
Reference:
Lee, H., Kang, M, Youn, S., Lee, J. & Kwon, Y., 2016. An experimental comparison of iterative MapReduce frameworks. ACM Cluster Computing, Vol. 20 (4), 3593-3604.
Wampler, D. (2015). How Spark beats MapReduce: event streaming, iterative algorithms and elasticity. Retrieved from https://www.lightbend.com/blog/how-spark-beats-mapreduce-event-streaming-iterative-algorithms-and-elasticity
MAPR, 2018. Parallel and iterative processing for machine learning recommendations with Spark. Retrieved from https://mapr.com/blog/parallel-and-iterative-processing-machine-learning-recommendations-spark/
Data Flair, 2017. 13 Big limitations of Hadoop and Solution to Hadoop drawbacks. Retrieved from https://data-flair.training/blogs/13-limitations-of-hadoop/
Comentários