MapReduce and Hadoop

ali@fuzzywireless.com
Mar 4, 2022
2 min read

The basic architecture of MapReduce offer simple and powerful programming model on large clusters of commodity machines for scalable parallel applications to process large volume of data sets (Sakr, Liu & Fayoumi, 2014). Key advantage from this implementation is fault tolerance, low hardware cost, scalability, and highly parallel architecture. The term MapReduce is consisted of two different processes, namely Map and Reduce, where Map process converts input data into tuples of key/value pairs while reduce process converts these tuples into smaller set of tuples after combining (IBM, 2018).

Hadoop is an open source, distributed computing, reliable and scalable framework using simple programming models (Apache Hadoop, 2018). Key attributes of Hadoop framework is low cost hardware, fault tolerance, high computing power due to distributed architecture, and scalable (SAS, 2018). Low cost is attributed to commodity hardware and open source software, distributed architecture with multiple copies of data ensure high power parallel computing and fault tolerance while simple addition of cheap nodes offers scalability (2018).

Key advantages of MapReduce is scalability and cost effective solution when compared against traditional relational database management systems (RDBMS) due to inexpensive commodity hardware and open source software (Pal, 2018). Simple programming in Java and parallel processing architecture offers high computing power with ease and efficiency. Distributed architecture and multiple copies of data on multiple nodes offer high available and resilient system in case of fault (2018).

On the other hand, some key limitations of MapReduce is that it serves well for tasks requiring batch processing but not suitable for real-time streaming data due to time intensive computing processes of Map and Reduce (Data Flair, 2017). Another drawback is lack of iterative data support, which require cyclic dataflow (output of one stage is input of other stage and so on). Lack of abstraction is another shortcoming, which require MapReduce code for each and every operation. Framework also does not offer caching intermediate data in memory, which significantly impact the performance of framework (Data Flair, 2017).

Reference

Apache Hadoop, 2018. Apache Hadoop. Retrieved from http://hadoop.apache.org/

SAS Hadoop, 2018. Hadoop. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html

IBM, 2018. MapReduce. Retrieved from https://www.ibm.com/analytics/hadoop/mapreduce

Sakr, S., Liu, A., & Fayoumi, A. G. (2014). MapReduce family of large-scale data-processing systems. In S. Sakr, & M. Gaber (Eds.), Large scale and big data: Processing and management (pp. 39-106). Boca Raton, FL: CRC Press.

Pal, K. (2018). Advantages of Hadoop MapReduce Programming. Retrieved from http://www.tutorialspoint.com/articles/advantages-of-hadoop-mapreduce-programming

Data Flair, 2017. 13 Big limitations of Hadoop and Solution to Hadoop drawbacks. Retrieved from https://data-flair.training/blogs/13-limitations-of-hadoop/

MapReduce and Hadoop

Recent Posts

Comments