MapReduce and Hadoop
- ali@fuzzywireless.com
- Mar 4, 2022
- 2 min read
The basic architecture of MapReduce offer simple and powerful programming model on large clusters of commodity machines for scalable parallel applications to process large volume of data sets (Sakr, Liu & Fayoumi, 2014). Key advantage from this implementation is fault tolerance, low hardware cost, scalability, and highly parallel architecture. The term MapReduce is consisted of two different processes, namely Map and Reduce, where Map process converts input data into tuples of key/value pairs while reduce process converts these tuples into smaller set of tuples after combining (IBM, 2018).
Hadoop is an open source, distributed computing, reliable and scalable framework using simple programming models (Apache Hadoop, 2018). Key attributes of Hadoop framework is low cost hardware, fault tolerance, high computing power due to distributed architecture, and scalable (SAS, 2018). Low cost is attributed to commodity hardware and open source software, distributed architecture with multiple copies of data ensure high power parallel computing and fault tolerance while simple addition of cheap nodes offers scalability (2018).
Key advantages of MapReduce is scalability and cost effective solution when compared against traditional relational database management systems (RDBMS) due to inexpensive commodity hardware and open source software (Pal, 2018). Simple programming in Java and parallel processing architecture offers high computing power with ease and efficiency. Distributed architecture and multiple copies of data on multiple nodes offer high available and resilient system in case of fault (2018).
On the other hand, some key limitations of MapReduce is that it serves well for tasks requiring batch processing but not suitable for real-time streaming data due to time intensive computing processes of Map and Reduce (Data Flair, 2017). Another drawback is lack of iterative data support, which require cyclic dataflow (output of one stage is input of other stage and so on). Lack of abstraction is another shortcoming, which require MapReduce code for each and every operation. Framework also does not offer caching intermediate data in memory, which significantly impact the performance of framework (Data Flair, 2017).
Reference
Apache Hadoop, 2018. Apache Hadoop. Retrieved from http://hadoop.apache.org/
SAS Hadoop, 2018. Hadoop. Retrieved from https://www.sas.com/en_us/insights/big-data/hadoop.html
IBM, 2018. MapReduce. Retrieved from https://www.ibm.com/analytics/hadoop/mapreduce
Sakr, S., Liu, A., & Fayoumi, A. G. (2014). MapReduce family of large-scale data-processing systems. In S. Sakr, & M. Gaber (Eds.), Large scale and big data: Processing and management (pp. 39-106). Boca Raton, FL: CRC Press.
Pal, K. (2018). Advantages of Hadoop MapReduce Programming. Retrieved from http://www.tutorialspoint.com/articles/advantages-of-hadoop-mapreduce-programming
Data Flair, 2017. 13 Big limitations of Hadoop and Solution to Hadoop drawbacks. Retrieved from https://data-flair.training/blogs/13-limitations-of-hadoop/
Comments