Resource Description Framework in MapReduce

ali@fuzzywireless.com
Mar 4, 2022
5 min read

Web 3.0

After web 2.0, new generation of web was evolved with read, write and an additional execution capability (Choudhury, 2014). Web 3.0 evolved from information connection stage to connecting knowledge and semantically structuring documents (Algosaibi, Albahli & Melton, 2015). Web content become readable to machines as well with the usage of new languages, standards, technologies and data representation models, like machine process-able graph data model using resource description framework (RDF) or Web Ontology Language (OWL). However locating and extraction of RDF becomes a major performance bottleneck (Cure, Naacke, Randriamalala & Amann, 2015). Resource description framework (RDF) represents data in the machine readable format whereas querying of RDF data is performed using SPARQL (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). The scale and size of web, which is already beyond classical single machine capability makes the querying of RDF data set very challenging, especially because SPARQL queries utilize several joins (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014).

Resource Descriptive Framework (RDF)

W3C (2014) define Resource descriptive framework (RDF) as a data set with three components, namely subject, predicate and object represented as subject has property predicate with value object. Uniform resource identifiers (URI) are unique and global identifiers to represent RDF, for instance URL is a subset of URI (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Figure 1 shows an example of RDF graph representing relationships.

Figure 1: RDF Graph

W3C (2014) has recommended SPARQL as the query language for RDF. SPARQL query is constructed using triple patterns of variables mapped to subject, predicate and object of RDF (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). The patterns of triple related with each other using logical AND are referred as Basic Graph Pattern (BGP). Figure 2 presents an example of SPARQL query for the RDF in Figure 1.

SELECT *

WHERE { ?person knows Walter

?person age ?age}

OPTIONAL {?person Email ?Email}

FILTER {?age >=8}

Figure 2 Sample SPARQL Query

Above SPARQL query will return nothing because although John knows Walter and aged greater than 8 but does not have an email address.

MapReduce

The basic architecture of MapReduce offer simple and powerful programming model on large clusters of commodity machines for scalable parallel applications to process large volume of data sets (Sakr, Liu & Fayoumi, 2014). Key advantage from this implementation is fault tolerance, low hardware cost, scalability, and highly parallel architecture. The term MapReduce is consisted of two different processes, namely Map and Reduce, where Map process converts input data into tuples of key/value pairs while reduce process converts these tuples into smaller set of tuples after combining (IBM, 2018).

Native MapReduce process the RDF using joins which is a difficult task due to size of data sets, which can be located on different nodes thus require shuffling across network thus inefficient (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Reduce-side join in MapReduce require repartition of both data sets to be joined during the shuffle phase and join is performed during reduce phase. This process is inefficient and waste network resources (Pulipaka, 2014). Another approach is to perform map-side join by avoiding shuffling and data transfer during reduce phase, but cannot be applied in most of the cases because preprocessing (sort and partitioning) is still required before join can be performed (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014).

MapReduce Query and Join Algorithm (MRQJ)

Zhang and Wang (2014) presented a novel algorithm to improve the MapReduce’s query processing of RDF by using MRQJ algorithm. The algorithm stores the data first into HDFS, SPARQL query is parsed by ARQ, followed by less costly greedy strategy of join plan generation and running SPARQL query processing for the map() and reduce () functions (Zhang & Wang, 2014). Figure 3 outlines the flow of MRQJ algorithm to process XML records.

Figure 3 MapReduce Query and Join Algorithm Flow

Query Optimization Based on MapReduce

Cheng, Weng and Gao (2012) developed a framework to overcome inefficiencies associated with the processing of large data sets in native MapReduce environment. The proposed framework is consisted of three parts:

1. Preprocessing of RDF data using PredicateLead method

2. Partition of SPARQL query into individual jobs

3. Execution of jobs for query results

In the preprocessing phase, the algorithm divides the RDF data sets into small parts to reduce searching (Cheng, Weng & Gao, 2012). Each RDF triple is converted into subject, predicate#object represented as a single line in the final data file to be stored in HDFS and processed by MapReduce. Job partitioning algorithm partition the SPARQL query into individual jobs followed by Map() and Reduce () process (Cheng, Weng & Gao, 2012).

Pig Latin

Pig Latin is the language developed by Yahoo to analyze large data sets on Apache Hadoop while the generic implementation of Pig Latin on Hadoop is Pig, which translates Pig Latin into series of MapReduce jobs (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Pig Latin follows a nested data model and composed of four data types, atom (simple integer or string), tuple (sequence of fields of any type), bag (collection of tuples) and map (collection of data items with associated keys) (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Operations performed in Pig Latin are load (input data), foreach (processing on every tuple of bag), filter, join, union, split (partition of bag into two or more bags). Figure 4 outlines the processing of RDF using Pig Latin.

Figure 4 Pig Latin for RDF Processing in MapReduce

Summary

In summary, native MapReduce framework is inefficient in handling large RDF data sets because of shuffling across network for joins (Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Reduce-side join and map-side join are two ways to handle large RDF but require repartition and reprocessing respectively thus not very efficient techniques (Pulipaka, 2014 & Schätzle, Przyjaciel-Zablocki, Hornung & Lausen, 2014). Zhang and Wang (2014) developed MapReduce query and join algorithm (MRQJ) which utilize RDF parser for HDFS storing and ARQ parser for join plan generation followed by map() and reduce() functions. Cheng, Weng and Gao (2012) optimize the query on MapReduce by preprocessing large RDF into small chunks, followed by partitioning of SPARQL into individual jobs, Map() and Reduc() functions. Schatzle et al (2014) outlined the Pig Latin based framework which starts with algebraic processing of SPARQL query followed by Pig Latin translator to create MapReduce jobs for large scale RDF data sets.

Reference

Pulipaka, G. (2016). Resolving large-scale performance bottlenecks in IoT networks accessing big data. Retrieved from https://medium.com/@gp_pulipaka/resolving-large-scale-performance-bottlenecks-in-iot-networks-accessing-big-data-b0e386c58796

Choudhury, N. (2014). World Wide Web and its journey from Web 1.0 to Web 4.0. 2014 International Journal of Computer Science and Information Technologies, 5(6), 8096-8100

Algosaibi, A., Albahli, S. & Melton, A. World Wide Web: A survey of its development and possible future trends. The 16th International Conference on Internet Computing and Big Data, Las Vegas NV.

Cure, O., Naacke, H., Randriamalala, T., & Amann, B. (2015, October). LiteMat: a scalable, cost-efficient inference encoding scheme for large RDF graphs. 2015 IEEE International Conference on Big Data, 1823-1830

Sakr, S., Liu, A., & Fayoumi, A. G. (2014). MapReduce family of large-scale data-processing systems. In S. Sakr, & M. Gaber (Eds.), 2014. Large scale and big data: Processing and management, 39-106. Boca Raton, FL: CRC Press.

Schätzle, A., Przyjaciel-Zablocki, M., Hornung, T., & Lausen, G. (2014). Large-scale RDF processing with MapReduce. In S. Sakr, & M. Gaber (Eds.), Large scale and big data: Processing and management, 151-182. Boca Raton, FL: CRC Press.

Cheng, J., Wang, W. & Gao, R., 2012. Massive RDF data complicated query optimization based on MapReduce. 2012 International Conference on Solid State Devices and Materials Science

Zhang, Y. & Wang, J., 2014. Query optimization of distributed RDF data based on MapReduce. Applied Mechanics and Materials, 970-973

BM, 2018. MapReduce. Retrieved from https://www.ibm.com/analytics/hadoop/mapreduce

W3C (2014). RDF Schema 1.1 Retrieved from https://www.w3.org/TR/2014/REC-rdf-schema-20140225/

Resource Description Framework in MapReduce

Recent Posts

Commenti