Realtime Stream Processing

ali@fuzzywireless.com
Mar 4, 2022
5 min read

In general, there are certain requirements for stream processing like, keep the data moving, handle stream imperfections, guaranteed predictable and repeatable outcomes, integrated storing and streaming of data, high availability and integrity, scalability and real-time response of the system (Zapletal, 2015). In the following subsections, two stream processing engines are detailed namely, Stormy and Twitter Storm.

Stormy

Stormy is a stream processing framework based on distributed stream processing service for continuous data processing on cloud (Loesing, Hentschel, Kraska & Kossmann, 2012). The well-known attributes of clouds, that is scalability and availability are fully leveraged in Stormy (Elshawi & Sakr, 2014). The framework makes use of distributed hash tables (DHT) to distribute queries across nodes and route events following query graph from query to query (Elshawi & Sakr, 2014). High availability is realized by replicating similar data on multiple nodes and processed concurrently (Kamburugamuve, 2013). Due to distributed architecture, there is no single point of failure. One important assumption is that a query can be executed fully by one node, which puts a limit on number of incoming events of a given stream (Loesing et al., 2012).

Four major functions of Stormy are:

1. registerStream (description) – maps the external data stream to a system identifier (SID) used internally by Stormy

2. registerQuery (name, query, input SIDs) – sets up query with name of parameters, query and internal SID or SIDs, while returning the SID of output stream

3. registerOutput (SID, target) – returns the host address and port of output for the given SID

4. pushEvents (SID, events) – pushes the events with input SID to the system (Elshawi & Sakr, 2014)

Using hashing mechanism, queries are distributed across multiple nodes whereas the output of hash function act as a circular ring, as shown in Figure 1 (Loesing et al., 2012). Each node in the ring is assigned with a random value that defines the position of node in the ring (Kamburugamuve, 2013). In case of new input, hash key is generated and assigned to first node by following the ring clockwise (Elshawi & Sakr, 2014). DHT helps in letting all nodes know the mapping of hash keys to node thus enable the forwarding of request to respective node. In case of changes to DHT, gossip protocol is used to propagate the new information to all nodes with some delay, delay can cause some instances of incorrect forwarding using outdated information, but eventually new mapping will fix the issue (Elshawi & Sakr, 2014). Stormy utilize a replication mechanism thus enabling replication of query across multiple nodes while replication protocol ensure that incoming events are executed by all replica nodes (Loesing et al., 2012).

In order to handle overload situation, load balancing and cloud bursting techniques are employed (Loesing et al., 2012). Each node monitors the utilization of resources, which include CPU, memory, network, and storage (Kamburugamuve, 2013). Node also disperse their utilization information to all nodes using gossip protocol (Elshawi & Sakr, 2014). Node compares their utilization with their neighbor nodes, in case of crossing certain load threshold load balancing is triggered locally between the neighbor nodes. Although load balancing is an efficient technique to balance load across all nodes but in case where the utilization of all nodes becomes too high than cloud bursting technique is used by adding new nodes, which is the decision made by elected cloud bursting leader (Elshawi & Sakr, 2014). New node takes a random position in the ring and share some data ranges of neighboring nodes and update the DHT (Kamburugamuve, 2013). On the contrary, if the load is too low below certain threshold than the cloud bursting leader can decide to remove the node from the ring by transferring data ranges to neighboring nodes followed by DHT update (Loesing et al., 2012).

Figure 1: Ring Architecture of Stormy’s distributed stream processing framework

Stormy is a framework suited for multi-tenant cloud with high availability, fault tolerance and scalability (Elshawi & Sakr, 2014). However due to replication of data across several nodes, there are some inefficiencies which are inherent because of concurrent processing across multiple nodes leading to waste of computational resources but necessary for fault tolerance and high availability.

Twitter Storm

Twitter developed a distributed and fault-tolerant stream processing engine using fundamental principles of actor theory known as Twitter Storm (Elshawi & Sakr, 2014). Key principles are:

1. horizontally scalable – concurrent processing and computations on multiple threads, processes and machines

2. guaranteed message processing – ensure that each message is fully processed at least once

3. fault tolerance – reassignment to other nodes in the event of failure

4. programming language agnostic – any language can be used to setup tasks and processing (Elshawi & Sakr, 2014)

The framework is composed of two components namely, spout and bolt shown in Figure 2. Spout is the source of stream while bolt perform some processing and emit new streams (Elshawi & Sakr, 2014). For instance, hot trending tweets require multiple steps of processing thus multiple bolts are involved. Topology is the graph of stream transformations, consisting of bolt or spout. Edge of the graph highlights which bolts are subscribed to which streams (Elshawi & Sakr, 2014). System supports various stream groupings, like

1. shuffle grouping – streams are randomly distributed amongst bolts for equal number of tuples

2. field grouping – partitioning of tuples into groups based on fields

3. all grouping – stream tuples are replicated across all bolts

4. global grouping – entire stream goes to single bolt (Elshawi & Sakr, 2014)

Figure 2: Twitter Storm Topology

In an essence, Twitter Storm cluster of nodes is similar to Hadoop but the key difference is the execution of processing forever unlike MapReduce job which eventually finishes (Elshawi & Sakr, 2014). Architecture of storm is also similar to Hadoop, consisting of master node (Nimbus) which acts like Hadoop’s JobTracker, worker nodes (Supervisor) which are working nodes running computations while ZooKeeper coordinates between master and worker nodes (Elshawi & Sakr, 2014).

However, storm require an implementation of function which can ensure recovery mechanism and check pointing for stateful operation (Nasir, 2016). Storm also require strict ordering, which is realized using external storage for transaction identifier and state of the operator. In the event of failure, resubmission can be done using the stored instance however it will cause degradation in speed of processing (Nasir, 2016).

Reference:

Elshawi, R. & Sakr, S. (2014). An overview of large-scale stream processing engines. In S. Sakr, & M. Gaber (Eds.), Large scale and big data: Processing and management (pp. 389-408). Boca Raton, FL: CRC Press.

Loesing, S., Hentschel, M., Kraska, T. & Kossmann, D. (2012). Stormy: An elastic and highly available streaming service in the cloud. Retrieved from http://cs.brown.edu/people/tkraska/pub/danac_stormy.pdf

Nasir, M. (2016). Fault tolerance for stream processing engines. Retrieved from https://pdfs.semanticscholar.org/eb74/c5c80b5698c2253bdff7ac9f228c76da1cf3.pdf

Zapletal, P. (2015). Introduction into distributed real-time stream processing. Retrieved from https://www.cakesolutions.net/teamblogs/introduction-into-distributed-real-time-stream-processing

Kamburugamuve, S. (2013). Survey of distributed stream processing for large stream sources. Retrieved from http://grids.ucs.indiana.edu/ptliupages/publications/survey_stream_processing.pdf

Realtime Stream Processing

Recent Posts

Comentários