Streaming

Real-Time Streaming

Tuesday, March 28, 2017

Apache Storm

Apache Storm (http://storm.apache.org) is a popular real-time streaming data framework for processing fast, large streams of data. Essentially Storm connects an incoming data source to a back-end data store while running some code on an intermediate path. Storm grew out of Nathan Marz’s work at BackType on streaming data generated from social media.

Most streaming applications benefit from context or state and therefore are really database applications. They require not just rapid data capture or ingestion but the ability to do real-time analytics and make real-time automated decisions on the data, i.e., make the analytics queryable.

Storm was created out of a need to rapidly ingest data from event streams and is often used for:

  • Data capture, ingest to Hadoop and OLAP systems
  • Pre-defined, read-only analytics such as counting, as well as cleaning, normalizing and preparing ingested data for long term storage

Storm is typically chosen because:

  • Apache open source license
  • While it solves a narrow streaming problem, it does it well
  • It’s fairly well known due to Twitter’s use of it; Twitter moved to Heron in 2015
  • Since Storm was designed to capture data, it supports data ingestion and export, as well as simple analytic calculations, but does not support serving of analytic results or real-time decision making.

 

 

Apache Spark Streaming

 

Apache Spark (http://spark.apache.org)  is a popular data processing framework for Hadoop. Apache Spark Streaming is a way of using Spark for streaming analytics against micro-batches of streaming data.


 

Lambda Architecture

The Lambda Architecture (http://lambda-architecture.net)  is designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. It attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate pre-computed views, while simultaneously using real-time stream processing to provide dynamic views. 

The Lambda Architecture was purpose-built as a robust framework for ingesting streams of fast data while providing efficient real-time and historical analytics. In Lambda, immutable data flows in one direction: into the system. The architecture’s main goal is to execute OLAP-type processing faster than what is possible with current OLAP solutions.

Lambda-based applications are used for:

  • Log ingestion and analytics
  • Real-time programmatic advertising
  • Serving streaming video
  • Recommendations engines
  • Big Data

Lambda is chosen for:

  • Applications that require lower latency
  • Data pipeline applications
  • Applications that process asynchronous, complex transformations 

 

Overview of the Lambda Architecture

The Lambda batch layer is usually a “data lake” system like Hadoop, although it could also be an OLAP data warehouse such as HP Vertica or IBM Netezza. This historical archive is used to hold all of the data ever collected. The batch layer supports batch query; batch processing is used to generate analytics, either predefined or ad hoc.

The Lambda speed layer is defined as a combination of queuing, streaming and operational data stores. In the Lambda Architecture, the speed layer is similar to the batch layer in that it computes similar analytics - except that it computes those analytics in real-time on only the most recent data. The analytics the batch layer calculates, for example, may be based on data one hour old. It is the speed layer’s responsibility to calculate real-time analytics based on fast moving data - data that is zero to one hour old.

Combining the analytics produced by the batch layer and the speed layer provides a complete view of analytics across all data, fresh and historical. The third layer of Lambda, the serving layer, is responsible for serving up results combined from both the speed and batch layers.

 

Developers evaluate the Lambda architecture for handling streaming data. Lambda’s inherent complexity, comprised of the three layers described (speed, serving and batch) requires developers to maintain the same application code (results) in two complex systems (the batch and speed layers).

Lambda issues include:

  • Complexity
  • Lambda cannot be used to build responsive, event-oriented applications
  • Lambda is limited: immutable data flows in one direction only, into the system -for analytics harvesting.

 

/GW