Apache Storm
Apache Storm (http://storm.apache.org) is a popular real-time streaming data framework for processing fast, large streams of data. Essentially Storm connects an incoming data source to a back-end data store while running some code on an intermediate path. Storm grew out of Nathan Marz’s work at BackType on streaming data generated from social media.
Most streaming applications benefit from context or state and therefore are really database applications. They require not just rapid data capture or ingestion but the ability to do real-time analytics and make real-time automated decisions on the data, i.e., make the analytics queryable.
Storm was created out of a need to rapidly ingest data from event streams and is often used for:
Storm is typically chosen because:
Apache Spark Streaming
Apache Spark (http://spark.apache.org) is a popular data processing framework for Hadoop. Apache Spark Streaming is a way of using Spark for streaming analytics against micro-batches of streaming data.
Lambda Architecture
The Lambda Architecture (http://lambda-architecture.net) is designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. It attempts to balance latency, throughput, and fault-tolerance by using batch processing to provide comprehensive and accurate pre-computed views, while simultaneously using real-time stream processing to provide dynamic views.
The Lambda Architecture was purpose-built as a robust framework for ingesting streams of fast data while providing efficient real-time and historical analytics. In Lambda, immutable data flows in one direction: into the system. The architecture’s main goal is to execute OLAP-type processing faster than what is possible with current OLAP solutions.
Lambda-based applications are used for:
Lambda is chosen for:
Overview of the Lambda Architecture
The Lambda batch layer is usually a “data lake” system like Hadoop, although it could also be an OLAP data warehouse such as HP Vertica or IBM Netezza. This historical archive is used to hold all of the data ever collected. The batch layer supports batch query; batch processing is used to generate analytics, either predefined or ad hoc.
The Lambda speed layer is defined as a combination of queuing, streaming and operational data stores. In the Lambda Architecture, the speed layer is similar to the batch layer in that it computes similar analytics - except that it computes those analytics in real-time on only the most recent data. The analytics the batch layer calculates, for example, may be based on data one hour old. It is the speed layer’s responsibility to calculate real-time analytics based on fast moving data - data that is zero to one hour old.
Combining the analytics produced by the batch layer and the speed layer provides a complete view of analytics across all data, fresh and historical. The third layer of Lambda, the serving layer, is responsible for serving up results combined from both the speed and batch layers.
Developers evaluate the Lambda architecture for handling streaming data. Lambda’s inherent complexity, comprised of the three layers described (speed, serving and batch) requires developers to maintain the same application code (results) in two complex systems (the batch and speed layers).
Lambda issues include:
/GW