Tuesday, March 28, 2017

On-line analytical processing (OLAP) is a category of data processing solutions designed to enable fast queries against raw, structured, multi-dimensional data. OLAP offerings are available from vendors including SAP, Microsoft, IBM, HP and Oracle.

OLAP solutions are typically used for business intelligence and reporting, business process management, sales management, reporting and forecasting, and financial reporting of high-value, Big Data. OLAP systems are chosen for the ability to analyze multidimensional data in near real-time, support for complex analytical queries, the ability to organize data at query time, and reduced time-to-reporting.

OLAP systems are pure analytics systems, defined by their primary focus on query performance. Common use combines loading data from one or more sources, and running a set of queries against that saved/stored data (aka data at rest). Loading can be continuous or batch; analytics systems often offer developers ways to load data in bulk faster than one row at a time. Queries can be known in advance, but can also be ad-hoc in nature. Often, analytical systems are tasked with managing tremendous amounts of data, terabyte or even petabyte scale.


OLAP/Analytics Design Tradeoffs

Systems optimized for load-and-query workloads often let assumptions about workloads influence their design. For example, many systems treat data as mostly immutable once loaded. This assumption enables several design choices:

Data can be stored in ways that are easier to scan, as opposed to easier to update. An example is the column store, a tabular storage format that is much easier to compress and scan, while also being harder to update.

Multiple readers can scan data without shared locks. If data isn’t changing, this enables more concurrency and better use of multi-core systems.

Whether columnar storage is used or not, more granular compression can be used. This compression is expensive if the goal is to access individual rows, but can be amortized across scans.

Immutable data is easier to persist, as data isn’t changing as it’s being written to disk.


Complementary Workloads

In practice, applications are often built with a mix of systems, and operational systems are often combined with analytical systems to build complete solutions. There are three primary reasons OLAP systems are added to operational systems:

  • OLAP systems are typically better suited to storing the tremendous volumes of data required to keep a history of state, events and actions for long periods of time.
  • OLAP systems support ad-hoc analysis, where questions aren’t known in advance.
  • OLAP systems often support deep processing of datasets in their entirety, and these workloads aren’t a great fit for operational systems. Most machine learning applications, for example, rely on multiple passes over large datasets.


Special Case: Hadoop Ecosystem OLAP

With the rise of “Big Data”, HDFS enthusiasts have sought to bring mature, relational OLAP and SQL to data in the Hadoop Ecosystem. These tools often put a large emphasis on scale-out and integration with other Hadoop Ecosystem tools, whether they be Map Reduce, Spark, Hive, Presto, SparkSQL or others. These tools focus on long reads on static, or append-only, datasets.