Why use Hadoop?

Tuesday, May 2, 2017

Why use Hadoop?


This article covers

  • Why use Hadoop?
  • What is Hadoop?
  • Recognizing Use Cases
  • Some Use Cases


Apache Hadoop has evolved into the standard platform solution for data storage and analysis.

Two key aspects of Hadoop have driven its rapid adoption by companies hungry for improved insights into the data they collect:

  • Hadoop can store data of any type and from any source—inexpensively and at very large scale.
  • Hadoop enables the sophisticated analysis of even very large data sets, easily and quickly.

Hadoop solves the difficult scaling problems caused by processing and analyzing large amounts of complex data. As the amount of data in a cluster grows, new servers can be added incrementally for it to be captured and analyzed. Hadoop performs better at scale than traditional systems, and in many instances it can perform analyses that traditional systems simply cannot.

Hadoop delivers significant benefits. With Hadoop you can:

  • Store anything: in its native format, no translating into a fixed data warehouse schema.
  • Control costs: Hadoop is open source software that runs on industry-standard hardware. The platform scales and handles resources independently allowing you to tailor the resources to your workload.
  • Operate with confidence: Hadoop has already been widely adopted across a wide range of industries.
  • Deploy and scale seamlessy: Hadoop is a proven foundation for the analysis of large-scale data sets. Whatever the size of your data is.


What is Hadoop?

Hadoop is an open data storage and processing system. It is scalable, fault-tolerant, and distributed. Hadoop runs on groups of servers that work together as part of a Hadoop cluster. Each server can perform tasks and store data. While each server individually may lack the processor power and storage capacity necessary to process huge data sets, Hadoop allows the servers to work together to store and process truly massive amounts of data.


Recognizing Use Cases

Complex data demands a new approach.

The nature of the data that enterprises must capture, store, and analyze is changing— as is the importance of the insights it can provide. Not all data fits neatly into the rows and columns of a table. It comes from many sources, in multiple formats: multimedia, images, text, real-time feeds, sensor streams and more. Data format requirements change over time as new sources are developed. Hadoop is able to store data in its native format while offering analytical access.

A Lot of data

Many companies are forced to discard valuable data because the cost of storing it is too high. New data sources compound the problem: people and machines are generating more data than ever before. Hadoop’s innovative architecture and use of low-cost industry-standard hardware for storage and processing helps you manage large amounts of data efficiently.

New approaches

Simple numerical summaries—average, minimum, sum—were sufficient for business problems in the 1980s and 1990s. But the large, complex data problems of today’s businesses require new techniques. The algorithms involved include natural language processes, pattern recognition, machine learning and other advanced methods.


Use Case 1: Detecting Cybersecurity Threats

Hadoop is a powerful platform for dealing with cybersecurity threats, fraud, and criminal activity. It is flexible enough to store all the data that matters: content, logs, relationships between people or systems, and patterns of activity. It is powerful enough to run sophisticated detection, analysis, and prevention algorithms and to create complex models from historical data to monitor real-time activity. Hadoop is also flexible enough to change as threats evolve, and to store and analyze new data sources as they become available. With Hadoop, companies can better analyze and prevent cybersecurity threats efficiently and effectively.


Use Case 2: Reducing Customer Churn

The discipline of Retention Processing (operations undertaken to reduce churn) has been on the leading edge of relational database technologies for years. But both the processing necessary to provide impactful, real-time analysis, and the amount of data required, has caused researchers to scale back analysis, rather than expand it.

Data analysis is key to determining customer preference and building customer lifetime value. However, to successfully analyze complex problems like customer churn, data from many sources is required. By combining these data sources using Hadoop, it’s possible to create models that tie together market forces, customer preferences, and company operations into a holistic view of customer retention that can positively impact profitability and company performance.


Use Case 3: Targeting Advertising

At its core, ad targeting is a specialized type of recommendation engine that identifies users, determines their preferences, and delivers the ads best suited to each user.

Optimization requires examining both the relevance of a given advertisement to a particular user, and the collection of bids by different advertisers who want to reach that visitor. The analytics required to make the correct choice are complex, and running them on the large dataset available requires a large-scale, massively parallel system.

This is possible by collecting the stream of user activity.  Capture that data on the cluster and run analyses continually to determine how successful the system has been at displaying ads that appealed to users. Business analysts at the exchange are able to generate reports on the performance of individual ads and to adjust the system to improve relevance and drive immediate increases in revenue.

Another possible solution is building sophisticated models of user behavior in order to choose the right ad for each viewer in real time. This model uses large amounts of historical and tracking data about each user to cluster ads and users, and to deduce preferences. By leveraging the power of Hadoop to analyze historical user data, the exchange delivers much better-targeted advertisements and can steadily refine its models and deliver increasingly better ads


Use Case 4: Delivering Search Results

Users today are more likely to search for information with keywords than to browse through folders looking for what they need. Good search tools are hard to build. They must store massive amounts of information, much of it complex text or multimedia files. They must be able to process these files to extract keywords or other attributes to use in searches.

Besides the difficulty of handling the data, a good search engine must be able to assess the intent and interest of the user when a search query arrives. The word “chip,” for example, can refer to food or to electronic components—context is vital to delivering useful search results. Delivering meaningful results is dependent on the analysis of user preferences, recent browsing history, and a number of other data sources.

You can deliver good search results by building the indexing infrastructure on Hadoop. It can be scaled easily to the data volume required. Just as importantly, it can run complicated indexing algorithms, and Hadoop’s distributed, parallel architecture can index very large amounts of information quickly.

In addition to information about individuals, their history, and their preferences used when building search indexes, effective search engines track user behavior in response to searches themselves (which results were clicked, and which were ignored) and uses it to refine search results for subsequent users and future search results.


Use Case 5: Predicting Utility Outages

The volume of data required to predict and prevent outages is enormous. A clear picture of the health of a system depends on both real-time and after-the-fact forensic analysis of huge amounts of collected data in a variety of formats.

Hadoop clusters can capture and store the data streaming off of sensors. Building a continuous analysis system to watch performance of individual components, looking for fluctuations that might suggest trouble. Hadoop is able to store the data from sensors inexpensively, so that you can afford to keep long-term historical data in usable form for forensic analysis. As a result, you can see and react to, long-term trends and emerging problems.




New from the blog

  • Vaadin  NavigationBuilder
    Friday, May 19, 2017

    Developing your navigation needs in default Vaadin is ok, but I thought it could be enhanced a bit. Hence the idea of creating a new addon, NavigationBuilder.


  • Database partitioning
    Tuesday, April 25, 2017

    This blog item describes a mechanism to simulate database table partitioning (in Oracle) without using the Partitioning Option.

  • OLAP and Hadoop
    Tuesday, March 28, 2017

    On-line analytical processing (OLAP) is a category of data processing solutions designed to enable fast queries against raw, structured, multi-dimensional data. OLAP offerings are available from vendors including SAP, Microsoft, IBM, HP and Oracle.

    OLAP solutions are typically used for business intelligence and reporting, business process management, sales management, reporting and forecasting, and financial reporting of high-value, Big Data. OLAP systems are chosen for the ability to analyze multidimensional data in near real-time, support for complex analytical queries, the ability to organize data at query time, and reduced time-to-reporting.