Relational DB versus Hadoop Cluster

Tuesday, March 17, 2015

 

 

Ever considered replacing your relational database with an Hadoop-cluster? At Juvo HQ we thought about this and decided to take this to the ‘mythbusters’ level!

Is Hadoop really fast(er)?

Consider the Hadoop framework. A first important thing is that an Hadoop-cluster contains just a few (or even one) master servers. The number of slave servers can vary from just one up to thousands of servers. Next, the system is based on two big blocks: HDFS and MapReduce. Each of these blocks work with some services that run on every server in the cluster.

HDFS stores the data on all the servers of the Hadoop-cluster. Consider the following example; say one wants to save a file in our Hadoop cluster, the name-node, which runs on the master server, cuts the file into different blocks with a fixed size. This block size can vary between 64 and 256 MB. During the second step, the name node distributes all these data blocks evenly over all the data nodes. During the final step, all the data nodes store the data blocks they received from the name node. The data nodes have no idea what data they store.

Figure 1 HDFS

MapReduce is responsible for processing the data. The most important feature is that the data is processed at the place where it is stored. This ability to work in parallel improves the speed of the framework significantly. The first step in processing a query is looking which data blocks are needed to handle the query. This step is executed by the jobtracker. After that, the jobtracker informs the right tasktrackers. These tasktrackers execute (map) the query for each other  independently on their data and return the results to the jobtracker. As a final step, the job tracker combines (reduces) all these results into the final answer of the query. Just like HDFS, the jobtracker runs on the master server and a tasktracker runs on every slave server.

Figure 2 MapReduce

 

 

Analyzing the data

We used a relatively small dataset to test our ‘myth’. The set contains 3 tables; the Artdim dimension (Article info, 1 million rows, 9 columns, 1GB in size), the Kladim dimension (customer info, 85000 rows, 4 columns, 20MB in size) and, last but not least, the Salesfacts table (salesfacts (duh!), 2.7 million rows, 183 columns, 4GB in size).

Comparison of a relational database and a four-node Hadoop cluster

Because of the hype concerning Hadoop, we automatically used a four-server Hadoop-cluster to store the data. But is this the best approach or is a relational database faster? To know the answer to this question, we ran some queries through both Hive and Oracle SQL Developer. The queries used for this were;

  • Count Artdim

  • Count Kladim

  • Count Salesfacts

  • Filter Association Rules (AR’s) to exclude some dummy AR’s (like a person who buys AA-batteries is likely to buy AAA-batteries)

  • Calculate average order size

The result can be found in the table below

 

Table 0-1 Duration of query (in seconds)

Query

Oracle 12c

Hadoop

Count Artdim

0,63

28,00

Count Kladim

0,25

29,00

Count Salesfacts

56,55

29,00

Filter AR’s

3,85

180,00

Calculate average order size

47,65

46,33

Average duration

21,79

62,47

As you can see, Hadoop isn’t that fast compared with an Oracle 12c database for our dataset. Hadoop is even slower in most cases, especially when the ‘Filter AR’s’ query is used. The reason is simple: that query uses the same table multiple times. The reason for this is that MapReduce cannot do its job correctly.

But what will happen if we increase the amount of data? Just imagine that its size is doubled or quadrupled. We did the test by duplicating the data. The result of running the same queries on a bigger dataset can be found in the figure below.

Figure 3 average runtime (in seconds) in function of the data size

A first thing we notice is that an Hadoop (------) cluster is faster in our specific case if the dataset is more than twice the original size. If the dataset is four times the original size, the Oracle database (------) will take more than twice as much time as the Hadoop system.

Our simple test shows us that Hadoop is made for Big Data or difficult queries only. Small amounts of data can be processed much faster by a relational database.

TB-PVE-GW/JUVO