Virtualization of the exploitation layer of a Data Vault

Wednesday, October 26, 2016

Whenever discussions about Data Vault happen, there is a typical tipping point in the discussion. This tipping point is “How about a performant exploitation layer”? In this blogpost I’m going to explain the problem and share my vision on this.

So, first of all, what is the problem on the exploitation layer when using Data Vault?

The Data Vault methodology is designed to store data reliably and gives the possibility to load very fast. This is a big advantage over other methodologies, certainly considering the Big Data trend that keeps growing. However, there is also a requirement for real-time reporting. A data vault model for a specific business area (let’s say a Dimension) might look like this:

In this example, all the blue boxes contain history and therefor have a start_date and end_date column. When all of these tables need to be combined into 1 Dimension-table, you need to combine all these dates into a P(oint)I(n)T(ime)-table. Which ensures that there are no overlaps nor gaps in the Dimension. As you can see, the extraction of this Dimension needs a lot of joins and rather complex logic. This results in a situation where real-time reporting becomes impossible.

And what is the solution?

There has been a lot of focus on this issue as most Data Vault projects hit this obstacle. Most vendors now provide appliances, in-memory technology, and in general much more powerfull hardware. This makes for a good scenario to push data virtualization as the perfect solution to solve this problem. By basically creating views which contain all the logic, all data is always up to date and because of the faster hardware, performance shouldn’t be an issue. However, these are claims made by hardware-companies and I think you should be carefull for false advertisements. In theory (which means: with unlimited hardware power) it should be possible to fully virtualize your exploitation layer. But when you look at the model, adding an attribute to the dimension doesn’t mean “adding a column in the select statement”, best case you’re speaking about adding 1 satellite table to the point in time table and adding that table into the joins. Worst case you’re talking about adding multiple hubs, links and satellites. So although a virtualization solution might work for now, it can’t be guaranteed that it will work in the future. You can add budget for powerfull hardware to a project, but you don’t want to get in the situation where you really need to upgrade your hardware because a dimension needs an extra field.

What’s the conclusion?

In my opinion, investing in a Data Vault is a good idea. If you create your ETL-flows in an intelligent way, you already have the code to transform them into a virtual datamart. But at this point, going straight to a virtualization solution seems like a big risk. When vendors start getting real life cases of successful implementations of virtual datamarts, you might want to consider taking the risk.