The coronavirus forces many of us to work from home. This could give an opportunity to start learning something new. In this blog we’ll hope to inspire you on how you can build dashboards on top of the various data-sources that are available in your organisation. We will introduce PDI - an ETL (Extract Transform Load) tool - in combination with Power BI - a dashboarding tool. This high-level introduction only scratches the surface of the unlimited data-preparation and visualization possibilities of these components. In later blogs we will zoom in on certain aspects of these tools.
The main-component that we introduce here is PDI, a tool providing you with a visual ETL flow to extract data from one or more sources, transform it into the desired form and load it into a new destination. Common use-cases are for example a data migration from one database to another, integrating multiple files and databases into one, data cleaning (from simple trimming to complex transformations) and much more. For this use case we’ve been using the free PDI community edition.
To examplify how PDI works we will build an HR dashboard that gives a quick overview of remaining leave days, worked days, sickness, project / non-project work ... . PDI prepares data from multiple input sources into reporting-ready data. This data will then be visualized in PowerBI dashboards. Our global workflow can be summarized as the sketch below. In the remainder of this blog we will dive into the details of each of these layers.
Firstly, we started with loading all data from different inputs (Excel files, XML files, Oracle databases) into our staging layer (a PostgreSQL-database). In a staging layer, data is mostly an exact copy of the source without any complex transformation. Having data in one format in one place makes life a lot easier in later stages.
Below you see a PDI-flow similar to the ones that populate the staging layer. On the left we load an excel file from google drive, replace all null values with the text ‘empty’ and then write it away to our staging table. The right one starts from the company’s oracle database, puts everything in lower chars and then also writes this into our staging layer.
Now comes the real fun: cleaning and preparing our data to easily visualize in our dashboard! Firstly, will take a closer look at the following transformation. This transformation takes data from two staging tables (timesheet_users and timesheet_user_hrms) and combines them into one reporting layer table. Every block in this transformation is called a step, where the first steps indicates which source-data we want to transform.
After indicating the staging tables we use a select step (second step in both flows above). This step specifies the information you want to keep from the input sources. Another nice feature of this step is to format the input data (for example how many decimals you want to keep, what date format you want to have, etc).
In a next step, we sort both tables on the same column (in this case we used ID, if you don’t have an ID you can also take a combination of different columns like first and last name to make it unique). Next, we merge both tables in the “Merge.join.2” step. Indicate the key-fields and how to join.
We now extend the newly merged data with a new ID that we use across all the reporting tables (“create Glance_ID step”). This makes it easy to link all reporting tables to each other in a later stage. We finish this transformation by filtering out unnecessary data and replacing the null values of the department with 0 (to handle unknowns in a later stage)’. We now finally get our employee table!
Please note that this transformation can be scheduled so that it runs every hour, day, week, … . This way, you have regularly updated data that can be used in your reports without any manual interaction.
Now that we have all this information cleaned and well in one place it’s time to make it visual! We use Power BI to create a bunch of different reports. As an example you can see the project/non-project report below. This shows how many % of the work has been spend on projects in total and per department. The bottom graph makes it tangible how this percentage changes in time with respect to the number of employees in a company.
In short, PDI is a great starting point to learn ETL tooling. Without any licensing cost you can discover how to wrangle data together into a dashboard. We believe that the visual nature of ETL tooling makes it easy to learn for non-tech people.
Some good references are:
In a next blog we will dive into the usage of Power BI. If you need any help on this topic or want to learn more, contact our Juvo Remote services.