Kettle Community Weekend

Monday, November 25, 2019

Every year, a member organizes a community weekend, where presentations are held about the use of Kettle and possible, new features. This year was a bit different since the community decided to separate itself a bit more from Hitachi and rename the community meeting from PCM (Pentaho Community Weekend) to KCM (Kettle Community Weekend). I’ve been to the previous two editions, so I decided to check it out once again.

Pentaho Updates

For those of you who aren’t familiar with Kettle. Let me sketch a quick background. Kettle is a tool used in Business Intelligence. It was created by Matt Casters in 2005, acquisitioned by Pentaho in 2006 and re-acquisitioned by Hitachi a few years ago. Matt left Hitachi behind in 2018 to join Neo4J (a graph database) and create new features to combine both.

There has been a lot of talk recently about Kettle, mostly frustrations about stuff breaking or new bugs popping up everywhere, but hey.. talk is talk, I guess.

The positive thing? A bigger work-load for Matt Casters? Anyway, the first KCM-edition was launched with an update on Kettle community edition and as always a bunch of new fixes and features have been released.

In my, humble, opinion the best one is the new logging feature, where the Neo4j graph structure is combined with PDI (Pentaho Data integration) logging. Creating logging of the execution lineage and a metadata graph. So in human terms, being able to find the shortest path between a main job and the error. Another new and exciting feature is Kettle Beam. I’m not a pro on Beam so if you want to know more, I suggest checking out:, before I start blabbing out, eyebrow-lifting nonsense.  

Hey Ray

Usually the presentations are split up into two different tracks, the business and technical track. This year only one track remained. Bright side: I was able to follow pretty much every presentation. The down side: some were a bit sleep-inducing.

One of my favorite presentations was about PMI (Plugin Machine Intelligence). I, honestly, had never heard about it before, but it didn’t take long for them to peak my interest. PMI is a, drumroll…, plugin for PDI to combine ETL with machine learning, a sort of merging of two worlds. They explained a lot about the underlying structure, but what stood out to me was the demonstration of their app: “Hey Ray”. Basically it’s an app on your phone where you can upload or take a picture from an x-ray and the app tells you if you have an injury or not. Perhaps not the most day-to-day practical app, until you turn it into a food app. They used the same ETL and underlying code to rebuild the app for food. Meaning: you can take a picture of what you are about to eat and the app tells you what it is. It may not sound so mind-blowing to you right now, but imagine the cool stuff you could do with it. For example, diabetes patients just having to take a picture of their food and knowing how much insulin they need to inject. I mean, it’s the start of an awesome collaboration between human and machine to benefit the day-to-day live. (And, yes, I’m a geek on that subject)

Keep it simple

Anyway, the cool stuff didn’t end there. The second presentation that got en-“graph”-ed (pun intended) into my brain, was the recommendation engine build for PDI. As a novice developer in PDI, common practices and general use, can be a steep learning curve. With the use of data science, and yes again, Neo4J, a recommendation engine was build, to hand out suggestions when building ETL. The main objective is to simplify the learning curve and help new developers get acquainted with the different elements that PDI holds. Pretty cool in its current use-case but let’s exaggerate a bit when I say “the crowd went wild”. It didn’t take too long for the developers in the room to see the potential it had. Not only for recommendation but also the idea of creating a graph storage for transformation and jobs, or even using it to enforce a certain “company-style” to new ETL-development.

Get graphic

I’m probably biased, since I’ve worked with Kettle and had a dab at Neo4j, but the final presentation that holds some food for thought, was the creation of a data lineage framework with Kettle and Neo4j.

Every developer knows the pain of starting at a new project and having zero to none documentation. Vice versa, each business owner knows the pain of having to spend a number of days (and money) on getting a new developer up to speed. So what if you could store your ETL and database structure as a graph? Enabling you to simply query all the information you need. Being able to find which column is situated in which database and which table and even seeing in which ETL-structure the columns is referenced. I’m pretty sure you’re all drooling over the idea and I can’t blame you, so was I.

In conclusion, the first KCM brought to light a few interesting topics and some cool new stuff to learn. It might not hold the complete, historical and famous grandeur of the previous PCM-editions, but it’s the yellow brick road to an improved city of Oz.