Tuesday, 27 June 2017

The visual paradigm of ETL tools

Paradigm (from http://dictionary.com):
- a framework containing the basic assumptions, ways of thinking, and methodology that are commonly accepted by members of a scientific community.
- such a cognitive framework shared by members of any discipline or group.

Following a recent demo of quite a well know data preparation tool, I was left thinking to myself, "well, that was confusing". The workflow itself was quite straightforward, in that it was an extraction of a reasonably straightforward dataset and then creating and evaluating a machine learning process. But there was just so much visual information on the screen, with so many icons, sub-processes and connections going all over the place, that it was just difficult to understand what was going on.

So, took to LinkedIn and Twitter on the subject and asked:

Quite a lot of comments were forthcoming, some of them quite interesting. I especially liked the one that suggested that the visual approach of one tool was essentially self-documenting.

It isn't.

The problem is that there is no shared paradigm about it. Well, there is a certain amount - for example, we tend to go left-to-right (until we don't) - but there is enough different options available to users to make one user's outputs very different to another's.

Let's have a look at a very simple example from Pentaho Data Integration (you might recall that I wrote an eBook some time ago on using Pentaho to prepare data for Qlik Sense):

Pentaho affords the user the option to have their flows going in whatever direction they want - up, down, left, right, diagonal - and flows can cross over. I can make as messy an interface as you want - although, hey, I can understand it and that is all that matters, right?

Even on a system that enforces a left-to-right paradigm, for example RapidMiner, still allows the user a lot of freedom. For example, this data flow:

This is nice and simple, flows from left-to-right. Looks great, right? But what about now:

Functionally, it's the exact same flow, but visually different enough from the first so as to look like a different flow to different users. How about now:

Again, it is the same flow, just with processes grouped. Most of the ETL tools will allow us to "tidy" the display by grouping multiple icons and flows into a sub-process. Different users may group in different ways.

Of course, when you write scripts, then you are even more free to do what you will. We can name variables whatever way we want. We can create sub routines and functions, classes and methods (depending on the language!), whatever we want. However, it does seem, and maybe this is just me, to be somewhat more controllable.

Script has a top-to-bottom flow. Even when using lots of functions, within those functions the code always flows from top-to-bottom. The syntax of the language is itself a constraint that enforces something that is more readable. Because the code is essentially structured text, we can even automate an enforced coding standard - including commenting.

This ability to automate the coding standard is actually a strength derived from many years of paradigm building. Scripting, in whatever language, has paradigms that developers quickly learn to follow.

Over time, the visual tools may develop those paradigms, but I am not sure that they can.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Monday, 2 January 2017

Hue, Saturation and Luminosity

Colour is an important variable to consider when designing a visualisation. A lot of Qlik developers, if we think of it at all, will think of colour in terms of a mix of red, green and blue - each of them usually as a numeric value between 0 and 255. A lot of web developers will think hexadecimally - #00 to #ff, with the R/G/B being expressed as a hex number like #00df87.

There is another way to think of colour, especially when thinking about how to represent scales, and that is to consider hue, saturation and luminosity.

Hue is what a lot of people will think of when they think of colour - almost the pure wavelength of the light spectrum, running from red to green to blue:

But is is actually a loop, because the blue runs back through to red again. Perhaps it is easier to represent as a circle (indeed, the CSS hsl function uses a value between 0-359, representing degrees on the colour wheel):

In QlikView, the HSL function takes a value between 0 and 1 for the hue. 0 is pure red, 0.33 is pure green and 0.67 is pure blue.

A changing hue is used by some designers to represent a scale - the so-called "rainbow scale". However, this is wrong on a number of levels. Not least of these is that there is no well accepted norm to say that red is low while blue is high and green is in the middle. Of course, we also have to remember the we need to design visualisations that may be used by people with colour blindness. Therefore, if you are representing a single climbing or falling scale, you should really just stick to a single hue value. If you are creating a diverging scale, then two hue values can be used.

Saturation means the level of saturation of the hue relative to grey - how much colour is there. This can be seen in the standard Microsoft colour picker:

So, we can see that, for each hue, the less saturated then the more grey. Very low saturation for any hue will effectively mean just grey. So, saturation is potentially useful to represent a scale - with a single hue (for example, green):

One thing that we should be aware of is that it is not possible for us to see subtle differences in the saturation, so it is always better to have a stepped scale, with 10 steps being an absolute maximum (colorbrewer2.org uses 9 as a maximum for this!):

Luminosity defines the levels of light that are emitted. We need to be careful here because this is often confused with brightness. However, luminosity is something that can be objectively measured but, like saturation, brightness is a subjective human measure. We can use luminosity as a scale:

As with saturation, we should consider using a stepped scale:

So, why would we worry about HSL? Because they are easily programmable! In both Qlik (all of the images here are built in QlikView using the HSL() function) and web/css technologies, there is a HSL colour function that will accept a hue, saturation and luminosity value. Even better, in both cases, the saturation and luminosity values are represented by percentages - which are ideal for calculating scales.

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn