Thursday 26 October 2017

Technical Debt in Analytics

I was lucky enough to attend the Spark Summit Europe this week, held in the Convention Centre, Dublin - a really good venue.

One of the concepts that appeared in several presentations (to which, of course, Spark based solutions are a natural solution!) was the idea of Technical Debt. The image that accompanied all the presentations was taken from a paper entitled Hidden Technical Debt in Machine Learning Systems, a paper authored by several Google employees:

The concept is very familiar to me from many years of selling QlikView. The debt there arises from the famous SIB - Seeing Is Believing (or just plain-old Proof-of-Concept) where we would go into a prospect company, take some of their data, hack together an impressive dashboard, and wow them with how quickly we could work our magic with this wonderful tool.

The debt, of course, arose when the prospect turned into a customer and wanted the POC put into production!

Eh, er, em, perhaps, oh... - that difficult conversation where we have to explain exactly how much work is needed to make this wonderful dashboard actually production ready.

Technical Debt is not a new concept. It was described as far back as 1992 by Ward Cunningham, (founder of the famous Hillside Group,  developer of Wiki, and one of the original signatories to the Agile Manifesto). It is unsurprising to find it described in Machine Learning. The extent of it may be a bit of a surprise.

Taking on debt is something that a business may accept as it may lead to growth opportunity. However, the business needs to understand the terms of the debt before they agree to it. This Google paper is worth reading and understanding.

Businesses need to understand that implementing "AI" and "Machine Learning" may lead to gold, but the debts will need to be paid. You wouldn't jump into a finance agreement without consulting an adviser, don't jump into analytics without talking to someone who know what they are talking about.

As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.

There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday 27 June 2017

The visual paradigm of ETL tools

Paradigm (from
- a framework containing the basic assumptions, ways of thinking, and methodology that are commonly accepted by members of a scientific community.
- such a cognitive framework shared by members of any discipline or group.

Following a recent demo of quite a well know data preparation tool, I was left thinking to myself, "well, that was confusing". The workflow itself was quite straightforward, in that it was an extraction of a reasonably straightforward dataset and then creating and evaluating a machine learning process. But there was just so much visual information on the screen, with so many icons, sub-processes and connections going all over the place, that it was just difficult to understand what was going on.

So, took to LinkedIn and Twitter on the subject and asked:

Quite a lot of comments were forthcoming, some of them quite interesting. I especially liked the one that suggested that the visual approach of one tool was essentially self-documenting.

It isn't.

The problem is that there is no shared paradigm about it. Well, there is a certain amount - for example, we tend to go left-to-right (until we don't) - but there is enough different options available to users to make one user's outputs very different to another's.

Let's have a look at a very simple example from Pentaho Data Integration (you might recall that I wrote an eBook some time ago on using Pentaho to prepare data for Qlik Sense):

Pentaho affords the user the option to have their flows going in whatever direction they want - up, down, left, right, diagonal - and flows can cross over. I can make as messy an interface as you want - although, hey, I can understand it and that is all that matters, right?

Even on a system that enforces a left-to-right paradigm, for example RapidMiner, still allows the user a lot of freedom. For example, this data flow:

This is nice and simple, flows from left-to-right. Looks great, right? But what about now:

Functionally, it's the exact same flow, but visually different enough from the first so as to look like a different flow to different users. How about now:

Again, it is the same flow, just with processes grouped. Most of the ETL tools will allow us to "tidy" the display by grouping multiple icons and flows into a sub-process. Different users may group in different ways.

Of course, when you write scripts, then you are even more free to do what you will. We can name variables whatever way we want. We can create sub routines and functions, classes and methods (depending on the language!), whatever we want. However, it does seem, and maybe this is just me, to be somewhat more controllable.

Script has a top-to-bottom flow. Even when using lots of functions, within those functions the code always flows from top-to-bottom. The syntax of the language is itself a constraint that enforces something that is more readable. Because the code is essentially structured text, we can even automate an enforced coding standard - including commenting.

This ability to automate the coding standard is actually a strength derived from many years of paradigm building. Scripting, in whatever language, has paradigms that developers quickly learn to follow.

Over time, the visual tools may develop those paradigms, but I am not sure that they can.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Monday 2 January 2017

Hue, Saturation and Luminosity

Colour is an important variable to consider when designing a visualisation. A lot of Qlik developers, if we think of it at all, will think of colour in terms of a mix of red, green and blue - each of them usually as a numeric value between 0 and 255. A lot of web developers will think hexadecimally - #00 to #ff, with the R/G/B being expressed as a hex number like #00df87.

There is another way to think of colour, especially when thinking about how to represent scales, and that is to consider hue, saturation and luminosity.

Hue is what a lot of people will think of when they think of colour - almost the pure wavelength of the light spectrum, running from red to green to blue:

But is is actually a loop, because the blue runs back through to red again. Perhaps it is easier to represent as a circle (indeed, the CSS hsl function uses a value between 0-359, representing degrees on the colour wheel):

In QlikView, the HSL function takes a value between 0 and 1 for the hue. 0 is pure red, 0.33 is pure green and 0.67 is pure blue.

A changing hue is used by some designers to represent a scale - the so-called "rainbow scale". However, this is wrong on a number of levels. Not least of these is that there is no well accepted norm to say that red is low while blue is high and green is in the middle. Of course, we also have to remember the we need to design visualisations that may be used by people with colour blindness. Therefore, if you are representing a single climbing or falling scale, you should really just stick to a single hue value. If you are creating a diverging scale, then two hue values can be used.

Saturation means the level of saturation of the hue relative to grey - how much colour is there. This can be seen in the standard Microsoft colour picker:

So, we can see that, for each hue, the less saturated then the more grey. Very low saturation for any hue will effectively mean just grey. So, saturation is potentially useful to represent a scale - with a single hue (for example, green):

One thing that we should be aware of is that it is not possible for us to see subtle differences in the saturation, so it is always better to have a stepped scale, with 10 steps being an absolute maximum ( uses 9 as a maximum for this!):

Luminosity defines the levels of light that are emitted. We need to be careful here because this is often confused with brightness. However, luminosity is something that can be objectively measured but, like saturation, brightness is a subjective human measure. We can use luminosity as a scale:

As with saturation, we should consider using a stepped scale:

So, why would we worry about HSL? Because they are easily programmable! In both Qlik (all of the images here are built in QlikView using the HSL() function) and web/css technologies, there is a HSL colour function that will accept a hue, saturation and luminosity value. Even better, in both cases, the saturation and luminosity values are represented by percentages - which are ideal for calculating scales.

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn