Thursday, 25 January 2018

Data Visualization can be that simple

Two years ago I published my Fundamental Rules of Data Visualization (or "Redmond's Rules") which are, in summary:

- Use the right visual variables
- Provide context with annotations
- SFW - make sure that the results are relevant to the viewer

It was interesting for me to listen to the latest edition of Data Stories with interviewee Michelle Borkin. From the opening snippet, Michelle is confirming the fundamentals:

"Put a title on your graph, annotate the important things, label your axes, pick appropriate visual encodings, ..., people will understand your visualizations"

The basis of my "rules" come from own meandering experience as a practitioner, colored by the many subjective opinions that I have encountered. Michelle's advice comes from good old fashioned scientific experimentation - it is good to see that there is some convergence between the two!

Michelle is a Assistant Professor at Northeastern University, where she works on visualization (among other things). Papers include What Makes a Visualization Memorable? and Beyond Memorability: Visualization Recognition and Recall, both worth reading for anyone interested in the area.

Data visualization can be simple. Learn a few basics and work with your audience to design what suits them best.

As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Thursday, 26 October 2017

Technical Debt in Analytics

I was lucky enough to attend the Spark Summit Europe this week, held in the Convention Centre, Dublin - a really good venue.

One of the concepts that appeared in several presentations (to which, of course, Spark based solutions are a natural solution!) was the idea of Technical Debt. The image that accompanied all the presentations was taken from a paper entitled Hidden Technical Debt in Machine Learning Systems, a paper authored by several Google employees:

The concept is very familiar to me from many years of selling QlikView. The debt there arises from the famous SIB - Seeing Is Believing (or just plain-old Proof-of-Concept) where we would go into a prospect company, take some of their data, hack together an impressive dashboard, and wow them with how quickly we could work our magic with this wonderful tool.

The debt, of course, arose when the prospect turned into a customer and wanted the POC put into production!

Eh, er, em, perhaps, oh... - that difficult conversation where we have to explain exactly how much work is needed to make this wonderful dashboard actually production ready.

Technical Debt is not a new concept. It was described as far back as 1992 by Ward Cunningham, (founder of the famous Hillside Group,  developer of Wiki, and one of the original signatories to the Agile Manifesto). It is unsurprising to find it described in Machine Learning. The extent of it may be a bit of a surprise.

Taking on debt is something that a business may accept as it may lead to growth opportunity. However, the business needs to understand the terms of the debt before they agree to it. This Google paper is worth reading and understanding.

Businesses need to understand that implementing "AI" and "Machine Learning" may lead to gold, but the debts will need to be paid. You wouldn't jump into a finance agreement without consulting an adviser, don't jump into analytics without talking to someone who know what they are talking about.

As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday, 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.

There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday, 27 June 2017

The visual paradigm of ETL tools

Paradigm (from
- a framework containing the basic assumptions, ways of thinking, and methodology that are commonly accepted by members of a scientific community.
- such a cognitive framework shared by members of any discipline or group.

Following a recent demo of quite a well know data preparation tool, I was left thinking to myself, "well, that was confusing". The workflow itself was quite straightforward, in that it was an extraction of a reasonably straightforward dataset and then creating and evaluating a machine learning process. But there was just so much visual information on the screen, with so many icons, sub-processes and connections going all over the place, that it was just difficult to understand what was going on.

So, took to LinkedIn and Twitter on the subject and asked:

Quite a lot of comments were forthcoming, some of them quite interesting. I especially liked the one that suggested that the visual approach of one tool was essentially self-documenting.

It isn't.

The problem is that there is no shared paradigm about it. Well, there is a certain amount - for example, we tend to go left-to-right (until we don't) - but there is enough different options available to users to make one user's outputs very different to another's.

Let's have a look at a very simple example from Pentaho Data Integration (you might recall that I wrote an eBook some time ago on using Pentaho to prepare data for Qlik Sense):

Pentaho affords the user the option to have their flows going in whatever direction they want - up, down, left, right, diagonal - and flows can cross over. I can make as messy an interface as you want - although, hey, I can understand it and that is all that matters, right?

Even on a system that enforces a left-to-right paradigm, for example RapidMiner, still allows the user a lot of freedom. For example, this data flow:

This is nice and simple, flows from left-to-right. Looks great, right? But what about now:

Functionally, it's the exact same flow, but visually different enough from the first so as to look like a different flow to different users. How about now:

Again, it is the same flow, just with processes grouped. Most of the ETL tools will allow us to "tidy" the display by grouping multiple icons and flows into a sub-process. Different users may group in different ways.

Of course, when you write scripts, then you are even more free to do what you will. We can name variables whatever way we want. We can create sub routines and functions, classes and methods (depending on the language!), whatever we want. However, it does seem, and maybe this is just me, to be somewhat more controllable.

Script has a top-to-bottom flow. Even when using lots of functions, within those functions the code always flows from top-to-bottom. The syntax of the language is itself a constraint that enforces something that is more readable. Because the code is essentially structured text, we can even automate an enforced coding standard - including commenting.

This ability to automate the coding standard is actually a strength derived from many years of paradigm building. Scripting, in whatever language, has paradigms that developers quickly learn to follow.

Over time, the visual tools may develop those paradigms, but I am not sure that they can.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Monday, 2 January 2017

Hue, Saturation and Luminosity

Colour is an important variable to consider when designing a visualisation. A lot of Qlik developers, if we think of it at all, will think of colour in terms of a mix of red, green and blue - each of them usually as a numeric value between 0 and 255. A lot of web developers will think hexadecimally - #00 to #ff, with the R/G/B being expressed as a hex number like #00df87.

There is another way to think of colour, especially when thinking about how to represent scales, and that is to consider hue, saturation and luminosity.

Hue is what a lot of people will think of when they think of colour - almost the pure wavelength of the light spectrum, running from red to green to blue:

But is is actually a loop, because the blue runs back through to red again. Perhaps it is easier to represent as a circle (indeed, the CSS hsl function uses a value between 0-359, representing degrees on the colour wheel):

In QlikView, the HSL function takes a value between 0 and 1 for the hue. 0 is pure red, 0.33 is pure green and 0.67 is pure blue.

A changing hue is used by some designers to represent a scale - the so-called "rainbow scale". However, this is wrong on a number of levels. Not least of these is that there is no well accepted norm to say that red is low while blue is high and green is in the middle. Of course, we also have to remember the we need to design visualisations that may be used by people with colour blindness. Therefore, if you are representing a single climbing or falling scale, you should really just stick to a single hue value. If you are creating a diverging scale, then two hue values can be used.

Saturation means the level of saturation of the hue relative to grey - how much colour is there. This can be seen in the standard Microsoft colour picker:

So, we can see that, for each hue, the less saturated then the more grey. Very low saturation for any hue will effectively mean just grey. So, saturation is potentially useful to represent a scale - with a single hue (for example, green):

One thing that we should be aware of is that it is not possible for us to see subtle differences in the saturation, so it is always better to have a stepped scale, with 10 steps being an absolute maximum ( uses 9 as a maximum for this!):

Luminosity defines the levels of light that are emitted. We need to be careful here because this is often confused with brightness. However, luminosity is something that can be objectively measured but, like saturation, brightness is a subjective human measure. We can use luminosity as a scale:

As with saturation, we should consider using a stepped scale:

So, why would we worry about HSL? Because they are easily programmable! In both Qlik (all of the images here are built in QlikView using the HSL() function) and web/css technologies, there is a HSL colour function that will accept a hue, saturation and luminosity value. Even better, in both cases, the saturation and luminosity values are represented by percentages - which are ideal for calculating scales.

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday, 31 May 2016

Pie charts and perceptual anchors

There have been some information on a couple of really excellent research papers on pie charts released today from Robert Kosara (@eagereyes) and Drew Skau (@SeeingStructure) :

I had, coincidentally, been doing some research of my own on perceptual anchors and how that relates to performance of pie charts versus stacked bar charts in part-to-whole comparisons. It would suggest that pies are not the terrible bad-guy after all. Who knew!

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday, 21 February 2016

Fundamental rules of data visualization

There are many "rules" of data visualization that we read in many publications. Some contradict others and some just don't make any sense. Some are accompanied by extensive amounts of proofiness, but often is missing appreciation of the fundamentals. I can use algebra to prove to you that 1+1=1, using perfectly legitimate algebraic transformations, but it is invalid because it breaks a fundamental rule (for those who are interested, I will add it at the end of the post).

I like to preach three fundamental rules of data visualization to those who will listen:

1.  Data visualization is all about ratios
This is so fundamental as almost ridiculous to have to mention, but we need to mention it. Any visualization that seeks to juxtapose several values for interpretation must do so using some kind of visual ratio.

There are many kinds of visual ratios and some are more effective than others. Cleveland and McGill (1984) gave us the order of effectiveness of interpretation for these ratios:

  • Position on a common scale
  • Position on non-aligned scales
  • Length
  • Direction
  • Angle
  • Area
  • Volume
  • Curvature
  • Shading
  • Color saturation

To try and create a data visualization that is not based on some kind of visual ratio is a fundamentally flawed approach. Every ratio is not always appropriate for every visualization either, so we need to learn about what works where.

2.  Data visualization is all about context
We can create the most wonderfully beautiful bar charts and present them on a large screen in Times Square or print them on the most opulent paper in the most vivid colors, but without context they are just rectangles.
Context devices will include such simple elements as titles and axes - enough annotation so as to allow the reader to understand exactly what they are looking at.
As Amanda Cox, Graphics Editor at the New York Times, said in her Eyeo Festival talk:

The annotation layer is the most important thing we do... otherwise it's a case of here it is, you go figure it out.

3.  Data visualization is about SFW
This is the most important thing from a business point of view - and good data visualization is about creating a good solution for the business. SFW stands for So What.
I will always remember the day when I had spent hours on a great dashboard to present to a board-level executive at one of our most important clients. It was technically awesome! Really pushing the boundaries of what the tool could do.
I proudly showed it off at the executive presentation. My client sat patiently through it until, finally, he looked me straight in the eye and said:

So f***ing what?

He was right of course. My technically advanced dashboard had a huge fundamental flaw - I had failed to connect it correctly to the business problem. It wasn't a good solution at all - except in my head.
Fundamentally, we need to make sure that our data visualizations connect with the audience that they are intended for. The first two rules give us the correct technical result, the last gives us the brilliant business solution.

We can create some great business solutions by following these three rules. They may not look great, they may have garish colors, but if the CEO is able to use them to track his business then that is a very good dashboard.

To achieve glory among your peers, you need to start going beyond the fundamentals. Learn what works and what doesn't in most situations. Know when you should use a pie chart and when you shouldn't. Learn how to lay things out. Learn the best colors to use. This does lead to a fourth rule that could be considered fundamental:

4.  Get out of the way and show the numbers
We don't talk about all the color and layout stuff for the good of our health. There are good reasons for doing things in the ways that you will read about in the books. Learn about the reasons for good consistent layout, easy on the eye colors and clean presentation.
Above all, learn that if we don't follow the fundamentals then we start to potentially obscure the data, and this is a flaw that is important to correct.
Get out of the way and show the numbers.

For those that are interested, 1 + 1 = 1:

a = b = 1

a = b

a^2 = ab

a^2 - b^2 = ab - b^2

(a + b)(a - b) = b(a - b)

a + b = b

1 + 1 = 1

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn