Sunday, 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.

There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday, 27 June 2017

The visual paradigm of ETL tools

Paradigm (from
- a framework containing the basic assumptions, ways of thinking, and methodology that are commonly accepted by members of a scientific community.
- such a cognitive framework shared by members of any discipline or group.

Following a recent demo of quite a well know data preparation tool, I was left thinking to myself, "well, that was confusing". The workflow itself was quite straightforward, in that it was an extraction of a reasonably straightforward dataset and then creating and evaluating a machine learning process. But there was just so much visual information on the screen, with so many icons, sub-processes and connections going all over the place, that it was just difficult to understand what was going on.

So, took to LinkedIn and Twitter on the subject and asked:

Quite a lot of comments were forthcoming, some of them quite interesting. I especially liked the one that suggested that the visual approach of one tool was essentially self-documenting.

It isn't.

The problem is that there is no shared paradigm about it. Well, there is a certain amount - for example, we tend to go left-to-right (until we don't) - but there is enough different options available to users to make one user's outputs very different to another's.

Let's have a look at a very simple example from Pentaho Data Integration (you might recall that I wrote an eBook some time ago on using Pentaho to prepare data for Qlik Sense):

Pentaho affords the user the option to have their flows going in whatever direction they want - up, down, left, right, diagonal - and flows can cross over. I can make as messy an interface as you want - although, hey, I can understand it and that is all that matters, right?

Even on a system that enforces a left-to-right paradigm, for example RapidMiner, still allows the user a lot of freedom. For example, this data flow:

This is nice and simple, flows from left-to-right. Looks great, right? But what about now:

Functionally, it's the exact same flow, but visually different enough from the first so as to look like a different flow to different users. How about now:

Again, it is the same flow, just with processes grouped. Most of the ETL tools will allow us to "tidy" the display by grouping multiple icons and flows into a sub-process. Different users may group in different ways.

Of course, when you write scripts, then you are even more free to do what you will. We can name variables whatever way we want. We can create sub routines and functions, classes and methods (depending on the language!), whatever we want. However, it does seem, and maybe this is just me, to be somewhat more controllable.

Script has a top-to-bottom flow. Even when using lots of functions, within those functions the code always flows from top-to-bottom. The syntax of the language is itself a constraint that enforces something that is more readable. Because the code is essentially structured text, we can even automate an enforced coding standard - including commenting.

This ability to automate the coding standard is actually a strength derived from many years of paradigm building. Scripting, in whatever language, has paradigms that developers quickly learn to follow.

Over time, the visual tools may develop those paradigms, but I am not sure that they can.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Monday, 2 January 2017

Hue, Saturation and Luminosity

Colour is an important variable to consider when designing a visualisation. A lot of Qlik developers, if we think of it at all, will think of colour in terms of a mix of red, green and blue - each of them usually as a numeric value between 0 and 255. A lot of web developers will think hexadecimally - #00 to #ff, with the R/G/B being expressed as a hex number like #00df87.

There is another way to think of colour, especially when thinking about how to represent scales, and that is to consider hue, saturation and luminosity.

Hue is what a lot of people will think of when they think of colour - almost the pure wavelength of the light spectrum, running from red to green to blue:

But is is actually a loop, because the blue runs back through to red again. Perhaps it is easier to represent as a circle (indeed, the CSS hsl function uses a value between 0-359, representing degrees on the colour wheel):

In QlikView, the HSL function takes a value between 0 and 1 for the hue. 0 is pure red, 0.33 is pure green and 0.67 is pure blue.

A changing hue is used by some designers to represent a scale - the so-called "rainbow scale". However, this is wrong on a number of levels. Not least of these is that there is no well accepted norm to say that red is low while blue is high and green is in the middle. Of course, we also have to remember the we need to design visualisations that may be used by people with colour blindness. Therefore, if you are representing a single climbing or falling scale, you should really just stick to a single hue value. If you are creating a diverging scale, then two hue values can be used.

Saturation means the level of saturation of the hue relative to grey - how much colour is there. This can be seen in the standard Microsoft colour picker:

So, we can see that, for each hue, the less saturated then the more grey. Very low saturation for any hue will effectively mean just grey. So, saturation is potentially useful to represent a scale - with a single hue (for example, green):

One thing that we should be aware of is that it is not possible for us to see subtle differences in the saturation, so it is always better to have a stepped scale, with 10 steps being an absolute maximum ( uses 9 as a maximum for this!):

Luminosity defines the levels of light that are emitted. We need to be careful here because this is often confused with brightness. However, luminosity is something that can be objectively measured but, like saturation, brightness is a subjective human measure. We can use luminosity as a scale:

As with saturation, we should consider using a stepped scale:

So, why would we worry about HSL? Because they are easily programmable! In both Qlik (all of the images here are built in QlikView using the HSL() function) and web/css technologies, there is a HSL colour function that will accept a hue, saturation and luminosity value. Even better, in both cases, the saturation and luminosity values are represented by percentages - which are ideal for calculating scales.

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday, 31 May 2016

Pie charts and perceptual anchors

There have been some information on a couple of really excellent research papers on pie charts released today from Robert Kosara (@eagereyes) and Drew Skau (@SeeingStructure) :

I had, coincidentally, been doing some research of my own on perceptual anchors and how that relates to performance of pie charts versus stacked bar charts in part-to-whole comparisons. It would suggest that pies are not the terrible bad-guy after all. Who knew!

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday, 21 February 2016

Fundamental rules of data visualization

There are many "rules" of data visualization that we read in many publications. Some contradict others and some just don't make any sense. Some are accompanied by extensive amounts of proofiness, but often is missing appreciation of the fundamentals. I can use algebra to prove to you that 1+1=1, using perfectly legitimate algebraic transformations, but it is invalid because it breaks a fundamental rule (for those who are interested, I will add it at the end of the post).

I like to preach three fundamental rules of data visualization to those who will listen:

1.  Data visualization is all about ratios
This is so fundamental as almost ridiculous to have to mention, but we need to mention it. Any visualization that seeks to juxtapose several values for interpretation must do so using some kind of visual ratio.

There are many kinds of visual ratios and some are more effective than others. Cleveland and McGill (1984) gave us the order of effectiveness of interpretation for these ratios:

  • Position on a common scale
  • Position on non-aligned scales
  • Length
  • Direction
  • Angle
  • Area
  • Volume
  • Curvature
  • Shading
  • Color saturation

To try and create a data visualization that is not based on some kind of visual ratio is a fundamentally flawed approach. Every ratio is not always appropriate for every visualization either, so we need to learn about what works where.

2.  Data visualization is all about context
We can create the most wonderfully beautiful bar charts and present them on a large screen in Times Square or print them on the most opulent paper in the most vivid colors, but without context they are just rectangles.
Context devices will include such simple elements as titles and axes - enough annotation so as to allow the reader to understand exactly what they are looking at.
As Amanda Cox, Graphics Editor at the New York Times, said in her Eyeo Festival talk:

The annotation layer is the most important thing we do... otherwise it's a case of here it is, you go figure it out.

3.  Data visualization is about SFW
This is the most important thing from a business point of view - and good data visualization is about creating a good solution for the business. SFW stands for So What.
I will always remember the day when I had spent hours on a great dashboard to present to a board-level executive at one of our most important clients. It was technically awesome! Really pushing the boundaries of what the tool could do.
I proudly showed it off at the executive presentation. My client sat patiently through it until, finally, he looked me straight in the eye and said:

So f***ing what?

He was right of course. My technically advanced dashboard had a huge fundamental flaw - I had failed to connect it correctly to the business problem. It wasn't a good solution at all - except in my head.
Fundamentally, we need to make sure that our data visualizations connect with the audience that they are intended for. The first two rules give us the correct technical result, the last gives us the brilliant business solution.

We can create some great business solutions by following these three rules. They may not look great, they may have garish colors, but if the CEO is able to use them to track his business then that is a very good dashboard.

To achieve glory among your peers, you need to start going beyond the fundamentals. Learn what works and what doesn't in most situations. Know when you should use a pie chart and when you shouldn't. Learn how to lay things out. Learn the best colors to use. This does lead to a fourth rule that could be considered fundamental:

4.  Get out of the way and show the numbers
We don't talk about all the color and layout stuff for the good of our health. There are good reasons for doing things in the ways that you will read about in the books. Learn about the reasons for good consistent layout, easy on the eye colors and clean presentation.
Above all, learn that if we don't follow the fundamentals then we start to potentially obscure the data, and this is a flaw that is important to correct.
Get out of the way and show the numbers.

For those that are interested, 1 + 1 = 1:

a = b = 1

a = b

a^2 = ab

a^2 - b^2 = ab - b^2

(a + b)(a - b) = b(a - b)

a + b = b

1 + 1 = 1

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Tuesday, 2 February 2016

How to lie with charts - crude oil versus retail gasoline prices

After watching a news item this morning, I posted the following question to social media:

If oil has dropped from > $100 / barrel to < $30, why are consumers still paying > €1 / litre?

There were some interesting responses. There was in my mind a suspicion that the retail prices were not coming down as quickly as the crude prices - but I had nothing to back that up with. I decided to investigate.

Taking crude oil prices from US Energy Information Administration and monthly retail price data from AA Ireland, I put the two together quickly in QlikView. I decided to fix the time period to January 2010 to January 2016, as the last time the Irish government added an additional excise duty to fuel was in December 2009, so I knew that wouldn't interfere with the figures.

I plotted the data on a time series and, Aha!:

"Black and white!", I thought to myself. How obvious. While the crude price has been dropping like a stone, the retail price has had a much gentler decent. I better get straight onto the press to reveal the petrol companies evil intent towards the good people of Ireland.

But wait! There is a real problem here. The problem is that we have started both axes at zero - which is usually a sacrosanct rule. However, in this case, because we are not comparing the same value ranges, it is actually a mistake. By forcing both ranges into one area, I am actually distorting both of them.

In QlikView, the fix is simple, we just take off the force zero option for both expressions, revealing a much different state of affairs:

The crude and retail prices have actually been varying in a very similar way over the period. If I calculate the Pearson's correlation coefficient for these two series, it comes out at aproximately .77 - which is generally considered a high correlation for this type of data. In fact, if I drill into the last couple of years, the correlation is even tighter:

The correlation coefficient for the last 25 months data calculates at approximately .95!

Any data scientists in the room might be tempted to normalize the data (calculating the z-scores) so that we can plot them on the same axis. When we do, we get a similar view to the one above:

And here is an example in Qlik Sense Cloud:

So, perhaps the oil companies are playing a straight bat on this one. There are many different variations into what goes into the retail price of a litre of fuel. The crude oil price is one of those, but quite significant. If we can see a good correlation between the two, then we can have some sense of confidence that all is operating fairly.

The main point here though is that it is quite easy in a lot of visualisation tools to accidentally tell the wrong story. You may have best intentions, but you may end up telling visual lies.

Be careful out there!

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Friday, 29 January 2016

CRISP-DM for Data Viz projects

Do you have a methodology for implementing Data Visualization projects?

How do you go about working with your stakeholders to deliver value?

The conception of CRISP-DM is 20 years old this year. It was conceived of as a process to formalize data mining (Cross Industry Standard Process for Data Mining) but if we have a look at the diagram below, it really fits for data visualization too!

When do we not do all of these steps in a data visualization project? If you are not doing them, why not? I'm OK if you don't, as long as you know why you are not.

It is definitely worth a data visualization practitioners while to review the documentation - much of it freely available online (start with CRISP-DM 1.0 Step-by-step data mining guides)

Stephen Redmond is a Data Visualization professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn