Wednesday, 29 April 2020

The most important form of work

If you haven't been following the new series of Qlik Virtual Meetups, they are worth tuning into and there is probably one in your region. Being virtual, it doesn't matter if they are in your region or not!

The recent edition of the Qlik Virtual Meetup Scotland featured an old friend in Treehive Strategy's Donald Farmer (it also featured the ever excellent Michael Tarallo talking about Qlik Sense April 2020!). In an excellent presentation, Donald mentioned two particular quotes that resonated with me and some of my own ideas around data visualisation and data literacy.

The first was from (the excellent!) Michael Lewis's book, Losers: The Road to Everyplace but the Whitehouse. The quote was:

"an explanation is where the mind comes to rest"

Donald uses the quote in the context of an analyst looking at data to what they feel is a conclusion. He feels, and I agree, that it is not enough to say that you have just followed the data. You stop being critical. Donald also paraphrases Deming: "with data, your just another person with data, and an opinion". This is because we are all fundamentally human beings with many, many biases. Where our mind comes to rest we feel comfortable.

For me, an interesting example of this occurred just yesterday with a chart published in this article: Three charts that show where the coronavirus death rate is heading. The chart, which at initial look looks like a spiral mess, with time, becomes clear and quite interesting. It is a great example of a chart that you need to spend some time with to appreciate, but the effort is rewarded.

I shared my opinion on Twitter, and some of the other discussion is interesting. However most of the discussion stopped at, "ugh!" - people were not prepared to give the chart the benefit of the doubt and invest any time. "Ugh!", was their explanation. That is where their minds rested.

The second quote was from a Harvard Business Review article titled What's So New About the New Economy written by Alan M. Webber:

"In the new economy, conversations are the most important form of work."

Donald expressed that Data literacy is about communicating with data. It is not just about a few people understanding data, it is about raising the level in society.

Conversations are enormously important in raising all the boats. Some data visualizations are merely about getting to rest - getting users to a position where they have the explanation that they want. Others are designed to try to get users beyond that point. But it is hard to get a body-at-rest moving. Conversations can help us get over that inertia. Conversations about data. Conversations over data.

Will you join the conversation?


As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook

Friday, 17 April 2020

How Many Segments? And other stories

Following on from my recent presentation at the first ever Qlik Virtual Meetup Scotland (link opens the meeting recording), there were a few questions arising. I thought it would be useful to answer them in a blog post.

Scott
Pies
I like a pie chart but how many segments before it becomes difficult to judge a value?
Paul
Pies
Would you not just group smaller segments into OTHER?
Steve
Pies / Bars
I think the long tail on a bar chart, with the Sense minichart, is very obvious - where lots of tiny pie wedges do not

Let's start with the "how many segments" question and hopefully the thread will move through the answering of the others. First, consider this typical pie chart, produced in Qlik Sense:


A typical pie chart with many segments

As with the majority of visualization tools, Qlik Sense has defaulted to the sensible option of sorting the segments in size order. This immediately allows us to see that the USA is bigger than Germany which is bigger than Austria. I can roughly estimate the values of each (as shown in my research) and I can quickly tell that the three markets make up just under half of the whole.

The very simple answer to the question of "how many segments" is, how many makes sense to meet the business requirements.

On the question of grouping smaller segments, I can modify the original chart as below, and still answer those same business questions:

Pie chart showing the top 3 segments versus all others

In this situation, I can still see answer that business question and have reduced the number of segments on display. I would argue that I can answer that business question equally well with either chart, although the first may actually deliver me additional insights, and that is the critical thing about either chart - that they answer that business question.

What can sometimes be a problem, however, is that with interactive tools such as Qlik Sense the user could drill down to just those 3 countries:

Pie chart after user has drilled to 3 countries 

Now the user is no longer able to see the part-to-whole of these countries' market share. Instead, they are looking at the part-to-whole of just these countries. This may be OK! It depends on the business question that the user wants to answer. If it is a problem, you can use Set Analysis in Qlik Sense to do something about it - similar to one of my previous pie posts.

Let's consider some alternatives (alt-pie charts!), starting with the simple bar chart.

Typical bar chart showing a measure versus a categorical value

Again, we typically order these charts by value, so we can still quickly see that the US is larger than Germany and both are larger than Austria. If our purpose here is to compare one country versus others, then this is the perfect chart to use. Even given that, it is not really so easy to compare the US vs. Poland or even Belgium, but interactivity can help with this. It is, however, a quite a lot more difficult to see how much of the total market is made up by the top 3 countries. That would be especially more difficult if there were a longer tail of smaller values that you might have to scroll. That is why I prefer the pie chart if the business question is a part-to-whole one. We can, of course, clump the other countries into "Others" in the bar chart, but it is still not easy to see the part-to-whole:


Bar chart with "Others" bar

The recommended choice for part-to-whole coming from anti-pie advocates, is the horizontal bar:


Examples of horizontal bars as an alternative to pie charts

As my research has shown, the horizontal bar does not always work as well as a pie chart. It is a valid option, and one that you could consider, but it should definitely not be the default.

To summarize, if you are asking the question "how many segments", then you may be asking the wrong question. Always remember Redmond's Rules:

  • Use the right visual encodings (and pie charts are a valid choice!)
  • Add labels and annotations to provide context to the user
  • SFW! Make sure that you are answering the business question

The last rule can be hard, because sometimes you don't actually know the questions that the users want to answer! In those circumstances, following a Design Thinking methodology will probably get you where you want to be.


As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook

Wednesday, 1 April 2020

Exponential data and logarithmic scales


I have to admit that when I first saw some of the recent data visualisations from the likes of the Financial Times and the New York Times, I wasn't an immediate fan. That is because they were using a logarithmic scale which distorts the data. My feeling was that they should be using a population based metric to compare different territories (XX per 100,000 is common).

Comparison of exponential data shown on a normal scale and on a logarithmic scale
A regular scale will have regular increments in the "Y" axis so if one point is twice as high as another, you can tell that it is twice the value. A logarithmic scale grows exponentially - generally log 2, so doubling on each equal size of increase (though the presentation usually rounds grid lines to 10s). If a point is twice as high as another, the higher value is the original value squared (e.g. 4 -> 16, 8 -> 64). It can be difficult for people to interpret, especially if they are not mathematical. In fact, I would suggest that it is almost impossible for most users to quickly tell the accurate difference in magnitude between different points - merely that one point is greater or lesser than another.

There is a general situation where it is useful to use a log scale, and that is where there is some skew in the data. For example, where there is a mix of some very high and many lower values - such as with exponentially growing data. In that situation, the scale of the higher values can obscure the lower values.

Ten US States growth shown on a normal scale. The higher value in one state hides detail in the other states. The dashed grey lines show example exponential growth patterns.
As an example, consider the chart above which shows growth patterns in several US states. All have a exponential type growth, but the higher values in New York makes it difficult to see the direction in detail of the smaller values. The scale needs to accommodate the high New York values, but most of the "action" in this chart it at the smaller values.

Comparison of ten US states using a logarithmic scale. The trajectory lines are straightened and it is easier to see the trajectory of the states with lower values.
When the same data is presented on a logarithmic chart, all of the lines are straightened and we get a much better view of the trajectory of each state.
I can now clearly see that Michigan's trajectory appears to be heading in a slightly worse direction than New York's. I am not concerning myself with how much farther ahead on the trajectory New York is, only the direction that they are both travelling and hence making mental forecasts about Michigan's future.

Bar chart with a logarithmic scale - don't do this kids! The log scale removes the comparative power of the bar chart.
BTW, I am good with using log scales like this for lines, but don't do it for bar charts! The effect of the logarithmic scale is to remove the power the the bar chart has of aiding our understanding of the difference in magnitudes. These differences are encoded by the length of the bar, a log scale will distort it. Don't do it!


Qlik Luminary, Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook

Monday, 5 August 2019

Pie charts ain't such a bad guy!

About 3 years ago I did some research using Amazon Mechanical Turk into how well people judge segments in a part-to-whole chart. Mentioned it in a blog post back then, but didn't go into too much detail.

There had been some excellent work done by Robert Kosara (from Tableau Research) along with Dean Skau into Pie Charts, Doughnut Charts and various differences. Robert has continued this with a number of papers since.

At the heart of my research was this nagging thinking that pervaded the data visualization ecosystem - pie charts are bad, never use pie charts. This was kinda, fine because I could do other stuff - especially segmented bar charts or tree maps, but I always had business users asking for pie charts and not really getting me when I tried to explain that they weren't the best way.

In 2015 I had started into studying for a Masters Degree in Data Analytics so was starting to get back into the academic way of thinking and looking at stuff. When I had a break from studying during the summer of 2016 I started to look around and found that there was no real basis for anyone rejecting a pie chart for part-to-whole comparison, other than they didn't really like them! Because when it came to actually testing pie charts versus other types of charts, then the pies seemed to do as well or better than the alternatives.

There was some suggestion in a number of papers that pie charts actually do better as they have a number of natural visual cues - at 0%, 25%, 50% and 75% - whereas the bar chart has definitive visual cues at 0% and 100% and a less well defined cue at 50%.

So, being the curious person that I am, I decided to test things for myself. I put a little money into it and spun up an Amazon Mechanical Turk account. I created a number of images (using QlikView of course!), and had the "workers" judge the size of a segment in a chart. I used a set of "baseline" pie and bar (just standard pie and bar chart) and then a set of bar charts that had additional visual cues added.


The chart above shows the comparison of mean absolute error recorded by participants (basically, how far off the mark were they with their estimations) and the 95% confidence intervals of those results.

The baseline pie chart performed better than the baseline bar, even considering the confidence intervals. This was not a surprise as it confirmed the results of an experiment from 1915!
The bar chart with a numerical scale as a visual cue performs the very best - and this aligns to what Stephen Few says in his famous paper, Save the Pies for Dessert.

Of course, it is not always practical to have a numerical scale on a bar chart, and I have shown that adding a perceptible visual cue at the decile positions (every 10%) performs almost as well as the scale. Much better than the baseline bar as well as a bar with visual cues at the quartile positions (every 25%).

Interestingly, adding a visual cue at the quartile positions for the pie chart did not improve its performance significantly over the baseline. With quartile cues not improving the performance over the baselines, it may indicate that we do indeed pick up on those cues automatically. More research needed here.

The upshot here is to not feel bad about using a pie chart for part-to-whole comparison. No need to feel embarrassed at the next visualization meetup or to share it on an online forum! Be bold!!!
The reality is back to my Fundamentals Rules of Visualization (or "Redmond's Rules") which are, in summary:

- Use the right visual variables
- Provide context with annotations
- SFW - make sure that the results are relevant to the viewer

I'm not the only one who is leaning in this direction and, as I blogged about previously, visualization can be that simple.

I finally got round to writing up an academic paper on my research and the good news (for me!) is that is has been accepted into the Short Papers section of IEEE Vis 2019 in Vancouver. If you are interested, a pre-print is available on arXiv.


Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
   @stephencredmond

Thursday, 25 January 2018

Data Visualization can be that simple

Two years ago I published my Fundamental Rules of Data Visualization (or "Redmond's Rules") which are, in summary:

- Use the right visual variables
- Provide context with annotations
- SFW - make sure that the results are relevant to the viewer

It was interesting for me to listen to the latest edition of Data Stories with interviewee Michelle Borkin. From the opening snippet, Michelle is confirming the fundamentals:

"Put a title on your graph, annotate the important things, label your axes, pick appropriate visual encodings, ..., people will understand your visualizations"

The basis of my "rules" come from own meandering experience as a practitioner, colored by the many subjective opinions that I have encountered. Michelle's advice comes from good old fashioned scientific experimentation - it is good to see that there is some convergence between the two!

Michelle is a Assistant Professor at Northeastern University, where she works on visualization (among other things). Papers include What Makes a Visualization Memorable? and Beyond Memorability: Visualization Recognition and Recall, both worth reading for anyone interested in the area.

Data visualization can be simple. Learn a few basics and work with your audience to design what suits them best.


As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Thursday, 26 October 2017

Technical Debt in Analytics

I was lucky enough to attend the Spark Summit Europe this week, held in the Convention Centre, Dublin - a really good venue.

One of the concepts that appeared in several presentations (to which, of course, Spark based solutions are a natural solution!) was the idea of Technical Debt. The image that accompanied all the presentations was taken from a paper entitled Hidden Technical Debt in Machine Learning Systems, a paper authored by several Google employees:


The concept is very familiar to me from many years of selling QlikView. The debt there arises from the famous SIB - Seeing Is Believing (or just plain-old Proof-of-Concept) where we would go into a prospect company, take some of their data, hack together an impressive dashboard, and wow them with how quickly we could work our magic with this wonderful tool.

The debt, of course, arose when the prospect turned into a customer and wanted the POC put into production!

Eh, er, em, perhaps, oh... - that difficult conversation where we have to explain exactly how much work is needed to make this wonderful dashboard actually production ready.

Technical Debt is not a new concept. It was described as far back as 1992 by Ward Cunningham, (founder of the famous Hillside Group,  developer of Wiki, and one of the original signatories to the Agile Manifesto). It is unsurprising to find it described in Machine Learning. The extent of it may be a bit of a surprise.

Taking on debt is something that a business may accept as it may lead to growth opportunity. However, the business needs to understand the terms of the debt before they agree to it. This Google paper is worth reading and understanding.

Businesses need to understand that implementing "AI" and "Machine Learning" may lead to gold, but the debts will need to be paid. You wouldn't jump into a finance agreement without consulting an adviser, don't jump into analytics without talking to someone who know what they are talking about.



As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday, 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.


There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.


Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn