Thursday, 26 October 2017

Technical Debt in Analytics

I was lucky enough to attend the Spark Summit Europe this week, held in the Convention Centre, Dublin - a really good venue.

One of the concepts that appeared in several presentations (to which, of course, Spark based solutions are a natural solution!) was the idea of Technical Debt. The image that accompanied all the presentations was taken from a paper entitled Hidden Technical Debt in Machine Learning Systems, a paper authored by several Google employees:

The concept is very familiar to me from many years of selling QlikView. The debt there arises from the famous SIB - Seeing Is Believing (or just plain-old Proof-of-Concept) where we would go into a prospect company, take some of their data, hack together an impressive dashboard, and wow them with how quickly we could work our magic with this wonderful tool.

The debt, of course, arose when the prospect turned into a customer and wanted the POC put into production!

Eh, er, em, perhaps, oh... - that difficult conversation where we have to explain exactly how much work is needed to make this wonderful dashboard actually production ready.

Technical Debt is not a new concept. It was described as far back as 1992 by Ward Cunningham, (founder of the famous Hillside Group,  developer of Wiki, and one of the original signatories to the Agile Manifesto). It is unsurprising to find it described in Machine Learning. The extent of it may be a bit of a surprise.

Taking on debt is something that a business may accept as it may lead to growth opportunity. However, the business needs to understand the terms of the debt before they agree to it. This Google paper is worth reading and understanding.

Businesses need to understand that implementing "AI" and "Machine Learning" may lead to gold, but the debts will need to be paid. You wouldn't jump into a finance agreement without consulting an adviser, don't jump into analytics without talking to someone who know what they are talking about.

As well as holding a Master's Degree in Data Analytics, Stephen Redmond is a practicing Data Professional of over 20 years experience. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

Sunday, 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.

There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.

Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn