Sunday, 22 October 2017

Using bipartite graphs projected onto two dimensions for text classification

It has been an interesting last couple of years - which means that I have been very quiet on this blog! But things will start to get busier here again. Busier, but different. As I have expanded my interests in wider data analytics technologies, I have been building experiences that I will start to share.

For much of the last two years I have been studying for a Master's degree in Data Analytics. This ended successfully and I am looking forward to being conferred with a H1 degree next month. My final project involved creating a new text classification method based on the bipartite relationship between words and documents but with, of course, a visual element, in that I have mapped the nodes of the bipartite graph onto two dimensions and then used QlikView to allow users to explore the model.


There is a published paper on the method that was presented at a recent conference in Zurich.

The important thing to note here is that this wasn't just a QlikView project. The model was built using Python and Spark, making use of the Databricks platform. As such, it is reflective of the direction of my interests over the last while - I still like using QlikView and Qlik Sense, but I have been working more and more on Big Data analytics, and Spark has been an important component of that.

I really like the the Big Data landscape right now - there are so many interesting things happening. I look forward especially to what is happening in visual analytics on Big Data. Companies such as Arcadia Data and Datameer are doing interesting thinks there. Qlik are, of course, working on a Big Data index, and that will be interesting to see when it comes out.

In the data science area, there are so many good desktop tools, but less options for working with the likes of Hadoop. I really like the new Cloudera Data Science Workbench in this regard, to allow teams of data professionals to work on code projects in a secure and governed way. I think that we will see other products making moves in this direction. For more 4GL type data processing, RapidMiner and Dataiku already work quite well with Hadoop. SAS has Hadoop connectivity, and some accuse them of having missed the Big Data boat, but they do have a forthcoming product called Viya that also promises to run directly on Big Data and the Cloud.

When I first started working with data, it was pretty much just SQL. Access was actually a pretty advanced data analysis tool, but was crippled with larger data sizes. When I look across the landscape now, it is hard not to be excited to see what will happen.


Stephen Redmond is a Data professional. He is author of Mastering QlikView, QlikView Server and Publisher and the QlikView for Developer's Cookbook
Follow me on Twitter   LinkedIn

1 comment: