Sunday 6 November 2011

Denormalize for Performance

It was gratifying for me to find out, after many QlikView implementations, that most of what I was doing was close to best practice in Dimensional Modelling.  I thought that I was mostly just building key tables to fix data connection issues and the denormalizing to clean the data structure and improve performance.

If you are unfamiliar with Dimensional Modelling, it has been around for a very long time - as long as QlikView!  Ralph Kimball, one of the early data warehouse pioneers, has literally written the book on the subject and has many great articles on his website:

If you are not familiar with relational database theory, the idea of normalization is to remove redundancy.  For example, if I have a simple product table like this:

Product Table

The category name of "Bings" is repeated several times.  In a relational database this value is actually stored several times.  From a storage point of view, it is more efficient to split the product table into a product and a category table like this:

Product Table

Category Table

In this case, the numeric ID "1" is stored several times but that takes up a lot less space than the word "Bings" which is now only stored once.  It also means that if I need to change the name of that category, I only need to change it in one location, thus improving my data consistency.

This system has been implemented widely.  Some architectures that you might come across, such as SAP or Oracle Financials, have thousands of tables, all perfectly crafted to store the data most efficiently in the relational database.  Normalization works great for transactional databases like this.

For designers, the problem occurred when people started reading data out of the transactional system, especially for reporting.  Because the queries now had to hop across several joins, the performance suffers immensely.  So, they started either undoing the normalisation process (denormalising) or, worse, keeping the normalised tables but duplicating the data into other tables so that queries ran quicker - a nightmare for data consistency.  Both approaches lead to an increase in the data storage requirements meaning that databases bloat more an more - but we can just throw hardware at that issue, right?

So, what is the best thing to do in QlikView?  A colleague of mine who was recently trained asked me about this because of the exercise that was to move the category table's data into the product table, hence denormalizing.  "Why?", he asked.  "Doesn't QlikView work just as well either way?".

There are a couple of reasons as to why you might do this.  And a couple of reasons why you might not.

Reasons for:

In one of his articles, Kimball suggests that such "snow-flaking" (leaving the category out of the product table) "compromises cross-attribute browsing performance".  Not only that, it "may interfere with the legibility of the database".

So that gives us 2 reasons:

- Performance
- Improved schema readability

And I will add a 3rd:

- Data size.

In the QlikView underlying data cloud, there will be an association between the product fields and the CategoryID field.  There will also be an association between the CategoryID field and the CategoryName field.  If I move the CategoryName into the product table (either map or join), I can then remove the CategoryID field and hence remove a redundant set of associations.  The CategoryName is not duplicated in QlikView because the columnar database still only stores it once.

Reasons against:

- Performance
- Improved schema readability
- Data size

I'm sure that you will be thinking that there is some sort of error here.  How can I have the same set of reasons "against" as I did "for"?

It depends!  It depends really on the number of records in each table and the number of fields.  For a relationship like product and category, there is probably only a few fields in the category table and it can really make sense to denormalise (it almost always make sense to do this for a table with only 2 fields - using ApplyMap).  However, with a couple of tables like, say, Order and Calendar, where Order may have millions of records and Calendar could have ten or more fields, it makes sense to keep them apart and just have the key association on the DateID field.  If you did join them in, it would mean that you could drop the DateID field but you would then have associations maintained between each field in the Order table and Year, and Month, and Day, and Week, etc., etc.  Many more associations being maintained than before and increasing your data size and possibly reducing your performance.  It may also become difficult to see the date fields in the schema whereas they are plain to see in their own Calendar table.

Stephen Redmond is CTO of CapricornVentis a QlikView Elite Partner

No comments:

Post a Comment

Note: only a member of this blog may post a comment.