Sunday, March 29, 2009

Don’t Overstate Data

 

One of the tricky things for users of relational databases is forming queries without overstating (or understating) results as a consequence of table joins.

Take this example that comes from an OBIEE user who had database tables containing information about opportunities and attachments, both of which related to accounts. The OppAcct table looked like this, showing that each account had multiple opportunities.

clip_image002

The AcctAttachment table looked like this, showing that each account had multiple attachments.

clip_image004
This user wanted to be able to report on the data from these tables showing opportunities, attachments, and amounts for each account. Results would look like this for the BP account, achieved by a query that joined these two tables together on Account.

clip_image006
The user thought these results were OK and proceeded to model this data in OBIEE. Initial results looked good – just like her SQL query.

clip_image008

However, when she removed Attachment and OpptyID columns from the query, Answers overstated results by a factor of 2.

clip_image010

clip_image012

What's gone wrong here? There are two possible sources of the problem: the data modeling in the database and the metadata modeling in OBIEE. We will look at both.

In the database model, there isn't a source of the distinct set of accounts. Someone familiar with dimensional modeling would phrase it differently: there isn't an account dimension table. Using the AccountAttachment table as a source of accounts leads to overstatement because each account may have several entries in this table. BP, in fact, has two, which leads to the doubling of the amounts for each opportunity when this table joins to the OppAcct table.

The user made these tables sources in her OBIEE business model (the “middle layer” in the OBIEE administration tool). She knew that an OBIEE business model requires, at a minimum, two logical tables (a logical dimension table and a logical fact table). Since the only aggregatable column is Amount, OppAccount was the best choice for the logical fact table source. She made AcctAttachment the logical dimension table source, since it had no aggregatable columns. She was following basic principles correctly.

Dimension table sources need to join to fact table sources. She created a physical join on AccountAttachment.Account = OppAcct.Account. Making this a key/foreign key join involved “lying” to the OBIEE administration tool about what the key of the AccountAttachment table is. It isn't Account, since the same Account exists multiple times in the table. And lying about dimension table keys is a good indication that trouble lies ahead.

clip_image014

In the business model, the administration tool requires that every logical dimension table has a logical key. This was another opportunity to lie, and lies here are another portent trouble.

clip_image016

Suppose she didn't lie here specified that the logical table key consists of Account plus Attachment. When doing a consistency check, she would have seen this Warning.

clip_image018
Obviously, the two tables do join. So the warning is misleading. What the administration tool is really warning you about is that the columns defined as the logical key do not match the columns in the physical join between these sources.

One way to solve this problem is to introduce a source of distinct accounts. You could do this in the database, and of course that's the best way, but you could also (for relatively small data sets) introduce a SQL-based table in the physical layer of the metadata.

clip_image020

You could join this “table” to the other two data sources – and now there would be no reason to lie about joins or keys.

clip_image022

The SQL table now is usable as a source in the business model.

clip_image024

Now aggregations of Amount will be correct, as the BI Server will drop the AcctAttachment table out of the physical query when it is not needed.

clip_image028 clip_image029 clip_image030

This is a very simple example, of course, but it illustrates some fundamental principles for successful OBIEE projects. First of all, develop a dimensionally sound physical data model. Time and effort expended in developing a data model for BI is a sound investment. Whenever an OBIEE project seems to get overly complex, it is a good indication that the underlying data model is not optimal. Secondly, be careful to tell the truth to the administration tool. The admin tool may allow you to get away with it, and some queries may even be correct. But more often than not, trouble lies ahead.