Posted: December 7, 2013 Filed under: Analytics, Business Intelligence, Dimensions, Tips | Tags: analysis, date, Degenerate dimension, dimension, Dimension table, Extract transform load, Fact table
I was recently working with a client and saw an interesting approach to a classic problem dealing with holidays as a dimension. Now maybe this is a common solution, but I hadn’t seen it before, so I thought I’d share. This same solution could be used for similar problems as well.
The goal is to be able to analyze the impact of holidays on various measures. Imagine you are analyzing sales or hours worked. Knowing if a particular day is a holiday is pretty important to understanding spikes in the data.
A first approach might be to have the holiday as a property in the date dimension. That would only work if the data you are dealing with has the same holiday for all data that points back to that particular date. This isn’t even a true case for the United States, where some states have their own holidays, much less on a global scale.
So what this client did, was solve the problem at the ETL level. For each fact, they check with the client calendar and see if the data is for a holiday or not and then set it in the fact table as a degenerate dimension. You could have a separate dimension as well, but they decided to avoid the join that would get created with Mondrian. Just make sure to index that column so you don’t do a table scan when grabbing the members.
A simple solution to what at first might appear to be a complex problem. I like that.
Posted: September 27, 2011 Filed under: Business Intelligence, Data Warehousing, Dimensions
Date and time are probably the most common dimensions in business analytics. Business users may want to know sales by quarter or month to monitor growth. Web sites can use these dimensions to see when they get the most traffic to properly scale their server farm. Restaurants can use day and time information for staffing plans. All of these and many other uses require dates and, possibly, time information.
First, a couple of definitions for BI
newbies. A measure is a fact about which the user wants to know information. Facts are stored in a fact table
with each row in the table representing an individual fact. Examples include sales amount, number of visitors to a web site, or number of patrons that visit a restaurant. A dimension is additional information related to the fact that can be use for analysis. Dimensions are stored in separate tables with columns, called attributes, for each of the various combinations of dimensions. Examples include quarter of the sales, day of the week for visits to a web site, or time of day that customers visit a restaurant. A fact will contain foreign keys to all of the dimensions related to it.
Many authors will refer to the time dimension when they are talking about dates. I prefer to separate time and date and talk about each independently. Dates are individual days whereas time is down the the level needed for analysis, typically the minute. While they could be defined in a single dimension, this would lead to extraordinarily large dimension tables. Splitting the two make much small tables. Imagine a date dimension with days for 10 years. This would have approximately 3650 rows, one for each day. A time table with one row for each minute, would have 1440 rows. If these two were combined the resulting table would have over 5 million records!! And ten years can be a small date dimension.
The reason for a special dimension rather than storing dates as a field in the fact table, is two-fold. First, it enables analysis to be performed at different levels, such as year, quarter, day, month, etc. Second, it avoids the need to perform complex date arithmetic, such as year to day, quarter vs. previous quarter, etc. Most BI tools, when properly configured, have support to handle dates elegantly.
Date Dimension Fields
A date dimension will likely have the following fields at a minimum:
- Month Number
- Month Name
- Day of Week Number
- Day of Week Name
The Id is the key for the date table. This is typically just a sequential integer for each record. So, why not just store the date in the fact? Because splitting the date in this way provides support for online analytical processing (OLAP
) where data can be rolled up by week, month, quarter, year, etc. More detail will be provided on this in the future, but suffice to say that it is very difficult to perform analysis on a date vs. splitting the date in this manner.
The reason there is a name and number for days and months is that the number is used for ordering and the name for display. With only these few values you can perform some interesting analysis. However, most companies will have many more fields that are important for their specific analysis.
- Fiscal Year
- Weekend or Weekday
- Special Event
The types of fields you use are highly dependent on the data needs of the users.
Time Dimension Fields
A time dimension will likely have the following fields at a minimum:
The Id for time is used similarly to the date Id. It allows data to be looked at by minute, or rolled up to hours.
Occasionally it might be useful to go down to the second level, such as for web tracking, etc. where there are many fine-grained records, but minute is usually low enough. Other interesting time dimensions are:
- AM or PM
- Time of Day (morning, noon, night)
- Meal time (for restaurants)
The meaning of many of these will vary by the user. For example, different meal times will vary by restaurant type.
Finally, it can be useful to “bucket” times by every ten minutes, hour, multi-hour, etc. So, when entering data with time, the time is put into the hour of the day. This can save on storage and processing speed if that’s the lowest granularity desired.
Drawbacks and Alternatives
The major drawback to the approach described above is that a user can do analysis based on the time and analysis based on the day, but not both. That is to say, a user can find out how busy a web site is by day and how many visitors visit on Friday vs. Monday, but rolling up by day and hour is not possible without a lot of additional effort. However, creating reports based on multiple dimensions is part of what BI tools excel at, so most useful combinations are still possible.
The alternative is to create a single dimension including date and time. However, many questions really aren’t meaningful, i.e. it’s usually most useful to know when things happen during the day or on what day, but not both. The biggest drawback is that if the dimensions are pre-populated, they can be very large. One mitigation approach is to populate the date-time dimension as facts are created meaning the dimension table
will be no larger than the fact table, and likely smaller. If combining date and time into a single dimension it is useful to carefully consider how detailed the data really needs to be and creating a bucket that reduces the number of time entries needed.
A fact in the fact table would then reference the date and time dimension. For example, assuming a web site was tracking date and time of visits to the site to the minute level the fact might have the following info:
- fromURL – URL the user visited from.
- landingPage – URL of the page the user visits
- timeId – ID reference to the time dimension
- dateId – ID reference to the date dimension
When populating the table it is necessary to translate the date to a date id and the time to a time id, which is easily supported by most ETL tools. This allows the data to be aggregated in many useful and interesting ways.
Time, and especially date, are very important dimensions in most data warehouses. Rather than storing the time or date as part of the fact, the user can query and aggregate at various levels depending on the information to be created. The specific fields and interpretation of the fields depend on the intended domain and usage.
A future post will talk about how to turn the date and time into hierarchies for analysis.