Getting creative with holidays in dimensions

I was recently working with a client and saw an interesting approach to a classic problem dealing with holidays as a dimension.  Now maybe this is a common solution, but I hadn’t seen it before, so I thought I’d share.  This same solution could be used for similar problems as well.

The goal is to be able to analyze the impact of holidays on various measures.  Imagine you are analyzing sales or hours worked.  Knowing if a particular day is a holiday is pretty important to understanding spikes in the data.

A first approach might be to have the holiday as a property in the date dimension.  That would only work if the data you are dealing with has the same holiday for all data that points back to that particular date.  This isn’t even a true case for the United States, where some states have their own holidays, much less on a global scale.

So what this client did, was solve the problem at the ETL level.  For each fact, they check with the client calendar and see if the data is for a holiday or not and then set it in the fact table as a degenerate dimension.  You could have a separate dimension as well, but they decided to avoid the join that would get created with Mondrian.  Just make sure to index that column so you don’t do a table scan when grabbing the members.

A simple solution to what at first might appear to be a complex problem.  I like that.

 


Using the AnalyzerBusinessGroup annotation in Pentaho Analyzer

A quiet, maybe too quiet, new feature of Analyzer in Pentaho 4.8 was the addition of the AnalyzerBusinessGroup annotation.  This annotation will let you specify that a measure should go into a specific group rather than be lumped in with a bunch of measures.  If you have just a few measures, it’s not that big a deal.  But many users have a lot of measures that can be categorized and it would be nicer to have them in separate groups.  I have not tried this with dimensions, but if you define them correctly it seems that it would be overkill.  I also suspect that Mondrian 4’s Measure Groups will make this obsolete, but don’t know that for a fact.

Using the feature is very simple.  Just add an annotation to the measure and specify the AnalyzerBusinessGroup.  For example:

<Annotation name=”AnalyzerBusinessGroup”>Orders</Annotation>

and

<Annotation name=”AnalyzerBusinessGroup”>Prices</Annotation>

results in the following (using the Steel Wheels example):

Analyzer Groups


Mondrian in Action Discount!

What’s better than getting an early release copy of Mondrian in Action?  How about getting it for 1/2 off!  All the same Mondrian goodness by for only half the price.  Just hop over to the Manning site and and use the discount code dotd1101au.

Hurry, though.  This code is only good on November 1st from 12am to 11:59pm EDT.

And while you’re at it, go see what Julian Hyde, author of Mondrian and co-author of Mondrian in Action has to say.

Remember – 50% off November 1st only with code dotd1101au.


Getting session variables when using the PRD scriptable data source

In a previous post I was asked in the comments how to use a session variable as a parameter when using the scriptable data source in Pentaho Report Designer.  I finally figured it out and thought I’d share with everyone. The solution is relevant for scriptable data sources, not just those that use MongoDB, so I’ll leave that complexity out of this discussion.

Pre-conditions

Before this will work you need to have some way you are getting a session variable set that you want to use.  You can create an action sequence that runs when a user logs in or, if you are using single sign-on, you can set it during the log in process that way.  These are describe other places, so I won’t go into how to set the session variable.  For the sake of this example, lets assume you have somehow set a session variable for a user called “Region” that has the region that applies to the user, North, South, East, or West.

Create a Report Parameter

The first thing to do is to create a report parameter that will get the session value.  Then set the Name and Value Type as appropriate.  The key step is to set the Default Value Formula to =ENV(“session:Region”). The ENV function will get the session value for the attribute with the name “Region”.  You should also set the parameter to be Hidden by checking the box, although while testing it can be handy to have it unchecked.  Note that if you preview in report designer this will have no value (there are ways to set it), so a default value can be handy.  I don’t recommend deploying to production with a valid default, though.

The following figure shows getting the Region value from the session.

Image

Using the Parameter

Using the parameter from your script is simple.  The scriptable interface provides a dataRow object with a .get(String name) command to get the value.  So, to get the value of Region at run time use the following line (in Groovy):

def region = dataRow.get(“Region”)

Then just use the value in the script.


Mondrian and Pentaho Analytics – A book is in the works

One of the biggest challenges many new users to Pentaho have is finding good documentation.  A variety of documentation exists, but it’s scattered among various sources, such as the Pentaho Knowledge Base, the Mondrian site, the Pentaho wiki, and a variety of blogs.  While much of this content is good, wouldn’t it be nice to have one place to go to for all things Mondrian?  I’m happy to say that a book is on the way.

It’s still very early in the process, but Julian Hyde, and Nick Goodman, and myself are collaborating to put together a book that covers the nuts and bolts of Mondrian including:

  • Setting up and running Mondrian
  • Creating schemas
  • Scaling Mondrian
  • Security
  • Integrating Mondrian with applications and enterprise tools

The as yet to be named book will be available from Manning and we hope to have some early access versions available this summer.  Early feedback is always welcome so we can make this the best book possible.

If there are specific topics you’d like to see in the book, please mention them in the comments.


Using Memcached with Pentaho Analysis EE

Background

Pentaho 4.1 EE will ship with Mondrian 3.3 EE.  Among several nice features is the addition of distributed in-memory caching for performance.  Three approaches are available out of the box:  Native Pentaho caching, Infinispan, and Memcached.  The first two are integrated into the BI Server and can be easily configured to run, but it takes multiple BI Server instances to gain the benefits of a distributed cache.  Since many enterprises will be running a cluster of servers for load balancing, this may not be a problem.  Memcached, on the other hand, allows you to run multiple memory cache servers without the BI Server.  These are very easy to set up and run, making it a good solution, particularly in cases where performance of individual analysis queries is important, but the number of simultaneous users and load on the BI Server is low.
The Pentaho wiki (http://wiki.pentaho.com/display/analysis/Pentaho+Analysis+EE) provides a very good description of how to install and configure the Mondrian EE plugin, but doesn’t really describe how to configure memcache itself.  Rather than hunt through at least two sites, I want to give an overview and talk about a few possible problems. In particular, I’m going to describe running with memcached on Ubuntu memcached nodes.

Mondrian Caching

Mondrian has multiple levels of caches that are important to understand.  The wiki has a very good description of the various caches, so I’ll just summarize here.  Note that a segment is part of a query that can be reused for later queries.  For example, the aggregated sales for North America in the 1st Quarter.
  • Query cache – This cache holds the data for the current query being executed.  Once the query is done, the cache releases the data.
  • Local cache – This cache resides in the VM with Mondrian.  It uses weak links to the segments being stored.  When the garbage collector is run, these cached segments can be garbage collected, meaning the next query that could have used the cached data will have to go back to the database and perform all the calculations.  Experience has shown that segments in the local cache are kept around for 2-3 hours before they are usually GCed, but there is no guarantee.
  • External segment cache – This cache is the subject of this post.  The external cache keeps segments using one of the approaches mentioned above.  While these segments can eventually be removed to free up space for new segments, the fact that they can be made very large means that they will be kept around for longer and they are not automatically garbage collected.

Caching with Memcached

Memcached (http://memcached.org/) is a simple, in-memory, distributed cache that stores key-value pairs.  The mondrian application calls memcached nodes and stores the data by providing a key and a serialized data value.  In the case with Mondrian, this is the serialized segment to be stored.  The key in Mondrian is generated automatically.  There are some implications and drawbacks to this approach when running with multiple servers that will be touched on below.  The two major advantages to using memcached are that you can run stand-alone memcache servers and that these servers can be fairly low power.  The servers only need to respond to network calls and return data from memory so they don’t need exceptionally fast processors or large hard drives.  Lots of RAM is what is really desired.

Configuration of Pentaho Analysis EE

Configuration of Pentaho Analysis EE is very simple.  There are two files to configure and both are described on the wiki.  Both files reside under the tomcat/webapps/pentaho/classes/ directory.
The first is pentaho-analysis-config.xml.  The key changes are to set USE_SEGMENT_CACHE and DISABLE_LOCAL_SEGMENT_CACHE to true.  This enables the external segment cache and turns off local caching.  The last is to specify the SEGMENT_CACHE_IMPL by providing the name of the class to use.  The configuration file has the three standard types.  Simply comment out the two not needed and uncomment the one you want to use.  As an aside, there is an API that you can use to provide your own cache, but I’m not going to cover that in this post.  If you were to create a different cache, simply implement the required interfaces and specify your class in the SEGMENT_CACHE_IMPL.
The second file, when using memcached, is memcached-config.xml.  There are three settings to modify.  The first is the SALT setting, which is simply a string that gets added to all keys.  This can be useful if multiple instances of the BI Server are running, such as development and QA, and they are using the same memcached nodes.  Simply provide different SALT values and the data is segmented between the BI Servers.  The second value is the SERVERS, which is simply a set of the memcached servers separated by a comma.  Finally, there are the WEIGHTS, which give the relative size of each server based on the RAM available.  This is used to allocate storage so one server isn’t constantly receiving too much data.  You will have to understand the configuration of each of the other servers to properly set these values.
Whenever these files are updated, the BI Server will need to be restarted.

Installing and Configuring Memcached on Ubuntu

This section describes how to install and configure memcached on an Ubuntu server.  Other installs are similar and the documentation on the Memcached site and wiki provided more detail.
First, you will need to have a static IP address for your memcached nodes or some way of consistently referring to them from the mondrian configuration.  This can be done by modifying the IP of your Ubuntu system to have a static IP.  I won’t cover how to do so here, since there are many examples on the web.
The easiest, and recommended, approach for installing memcached is to use the Ubuntu package.  At the command line enter:
$ sudo apt-get install memcached
Enter the administrator password and the install will run.  There may be some prompts, but the defaults work fine.  This script will create a memcache user and install memcached in the /etc/init.d folder to start as a deamon on system startup.  It does not automatically start it, however, so I recommend testing and configuring prior to restarting the server.
There are two files of significance that you will need to know about.  The first is the configuration file for memcached and the second is the log file.  By default the log file is /var/log/memcached.log.  This file is useful when running the deamon to see what’s going on.  The second file, and the one to modify, is the /etc/memcached.config file.
There are four main settings that you may want to change:
  1. -m <number> This setting tells how much RAM to use for memcached.  This is the max that it will use before dropping data.  Memcached will use more than this, so you don’t want it to use all of the RAM.  However, if the node isn’t doing much else, then I’d use most of the available RAM.
  2. -l <IP address> By default this is set to 127.0.0.1.  The memcached comment indicates that this means it will listen on all IP addresses. However, I was unable to get it to work without changing this value to the static IP of the machine.  Forum posts on various locations also indicated that use of the localhost address would cause memcached to not listen to external requests.
  3. -p <port> The default port is 11211.   There is no reason to change it unless you are running multiple memcached instances on the same machine or some other process is using this port.
  4. -v, -vv, -vvv These flags cause memcached to be verbose by increasing levels.  These are commented out.  In a production system I would leave them commented out, at least after initial use, since they could cause the logs to get quite large.  However, I like using them during initial setup and test.  They indicate that there is activity, which can be used to confirm that Mondrian is properly configured.

That’s really all there is to configuring memcached.  Restart the machines and use “$ ps aux” to verify that memcached is running.  You can also use “$ netstat -l” to make sure the process is listening on the correct port.

Additional Considerations

Using a local cache with only one server

The recommendation to turn off the local cache when running with an extended cache is because the extended cache could be cleared by a different node because the underlying data has been cleared.  If the local cache is used, then it’s possible that it has old data.  In this case all server nodes would need to be cleared.  However, if there is only a single BI Server node, then it might be advantageous to leave the local cache turned on.  This would avoid the network delay of accessing the extended cache when the segment data is stored locally.

Network bottlenecks

Request to external memcached nodes send and retrieve data across the network.  While these are typically small amounts of data, on a busy or slow network they can add to delay.  If you expect a lot of traffic be sure to have a fast network.

Multiple BI Servers with Memcache

Multiple servers can be configured to use Memcache.  There is not even a requirement that each of the servers be configured to use the same set of memcache servers.  If, for some reason, the servers were configured to use different extended caches, or had local caching turned on, then each individual severed would need to have the cache flushed.

Reloading Data and Invalidating the Cache

When new data is entered into the data mart via the ETL process, Mondrian does not automatically know the data is no longer valid.  With local caching this disconnect would eventually be resolved automatically (although not deterministically) through the garbage collection process.  However, with memcached extended caching, the segments can stay around indefinitely.  This means that an approach, ideally automated, needs to be available to flush the caches when the data mart is updated.  Two (or more?) approaches can be used to solve this problem.  Either implement a DataSourceChangeListener or use the SPIs to call Mondrian and invalidate the cache.  This capability will be the subject of a future post.

Timeout Errors when Running Analysis

When a configured memcache node is not available, pretty bad things start to happen.  The most noticeable feature is that the user gets a timeout exception on their report.  This generally indicates that a memcache instance has failed or there is a configuration problem.  Check the catalina.out files for details in order to troubleshoot.