Getting creative with holidays in dimensions

I was recently working with a client and saw an interesting approach to a classic problem dealing with holidays as a dimension.  Now maybe this is a common solution, but I hadn’t seen it before, so I thought I’d share.  This same solution could be used for similar problems as well.

The goal is to be able to analyze the impact of holidays on various measures.  Imagine you are analyzing sales or hours worked.  Knowing if a particular day is a holiday is pretty important to understanding spikes in the data.

A first approach might be to have the holiday as a property in the date dimension.  That would only work if the data you are dealing with has the same holiday for all data that points back to that particular date.  This isn’t even a true case for the United States, where some states have their own holidays, much less on a global scale.

So what this client did, was solve the problem at the ETL level.  For each fact, they check with the client calendar and see if the data is for a holiday or not and then set it in the fact table as a degenerate dimension.  You could have a separate dimension as well, but they decided to avoid the join that would get created with Mondrian.  Just make sure to index that column so you don’t do a table scan when grabbing the members.

A simple solution to what at first might appear to be a complex problem.  I like that.

 


Using the AnalyzerBusinessGroup annotation in Pentaho Analyzer

A quiet, maybe too quiet, new feature of Analyzer in Pentaho 4.8 was the addition of the AnalyzerBusinessGroup annotation.  This annotation will let you specify that a measure should go into a specific group rather than be lumped in with a bunch of measures.  If you have just a few measures, it’s not that big a deal.  But many users have a lot of measures that can be categorized and it would be nicer to have them in separate groups.  I have not tried this with dimensions, but if you define them correctly it seems that it would be overkill.  I also suspect that Mondrian 4′s Measure Groups will make this obsolete, but don’t know that for a fact.

Using the feature is very simple.  Just add an annotation to the measure and specify the AnalyzerBusinessGroup.  For example:

<Annotation name=”AnalyzerBusinessGroup”>Orders</Annotation>

and

<Annotation name=”AnalyzerBusinessGroup”>Prices</Annotation>

results in the following (using the Steel Wheels example):

Analyzer Groups


Getting the PostgreSQL installer out of quarantine on OS X Mountain Lion

In a noble attempt to start playing around with Pentaho 5.0 (aka Sugar), I downloaded the CI version and PostgreSQL, since it’s replaced MySQL as of 4.8 as the default database for the repository.  After dutifully reading the README file and changing some shared memory settings (don’t skip this step), rebooting, and remounting the install image, I kicked off the install app.  Nothing.  No message, no popup, NOTHING.  Except for a not so helpful message in the Console that indicated one of the install files was in quarantine because of TextEdit.

After a bit of searching, I found a helpful site that told me how to turn off quarantine (search for “quarantine”).  It turns out there is a handy bash command to turn off the quarantine.  “$ xattr -r -d com.apple.quarantine <file-name>”.  The ‘-r’ recurses if this is a file.  So, I cd’d to the /Volumes/PostgreSQL 9.2.2-1 and ran “$ xattr -r -d com.apple.quarantine postgresql-9.2.2-1-osx.app” since a .app file is really a directory.  And …. it failed because it’s a mounted disk without write permissions.

That should have been obvious to me, but it’s late.  So I copied the .app file to ~/Downloads and ran the command again.  After a quick return I ran the install app with no further problems.

Now to do all the cool stuff I had originally planned before getting sidetracked.


Mondrian in Action Discount!

What’s better than getting an early release copy of Mondrian in Action?  How about getting it for 1/2 off!  All the same Mondrian goodness by for only half the price.  Just hop over to the Manning site and and use the discount code dotd1101au.

Hurry, though.  This code is only good on November 1st from 12am to 11:59pm EDT.

And while you’re at it, go see what Julian Hyde, author of Mondrian and co-author of Mondrian in Action has to say.

Remember – 50% off November 1st only with code dotd1101au.


Changing the JVM Memory Settings for Pentaho Design Tools on OS X

Background

Pentaho ships a number of design tools, such as Pentaho Report Designer, Pentaho Data Integration, and others for Linux, Windows, and OS X.  While the apps appear to be native, they are Java applications that have been wrapped to run in the native environment.

One of the key settings that users often want to change are the JVM memory settings.  These settings indicate how much memory a process should start with and, more importantly, the maximum amount it should use.  If you allocate too little then it’s possible to get OutOfMemory exceptions and the application stops functioning.  This post will show you how to modify the settings.

Finding the File to Modify

The file we want to modify is called Info.plist.  This file contains settings for the application that is being executed and is a standard file in an OS X application. OS X bundles applications into an Application bundle with an extension of .app.  This is really a directory, but FInder and other tools make it appear as if it’s a single file.

There are two ways to get access to the file.  The first is to use Finder.  Find the application file in the design-tools and sub-folders under Pentaho or wherever you installed them.  The application will show up as a single file.  Now right-click on the file and select “Show Package Contents”.  This will open a new view in Finder with all of the contents of the file.

Once you are viewing the contents of the package, select the Contents folder and you should see the Info.plist file.  This is an XML file, so you can edit with any text editor.  If you double-click it will attempt to use XCode if you have it installed.  I recommend using a different text editor, such as TextMate or textedit.

Right click on the Info.plist file and select Open With… and then choose the editor of your choice.  You may have to select Other… and then choose the app if the one you want isn’t displayed.  The file should open up in your text editor.

The alternative to Finder, is to use the Terminal.  This is my preferred approach, but it should only be used by those who are comfortable working from the command line.  You will need a text editor that you are also comfortable using, such as vi or emacs.  I personally prefer vi.

Navigate to the same folder.  The Terminal doesn’t treat the .app file as anything other than a folder, so basic usage of the cd command is enough.  Once you are in the proper directory, edit the file.

Changing the File

Now that you have the file open in an editor, you can make the needed change.  Near the bottom of the file you should find a declaration similar to the following:

<key>VMOptions</key>
<string> -Xms512m -Xmx1024m</string>

These are the settings that will get passed to the JVM when the application starts.  The -Xms setting is the amount of memory used on started.  The -Xmx is the maximum amount to use before errors start happening.   In this example I have 512MB on startup and a maximum of 1024MB (1GB) of memory maximum.  Note that these are increase from the default that I started with.

It’s customary to specify the memory in megabytes, using the 256m where m is megabytes.  It is also customary, although not required, to have the memory e in exponentials of 2, e.g. 256, 512, 1024, etc.  The mx value should always be larger than the ms value.  I personally make it twice as large.

Once you’ve set the values you want, save the file and then restart the application if it is running.  Assuming you increased the values, you should now have more memory.


Using MongoDB with Community Data Access (CDA)

I’ve covered how to get data directly from MongoDB directly into Pentaho Reports.  But what if you want to get them in a dashboard built with Community Dashboard Framework (CDF)?  That’s actually about as simple as getting it into a Pentaho report.

CDF comes with a related technology known as Community Data Access (CDA).  In general, CDA is a technology that abstracts access to data and provides a consistent approach to data access for CDF dashboards.  Among it’s other nice features are that it supports a wide variety of data sources including scripting.  Currently the scripting data source only supports Beanshell, but that’s good enough for communicating with MongoDB.

I’ve used the same database and collection from the reporting example, so you can find the source there.  I won’t go into all of the details of the CDA since you can find those at the WebDetails site.  The significant pieces are that you need to have a Connection of type “scripting.scripting” and then a DataAccess of type “scripting”.  As you can see, the query script is essentially the same as the one for reporting except that it’s in the Beanshell syntax.

<?xml version="1.0" encoding="UTF-8"?>
<CDADescriptor>
 <DataSources>
 <Connection id="mongodb" type="scripting.scripting">
 <Language>beanshell</Language>
 <InitScript/>
 </Connection>
 </DataSources>
 <DataAccess access="public" cache="true" cacheDuration="3600" connection="mongodb" id="mongodb-sales" type="scriptable">
   <Name>Sales via MongoDB</Name>
   <Columns/>
   <Parameters/>
   <Query><![CDATA[
import com.mongodb.*;
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel;
Mongo mongo = new Mongo();
db = mongo.getDB("pentaho");
sales = db.getCollection("sales");
String [] columnNames = {"Region", "Year", "Q1", "Q2", "Q3", "Q4"};
Class[] columnTypes = {String.class, Integer.class, Integer.class, Integer.class, Integer.class, Integer.class};
TypedTableModel model = new TypedTableModel(columnNames, columnTypes);
docs= sales.find();
while (docs.hasNext()) {
 doc = docs.next();
 model.addRow(new Object[] { doc.get("region"), doc.get("year"), doc.get("q1"), doc.get("q2"), doc.get("q3"), doc.get("q4")});
}
docs.close();
return model;
 ]]></Query>
 </DataAccess>
</CDADescriptor>

Now that the CDA data access has been defined you can use the CDA previewer to view the results.  Assuming you've used the same data you should see something similar to the following:

Example of the data from MongoDB via CDA.

At this point the data is available for any dashboard that has permission to access the data.

 


Getting session variables when using the PRD scriptable data source

In a previous post I was asked in the comments how to use a session variable as a parameter when using the scriptable data source in Pentaho Report Designer.  I finally figured it out and thought I’d share with everyone. The solution is relevant for scriptable data sources, not just those that use MongoDB, so I’ll leave that complexity out of this discussion.

Pre-conditions

Before this will work you need to have some way you are getting a session variable set that you want to use.  You can create an action sequence that runs when a user logs in or, if you are using single sign-on, you can set it during the log in process that way.  These are describe other places, so I won’t go into how to set the session variable.  For the sake of this example, lets assume you have somehow set a session variable for a user called “Region” that has the region that applies to the user, North, South, East, or West.

Create a Report Parameter

The first thing to do is to create a report parameter that will get the session value.  Then set the Name and Value Type as appropriate.  The key step is to set the Default Value Formula to =ENV(“session:Region”). The ENV function will get the session value for the attribute with the name “Region”.  You should also set the parameter to be Hidden by checking the box, although while testing it can be handy to have it unchecked.  Note that if you preview in report designer this will have no value (there are ways to set it), so a default value can be handy.  I don’t recommend deploying to production with a valid default, though.

The following figure shows getting the Region value from the session.

Image

Using the Parameter

Using the parameter from your script is simple.  The scriptable interface provides a dataRow object with a .get(String name) command to get the value.  So, to get the value of Region at run time use the following line (in Groovy):

def region = dataRow.get(“Region”)

Then just use the value in the script.


Adding Parameters to the Pentaho MongoDB Report

In the previous post, I showed how to create a Pentaho report with MongoDB as a data source using Groovy Scripting.  That’s pretty handy, but a lot of users want to be able to have parameters they can use to filter the date.  Adding parameters is pretty easy, but not well documented, so here’s how it’s done.  These instructions assume you are fairly familiar with Pentaho Report Designer.  If you need the data file and instruction on setting up the data, see the previous post.

To make it easy for the user, let’s create a query to use for the parameter values.  You could do another MongoDB query, but I’ll just use a table to keep things simple.  Under the data source tab add a new data source of type Table, create a query, and then add values for each of the regions: North, South, East West.

Regions for the parameter values.

Next, create a new parameter named Region that the user will select the Region to filter on.  The setting should look like the following:

Region parameter settings

Now you’re ready to use the parameters in your query.  Since the query is a Groovy script you have to use a special call rather than the traditional ${parameter} approach that you might be used to in SQL queries.  Specifically you will use the rowData.get("parameter-name"). The following is the new version of the query:

import com.mongodb.*
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel

def mongo = new Mongo()
def db = mongo.getDB("pentaho")
def sales = db.getCollection("sales")

def columnNames = ["Region", "Year", "Q1", "Q2", "Q3", "Q4"] as String[]
def columnTypes = [String.class, Integer.class, Integer.class, Integer.class, Integer.class, Integer.class] as Class[]

TypedTableModel model = new TypedTableModel(columnNames, columnTypes)

def docs = sales.find(new BasicDBObject("region", dataRow.get("Region")))
while (docs.hasNext()) {
def doc = docs.next()
model.addRow([ doc.get("region"), doc.get("year"), doc.get("q1"), doc.get("q2"), doc.get("q3"), doc.get("q4") ] as Object[])
}
docs.close()

model

Download the report description to see all the details. When run it looks like the following:

Example report with queries.


Creating Pentaho Reports from MongoDB


So you’ve made the move and started using MongoDB to store unstructured data.  Now your users want to create reports on the MongoDB databases and collections.  One approach is to use a Kettle transformation that retrieves data from MongoDB for reports.  This approach is documented on the Pentaho Wiki.  However, I want to use the MongoDB database directly without dealing with Spoon and Kettle transformations.  Fortunately Pentaho Reporting also supports scripting with Groovy built in.  This tutorial will show you how to create a report against MongoDB data using the Java drivers and Groovy scripting.

Pre-Conditions
You should already have mongodb installed and accessible.  I’m running on the same machine with the default settings, so vary the code as needed for your configuration.
You also need to put the mongo-java-driver-2.7.2.jar file in the libraries for Report Designer and the BA Server
$PENTAHO_HOME/design-tools/Pentaho Report Designer.app/lib/
$PENTAHO_HOME/server/biserver-ee/tomcat/webapps/pentaho/WEB-INF/lib
Restart the app and BA Server if they are running to pick up the new .jar files.
Setting Up
The first thing you need is some data.  I've created an input file of sales by region and year to use as an example.  Download and import the data using the mongoimport command:
> mongoimport -d pentaho -c sales data.json
Verify that the data has been successfully imported by opening the mongo shell and using the following commands:
> use pentaho
> db.sales.find();
You should see a list of documents that were added.
Creating the Report
  1. Using Pentaho Report Designer, create a new report.
  2. Add a data source and choose Advanced -> Scriptable
  3. Select groovy as the language and click the (+) for a new query
  4. Enter the following code as the script

import com.mongodb.*
import org.pentaho.reporting.engine.classic.core.util.TypedTableModel
def mongo = new Mongo()
def db = mongo.getDB("pentaho")
def sales = db.getCollection("sales")

def columnNames = ["Region", "Year", "Q1", "Q2", "Q3", "Q4"] as String[]
def columnTypes = [String.class, Integer.class, Integer.class, Integer.class, Integer.class, Integer.class] as Class[]

TypedTableModel model = new TypedTableModel(columnNames, columnTypes)

def docs = sales.find()
while (docs.hasNext()) {
def doc = docs.next()
model.addRow([ doc.get("region"), doc.get("year"), doc.get("q1"), doc.get("q2"), doc.get("q3"), doc.get("q4") ] as Object[])
}
docs.close()

model

This will read the data from MongoDB and return the table model needed by the reporting engine.
From here it's just standard report generation and publishing, which is described in the Pentaho documentation.
Go ahead and download the report definition.  You'll need to have mongod running and on the default (or change your script).  The results look something like the following:

Sales by Region from MongoDB


Using a Textbox Parameter for Multi-Value Selections in Pentaho Report Designer

Pentaho Report Designer allows you to create parameters that users can use to filter the data.  Usually this is done by providing buttons or lists or other standard UI types to select the values.  However, there are times when the list of possible values is quite long, making a selection list to long to be feasible.  In these cases it would be ideal to provide a simple text box and then enter the list of matching values to use.  Unfortunately a text box is associated with a single value and not a list.  But not to fear, there is a straightforward solution using a hidden parameter and a post-processing formula.

Lets say you want to look at information from an HR database that contains 1000 employees with dozens of records each an you only want the information for a few employees.  So you create a report with the query:
 
select * from employees where employee_id in (${empid});
You then create a parameter called empid with a text box and type of string and run the report.  Typing in 1002, 1005, 1007 you only get data back on the employee with number 1002.
To fix this problem, create a second parameter that we'll call empid_array.  This parameter has a type of Object and a Post-Processing Formula of =CSVARRAY([empid];0;",";).  Check the box that makes the parameter hidden so that it won't show up on forms.
This formula says to convert the parameter named empid to an array using a comma as a separator and don't quote.
Now modify your query to use the empid_array parameter:
 
select * from employees where employee_id in (${empid_array});
Now run your report and enter 1002, 1005, 1007 and you should see all records for all of the employees.

Follow

Get every new post delivered to your Inbox.

Join 295 other followers