Using Memcached with Pentaho Analysis EE
Posted: October 11, 2011 Filed under: Analytics, Business Intelligence
4.1 EE will ship with Mondrian 3.3 EE. Among several nice features is the addition of distributed in-memory caching for performance. Three approaches are available out of the box: Native Pentaho caching, Infinispan, and Memcached
. The first two are integrated into the BI Server and can be easily configured to run, but it takes multiple BI Server instances to gain the benefits of a distributed cache. Since many enterprises will be running a cluster of servers for load balancing, this may not be a problem. Memcached, on the other hand, allows you to run multiple memory cache servers without the BI Server. These are very easy to set up and run, making it a good solution, particularly in cases where performance of individual analysis queries is important, but the number of simultaneous users and load on the BI Server is low.
The Pentaho wiki (http://wiki.pentaho.com/display/analysis/Pentaho+Analysis+EE
) provides a very good description of how to install and configure the Mondrian EE plugin, but doesn’t really describe how to configure memcache itself. Rather than hunt through at least two sites, I want to give an overview and talk about a few possible problems. In particular, I’m going to describe running with memcached on Ubuntu memcached nodes.
Mondrian has multiple levels of caches that are important to understand. The wiki has a very good description of the various caches, so I’ll just summarize here. Note that a segment is part of a query that can be reused for later queries. For example, the aggregated sales for North America in the 1st Quarter.
- Query cache – This cache holds the data for the current query being executed. Once the query is done, the cache releases the data.
- Local cache – This cache resides in the VM with Mondrian. It uses weak links to the segments being stored. When the garbage collector is run, these cached segments can be garbage collected, meaning the next query that could have used the cached data will have to go back to the database and perform all the calculations. Experience has shown that segments in the local cache are kept around for 2-3 hours before they are usually GCed, but there is no guarantee.
- External segment cache – This cache is the subject of this post. The external cache keeps segments using one of the approaches mentioned above. While these segments can eventually be removed to free up space for new segments, the fact that they can be made very large means that they will be kept around for longer and they are not automatically garbage collected.
Caching with Memcached
) is a simple, in-memory, distributed cache that stores key-value pairs. The mondrian application calls memcached nodes and stores the data by providing a key and a serialized data value. In the case with Mondrian, this is the serialized segment to be stored. The key in Mondrian is generated automatically. There are some implications and drawbacks to this approach when running with multiple servers that will be touched on below. The two major advantages to using memcached are that you can run stand-alone memcache servers and that these servers can be fairly low power. The servers only need to respond to network calls and return data from memory so they don’t need exceptionally fast processors or large hard drives. Lots of RAM is what is really desired.
Configuration of Pentaho Analysis EE
Configuration of Pentaho Analysis EE is very simple. There are two files to configure and both are described on the wiki. Both files reside under the tomcat/webapps/pentaho/classes/ directory.
The first is pentaho-analysis-config.xml. The key changes are to set USE_SEGMENT_CACHE and DISABLE_LOCAL_SEGMENT_CACHE to true. This enables the external segment cache and turns off local caching. The last is to specify the SEGMENT_CACHE_IMPL by providing the name of the class to use. The configuration file has the three standard types. Simply comment out the two not needed and uncomment the one you want to use. As an aside, there is an API that you can use to provide your own cache, but I’m not going to cover that in this post. If you were to create a different cache, simply implement the required interfaces and specify your class in the SEGMENT_CACHE_IMPL.
The second file, when using memcached, is memcached-config.xml. There are three settings to modify. The first is the SALT setting, which is simply a string that gets added to all keys. This can be useful if multiple instances of the BI Server are running, such as development and QA, and they are using the same memcached nodes. Simply provide different SALT values and the data is segmented between the BI Servers. The second value is the SERVERS, which is simply a set of the memcached servers separated by a comma. Finally, there are the WEIGHTS, which give the relative size of each server based on the RAM available. This is used to allocate storage so one server isn’t constantly receiving too much data. You will have to understand the configuration of each of the other servers to properly set these values.
Whenever these files are updated, the BI Server will need to be restarted.
Installing and Configuring Memcached on Ubuntu
This section describes how to install and configure memcached on an Ubuntu server
. Other installs are similar and the documentation on the Memcached site and wiki provided more detail.
First, you will need to have a static IP address
for your memcached nodes or some way of consistently referring to them from the mondrian configuration. This can be done by modifying the IP of your Ubuntu system to have a static IP. I won’t cover how to do so here, since there are many examples on the web.
The easiest, and recommended, approach for installing memcached is to use the Ubuntu package. At the command line enter:
$ sudo apt-get install memcached
Enter the administrator password and the install will run. There may be some prompts, but the defaults work fine. This script will create a memcache user and install memcached in the /etc/init.d folder to start as a deamon on system startup. It does not automatically start it, however, so I recommend testing and configuring prior to restarting the server.
There are two files of significance that you will need to know about. The first is the configuration file for memcached and the second is the log file. By default the log file is /var/log/memcached.log. This file is useful when running the deamon to see what’s going on. The second file, and the one to modify, is the /etc/memcached.config file.
There are four main settings that you may want to change:
- -m <number> This setting tells how much RAM to use for memcached. This is the max that it will use before dropping data. Memcached will use more than this, so you don’t want it to use all of the RAM. However, if the node isn’t doing much else, then I’d use most of the available RAM.
- -l <IP address> By default this is set to 127.0.0.1. The memcached comment indicates that this means it will listen on all IP addresses. However, I was unable to get it to work without changing this value to the static IP of the machine. Forum posts on various locations also indicated that use of the localhost address would cause memcached to not listen to external requests.
- -p <port> The default port is 11211. There is no reason to change it unless you are running multiple memcached instances on the same machine or some other process is using this port.
- -v, -vv, -vvv These flags cause memcached to be verbose by increasing levels. These are commented out. In a production system I would leave them commented out, at least after initial use, since they could cause the logs to get quite large. However, I like using them during initial setup and test. They indicate that there is activity, which can be used to confirm that Mondrian is properly configured.
That’s really all there is to configuring memcached. Restart the machines and use “$ ps aux” to verify that memcached is running. You can also use “$ netstat -l” to make sure the process is listening on the correct port.
Using a local cache with only one server
The recommendation to turn off the local cache when running with an extended cache is because the extended cache could be cleared by a different node because the underlying data has been cleared. If the local cache is used, then it’s possible that it has old data. In this case all server nodes would need to be cleared. However, if there is only a single BI Server node, then it might be advantageous to leave the local cache turned on. This would avoid the network delay of accessing the extended cache when the segment data is stored locally.
Request to external memcached nodes send and retrieve data across the network. While these are typically small amounts of data, on a busy or slow network they can add to delay. If you expect a lot of traffic be sure to have a fast network.
Multiple BI Servers with Memcache
Multiple servers can be configured to use Memcache. There is not even a requirement that each of the servers be configured to use the same set of memcache servers. If, for some reason, the servers were configured to use different extended caches, or had local caching turned on, then each individual severed would need to have the cache flushed.
Reloading Data and Invalidating the Cache
When new data is entered into the data mart via the ETL process, Mondrian does not automatically know the data is no longer valid. With local caching this disconnect would eventually be resolved automatically (although not deterministically) through the garbage collection process. However, with memcached extended caching, the segments can stay around indefinitely. This means that an approach, ideally automated, needs to be available to flush the caches when the data mart is updated. Two (or more?) approaches can be used to solve this problem. Either implement a DataSourceChangeListener or use the SPIs to call Mondrian and invalidate the cache. This capability will be the subject of a future post.
Timeout Errors when Running Analysis
When a configured memcache node is not available, pretty bad things start to happen. The most noticeable feature is that the user gets a timeout exception on their report. This generally indicates that a memcache instance has failed or there is a configuration problem. Check the catalina.out files for details in order to troubleshoot.