# Add Caching to a Data Access Layer

Dynamic web pages are great; you can adapt the resulting page to your user, show other user's activity, offer different products to your customers based on their navigation history, and so on. But the more dynamic a website is, the more database queries you'll probably need to perform. Unfortunately, these database queries consume the largest portion of your running time.

In this tutorial, I will demonstrate a way to improve performance, without running extra unnecessary queries. We'll develop a query caching system for our data layer with small programming and deployment cost.

### 1. The Data Access Layer

Adding a caching layer transparently to an application is often difficult because of the internal design. With object oriented languages (like PHP 5) it is a lot easier, but it still can be complicated by poor design.

In this tutorial, we set our starting point in an application that performs all its database access through a centralized class from which all data models inherit the basic database access methods. The skeleton for this starting class looks like this:

Let's implement it step by step. First, the constructor that will use the PDO library to interface with the database:

We connect to the database using the PDO library. For the database credentials I use a static class named "app_AppConfig" that centralizes the application's configuration information.

To store the database connection, we use a static attribute ($DB). We use a static attribute in order to share the same connection with all the instances of "model_Model", and, because of that, the connection code is protected with an if (we don't want to connect more than once). In the last line of the constructor we set the exception error model for PDO. In this model, for every error the PDO finds, it throws an exception (class PDOException) instead of returning error values. This is a matter of taste, but the rest of the code can be kept cleaner with the exceptional model, which is good for this tutorial. Executing queries can be very complex, but in this class we have taken a simple approach with a single doStatement() method: This method executes the query, and returns an associative array with the entire result set (if any). Note that we are using the static connection (self::$DB). Note, also, that this method is protected. This is because we don't want the user to execute arbitrary queries. Instead of that we will provide concrete models to the user. We will see this later, but before let's implement the last method:

The "model_Model" class is a very simple but convenient class for data layering. Although it's simple (it can be enhanced with advanced features like prepared statements if you want), it does the basic stuff for us.

To complete the configuration part of our application, let's write the "app_Config" static class:

As stated before, we will provide concrete models to access the database. As a little example, we will use this simple schema: a documents table and an inverted index to search whether a document contains a given word or not:

From the basic data access class (model_Model), we derive as many classes as needed by the data design of our application. In this example, we can derive those two self-explanatory classes:

Those derived models is where we add the public information. Using them is extremely simple:

The result for this example might look similar to that (obviously it depends on your actual data):

What we have written is shown in the next UML class diagram:

### 2. Planning our Caching Scheme

When things start to collapse in your database server, it is time to take break and consider optimizing the data layer. After having optimized your queries, adding the proper indexes, etc., the second move is to try to avoid unnecessary queries: why make the same request to the database on every user request, if this data hardly changes?

With a well-planned and well-decoupled class organization, we can add an extra layer to our application almost with no programming cost. In this case, we are going to extend the "model_Model" class to add transparent caching to our database layer.

#### The Caching Basics

Since we know that we need a caching system, let's focus on that particular problem and, once sorted out, we will integrate it in our data model. For now, we won't think in terms of SQL queries. It's easy to abstract a little and build a general enough scheme.

The simplest caching scheme consist of [key,data] pairs, where the key identifies the actual data we want to store. This schema is not new, in fact, it is analogous to PHP's associative arrays, and we use it all the time.

So we will need a way to store a pair, to read it, and to delete it. That's enough to build our interface for cache helpers:

The interface is quite easy: the get method gets a value, given its identifying key, the put method sets (or updates) the value for a given key, and the delete method deletes it.

With this interface in mind, it's time to implement our first real caching module. But before doing it, we will choose the data storage method.

#### The Underlying Storage System

The decision to build a common interface (like cache_CacheHelper) for caching helpers will allow us to implement them nearly on top of every storage. But on top on what storage system? There are a lot of them we can use: shared memory, files, memcached servers or even SQLite databases.

Often underestimated, DBM files are perfect for our caching system, and we are going to use them in this tutorial.

DBM files work naively on (key,data) pairs, and do it very fast due to its internal B-tree organization. They also do the access control for us: we don't need to worry about blocking the cache before writing (like we will have to do on other storage systems); DBM does it for us.

DBM files are not driven by expensive servers, they do their work inside a lightweight library on the client side accessing locally to the actual file that stores the data. In fact they actually are a family of file formats, all of them with the same basic API for (key,data) access. Some of them allow repeated keys, others are constant and don't allow writes after closing the file for the first time (cdb), etc. You can read more about that on http://www.php.net/manual/en/dba.requirements.php

Nearly every UNIX system install one type or more of these libraries (probably Berkeley DB or GNU dbm). For this example, we will use "db4" format (Sleepycat DB4 format: http://www.sleepycat.com). I have found that this library is often preinstalled, but you can use whichever library you want (except cdb, of course: we want to write on the file). In fact you could move this decision into the "app_AppConfig" class and adapt it for every project you do.

With PHP, we have two alternatives to deal with DBM files: the "dba" extension (http://php.net/manual/en/book.dba.php) or the "PEAR::DBA" module (http://pear.php.net/package/DBA). We will use the "dba" extension, which probably you already have installed in your system.

#### Wait a minute, we are dealing with SQL and result sets!

DBM files work with strings for key and values, but our problem is to store SQL result sets (that can vary in structure quite a lot). How could we manage to convert them from one world to the other?

Well, for keys, it is very easy because the actual SQL query string identifies a set of data very well. We can use the MD5 digest of the query string to shorten the key. For values, it is trickier, but here your allies are the serialize() / unserialize() PHP functions, which can be used to convert from arrays to string and vice versa.

We will see how all this works in the next section.

### 3. Static Caching

In our first example, we will deal with the easiest way to perform caching: caching for static values. We will write a class called "cache_DBM" implementing the interface "cache_CacheHelper", just like that:

This class is very easy: a mapping between our interface and dba functions. In the constructor, the given file is opened,
and the returned handler is stored in the object in order to use it in the other methods.

A simple example of use:

Below, you'll find what we have done here expressed as an UML class diagram:

Now let's add the caching system to our data model. We could have changed the "model_Model" class in order to add caching to each of its derived classes. But, if we had done so, we would have lost the flexibility to assign the caching characteristic only to specific models, and I think this is an important part of our job.

So we will create another class, called "model_StaticCache", which will extend "model_Model" and will add caching functionality. Let's start with the skeleton:

In the constructor, we first call the parent constructor in order to connect to the database. Then, we create and store, statically, a "cache_DBM" object (if not created before elsewhere). We store one instance for every derived class name because we are using one DBM file for every one of them. For that purpose, we use the static array "$cache". To determine in which directory we have to write the cache files, we have used again the application's configuration class: "app_AppConfig". And now: the doStatement() method. The logic for this method is: convert the SQL statement to a valid key, search the key in the cache, if found return the value. If not found, execute it in the database, store the result and return it: There are two more things worth noting. First, we are using the MD5 of the query as the key. In fact, it is not necessary, because the underlying DBM library accepts keys of arbitrary size, but it seems better to shorten the key anyway. If you are using prepared statements, remember to concatenate the actual values to the query string to create the key! Once the "model_StaticCache" is created, modifying a concrete model for its use is trivial, you only need to change its "extends" clause in the class declaration: And that's all, the magic is done! The "model_Document" will perform only one query for every document to retrieve. But we can do it better. ### 4. Caching Expiration In our first approach, once a query is stored in the cache, it remains valid forever until two things occur: we delete its key explicitly, or we unlink the DBM file. However this approach is only valid for a few data models of our application: the static data (like menu options and this kind of things). The normal data in our application is likely to be more dynamic than that. Think about a table containing the products we sell in our web page. It is not likely to change every minute, but there is the chance that this data will change (by adding new products, changing selling prices, etc.). We need a way to implement caching, but have a way to react to changes in data. One approach to this problem is to set an expiration time to the data stored in the cache. When we store new data in the cache, we set a window of time in which this data will be valid. After that time, the data will be read from the database again and stored into the cache for another period of time. As before, we can create another derived class from "model_Model" with this functionality. This time, we will call it "model_ExpiringCache". The skeleton is similar to "model_StaticCache": In this class we have introduced a new attribute:$expiration. This one will store the configured time window for valid data. We set this value in the constructor, the rest of the constructor is the same as in "model_StaticCache":

The bulk of the job comes in the doStatement. The DBM files have no internal way to control expiration of data, so we must implement our own. We'll do it by storing arrays, like this one:

This kind of array is what we serialize, and store into the cache. The "time," key is the modification time of the data in the cache, and the "data" is the actual data we want to store. On read time, if we find that the key exists, we compare the creation time stored with the current time and return the data if not expired.

If the key doesn't exist or is expired, we continue executing the query and storing the new result set in the cache before returning it.

Simple!

Now let's convert the "model_Index" to a model with expiring cache. As it happens, with "model_Documents," we only need to modify the class declaration and change the "extends" clause:

About the expiration time... some considerations must be made. We use a constant expiration time (1 hour = 3,600 seconds), for the sake of simplicity, and because we don't want to modify the rest of our code. But, we can easily modify it in a lot of ways to allow us to use different expiration times, one for each model. Afterward we will see how.

The class diagram for all our job is as follows:

### 5. Different Expiration

In every project, I am sure you will have different expiration time for nearly every model: from a couple of minutes to hours, or even days.

If only we could have a different expiration time for every model, it would be perfect... but, wait! We can do it easily!

The most direct approach is to add an argument to the constructor, so the new constructor for "model_ExpiringCache" will be this one:

Then, if we want a model with a 1 day expiration time (1 day = 24 hours = 1,440 minutes = 86,400 seconds), we can accomplish it this way:

And that's all. However, the drawback is that we must modify all the data models.

Another way of doing it is to delegate the task to the "app_AppConfig":

And then add the call to this new method on the "model_ExpiringCache" constructor, like this:

This latest method allows us to do fancy things, like use different expiration values for production or development environments in a more centralized way. Anyway, you can choose yours.

In UML, the total project looks like this:

### 6. Some Caveats

There are some queries that cannot be cached. The most evident ones are modifying queries like INSERT, DELETE or UPDATE. These queries must arrive to the database server.

But even with SELECT queries, there are some circumstances in which a caching system can creaet problems. Take a look at a query like this one:

This query selects randomly 10 banners for the "home" zone of our website. This is intended to generate movement in the banners shown in our home, but if we cache this query, the user will not see any movement at all, until the cached data expires.

The rand() function is not deterministic (as it is not now() or others); so it will return a different value on every execution. If we cache it, we will freeze only one of those results for all the caching period, and therefore breaking the functionality.

But with a simple re-factoring, we can obtain the benefits of caching and show pseudo-randomness:

What we are doing here is to cache fifty different random banners configurations, and select them randomly. The 50 SELECT's will look like this:

We have added a constant condition to the select, which has no cost to the database server but renders 50 different keys for the caching system. A user will need to load the page fifty times to see all the banner's different configurations; so the dynamic effect is achieved. The cost is fifty queries to the database to fetch the cache.

### 7. A Benchmark

What benefits can we expect from our new caching system?

First, it must be said that, in raw performance, sometimes our new implementation will run slower than database queries, specially with very simple, well-optimized queries. But for those queries with joins, our DBM cache will run faster.

However, the problem we solved is not raw performance. You will never have a spare database server for your tests in production. You'll probably have a server with high workloads. In this situation, even the fastest query can run slowly, but with our caching scheme, we are not even using the server, and, in fact, we are reducing its workload. So the real performance increase will come in the form of more petitions per second served.

In a website that I am currently developing, I have done a simple benchmark to understand the benefits of caching. The server is modest: it runs Ubuntu 8.10 running on top of an AMD Athlon 64 X2 5600+, with 2 GB of RAM and an old PATA hard disk. The system runs Apahce and MySQL 5.0, that comes with the Ubuntu distribution without any tuning.

The test was to run Apache's benchmark program (ab) with 1, 5 and 10 concurrent clients loading a page 1,000 times from my development website. The actual page was a product detail that has no less than 20 queries: menu contents, product details, recommended products, banners, etc.

The results without cache were 4.35 p/s for 1 client, 8.25 for 5 clients, and 8.29 for 10 clients. With caching (different expiration), the results were 25.55 p/s with 1 client, 49.01 for 5 clients, and 48.74 for 10 clients.

### Final Thoughts

I've shown you an easy way to insert caching into your data model. Of course, there are a plethora of alternatives, but this one is but one choice that you have.

We have used local DBM files to store the data, but there are even faster alternatives that you might consider exploring. Some ideas for the future: using APC's apc_store() functions as underlying storage system, shared memory for the really critical data, using memcached, etc.

I hope you have enjoyed this tutorial as much as I did writing it. Happy caching!