Skip to main content

Berkeley DB, Java Edition I: The Basics

August 24, 2004

{cs.r.title}










Contents
Introduction
What is Berkeley DB?
Berkeley DB is Open Source
Using
Berkeley DB
Converting Objects to Data
Creating
a Database
Fetching
and Retrieving Objects
Final
Thoughts

Introduction

For years, programmers have been using Berkeley DB, the embedded database from Sleepycat Software, when they need a high-quality persistent and transactional datastore, but don't want or need the overhead associated with relational databases. The most famous example is probably Cisco's use of Berkeley DB in its routers, but there are many others. For example, the Subversion version control system stores its repository in Berkeley DB. And Motorola is embedding Berkeley DB in smart phones.

The original version of Berkeley DB is written in C. This means that, up until now, if a Java programmer wanted to use Berkeley DB, she would have to use some sort of translation layer (for example, JNI) to handle the communication between Java and Berkeley DB. (If you're wondering why there isn't a JDBC driver, the short answer is that Berkeley DB isn't a relational database. We'll cover this in more detail later on). While that's acceptable in some scenarios, it's also very limiting. Which is why Sleepycat recently released a Java version of Berkeley DB. Here's a quote from their press release:

As Java has become more important for building large systems, customers asked for a version of Berkeley DB that provides the same services, but that is written entirely in Java. They wanted the performance, reliability, and scalability of Berkeley DB without the complexity and cost of switching between Java and C execution at run time. In addition, they wanted to take advantage of abstractions and services available in Java that are not available under C.

In this article, I'm going to cover the basics of using the Java Edition of Berkeley DB. We'll cover the basics of embedded databases, discuss Berkeley DB, and talk about some of the basic things you need to know in order to use it.

Then, in part two, I will proceed through most of the details of implementing a session management system. By walking through the details of a real-world example, At the end of that article, we still won't have fully covered the Berkeley DB API, but we will have covered enough details to give you a good idea of how it's used, and what the design tradeoffs are.

Throughout this article, when referring to sessions or session management, I'm using the terminology of client-server and web applications. If you're unfamiliar with session management, please take a moment to browse the Wikipedia definition.

In addition. please be aware that, for brevity, I'm going to use "Berkeley DB" for what is usually referred to as "Berkeley DB Java Edition" or "Berkeley DB JE."

Acknowledgments

Martin O'Connor and several kind people from Sleepycat Software reviewed this article.

What is Berkeley DB?

When you mention the word "database" to a Java programmer, it's a good bet that her mind immediately
slides to the world of relational databases, and relational database servers. It's a reasonable reaction for a Java programmer -- relational databases living inside of separate servers are the most common type of database out there, and they're the type of database most Java programmers interact with on a daily basis. However, while Berkeley DB databases are databases in the sense of the Wikipedia definition, they definitely aren't that type of database.

What type of database are they? Well, Berkeley DB databases have five main characteristics:

  • They are embedded databases. All of the database code runs inside of your application's process, just like any other library or framework you use in your projects.

    Like anything else, embedded database have benefits and drawbacks. The major benefit is performance and deployment simplicity. If you don't have to open sockets and send data to a remote database processes, you can get a dramatic boost in performance. And, obviously, not having to configure and maintain a separate database server can be a big win, as well.

    The drawbacks to embedding the database are more subtle. One is that the memory and performance profile of your application might change as Berkeley DB caches data or performs indexing operations. Another drawback is that it's harder to share a database across processes. Berkeley DB does allow multiple applications to open the files associated to a database, but only one process is allowed to write to the database (and the other processes, because they are caching data, might not see recent changes).

  • They are maps. A Berkeley DB database is a map from keys (which are indexed) to values. You perform lookups using the keys. The values are single objects, and you can't perform lookups based on value characteristics. Among other things, this means that ad-hoc queries aren't really possible without loading the entire database into memory and scanning through all of the data.
  • They are not accessed declaratively. A Berkeley DB database is accessed by an API, using Java method calls. Unlike some other databases that can be run in embedded mode (for example, HSQL), Berkeley DB does not use a declarative query syntax like SQL, and results are not returned in instances of ResultSet.

    Another consequence of this is that the data mapping code is more explicit. Unlike SQL-based systems, where the database might store data as NUMERIC(10,2) (and the JDBC driver would perform a mapping to Java types), a Berkeley DB database doesn't have a separate type system from your application.

  • They can be transactional. When you create a Berkeley DB database, you can specify whether you will need to support transactions across multiple operations.
  • They are persistent. Berkeley DB databases are automatically stored to the file system. Between transactions and persistence, Berkeley DB aims to provide enterprise-class data management.

Of course, when I go through this list, and try to explain Berkeley DB to people, they usually respond with comments like:

So, it's, like, what? A well-behaved HashMap?

To which I usually reply, "well, yeah, sort of. It's actually a little more like a well-behaved TreeMap, but you get the idea."

At its core, a Berkeley DB database is really just a b-tree with a persistent file format and nice transactional semantics thrown in. It makes sense to use it, or something like it, in situations when all of the following are true:

  • You need to store objects in an indexed collection.
  • You only need a small number of fairly simple ways to retrieve the objects.
  • You don't need ad-hoc queries.
  • You need transactional insert/remove/update operations on your storage.
  • You need persistence to disk.
  • You need a very small footprint, and for the database to be in-process.
  • You don't need to have multiple processes or applications access the database.

This is a fairly specialized set of requirements. But a large number of applications meet them. For example: an LDAP server (such as Apache's Eve Directory Server) is a perfect candidate. It stores a large amount of information, indexed in a few simple ways. It doesn't need ad-hoc queries, but does need persistence and transactions. Similarly, a naming service, such as a JNDI service provider, is a good example of an application that meets these requirements.

Session management code, where servers store state specific to a client for a short period of time (and which are stored using a session key) is also a good candidate for a Berkeley DB database. In fact, in the second part of this article I'll show you a fairly complete implementation of session management.

Berkeley DB is Open Source

In addition to selling commercial licenses, Sleepycat releases Berkeley DB under an open source license. The terms are fairly similar to the GNU Public License; the key point is that, if you're distributing your software to third parties, you have to open source your application. Or, in their words:

The open source license for Berkeley DB permits you to use the software at no charge under the condition that if you use Berkeley DB in an application you redistribute, the complete source code for your application must be available and freely redistributable under reasonable conditions. If you do not want to release the source code for your application, you may purchase a license from Sleepycat Software. For pricing information, or if you have further questions on licensing, please contact us.

One very nice thing about open source applications is that the source code is available. This helps programmers in three ways. The first is obvious: if you're not sure about how to use a particular method ("Can I pass null in for that argument?"), you can look at the source code and figure it out. The second is almost as obvious: if you're running a debugger, or a memory profiler, it helps to have the source code. Even cooler, if you're using a tool such as AspectJ, having the source lets you construct finely targeted pointcuts.

The third advantage of an open source application is that you can reassure yourself about code quality. There's really no substitute for old-fashioned poking around when you want to do a quick assessment of library quality. I'm happy to report that the Sleepycat code looks quite nice. I ran both Excelsior's FlawDetector and FindBugs against the .jars. FlawDetector didn't see any errors. And, while FindBugs found quite a few things to complain about, most of them didn't seem too serious. Figure 1 shows a typical issue: inconsistent synchronization is a sign that the code wasn't factored properly, and that the locking is a bit trickier than it needed to be. I don't know that there's a bug lurking here, but I'd be happier if FindBugs didn't report this one.

Figure 1. Problems with inconsistent synchronization
Figure 1. Problems with inconsistent synchronization

On the other hand, Figure 2 shows a potentially serious issue. Doublechecked locking doesn't work in Java, and it looks like version 1.5 of Berkeley DB uses doublechecked locking in a couple of places.

Figure 2. Problems with doublechecked locking
Figure 2. Problems with doublechecked locking


Gregory Burd, the product manager for Sleepycat, has told me that all of the FindBugs issues will be resolved in the next release of Berkeley DB (currently scheduled for early fall) and that FindBugs is now a permanent part of their release process. I was impressed both with their willingness to fix these bugs, and with the way they quickly moved to incorporate FindBugs into their build process.







Using Berkeley DB

At this point, I've told told you what Berkeley DB is, I've given you some ideas on when it's an appropriate tool to use (and what problems it solves), and I've told you it's a fairly high-quality library. By now, you're probably itching for something more concrete. In this section, I'm going to cover the basics of using Berkeley DB. And then, in the next section, I'll show you how to implement session management using Berkeley DB.

Converting Objects to Data

Recall that, at its heart, a Berkeley DB database is a b-tree. It's a collection of key-value pairs. Keys map to values, and they usually do so uniquely (you can configure things so that keys don't have unique values, but the default is for each key to have a single value). What's more, internally, Berkeley DB is storing both key and values as byte arrays. It doesn't track object types or field names or any data like that -- it stores instances of DatabaseEntry, which is simply a wrapper around a byte array that provides some convenience methods.

This has two consequences. The first is that you have to provide Berkeley DB with some way to transform your objects into instances of DatabaseEntry (alternatively, you can perform the transformation yourself). The second, which we'll return to later, is that the internal order of keys and values in a Berkeley DB database isn't what you might expect: it's the order defined by the "natural order" on the byte arrays.

There are two ways to do create instances of DatabaseEntry:

  1. You can implement the EntryBinding interface (you usually do this by extending TupleBinding). EntryBinding has two methods, objectToEntry and entryToObject, and can be thought of as an object translation layer you provide to Berkeley DB.
  2. You can have your objects implement the Serializable interface. In this case, you will also need to create a second database to store the class definitions. But once you do so, you can simply use Sleepycat's implementation of serialization, instead of writing lots of different subclasses of TupleBinding.

The first approach, repeatedly subclassing TupleBinding, requires more programmer effort and code. The second approach, using serialization, is slower and requires you to create and maintain a second database of class files (e.g., it's slightly more resource-intensive). There's no good general-purpose answer to the question "Which approach should I use?" but in this article, we'll subclass TupleBinding. It's easy to do and, for small projects, is clearly the better approach.

Here's part of the code for a simple class that performs binding for a class named Session.

public class SessionBinding extends TupleBinding {

public Object entryToObject(TupleInput ti) throws IOException {
Session returnValue;
String sessionKey = ti.readString();
// read rest of arguments using TupleInput and build the session object.
return returnValue;
}
   
public void objectToEntry(Object object, TupleOutput to) throws IOException {
Session session = (Session)object;
to.writeString(session.getSessionKey());
// write out rest of session fields using TupleOutput
}
}

The point to notice about this code is that it really looks a lot like serialization. It's a translation layer that converts objects to byte arrays, and vice versa. When objects are stored, objectToEntry is called. When they're retrieved, entryToObject is called.

Creating a Database

Now that you know how to convert objects to and from instances of DatabaseEntry, the next step is to actually create a database (otherwise, we have no place to store the entries). Because Berkeley DB automatically persists the database (there's no "in-memory only" mode, as far as I can tell), this can be somewhat complicated, and four different Berkeley DB classes are involved: EnvironmentConfig, Environment, DatabaseConfig, and Database. Each of these objects has a different role in creating "the database." An instance of Database corresponds to a single map (a single set of key-value pairs). It's got some configuration properties (the most important of which are whether it supports transactions and whether key/data pairs have to be unique). It's configured at creation time, using an instance of DatabaseConfig.

An instance of Environment corresponds to a set of instances of Database and a location in the file system. While it isn't a perfect analogy, you can think of a Berkeley DB Database as corresponding to a single table in an SQL database. And the Environment corresponds to a collection of Tables (e.g., a "database" in the SQL model).


One of the main reasons this analogy isn't perfect is that Berkeley DB also allows the existence of secondary databases. A Berkeley DB database is a map from a set of keys to a set of values. That's fine for a lot of scenarios, but in many cases, you need more than one key. For example, if you're a doctor, you might want to index instances of PatientRecord by both socialSecurityNumber and lastName. The way to solve this problem is to use a secondary database. Secondary databases are databases, but they're linked to a primary database and they serve as additional indices. We'll talk more about this later.

In order to create an instance of Environment and Database, you need to decide a few things. The four most important decisions are:

  1. Where in the file system will databases be stored?
  2. What should happen if the database doesn't already exist?
  3. Will the environment support transactions?
  4. Will the database support transactions?

Once you've made these decisions, creating a database is easy. Here's a code snippet that shows the basic process:

[prettify]		EnvironmentConfig environmentConfig = new EnvironmentConfig();
		environmentConfig.setAllowCreate(true);
		// perform other environment configurations
		File file = new File(DATA_DB_DIR_NAME);
		Environment environment = new Environment(file, environmentConfig);
		DatabaseConfig databaseConfig = new DatabaseConfig();
		databaseConfig.setAllowCreate(true);
		// perform other database configurations
		_sleepyCatDB = environment.openDatabase(null, DB_NAME , databaseConfig);</pre>

<p>About the only really surprising here is the use of "property names." In addition to methods, such as as <code>setAllowCreate
, both EnvironmentConfig and DatabaseConfig have methods that allow you to specify a property and its value by name. Thus, for example, in the above code snippet, we could have added in:

		environmentConfig.setConfigParam("java.util.logging.level", "INFO");[/prettify]

This is handy for a couple of reasons. The first is that these are also environmental properties. That is, Berkeley DB supports a property file format (the file is named je.properties). If the properties file exists, the values will be taken from there instead of using the values set in your code. This makes it easy to change property values without recompiling. The second is that it makes it easy to write administration tools and scripts that simply store and retrieve name-value pairs.

Fetching and Retrieving Objects

At this point, we know how to transform objects into instances of DatabaseEntry and we know how to create instances of Environment and Database. The next step in our whirlwind tour involves storing and retrieving objects from the database. Fortunately, it's pretty simple. The general sequence for inserting and removing objects is:

  1. Create a transaction (if you are using transactions). You create transactions using the instance of Environment. Also note that you can have more than one simultaneous transaction.
  2. Create the instances of DatabaseEntry you will be using.
  3. Call the appropriate put or delete method on the instance of Database.
  4. Close the transaction.

Here, for example, is the code from the forthcoming session management example. In order to add a session, we convert the session key (a string, in this example) and then the session itself into instances of DatabaseEntry, and then call the put method. This causes Berkeley DB to add the byte arrays to its b-tree.

[prettify]    public void addSession(Session session) {
        addSession(session, null);	// null transaction is "autocommit"
    }

    private void addSession(Session session, Transaction transaction) {
        String sessionKey = session.getSessionKey();
        try {
            DatabaseEntry key = getDatabaseEntry(sessionKey);
            DatabaseEntry value = getDatabaseEntry(session);
            _mainDB.put(transaction, key, value);
        } catch (Exception e) {
            System.out.println("Database error");
            e.printStackTrace();
        }
    }

    private DatabaseEntry getDatabaseEntry(String string) throws Exception {
        return new DatabaseEntry(string.getBytes(DEFAULT_CHARSET));
    }

    private DatabaseEntry getDatabaseEntry(Session session) throws Exception {
        DatabaseEntry returnValue = new DatabaseEntry();
        _sessionBinding.objectToEntry(session, returnValue);
        return returnValue;
    }</pre>

<blockquote>
<em>
Note: Unless you pass it a <code>Comparator
, Berkeley DB will use its own ordering on the keys. This isn't an issue for retrieval that's based on specific key values, but can be a problem when you want to retrieve a range of entries.

Adding and inserting records is easy. Updating a record is a little harder: the way you update an entry in Berkeley DB is by removing the old version of an object, and then adding it back. And, of course, you want to do this in a transaction (so that both operations either succeed or fail). In the following example, we create a transaction. (The two arguments to beginTransaction are the parent transaction and an instance of TransactionConfig respectively; most of the time you pass in null for these.) We then simply remove the object and reinsert it. Here's an example of code that performs an update.

public void updateDB(Session session) {
    Transaction transaction;
    try {
        transaction = _environment.beginTransaction(null, null);
        removeSession(session, transaction);
        session.touch();
        addSession(session, transaction);
        transaction.commitNoSync();
    } catch (Exception e) {
        System.out.println("Database error");
        e.printStackTrace();
        try {
            transaction.abort();
        }
        catch (Exception ignored) {}
    }
}[/prettify]

One interesting wrinkle is that this code uses the commitNoSync method. By default, Berkeley DB doesn't write to disk after every operation (e.g., what's stored to disk can be out of sync with the database in memory). In most cases, you have to explicitly tell Berkeley DB to synchronize to disk. When you are committing a transaction, you can opt for one of the following: commit, commitNoSync, or commitWithSync. commit lets Berkeley DB decide whether to synchronize or not (depending on the database and environment configuration); commitNoSync and commitWithSync let you decide explicitly whether or not the synchronization needs to occur.

Final Thoughts

In this article, I introduced Sleepycat's new Java edition of Berkeley DB. You've seen examples of where it is used, reasons for using it, and some basic details on how to do so.

In the next article, we'll do a deep dive into a real-world example. In particular, we'll walk through how to implement session management using Berkeley DB.

William Grosso is the vice president of engineering for Echopass.
Related Topics >> Databases   |