The Source for Java Technology Collaboration
User: Password:



   

Berkeley DB, Java Edition I: The Basics Berkeley DB, Java Edition I: The Basics

by William Grosso
08/24/2004


Contents
Introduction
What is Berkeley DB?
Berkeley DB is Open Source
Using Berkeley DB
Converting Objects to Data
Creating a Database
Fetching and Retrieving Objects
Final Thoughts

Introduction

For years, programmers have been using Berkeley DB, the embedded database from Sleepycat Software, when they need a high-quality persistent and transactional datastore, but don't want or need the overhead associated with relational databases. The most famous example is probably Cisco's use of Berkeley DB in its routers, but there are many others. For example, the Subversion version control system stores its repository in Berkeley DB. And Motorola is embedding Berkeley DB in smart phones.

The original version of Berkeley DB is written in C. This means that, up until now, if a Java programmer wanted to use Berkeley DB, she would have to use some sort of translation layer (for example, JNI) to handle the communication between Java and Berkeley DB. (If you're wondering why there isn't a JDBC driver, the short answer is that Berkeley DB isn't a relational database. We'll cover this in more detail later on). While that's acceptable in some scenarios, it's also very limiting. Which is why Sleepycat recently released a Java version of Berkeley DB. Here's a quote from their press release:

As Java has become more important for building large systems, customers asked for a version of Berkeley DB that provides the same services, but that is written entirely in Java. They wanted the performance, reliability, and scalability of Berkeley DB without the complexity and cost of switching between Java and C execution at run time. In addition, they wanted to take advantage of abstractions and services available in Java that are not available under C.

In this article, I'm going to cover the basics of using the Java Edition of Berkeley DB. We'll cover the basics of embedded databases, discuss Berkeley DB, and talk about some of the basic things you need to know in order to use it.

Then, in part two, I will proceed through most of the details of implementing a session management system. By walking through the details of a real-world example, At the end of that article, we still won't have fully covered the Berkeley DB API, but we will have covered enough details to give you a good idea of how it's used, and what the design tradeoffs are.

Throughout this article, when referring to sessions or session management, I'm using the terminology of client-server and web applications. If you're unfamiliar with session management, please take a moment to browse the Wikipedia definition.

In addition. please be aware that, for brevity, I'm going to use "Berkeley DB" for what is usually referred to as "Berkeley DB Java Edition" or "Berkeley DB JE."

Acknowledgments

Martin O'Connor and several kind people from Sleepycat Software reviewed this article.

What is Berkeley DB?

When you mention the word "database" to a Java programmer, it's a good bet that her mind immediately slides to the world of relational databases, and relational database servers. It's a reasonable reaction for a Java programmer -- relational databases living inside of separate servers are the most common type of database out there, and they're the type of database most Java programmers interact with on a daily basis. However, while Berkeley DB databases are databases in the sense of the Wikipedia definition, they definitely aren't that type of database.

What type of database are they? Well, Berkeley DB databases have five main characteristics:

  • They are embedded databases. All of the database code runs inside of your application's process, just like any other library or framework you use in your projects.

    Like anything else, embedded database have benefits and drawbacks. The major benefit is performance and deployment simplicity. If you don't have to open sockets and send data to a remote database processes, you can get a dramatic boost in performance. And, obviously, not having to configure and maintain a separate database server can be a big win, as well.

    The drawbacks to embedding the database are more subtle. One is that the memory and performance profile of your application might change as Berkeley DB caches data or performs indexing operations. Another drawback is that it's harder to share a database across processes. Berkeley DB does allow multiple applications to open the files associated to a database, but only one process is allowed to write to the database (and the other processes, because they are caching data, might not see recent changes).

  • They are maps. A Berkeley DB database is a map from keys (which are indexed) to values. You perform lookups using the keys. The values are single objects, and you can't perform lookups based on value characteristics. Among other things, this means that ad-hoc queries aren't really possible without loading the entire database into memory and scanning through all of the data.
  • They are not accessed declaratively. A Berkeley DB database is accessed by an API, using Java method calls. Unlike some other databases that can be run in embedded mode (for example, HSQL), Berkeley DB does not use a declarative query syntax like SQL, and results are not returned in instances of ResultSet.

    Another consequence of this is that the data mapping code is more explicit. Unlike SQL-based systems, where the database might store data as NUMERIC(10,2) (and the JDBC driver would perform a mapping to Java types), a Berkeley DB database doesn't have a separate type system from your application.

  • They can be transactional. When you create a Berkeley DB database, you can specify whether you will need to support transactions across multiple operations.
  • They are persistent. Berkeley DB databases are automatically stored to the file system. Between transactions and persistence, Berkeley DB aims to provide enterprise-class data management.

Of course, when I go through this list, and try to explain Berkeley DB to people, they usually respond with comments like:

So, it's, like, what? A well-behaved HashMap?

To which I usually reply, "well, yeah, sort of. It's actually a little more like a well-behaved TreeMap, but you get the idea."

At its core, a Berkeley DB database is really just a b-tree with a persistent file format and nice transactional semantics thrown in. It makes sense to use it, or something like it, in situations when all of the following are true:

  • You need to store objects in an indexed collection.
  • You only need a small number of fairly simple ways to retrieve the objects.
  • You don't need ad-hoc queries.
  • You need transactional insert/remove/update operations on your storage.
  • You need persistence to disk.
  • You need a very small footprint, and for the database to be in-process.
  • You don't need to have multiple processes or applications access the database.

This is a fairly specialized set of requirements. But a large number of applications meet them. For example: an LDAP server (such as Apache's Eve Directory Server) is a perfect candidate. It stores a large amount of information, indexed in a few simple ways. It doesn't need ad-hoc queries, but does need persistence and transactions. Similarly, a naming service, such as a JNDI service provider, is a good example of an application that meets these requirements.

Session management code, where servers store state specific to a client for a short period of time (and which are stored using a session key) is also a good candidate for a Berkeley DB database. In fact, in the second part of this article I'll show you a fairly complete implementation of session management.

Berkeley DB is Open Source

In addition to selling commercial licenses, Sleepycat releases Berkeley DB under an open source license. The terms are fairly similar to the GNU Public License; the key point is that, if you're distributing your software to third parties, you have to open source your application. Or, in their words:

The open source license for Berkeley DB permits you to use the software at no charge under the condition that if you use Berkeley DB in an application you redistribute, the complete source code for your application must be available and freely redistributable under reasonable conditions. If you do not want to release the source code for your application, you may purchase a license from Sleepycat Software. For pricing information, or if you have further questions on licensing, please contact us.

One very nice thing about open source applications is that the source code is available. This helps programmers in three ways. The first is obvious: if you're not sure about how to use a particular method ("Can I pass null in for that argument?"), you can look at the source code and figure it out. The second is almost as obvious: if you're running a debugger, or a memory profiler, it helps to have the source code. Even cooler, if you're using a tool such as AspectJ, having the source lets you construct finely targeted pointcuts.

The third advantage of an open source application is that you can reassure yourself about code quality. There's really no substitute for old-fashioned poking around when you want to do a quick assessment of library quality. I'm happy to report that the Sleepycat code looks quite nice. I ran both Excelsior's FlawDetector and FindBugs against the .jars. FlawDetector didn't see any errors. And, while FindBugs found quite a few things to complain about, most of them didn't seem too serious. Figure 1 shows a typical issue: inconsistent synchronization is a sign that the code wasn't factored properly, and that the locking is a bit trickier than it needed to be. I don't know that there's a bug lurking here, but I'd be happier if FindBugs didn't report this one.

Figure 1. Problems with inconsistent synchronization
Figure 1. Problems with inconsistent synchronization

On the other hand, Figure 2 shows a potentially serious issue. Doublechecked locking doesn't work in Java, and it looks like version 1.5 of Berkeley DB uses doublechecked locking in a couple of places.

Figure 2. Problems with doublechecked locking
Figure 2. Problems with doublechecked locking
Gregory Burd, the product manager for Sleepycat, has told me that all of the FindBugs issues will be resolved in the next release of Berkeley DB (currently scheduled for early fall) and that FindBugs is now a permanent part of their release process. I was impressed both with their willingness to fix these bugs, and with the way they quickly moved to incorporate FindBugs into their build process.

Pages: 1, 2

Next Page » 

View all java.net Articles.

 Feed java.net RSS Feeds