Skip to main content

Building a Better Brain, Part 1: The Protocol

March 16, 2004

{cs.r.title}








Contents
Define the Purpose
Model the Data Clearly
Download the Data
Searching
Searching Only Updates
Updating Entries
Render the Data
Scalability
Documentation for Implementors
Summary

I have a problem. Two problems, actually. Maybe you can help me.

Problem One: I need to remember lots of small bits of arcane technical
knowledge. I keep a text file and wish it was easier to search, copy, store,
transport, etc.

Problem Two: I search the Web a lot and often get the wrong results
because what I'm looking for is very specific. Sometimes what I want isn't on
the Web. It's in newsgroups, online documentation, forums, or just not on the net
anywhere. What I'd really like to do is search through someone else's brain, or
at least the part they'd let me access. I'd even pay a small fee to do it.

Solution: solve both problems at once. Create a remotely searchable and
cacheable database. To be nice to use, and widely deployed, it must be efficient and super-easy to implement: the searchable equivalent of RSS.

This article is the first in a two-part series. We will explore designing a
simple but robust web service protocol called BrainFeed, considering alternatives and balancing pros and cons. Throughout the process, we will stay
focused on making the protocol simple and reuse existing (preferably open)
technologies wherever possible. The second article will build on top of the
protocol to create an advanced thick client for searching right from the
desktop.

I've used lots of bad web service protocols. SOAP and even XML-RPC are often
overkill. They are complicated to implement and obscure the real problem you are
trying to solve. The most successful web service I've seen (apart from web pages
themselves) is RSS. Why? Because it's simple. It does only one thing, and it does
it well. It was built on top of a stack of other open and widely deployed
technologies. For intranets and extranets, the other web service specifications
are quite useful, because one end (you) has some sort of a formal relationship
with the other end (your customers, partners, or other departments). They can
afford the costs of being highly structured. For widely deployed web services
that go over the Internet, where you have either an informal or at least
low-overhead relationship with the other end, we need something simpler: an
RSS-level web service. We will use RSS as the inspiration for BrainFeed.

Following in RSS' footsteps, our web service will have to do the following:

  • Define its purpose.
  • Model the data clearly.
  • Download the data.
  • Search the data.
  • Update the data.
  • Finally, render the data.

all while:

  • Reusing existing technology wherever possible.
  • Being adaptable to many platforms and languages.
  • Being simple and adequately documented for others to implement.

If we had tried this 10 years ago, it probably wouldn't have worked. We would
have had to invent too much from scratch. Today, however, we are blessed to have
open specs with free implementations all over the place. It doesn't matter what
language you program in, you can probably find an XML parser and an HTML
renderer of some sort. XHTML is a clean, semantic document language and CSS2 can
support almost any style we want. And the best part is someone else has already
written almost all of the pieces. In the Java universe, we are especially blessed
to have XML parsers and HTTP access built right into the platform. We just have
to put it together! Welcome to the Lego school of software design.

Define the Purpose

First things first: what does our web service do? A technical definition
would be: a network-accessible service for searching and downloading a
fairly flat data repository of small documents. So what does that mean?
Well, it's on the network, so it's remote. That means we have to deal with
network reliability issues, encryption, and authentication. The next part is that we
are searching through the database, so we need to specify search semantics,
keywords, and a query language. We are downloading the documents, so we need to
specify encoding and formatting. Finally, we are rendering the documents to the
screen, so we need hints on how they should be presented to the end user.

Everything we are talking about storing in this database is small snippets of
information: the syntax of an SSH command, some sample code to create a window,
or Javadocs of the Robot class. We'd also like to update and add to the
database, so let's drop that in there, too. So our final definition adds up to
this: "A protocol for searching, downloading, and updating small documents from a
network-accessible service."

Model the Data Clearly

Now that we have our definition of what the service should do, how should it
be structured? Since we are talking about small documents, we should probably use
the lingua franca of the networked document world: XML. This means we can bring
all of our XML expertise and tools to bear on the project. In fact we can
structure the entire database as a single logical XML document. Note the word
logical here. It doesn't actually have to be stored as an XML document on
either end. In fact, for any sufficiently large dataset, it probably won't be. But
by specifying that it's logically an XML document we get a whole lot of
useful semantics thrown in for free. We can use IDs and know that they will be
unique. We get an infinitely nestable structure. We get well-defined white-space
handling, structure validation, character encoding, and all of the other goodies
that come with XML. All for free! I'm sure glad we live in the 21st century,
unlike those encodingless savages back in the late 20th.

So what does it look like? I'm a visual person, so I like examples. And we
have no page limits on the Web, so let's start off with a simple sample.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE brain SYSTEM "../brainfeed.dtd">
<brain searchable="true" language="blah">
    <entry id="1">
        <content>
            <p>This is an example</p>
            <pre><code>This is the
               code for the example;
            </code></pre>
            <img src="images/output.png">
            <p>and that's how you do it!</p>
        </content>
    </entry>
    <entry id="b">
        <content>
            <p>Here's how you do this:</p>
            <code>cmdline --args</code>
            <p>and that's how you do it!</p>
        </content>
    </entry>
</brain>

There's our logical XML with the usual headers and doctype at the top. A
single <brain> element encloses multiple entry elements, each of which has an ID
to make it unique. The entry has a <content> element, which contains valid XHTML
strict markup. XHTML strict is important because it is extremely well defined and
is becoming more and more supported by the HTML renderers of the world (IE, Mozilla,
Safari, etc.).

Download the Data

Using our rule reuse existing technology whenever possible, we will
keep things simple with an HTTP GET. Now we can reuse all of the existing HTTP
code out there and we can use web servers and CGI scripts for our server, instead
of designing custom request code. Java implements HTTP right in the java.net
package (new java.net.URL("http://www.java.net/").openStream()). Using
HTTP also means we get to ride on port 80 and go through firewalls.

Now, why use a GET instead of a POST? Just because it's simpler. A GET can
describe everything we are likely to want. Since this is mainly one-way
communication (send a few bits for a query and get a lot of bits back) we don't
need the complexity of POSTing. GET describes a server, a location on the
server, and allows the document on the server to be a literal document on disk
instead of a program. This gives us our next rule, make it simple to implement.
This also fits the inherent semantics of GET, which is to get a document. POST
and PUT are for editing.

Example:

http://www.server.com:80/brainfiles/joshua.xml

This could be an actual file on disk served by a stock Apache web server, and
it would work just fine.

Searching

Now that we can get a document, how can we search through it? This is more
complicated, since our choice of a query language somewhat dictates the structure
of our database and how users interact with our service. Timothy Bray wrote an
excellent series of articles about searching. He recommends a simple GET-based API with keywords ANDed together. AND searching is intuitive for users and pretty easy to implement. It also has the side benefit of being very easy to describe in our
protocol.

I've always envisioned an advanced version of this working as a filter on
every keystroke, much like the search in iTunes, so a keyword-based API will probably work well. I personally type ANDed keywords into Google to find things,
so something simiilar works here.

We can do full text searching, but the search engine probably won't know that
part of the document is more important than other parts. A great way to deal
with this is to simply mark which parts are important. How does the search engine
know what each document is about? Historically, documents have had titles to
specify what they are about, so we'll add something like that here:

<entry id="d">
    <keyword>Java</keyword>
    <keyword>JNI</keyword>
    <title>JNI: What is JNI</title>

    <content>
        <p>JNI stands for Java Native Interface.
        It's a way of......</p>
    </content>

</entry>

To keep things simple, we will say that the search should be case-insensitive
and white space and punctuation should be ignored. Such detailed searching is
rarely useful, and usually causes valid results to be skipped. We will also
specify searching in order of keywords, titles, and body text, but actual search
is left up to the server to implement in whatever way it feels will return the
best results. A Java server will probably use Lucene, which has a wide array of
algorithmic tweaks available.

To add searching to the spec, since downloading uses an HTTP GET
request, we'll just extend that with some query parameters:

http://server.com/brainfiles/joshua.xml?query=java&query=interface

This will search for entries matching both "java" and "interface."
HTTP allows us to specify a parameter name more than once (which is how
groups of checkboxes are often submitted), so we take advantage of this
by sending multiple query values.

If the end user program wants to specify just a particular field, then we can specify it
explicitly with keyword, title, and content:

http://server.com/brainfiles/joshua.xml?keyword=jni&keyword=java&title=sql







Searching Only Updates

Now suppose we have a client that would like to cache the dataset and just
receive updates when something changes. after all, if this is a source that you
use all of the time, you'd like to save the dataset for faster access, plus you
don't want to hog bandwidth all of the time.

First we need to be able to specify what to download. Since a caching reader
will know when it last checked for entries, it can just ask for whatever is new
since a particular timestamp. We can specify this with a modified parameter,
http://server.com/brainfile/joshua.xml?modified=timestamp, and then
add a modified element to each entry:

<entry>
    <modified>timestamp</modified>
</entry>

Simple enough. This will then return all entries modified with or after the
modified timestamp. We just have to specify the timestamp.

This one is a bit tricky. We could go grab an existing format like RFC 822: "Sat, 07 Sep
2002 00:00:01 GMT," or we could go for a completely numeric Unix timestamp like
this: "1027568712." Both of these are bad, though. The first one uses abbreviations
for the month and day, which won't internationalize well and can vary within a
nation, requiring us to add a language marker just for the date. Plus, the day of
week isn't needed just to tell when something was modified. On the other hand, it
is human-readable, which is a plus. The second format, a Unix timestamp, is not
human-readable at all, and it's Unix-specific. Not all platforms have a way of
calculating the time based on milliseconds from a particular date, and the
start date varies. Not to mention the fact that there is no timezone marker, so
we could be off by as much as 24 hours. To satisfy our needs, we must clearly
specify an absolute time in a format that is at least somewhat human-readable,
doesn't have language issues, can be parsed fairly easily, and preferably,
doesn't need HTTP escaping. I propose the following:

format: dd/MM/yyyy-hh:mm:ss-z
ex:     08/31/2004-14:35:00-GMT

This encoding uniquely specifies time in a format somewhat familiar to humans
without using any language-specific terms or punctuation that would
need to be escaped. The example date is my next birthday at 2:34 in the
afternoon, GMT. We can put this in a query like so:

http://server.com/brains/joshua.xml?modified-after=08/31/2004-14:35:00-GMT

Now we can search for all entries modified after any particular date. For
completeness, we will add a modified-before parameter as well, allowing the
client to specify any range of dates, with the range being open if only one of
them is specified.

A nice side benefit of searching for updates is that we can cache the data on
the client side and do our own searching. in a way that the server may never have
thought of, or that would be impossible to do over a network connection,
like the aforementioned incremental searching.

Updating Entries

Now that we can search for anything we want, let's make our system a little
more two-way. How do we update entries? Well, we already have a means of
representing a document, so let's just reuse it. Instead of downloading a logical
XML document, we will upload it. By doing a POST to the same URL as the one we
downloaded, pushing up an XML document as the complete output (rather than as a
parameter) we can push our changes up to the server with a minimum of fuss. We
also need to specify which entry we would like to update. Fortunately, we have
already specified that each entry must have an ID attribute that is global to
the entire XML document. So we declare that an existing entry with that ID should
be replaced with the new one. This also means we can upload multiple entry
updates in the same document. It's true that this means we are uploading the
entire entry, even if there was only a spelling change, but in general this will
not be a problem, because we are talking about small entries. Only the dataset
itself is large. The simplicity of this scheme outweighs the benefits we would
get from diffing the individual entries. It also avoids the corruption issues of
diff and merge synchronization.

To add new entries, we upload the new entry without an ID. The server will add
it to the XML, assign it a new ID (based on whatever scheme the server deems
appropriate), and return the entries in a new document with the IDs added.

Uploading data will often be a restricted action. Sometimes, even downloading
will need security. Some people might be storing sensitive data, such as accessing
an internal company-only knowledge base. Once again, we will solve this issue with
existing standards. HTTPS using SSL is the standard for the web world, and most
XML-savvy platforms support it. J2SE 1.4 now supports it without needing any
external libraries.

For authentication, we will do the same, using HTTP Auth over any proprietary
system. This does introduce one complication, though. We are downloading and
uploading via the same URL, and HTTP Auth works by protecting a particular URL.
In an ideal world, we would tell the server to use HTTP Auth only for POSTing and
not for GETing (or to implement whatever other restrictions we might want).
However, most web servers attach the authentication to a particular URL or
directory, instead of the pair of a URL and HTTP request type. In the interest
of simplicity, we can say that authenticated uploading will go to a different URL
than downloading. This makes the implementation simple, but introduces a
usability problem. Now we have two URLs to remember, not one. If you've ever
tried to explain the difference between IMAP and SMTP servers to your mother, you
know what I'm talking about. It would be better if we could tell someone, "The web address to your brainfile is this," and then let the software autodetect how to post. Autodetection might also be useful for other things, too.

To implement autodetection of configuration we need to add some metadata to
the XML file. By adding a meta tag, we can store whatever we need.
Clients are told to only use what they understand and ignore the rest. Here I've
added some simple bits of metadata along with the URL to use for posting.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE brain SYSTEM "../brainfeed.dtd">
<brain searchable="true" language="blah">
    <meta>
        <uri>http://brain.joshy.org/</uri>
        <author>Joshua Marinacci</author>
        <author-email>joshua@marinacci.org</author-email>
        <description>This is Joshua's Brain</description>
        <post-url>
            http://server.com/brainfile/upload/joshua.xml
        </post-url>
    </meta>

As an extra feature to help with autodetection when first adding a Brain
Feed to a client, we will say that doing a GET with the parameter
meta=only will return just the metadata. meta=true
and meta=false will do a normal request and include or not include
the metadata with it. True is the default.

Render the Data

To render the data on screen, we need to choose an appropriate font for the
character set. This means we need to know the encoding and language of the
BrainFeed. All XML files can set the encoding at the top, and most XML parsers will
take care of converting the files' encoding into the native platform's preferred
encoding. For Java, the parsers are required to take care of this automatically
and convert all text into Unicode. We can reasonably expect Win32, Cocoa,
GTK, and other platforms to provide similar capabilities.

For example:

<?xml version="1.0" encoding="UTF-8"?>

The language can be specified with a lang attribute on the
brain element. We will also add an optional attribute to each feed
in case it's a multilingual feed.

<?xml version="1.0" encoding="UTF-8"?>
<brain lang="en-us">
  <entry id="1" lang="en-us">
    ...

The actual display of an entry is left up to the client. XHTML specifies
everything semantically with no style. To allow the author of the BrainFeed to
suggest some style, we can add a CSS link with every entry by including a
<style> style attribute and element. If this is a web-based
client, the CSS will be passed to the browser. In the case of a thick client,
such as a Swing app using the HTMLPane, the CSS would be parsed and pulled into the
display.

<entry id="c" style="mystyle.css">
    <style type="text/css">
        p.cool { color: blue; }
    </style>
    <content>
        <p class="cool">my cool entry</p>
    </content>
</entry>

Scalability

Creating a specification that can be implemented on a variety of devices
and platforms, each with different needs and resources, can be quite a
challenge. Scalability is a hard problem. However, virtually every technology we
are using was designed with scalability in mind. Again the beauty of open
standards shines through. All we need to do is specify which parts of our spec
are optional and how. The underlying tech of HTTP, XHTML, and CSS will take care
of most of the rest.

Riffing off of CSS we will define our specification in different levels,
each one building on top of the previous one. We will group the different features
by estimated difficulty of implementation and need for the feature.

  • Level 1: Downloading only. no searching, updates, or pulling down updates
    incrementally.
    This makes our service not as useful as it might otherwise be,
    but it means anyone can publish by just dropping a flat file onto their server.
    No scripts or CGI programs required.

  • Level 2: Searching by the query, keyword, title, body, and timestamp. This
    doesn't require the infrastructure for authentication and posting, but allows
    everything else.

  • Level 3: Posting

  • Levels 4 and up: Reserved for future use

Again, for autodetection, we can add another meta tag for the level:

<brain>
    <meta>
        <level>1</level>
    </meta>

Documentation for Implementors

That's it. We have completely described a web service, or at least its
protocol. However, if we want our service to be popular, we need to do one last
thing. Now that we have our brain file system designed, we have only one thing
left to do: documentation. The documentation of an XML web-based service really needs
three parts. First, we need a computer-readable spec. XML conveniently provides
this in the form of DTDs. Even though DTDs are intended for XML parsers, they are
often the documentation of last resort for client implementers, so it helps to
format them nicely with good comments. The DTD declares, unambiguously, what goes
where. If you document the DTD well, you will have fewer emails from
developers screaming, "What the @&#$ does this tag do?!"

Here is a portion of the BrainFeed DTD:

<!-- The root level element.
        There is only one of these. -->
<!ELEMENT brain (meta?, entry*)>

<!-- ======== META =========== -->
<!--  This contains the meta information about the feed.
    It is optional. If it exists then it should go
    at the top to make parsing easier. -->
<!ELEMENT meta (uri,owner,description)>


<!-- These all go inside the meta.
        They are really just descriptive -->
<!ELEMENT uri (#PCDATA)>
<!ELEMENT owner (#PCDATA)>
<!ELEMENT description (#PCDATA)>

One tricky thing to note here is that we are using XHTML as part of our
definition. Now, it's all fine and well to say that we are using XHTML, but if we
expect brain files to properly validate, then we need to have XHTML
embedded in our DTD. What we want is a definition for content that looks
something like this:

<!ELEMENT content(#PCDATA, div, p, pre, h1, h2, h3, .....

If we took this approach, then we would also have to redefine div, p, and all
of the other tags in XTHML. We could also copy and paste the full XTHML DTD into
ours. Either approach would be a lot of work and we would require hacking around
in someone else's DTD. Fortunately, the designers of XHTML thought of this and
designed a modular version. All we have to do is import their DTD and turn off
the parts we don't want.

<!-- ignore the meta and title elements -->
<!ENTITY % title.element  "IGNORE" >
<!ENTITY % xhtml-meta.module "IGNORE" >

<!-- XHTML include -->
<!ENTITY % xhtml11.mod
     PUBLIC "-//W3C//DTD XHTML 1.1//EN"
            "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.mod;

<!-- define content to contain mixed
        block level content -->
<!ELEMENT content (%Block.mix;)*>

The first two ENTITY lines turn off the title element (since we don't want it
to clash with our own title element) and the meta module, which contains the
meta tag and all of its attribute definitions. The net four lines import the
actual XHTML DTD. First it defines the entity, xhtml11.mod, which points to the
external DTD. We give it both the PUBLIC URI (the official W3C name for XTHML)
and a SYSTEM URL (a location from which to download the DTD). Then it uses the entity on
the next line (%xhtml11.mod;) to include all of XHTML's definitions in our DTD.
Finally, the last line actually declares the content element as containing
%Block.mix;, which the XHTML DTD defines as the contents of a body element. It's
shorthand for div, p, h1, h2, h3, etc. And with that, we have imported and customized XHTML into our BrainFeed DTD.

The next step is to create a human-readable document; one that takes the
reader through each part of the protocol, explaining its purpose and usage. It
should also give a 30,000-foot view of the system to help implementors get the
general idea. Remember that more people will be attracted to your project if they
get a good, quick description up front. For the BrainFeed project, I have decided
to write an article about it for a notable technical web site, but this is not
always required. :)

The last step is to create a sample implementation. It should be as simple as
possible. Don't worry about speed or efficiency; just make it clear and
comprehensive. If you can release the code as open source, then it's even better,
since this will give other developers a base upon which to build their versions.

For BrainFeed, I have created a simple client and server implementation. The
client is a pair of JSPs, one for searching and one for posting. They do simple
queries to a BrainFeed URL and return the results. The server is implemented
with a simple servlet. The servlet responds to GETs and POSTs, saving the results
into an on-disk XML file. Production-quality implementations would never do this,
of course; they would store the entries in a database. But for our sample
implementation, this will more than suffice. You can try out the sample client here. This contains a few entries about Java development. Try searching on "Java," "JNI," and "SQL."

Summary

Now that we've created our protocol, and documented it enough for others to
build their own versions of it, what can we do with it? The first thing is, of
course, a personal database of hand-entered snippets, but as the implementations
mature and become popular we can imagine many other uses:

  • Search through the brainfeeds of famous Java developers.
  • Search through all of your old emails.
  • Search through Javadocs and Java forums from within your IDE.
  • Easy searchable FAQs for web sites.
  • Search through weblog archives.
  • An open helpfile format.
  • A searchable collection of cheatsheets and quickrefs.
  • Quick lookup from a dictionary and thesaurus.
  • Search through your book collection from The Gutenberg Project.
  • Search through an intranet knowledgebase.
  • Search through Google and Yahoo from your own program.

The next logical step is to design a new interface. Instead of a web-based
thin client, we could have a thick client that does real-time searching and
caches the results. In the second article of this series, we will build a Swing
application with local caching, keystroke incremental searching, and an embedded
HTML renderer.

Josh Marinacci first tried Java in 1995 at the request of his favorite TA and has never looked back.
Related Topics >> Programming   |   Search   |   Web Services and XML   |