Searching Only Updates
Now suppose we have a client that would like to cache the dataset and just
receive updates when something changes. after all, if this is a source that you
use all of the time, you'd like to save the dataset for faster access, plus you
don't want to hog bandwidth all of the time.
First we need to be able to specify what to download. Since a caching reader
will know when it last checked for entries, it can just ask for whatever is new
since a particular timestamp. We can specify this with a modified parameter,
http://server.com/brainfile/joshua.xml?modified=timestamp, and then
add a modified element to each entry:
<entry>
<modified>timestamp</modified>
</entry>
Simple enough. This will then return all entries modified with or after the
modified timestamp. We just have to specify the timestamp.
This one is a bit tricky. We could go grab an existing format like RFC 822: "Sat, 07 Sep
2002 00:00:01 GMT," or we could go for a completely numeric Unix timestamp like
this: "1027568712." Both of these are bad, though. The first one uses abbreviations
for the month and day, which won't internationalize well and can vary within a
nation, requiring us to add a language marker just for the date. Plus, the day of
week isn't needed just to tell when something was modified. On the other hand, it
is human-readable, which is a plus. The second format, a Unix timestamp, is not
human-readable at all, and it's Unix-specific. Not all platforms have a way of
calculating the time based on milliseconds from a particular date, and the
start date varies. Not to mention the fact that there is no timezone marker, so
we could be off by as much as 24 hours. To satisfy our needs, we must clearly
specify an absolute time in a format that is at least somewhat human-readable,
doesn't have language issues, can be parsed fairly easily, and preferably,
doesn't need HTTP escaping. I propose the following:
format: dd/MM/yyyy-hh:mm:ss-z
ex: 08/31/2004-14:35:00-GMT
This encoding uniquely specifies time in a format somewhat familiar to humans
without using any language-specific terms or punctuation that would
need to be escaped. The example date is my next birthday at 2:34 in the
afternoon, GMT. We can put this in a query like so:
http://server.com/brains/joshua.xml?modified-after=08/31/2004-14:35:00-GMT
Now we can search for all entries modified after any particular date. For
completeness, we will add a modified-before parameter as well, allowing the
client to specify any range of dates, with the range being open if only one of
them is specified.
A nice side benefit of searching for updates is that we can cache the data on
the client side and do our own searching. in a way that the server may never have
thought of, or that would be impossible to do over a network connection,
like the aforementioned incremental searching.
Updating Entries
Now that we can search for anything we want, let's make our system a little
more two-way. How do we update entries? Well, we already have a means of
representing a document, so let's just reuse it. Instead of downloading a logical
XML document, we will upload it. By doing a POST to the same URL as the one we
downloaded, pushing up an XML document as the complete output (rather than as a
parameter) we can push our changes up to the server with a minimum of fuss. We
also need to specify which entry we would like to update. Fortunately, we have
already specified that each entry must have an ID attribute that is global to
the entire XML document. So we declare that an existing entry with that ID should
be replaced with the new one. This also means we can upload multiple entry
updates in the same document. It's true that this means we are uploading the
entire entry, even if there was only a spelling change, but in general this will
not be a problem, because we are talking about small entries. Only the dataset
itself is large. The simplicity of this scheme outweighs the benefits we would
get from diffing the individual entries. It also avoids the corruption issues of
diff and merge synchronization.
To add new entries, we upload the new entry without an ID. The server will add
it to the XML, assign it a new ID (based on whatever scheme the server deems
appropriate), and return the entries in a new document with the IDs added.
Uploading data will often be a restricted action. Sometimes, even downloading
will need security. Some people might be storing sensitive data, such as accessing
an internal company-only knowledge base. Once again, we will solve this issue with
existing standards. HTTPS using SSL is the standard for the web world, and most
XML-savvy platforms support it. J2SE 1.4 now supports it without needing any
external libraries.
For authentication, we will do the same, using HTTP Auth over any proprietary
system. This does introduce one complication, though. We are downloading and
uploading via the same URL, and HTTP Auth works by protecting a particular URL.
In an ideal world, we would tell the server to use HTTP Auth only for POSTing and
not for GETing (or to implement whatever other restrictions we might want).
However, most web servers attach the authentication to a particular URL or
directory, instead of the pair of a URL and HTTP request type. In the interest
of simplicity, we can say that authenticated uploading will go to a different URL
than downloading. This makes the implementation simple, but introduces a
usability problem. Now we have two URLs to remember, not one. If you've ever
tried to explain the difference between IMAP and SMTP servers to your mother, you
know what I'm talking about. It would be better if we could tell someone, "The web address to your brainfile is this," and then let the software autodetect how to post. Autodetection might also be useful for other things, too.
To implement autodetection of configuration we need to add some metadata to
the XML file. By adding a meta tag, we can store whatever we need.
Clients are told to only use what they understand and ignore the rest. Here I've
added some simple bits of metadata along with the URL to use for posting.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE brain SYSTEM "../brainfeed.dtd">
<brain searchable="true" language="blah">
<meta>
<uri>http://brain.joshy.org/</uri>
<author>Joshua Marinacci</author>
<author-email>joshua@marinacci.org</author-email>
<description>This is Joshua's Brain</description>
<post-url>
http://server.com/brainfile/upload/joshua.xml
</post-url>
</meta>
As an extra feature to help with autodetection when first adding a Brain
Feed to a client, we will say that doing a GET with the parameter
meta=only will return just the metadata. meta=true
and meta=false will do a normal request and include or not include
the metadata with it. True is the default.
Render the Data
To render the data on screen, we need to choose an appropriate font for the
character set. This means we need to know the encoding and language of the
BrainFeed. All XML files can set the encoding at the top, and most XML parsers will
take care of converting the files' encoding into the native platform's preferred
encoding. For Java, the parsers are required to take care of this automatically
and convert all text into Unicode. We can reasonably expect Win32, Cocoa,
GTK, and other platforms to provide similar capabilities.
For example:
<?xml version="1.0" encoding="UTF-8"?>
The language can be specified with a lang attribute on the
brain element. We will also add an optional attribute to each feed
in case it's a multilingual feed.
<?xml version="1.0" encoding="UTF-8"?>
<brain lang="en-us">
<entry id="1" lang="en-us">
...
The actual display of an entry is left up to the client. XHTML specifies
everything semantically with no style. To allow the author of the BrainFeed to
suggest some style, we can add a CSS link with every entry by including a
<style> style attribute and element. If this is a web-based
client, the CSS will be passed to the browser. In the case of a thick client,
such as a Swing app using the HTMLPane, the CSS would be parsed and pulled into the
display.
<entry id="c" style="mystyle.css">
<style type="text/css">
p.cool { color: blue; }
</style>
<content>
<p class="cool">my cool entry</p>
</content>
</entry>
Scalability
Creating a specification that can be implemented on a variety of devices
and platforms, each with different needs and resources, can be quite a
challenge. Scalability is a hard problem. However, virtually every technology we
are using was designed with scalability in mind. Again the beauty of open
standards shines through. All we need to do is specify which parts of our spec
are optional and how. The underlying tech of HTTP, XHTML, and CSS will take care
of most of the rest.
Riffing off of CSS we will define our specification in different levels,
each one building on top of the previous one. We will group the different features
by estimated difficulty of implementation and need for the feature.
Level 1: Downloading only. no searching, updates, or pulling down updates
incrementally. This makes our service not as useful as it might otherwise be,
but it means anyone can publish by just dropping a flat file onto their server.
No scripts or CGI programs required.
Level 2: Searching by the query, keyword, title, body, and timestamp. This
doesn't require the infrastructure for authentication and posting, but allows
everything else.
Level 3: Posting
Levels 4 and up: Reserved for future use
Again, for autodetection, we can add another meta tag for the level:
<brain>
<meta>
<level>1</level>
</meta>
Documentation for Implementors
That's it. We have completely described a web service, or at least its
protocol. However, if we want our service to be popular, we need to do one last
thing. Now that we have our brain file system designed, we have only one thing
left to do: documentation. The documentation of an XML web-based service really needs
three parts. First, we need a computer-readable spec. XML conveniently provides
this in the form of DTDs. Even though DTDs are intended for XML parsers, they are
often the documentation of last resort for client implementers, so it helps to
format them nicely with good comments. The DTD declares, unambiguously, what goes
where. If you document the DTD well, you will have fewer emails from
developers screaming, "What the @$ does this tag do?!"
Here is a portion of the BrainFeed DTD:
<!-- The root level element.
There is only one of these. -->
<!ELEMENT brain (meta?, entry*)>
<!-- ======== META =========== -->
<!-- This contains the meta information about the feed.
It is optional. If it exists then it should go
at the top to make parsing easier. -->
<!ELEMENT meta (uri,owner,description)>
<!-- These all go inside the meta.
They are really just descriptive -->
<!ELEMENT uri (#PCDATA)>
<!ELEMENT owner (#PCDATA)>
<!ELEMENT description (#PCDATA)>
One tricky thing to note here is that we are using XHTML as part of our
definition. Now, it's all fine and well to say that we are using XHTML, but if we
expect brain files to properly validate, then we need to have XHTML
embedded in our DTD. What we want is a definition for content that looks
something like this:
<!ELEMENT content(#PCDATA, div, p, pre, h1, h2, h3, .....
If we took this approach, then we would also have to redefine div, p, and all
of the other tags in XTHML. We could also copy and paste the full XTHML DTD into
ours. Either approach would be a lot of work and we would require hacking around
in someone else's DTD. Fortunately, the designers of XHTML thought of this and
designed a modular version. All we have to do is import their DTD and turn off
the parts we don't want.
<!-- ignore the meta and title elements -->
<!ENTITY % title.element "IGNORE" >
<!ENTITY % xhtml-meta.module "IGNORE" >
<!-- XHTML include -->
<!ENTITY % xhtml11.mod
PUBLIC "-//W3C//DTD XHTML 1.1//EN"
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd" >
%xhtml11.mod;
<!-- define content to contain mixed
block level content -->
<!ELEMENT content (%Block.mix;)*>
The first two ENTITY lines turn off the title element (since we don't want it
to clash with our own title element) and the meta module, which contains the
meta tag and all of its attribute definitions. The net four lines import the
actual XHTML DTD. First it defines the entity, xhtml11.mod, which points to the
external DTD. We give it both the PUBLIC URI (the official W3C name for XTHML)
and a SYSTEM URL (a location from which to download the DTD). Then it uses the entity on
the next line (%xhtml11.mod;) to include all of XHTML's definitions in our DTD.
Finally, the last line actually declares the content element as containing
%Block.mix;, which the XHTML DTD defines as the contents of a body element. It's
shorthand for div, p, h1, h2, h3, etc. And with that, we have imported and customized XHTML into our BrainFeed DTD.
The next step is to create a human-readable document; one that takes the
reader through each part of the protocol, explaining its purpose and usage. It
should also give a 30,000-foot view of the system to help implementors get the
general idea. Remember that more people will be attracted to your project if they
get a good, quick description up front. For the BrainFeed project, I have decided
to write an article about it for a notable technical web site, but this is not
always required. :)
The last step is to create a sample implementation. It should be as simple as
possible. Don't worry about speed or efficiency; just make it clear and
comprehensive. If you can release the code as open source, then it's even better,
since this will give other developers a base upon which to build their versions.
For BrainFeed, I have created a simple client and server implementation. The
client is a pair of JSPs, one for searching and one for posting. They do simple
queries to a BrainFeed URL and return the results. The server is implemented
with a simple servlet. The servlet responds to GETs and POSTs, saving the results
into an on-disk XML file. Production-quality implementations would never do this,
of course; they would store the entries in a database. But for our sample
implementation, this will more than suffice. You can try out the sample client here. This contains a few entries about Java development. Try searching on "Java," "JNI," and "SQL."
Summary
Now that we've created our protocol, and documented it enough for others to
build their own versions of it, what can we do with it? The first thing is, of
course, a personal database of hand-entered snippets, but as the implementations
mature and become popular we can imagine many other uses:
- Search through the brainfeeds of famous Java developers.
- Search through all of your old emails.
- Search through Javadocs and Java forums from within your IDE.
- Easy searchable FAQs for web sites.
- Search through weblog archives.
- An open helpfile format.
- A searchable collection of cheatsheets and quickrefs.
- Quick lookup from a dictionary and thesaurus.
- Search through your book collection from The Gutenberg Project.
- Search through an intranet knowledgebase.
- Search through Google and Yahoo from your own program.
The next logical step is to design a new interface. Instead of a web-based
thin client, we could have a thick client that does real-time searching and
caches the results. In the second article of this series, we will build a Swing
application with local caching, keystroke incremental searching, and an embedded
HTML renderer.
Joshua Marinacci first tried Java in 1995 at the request of his favorite TA and has never looked back.