|
|
||||||||||||||||||
by Joshua Marinacci | ||||||||||||||||||
| |||||||||||
I have a problem. Two problems, actually. Maybe you can help me.
Problem One: I need to remember lots of small bits of arcane technical knowledge. I keep a text file and wish it was easier to search, copy, store, transport, etc.
Problem Two: I search the Web a lot and often get the wrong results because what I'm looking for is very specific. Sometimes what I want isn't on the Web. It's in newsgroups, online documentation, forums, or just not on the net anywhere. What I'd really like to do is search through someone else's brain, or at least the part they'd let me access. I'd even pay a small fee to do it.
Solution: solve both problems at once. Create a remotely searchable and cacheable database. To be nice to use, and widely deployed, it must be efficient and super-easy to implement: the searchable equivalent of RSS.
This article is the first in a two-part series. We will explore designing a simple but robust web service protocol called BrainFeed, considering alternatives and balancing pros and cons. Throughout the process, we will stay focused on making the protocol simple and reuse existing (preferably open) technologies wherever possible. The second article will build on top of the protocol to create an advanced thick client for searching right from the desktop.
I've used lots of bad web service protocols. SOAP and even XML-RPC are often overkill. They are complicated to implement and obscure the real problem you are trying to solve. The most successful web service I've seen (apart from web pages themselves) is RSS. Why? Because it's simple. It does only one thing, and it does it well. It was built on top of a stack of other open and widely deployed technologies. For intranets and extranets, the other web service specifications are quite useful, because one end (you) has some sort of a formal relationship with the other end (your customers, partners, or other departments). They can afford the costs of being highly structured. For widely deployed web services that go over the Internet, where you have either an informal or at least low-overhead relationship with the other end, we need something simpler: an RSS-level web service. We will use RSS as the inspiration for BrainFeed.
Following in RSS' footsteps, our web service will have to do the following:
all while:
If we had tried this 10 years ago, it probably wouldn't have worked. We would have had to invent too much from scratch. Today, however, we are blessed to have open specs with free implementations all over the place. It doesn't matter what language you program in, you can probably find an XML parser and an HTML renderer of some sort. XHTML is a clean, semantic document language and CSS2 can support almost any style we want. And the best part is someone else has already written almost all of the pieces. In the Java universe, we are especially blessed to have XML parsers and HTTP access built right into the platform. We just have to put it together! Welcome to the Lego school of software design.
First things first: what does our web service do? A technical definition would be: a network-accessible service for searching and downloading a fairly flat data repository of small documents. So what does that mean? Well, it's on the network, so it's remote. That means we have to deal with network reliability issues, encryption, and authentication. The next part is that we are searching through the database, so we need to specify search semantics, keywords, and a query language. We are downloading the documents, so we need to specify encoding and formatting. Finally, we are rendering the documents to the screen, so we need hints on how they should be presented to the end user.
Everything we are talking about storing in this database is small snippets of information: the syntax of an SSH command, some sample code to create a window, or Javadocs of the Robot class. We'd also like to update and add to the database, so let's drop that in there, too. So our final definition adds up to this: "A protocol for searching, downloading, and updating small documents from a network-accessible service."
Now that we have our definition of what the service should do, how should it be structured? Since we are talking about small documents, we should probably use the lingua franca of the networked document world: XML. This means we can bring all of our XML expertise and tools to bear on the project. In fact we can structure the entire database as a single logical XML document. Note the word logical here. It doesn't actually have to be stored as an XML document on either end. In fact, for any sufficiently large dataset, it probably won't be. But by specifying that it's logically an XML document we get a whole lot of useful semantics thrown in for free. We can use IDs and know that they will be unique. We get an infinitely nestable structure. We get well-defined white-space handling, structure validation, character encoding, and all of the other goodies that come with XML. All for free! I'm sure glad we live in the 21st century, unlike those encodingless savages back in the late 20th.
So what does it look like? I'm a visual person, so I like examples. And we have no page limits on the Web, so let's start off with a simple sample.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE brain SYSTEM "../brainfeed.dtd">
<brain searchable="true" language="blah">
<entry id="1">
<content>
<p>This is an example</p>
<pre><code>This is the
code for the example;
</code></pre>
<img src="images/output.png">
<p>and that's how you do it!</p>
</content>
</entry>
<entry id="b">
<content>
<p>Here's how you do this:</p>
<code>cmdline --args</code>
<p>and that's how you do it!</p>
</content>
</entry>
</brain>
There's our logical XML with the usual headers and doctype at the top. A
single <brain> element encloses multiple entry elements, each of which has an ID
to make it unique. The entry has a <content> element, which contains valid XHTML
strict markup. XHTML strict is important because it is extremely well defined and
is becoming more and more supported by the HTML renderers of the world (IE, Mozilla,
Safari, etc.).
Using our rule reuse existing technology whenever possible, we will
keep things simple with an HTTP GET. Now we can reuse all of the existing HTTP
code out there and we can use web servers and CGI scripts for our server, instead
of designing custom request code. Java implements HTTP right in the java.net
package (new java.net.URL("http://www.java.net/").openStream()). Using
HTTP also means we get to ride on port 80 and go through firewalls.
Now, why use a GET instead of a POST? Just because it's simpler. A GET can
describe everything we are likely to want. Since this is mainly one-way
communication (send a few bits for a query and get a lot of bits back) we don't
need the complexity of POSTing. GET describes a server, a location on the
server, and allows the document on the server to be a literal document on disk
instead of a program. This gives us our next rule, make it simple to implement.
This also fits the inherent semantics of GET, which is to get a document. POST
and PUT are for editing.
Example:
http://www.server.com:80/brainfiles/joshua.xml
This could be an actual file on disk served by a stock Apache web server, and it would work just fine.
Now that we can get a document, how can we search through it? This is more
complicated, since our choice of a query language somewhat dictates the structure
of our database and how users interact with our service. Timothy Bray wrote an
excellent series of articles about searching. He recommends a simple GET-based API with keywords ANDed together. AND searching is intuitive for users and pretty easy to implement. It also has the side benefit of being very easy to describe in our
protocol.
I've always envisioned an advanced version of this working as a filter on
every keystroke, much like the search in iTunes, so a keyword-based API will probably work well. I personally type ANDed keywords into Google to find things,
so something simiilar works here.
We can do full text searching, but the search engine probably won't know that part of the document is more important than other parts. A great way to deal with this is to simply mark which parts are important. How does the search engine know what each document is about? Historically, documents have had titles to specify what they are about, so we'll add something like that here:
<entry id="d">
<keyword>Java</keyword>
<keyword>JNI</keyword>
<title>JNI: What is JNI</title>
<content>
<p>JNI stands for Java Native Interface.
It's a way of......</p>
</content>
</entry>
To keep things simple, we will say that the search should be case-insensitive and white space and punctuation should be ignored. Such detailed searching is rarely useful, and usually causes valid results to be skipped. We will also specify searching in order of keywords, titles, and body text, but actual search is left up to the server to implement in whatever way it feels will return the best results. A Java server will probably use Lucene, which has a wide array of algorithmic tweaks available.
To add searching to the spec, since downloading uses an HTTP GET
request, we'll just extend that with some query parameters:
http://server.com/brainfiles/joshua.xml?query=java&query=interface
This will search for entries matching both "java" and "interface."
HTTP allows us to specify a parameter name more than once (which is how
groups of checkboxes are often submitted), so we take advantage of this
by sending multiple query values.
If the end user program wants to specify just a particular field, then we can specify it explicitly with keyword, title, and content:
http://server.com/brainfiles/joshua.xml?keyword=jni&keyword=java&title=sql
Pages: 1, 2 |
Building a Better Brain, Part 2: A Great Thick Client
Joshua Marinacci built a distributed system for storing, searching, and updating small pieces of information. In this installment, he shows how to build an attractive thick client with Swing.
View all java.net Articles.
|
|