The Source for Java Technology Collaboration
User: Password:



   

Building a Better Brain, Part 2: A Great Thick Client Building a Better Brain, Part 2: A Great Thick Client

by Joshua Marinacci
03/26/2004


Contents
The Base Program
Real-Time Incremental Searching
Creating an Index with Lucene
Implementing a Real-Time Search
Searching on Each Keystroke
Syncing the BrainFeeds
Creating an HTML View
Adding Style to the Layout
The Future

In the previous article, we designed a web server protocol for searching and updating small chunks of information, called Brain Entries, that are stored in BrainFeeds. The sample client is a JSP program that displays the entries in a web browser. Now it would be nice to have a really good thick client that would let us do real-time searches, local data caching, and properly render the entries in the client itself instead of in a web browser.

In this article, we are going to build a desktop application to read and post to BrainFeeds. Since it's a real application, we will also be able to cache the feeds and do incremental updates to disk. This lets us do fast real-time searching through the local cache. Also, since we won't have access to a browser anymore, we will customize the HTMLEditorKit to render each entry as HTML directly in our application.

The Base Program

Like most desktop applications, we will start with a simple base. Our application has one frame with three sections (as seen in Figure 1). The top is a search box, the middle displays the results in a list, and the bottom renders the selected entry as HTML. The bottom two buttons are for editing the currently selected entry and for adding a new one. You can download the source code for this application here: brainfeed.zip.

Figure 1
Figure 1. The base application

Real-Time Incremental Searching

The first feature we'll add to make our application really nice to use is real-time incremental searching. This is a method of searching most prominently featured in iTunes, though you can also find it in text editors (like the venerable XEmacs), file managers, and even the combo boxes of some applications. The two key points of real-time incremental searching are that the search is run over again on each keystroke, and that the user can search for substrings. This means that a search for "ten" would match "ten," "tent," and "forgotten." These two techniques combine to create a great user experience, but at the cost of processor speed and disk space for an index. Fortunately, we live in the age of cheap and powerful computers that waste most of their resources waiting in a loop for a mouse click. Incremental searching can be slow, but for the datasets we will be dealing with (say, less than 20MB of pure text), on modern computers it should be nearly instantaneous.

So how do we do it? First we need a powerful database with support for wildcard searching. Lucene is a 100% Java, open source search engine that supports almost everything we need. It was written by the author of Apple's VTwin search engine, and supports both full-text and wildcard searching. Now adopted by the Apache Jakarta project, it provides top-notch searching for any Java application. We just need to hook it up.

Creating an Index with Lucene

First we need to create an index on the client side to store all of our entries. The index contains all of the words that we can search on, presorted to make searching faster. It also lets us set some options about how to deal with spaces, plural words, and other language issues.

File indexDir = new File("braindir");

// the stop analyzer breaks the text on word boundaries
// converting it all to lower case and stripping out the stop
// words (like "the", and "a")
Analyzer analyzer = new StopAnalyzer();

if(writer == null) {
    try {
        // create a new indexwriter.
        // the false means it won't overwrite the old index
        writer = new IndexWriter(indexDir, analyzer, false);
    } catch (IOException ex) {
        // create a new index writer and overwrite the old index
        writer = new IndexWriter(indexDir, analyzer, true);
    }

    writer.close();
}

The code above will create an index in the braindir directory. The first call to new IndexWriter() will open the index without creating it. If the call fails because the index doesn't already exist, then it will make the call again with true for the last argument to create a new index. The Analyzer is a set of rules about how to preprocess the data before putting it into the database. The StopAnalyzer, one of the default Analyzers that comes with Lucene, will convert all text to lowercase and remove stop words. Stop words are short words like "the" and "a" that convey little or no meaning and are not useful for searching. We can leave them out to speed up processing and make the search more targeted.

Now that we have an index, we need to put the entries into it. Each entry has already been parsed into a BrainEntry object (reused from the JSP version), which has accessors for each field we will need. Lucene stores text in Document objects, so we will create one Document for each BrainEntry.

private static void addToIndex(File indexDir,
                               BrainEntry be,
                               boolean create)
    throws Exception {
    IndexWriter writer = getWriter();
    // create a new document for the brain entry
    Document doc = new Document();

    // pull out all of the fields and put them
    // in the document
    String id = be.getId();
    doc.add(Field.Keyword("id",id));
    doc.add(Field.Keyword("uri",be.getURI()));
    doc.add(Field.Keyword("iduri",be.getId() + 
                          ":"+be.getURI()));
    doc.add(Field.Text("title", be.getTitle()));
    doc.add(Field.UnIndexed("content",
                            be.getContentString()));

    // add each keyword
    Iterator it = be.getKeywordList().iterator();
    while(it.hasNext()) {
        String keyword = (String)it.next();
        doc.add(Field.Text("keyword",keyword));
    }

    // add the document and close
    writer.addDocument(doc);
    writer.close();
}

First we add searchable fields to the Document and then we add the content. Lucene has different types of fields depending on how they should be included in the index. We want the id and source uri to be keywords, and the title is text. A keyword field is a string that will be stored and indexed but not tokenized, meaning it won't be modified in any way. Since we need the id and uri external to the program, we don't want them to be changed at all. A Text field is also stored and indexed, but it will also be tokenized, which in our case will make it lowercase and remove the stop words. All of the fields that we would like our users to search on will be stored as text. For the content (the body text of the entry), we don't actually want to index it for searching, since that would make queries slower. Instead, we just want to use the database as a convenient storage mechanism, so it gets stuffed into an UnIndexed field. Once our Document is set up, we add it to the index.

As we saw above, we write to the index with an IndexWriter. To search through the index, we will use, not surprisingly, an IndexSearcher. The query itself is derived from the QueryParser, which takes our query string, the name of the field we want to search, and the analyzer. We will use the same Analyzer when we originally put the entry into the index; the StopAnalyzer. Finally, we execute the search and loop through the results.

private static List luceneSearch(String q,
                                 File indexDir)
    throws Exception {
    init();
    List list = new ArrayList();

    // create an index search
    Directory fsDir =
        FSDirectory.getDirectory(indexDir, false);
    IndexSearcher is = new IndexSearcher(fsDir);

    // create a new query based on the
    // query string passed in
    Query query =
        QueryParser.parse(q, "keyword",
                          new StopAnalyzer());

    // do the search
    Hits hits = is.search(query);

    for (int i = 0; i < hits.length(); i++) {
        Document doc = hits.doc(i);
        BrainEntry be = new BrainEntry();
        be.setId(doc.get("id"));
        be.setURI(doc.get("uri"));
        be.setTitle(doc.get("title"));
        be.setContentString(doc.get("content"));
        Field[] keywords = doc.getFields("keyword");
        for(int j=0; j<keywords.length; j++) {
            //u.p("keyword: " + keywords[j]);
            be.addKeyword(keywords[j].stringValue());
        }
        list.add(be);
    }
    return list;
}

To create an incremental search, we need to modify the query. Lucene doesn't support complete substring search (where a search for "oo" would return "noon"), but it does support prefix substrings, meaning a search for "jav" will return both "java" and "javascript." This is done by adding a wildcard ("*") to each term. Years of Googling have conditioned people to continue typing words to narrow down a search, so we will just AND the search terms together into our final query string.

public List search(String[] terms) throws Exception {
    // return empty array if empty query
    if(terms.length == 0) return new ArrayList();

    StringBuffer query = new StringBuffer();
    // add the first term with a wildcard (*)
    query.append(terms[0]+"*");

    // AND all of the additional terms
    // with *'s after them
    for(int i=1; i<terms.length; i++) {
        query.append(" AND " + terms[i] +"*");
    }
    return luceneSearch(query.toString(),
                        this.indexdir);
    //return bes;
}

Pages: 1, 2

Next Page » 

Related Articles

Building a Better Brain, Part 1: The Protocol
Joshua Marinacci wants to build a distributed system for storing, searching, and updating small pieces of information. In this article, he shows how Java-friendly standards like XML and HTTP will make up the foundation of his BrainFeed web application..

View all java.net Articles.

 Feed java.net RSS Feeds