Skip to main content

Lucene Intro - Testing TOC

July 30, 2003





Contents
Indexing
Lucene Index Anatomy
   Documents
   Terms
Analysis
Searching
Summary
Resources

In order to make sense of the perceived complexity of
the world, humans have invented categorizations, classifications, genus,
species, and other types of hierarchical organizational schemes. The
explosion of the Internet and electronic data repositories has realized many
dreams of being able to quickly find information that was previously
unattainable. Yahoo was the first high-profile categorized view of the Internet.
More and more, though, users demand the flexibility of free form queries
which cuts across rigid category boundaries, as proven by the popular reliance
on search engines like Google. If users are demanding these capabilities of your applications, Lucene is quite possibly the best answer!

Lucene is a high-performance, scalable, search engine technology. Both
indexing and searching features make up the Lucene API. The first part of this article takes you through an example of using Lucene to index all the text files in a directory and its subdirectories. Before proceeding to examples of analysis and searching, we'll take a brief detour to discuss the format of the index directory.

Indexing

We'll begin by creating the Indexer class that will be used to index all the text files
in a specified directory. This class is a utility class with a single public method index() that
takes two arguments. The first argument is a File object indexDir
that corresponds to the directory where the index will be created. The second argument is another File object dataDir that corresponds to the directory to be indexed.

public static void index(File indexDir, File dataDir) throws IOException {
    if (!dataDir.exists() || !dataDir.isDirectory()) {
       throw new IOException(dataDir + " does not exist or is not a directory");
    }

    IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
    indexDirectory(writer, dataDir);
    writer.close();
}

After checking that dataDir exists and is a directory, we instantiate the
IndexWriter object that will be used to create the index.
The IndexWriter constructor used above accepts as its first parameter the
directory where the index will be created, with the last argument mandating
that it be created from scratch rather than reusing an index that may already
exist in that same location. The middle parameter is the analyzer to
use for tokenized fields. Field analysis is described below, but for
now we can take for granted that the important words in the file will
be indexed thanks to the StandardAnalyzer.

The indexDirectory() walks the directory tree, scanning
for .txt files. Any .txt file will be indexed using the indexFile() method, any
directory will be processed using the indexDirectory()method, and any other
file will be ignored. Here is the code for indexDirectory.

private static void indexDirectory(IndexWriter writer, File dir) throws IOException {
    File[] files = dir.listFiles();

    for (int i=0; i < files.length; i++) {
        File f = files[i];
        if (f.isDirectory()) {
           indexDirectory(writer, f);  // recurse
        } else if (f.getName().endsWith(".txt")) {
           indexFile(writer, f);
        }
    }
}

The indexDirectory() method lives independently from Lucene. This is
an example of Lucene usage in general -- using Lucene rarely involves much
coding directly with the Lucene API, but rather relies on your cleverness
using it. And finally in the Indexer class, we get to the heart of
its purpose, indexing a single text file:

private static void indexFile(IndexWriter writer, File f) throws IOException {
    System.out.println("Indexing " + f.getName());

    Document doc = new Document();
    doc.add(Field.Text("contents", new FileReader(f)));
    doc.add(Field.Keyword("filename", f.getCanonicalPath()));
    writer.addDocument(doc);
}

And believe it or not, we're done! We've just indexed an entire directory
tree of text files. Yes, it really is that simple. To summarize, all
it took to create this index were these steps:

  1. Create an IndexWriter.
  2. Locate each file to be indexed by walking the directory looking for file names ending in .txt.
  3. For each text file, create a Document with the desired Fields.
  4. Add the document to the IndexWriter instance.

Let's assemble these methods into an Indexer class and add the appropriate imports. You can index
a file by calling Indexer.index( indexDir, dataDir). We've also added a main() method so
the Indexer can be run from the command line with the two directories passed in as command line
parameters.

import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;

import java.io.File;
import java.io.IOException;
import java.io.FileReader;

public class Indexer {
    public static void index(File indexDir, File dataDir) throws IOException {
        if (!dataDir.exists() || !dataDir.isDirectory()) {
            throw new IOException(dataDir + " does not exist or is not a directory");
        }
        IndexWriter writer = new IndexWriter(indexDir, new StandardAnalyzer(), true);
        indexDirectory(writer, dataDir);
        writer.close();
    }

    private static void indexDirectory(IndexWriter writer, File dir) throws IOException {
        File[] files = dir.listFiles();

        for (int i=0; i < files.length; i++) {
            File f = files[i];
            if (f.isDirectory()) {
                indexDirectory(writer, f);  // recurse
            } else if (f.getName().endsWith(".txt")) {
                indexFile(writer, f);
            }
        }
    }

    private static void indexFile(IndexWriter writer, File f) throws IOException {
        System.out.println("Indexing " + f.getName());

        Document doc = new Document();
        doc.add(Field.Text("contents", new FileReader(f)));
        doc.add(Field.Keyword("filename", f.getCanonicalPath()));
        writer.addDocument(doc);
    }

    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: " + Indexer.class.getName() + " <index dir> <data dir>");
        }
        File indexDir = new File(args[0]);
        File dataDir = new File(args[1]);
        index(indexDir, dataDir);
    }
}

In this example, two fields are part of each document: the contents of the
text file and the full file name. The contents field gets some
extra special treatment under the covers as the StandardAnalyzer, which is
discussed below, processes it. The filename field is indexed as is.
There are still more explanations about what is going on, of course.
The Field static methods, Text, and Keyword will be explained in detail after
we take a quick look inside a Lucene index.

Lucene Index Anatomy

The Lucene index format is a directory structure of several files.
You can successfully use Lucene without understanding this directory structure. Feel free
to skip this section and treat the directory as a black box without regard to what is inside. When you are ready to dig
deeper you'll find that the files you created in the last section contain statistics and other data
to facilitate rapid searching and ranking. An index contains a sequence
of documents. In our indexing example, each document represents information
about a text file.

Documents

Documents are the primary retrievable units from a Lucene query. Documents
consist of a sequence of fields. Fields have a name ("contents" and
"filename" in our example). Field values are a sequence of terms.

Terms

A term is the smallest piece of a particular field. Fields have three
attributes of interest:

  • Stored -- Original text is available in the documents returned from a search.
  • Indexed -- Makes this field searchable.
  • Tokenized -- The text added is run through an analyzer and broken into relevant
    pieces (only makes sense for indexed fields).

Stored fields are handy for immediately having the original text available
from a search, such as a database primary key or filename. Stored fields
can dramatically increase the index size, so use them wisely. Indexed
field information is stored extremely efficiently, such that the same term
in the same field name across multiple documents is only stored once, with pointers to the documents that
contain it.

The Field class has a few static methods to construct fields with combinations
of the various attributes. They are:

  • Field.Keyword -- Indexed and stored, but not tokenized. Keyword fields
    are useful for data like filenames, part numbers, primary keys, and other text that needs to stay intact as is.
  • Field.Text -- Indexed and tokenized. The text is also stored if added as a String, but
    not stored if added as a Reader.
  • Field.UnIndexed -- Only stored. UnIndexed fields are not searchable.
  • Field.UnStored -- Indexed and tokenized, but not stored. UnStored fields are ideal for
    text you want to be searchable but want to maintain the original text elsewhere or it is not needed for immediate display
    from search results.

Up to now, Lucene seems relatively simple. But don't be fooled into
thinking that there is not much to what is under the covers. It's actually
quite sophisticated. The heart of this sophistication comes in the
analysis of text, and how terms are pulled from the field data.

Analysis

Tokenized fields are where the real fun happens.
In our example, we are indexing the contents of text files. The goal
is to have the words in the text file be searchable, but for practical purposes
it doesn't make sense to index every word. Some words like "a", "and",
and "the" are generally considered irrelevant for searching and can be optimized
out -- these are called stop words.

Does case matter for searching?
What are word boundaries? Are acronyms, email addresses, URLs, and
other such textual constructs kept intact and made searchable? If
a singular word is indexed, should searching on the plural form return the
document? These are all very interesting and complex questions to ask
when deciding on which analyzer to use, or whether to create your own.

In our example, we use Lucene's built-in StandardAnalyzer,
but there are other built-in analyzers as well as some optional ones (found
currently in the Lucene "sandbox" CVS repository) that can be used.
Here is some code that explores what several of these analyzers do to two
different text strings:

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.WhitespaceAnalyzer;
import org.apache.lucene.analysis.StopAnalyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.analysis.snowball.SnowballAnalyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.StringReader;
import java.io.IOException;

public class AnalysisDemo {
    private static final String[] strings = {
        "The quick brown fox jumped over the lazy dogs",
        "XY&Z Corporation - xyz@example.com"
    };

    private static final Analyzer[] analyzers = new Analyzer[]{
        new WhitespaceAnalyzer(),
        new SimpleAnalyzer(),
        new StopAnalyzer(),
        new StandardAnalyzer(),
        new SnowballAnalyzer("English", StopAnalyzer.ENGLISH_STOP_WORDS),
    };

    public static void main(String[] args) throws IOException {
        for (int i = 0; i < strings.length; i++) {
            analyze(strings[i]);
        }
    }

    private static void analyze(String text) throws IOException {
        System.out.println("Analzying "" + text + """);
        for (int i = 0; i < analyzers.length; i++) {
            Analyzer analyzer = analyzers[i];
            System.out.println("\t" + analyzer.getClass().getName() + ":");
            System.out.print("\t\t");
            TokenStream stream = analyzer.tokenStream("contents", new StringReader(text));
            while (true) {
                Token token = stream.next();
                if (token == null) break;

                System.out.print("[" + token.termText() + "] ");
            }
            System.out.println("\n");
        }
    }

}

The analyze method is using Lucene's API in an exploratory fashion.
Your indexing code would not need to see the results of textual analysis,
but it is helpful to see the terms that result from the various analyzers.
Here are the results:

Analzying "The quick brown fox jumped over the lazy dogs"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
        [The] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

    org.apache.lucene.analysis.SimpleAnalyzer:
        [the] [quick] [brown] [fox] [jumped] [over] [the] [lazy] [dogs]

    org.apache.lucene.analysis.StopAnalyzer:
        [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

    org.apache.lucene.analysis.standard.StandardAnalyzer:
        [quick] [brown] [fox] [jumped] [over] [lazy] [dogs]

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
        [quick] [brown] [fox] [jump] [over] [lazi] [dog]

Analzying "XY&Z Corporation - xyz@example.com"
    org.apache.lucene.analysis.WhitespaceAnalyzer:
        [XY&Z] [Corporation] [-] [xyz@example.com]

    org.apache.lucene.analysis.SimpleAnalyzer:
        [xy] [z] [corporation] [xyz] [example] [com]

    org.apache.lucene.analysis.StopAnalyzer:
        [xy] [z] [corporation] [xyz] [example] [com]

    org.apache.lucene.analysis.standard.StandardAnalyzer:
        [xy&z] [corporation] [xyz@example] [com]

    org.apache.lucene.analysis.snowball.SnowballAnalyzer:
        [xy&z] [corpor] [xyz@exampl] [com]

The WhitespaceAnalyzer is the most basic, simply separating tokens based
on, of course, whitespace. Note that not even capitalization was changed.
Searches are case-sensitive, so a general best practice is to lowercase text
during the analysis phase. The rest of the analyzers do lowercase as
part of the process. The SimpleAnalyzer splits text at non-character
boundaries, such as special characters ('&', '@', and '.' in the second
demo string). The StopAnalyzer builds upon the features of the SimpleAnalyzer
and also removes common English stop words.

The most sophisticated analyzer built into Lucene's core
is StandardAnalyzer. Under the covers it is a JavaCC-based parser with
rules for email addresses, acronyms, hostnames, floating point numbers,
as well as the lowercasing and stop word removal. Analyzers build upon a
chaining-filter architecture, allowing single-purpose rules to be combined.

The SnowballAnalyzer illustrated is not currently a built-in
Lucene feature. It is part of the source code available in the jakarta-lucene-sandbox
CVS repository. It has the most peculiar results of all analyzers shown.
The algorithm is language-specific, using stemming. Stemming algorithms
attempt to reduce a word to a common root form. This is seen with "lazy"
being reduced to "lazi". The word "laziness" would also be reduced
to "lazi", allowing searches for either word to find documents containing
the other. Another interesting example of the SnowballAnalzyer in action
is on the text "corporate corporation corporations corpse", which yielded
these results:

[corpor] [corpor] [corpor] [corps]

This was not the case for a lot of .com's, which became synonymous with "corpse," although the stemming algorithm sees the difference.

There is far more to textual analysis than is covered
here. It is the topic of many dissertations and patents, and certainly
ongoing research. Let's now turn our attention to searching, with
the knowledge of how tokens are pulled from the original text.

Searching

To match our indexing example, a Searcher class was created
to display search results from the same index. Its skeleton main is
shown here:

import org.apache.lucene.document.Document;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Hits;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.store.Directory;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import java.io.File;

public class Searcher {
    public static void main(String[] args) throws Exception {
        if (args.length != 2) {
            throw new Exception("Usage: " + Searcher.class.getName() + " <index dir> <query>");
        }

        File indexDir = new File(args[0]);
        String q = args[1];

        if (!indexDir.exists() || !indexDir.isDirectory()) {
            throw new Exception(indexDir + " is does not exist or is not a directory.");
        }

        search(indexDir, q);
    }
}

Again, we see nothing exciting here, just grabbing the command-line arguments
representing the index directory (which must have previously been created)
and the query to use. The interesting stuff happens in the search method:

public static void search(File indexDir, String q)  throws Exception{
    Directory fsDir = FSDirectory.getDirectory(indexDir, false);
    IndexSearcher is = new IndexSearcher(fsDir);

    Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
    Hits hits = is.search(query);
    System.out.println("Found " + hits.length() + " document(s) that matched query '" + q + "':");
    for (int i = 0; i < hits.length(); i++) {
        Document doc = hits.doc(i);
        System.out.println(doc.get("filename"));
    }
}

Through Lucene's API, a Query object instance is created and handed to an
IndexSearcher.search method. The Query object can be constructed through
the API using the built-in Query subclasses:

  • TermQuery
  • BooleanQuery
  • PrefixQuery
  • WildcardQuery
  • RangeQuery
  • and a few others.

In our search method,
though, we are using the QueryParser to parse a user-entered query.
QueryParser is a sophisticated JavaCC-based parser to turn Google-like search
expressions into Lucene's API representation of a Query. Lucene's expression
syntax is documented on the Lucene web site (see Resources); expressions may
contain boolean operators, per-field queries, grouping, range queries, and
more. An example query expression "+java -microsoft", which returns
hits for documents that contain the word "java" but not the word "microsoft." QueryParser.query requires the developer specify the default field for searching,
and in this case we specified the "contents" field. This would be equivalent
to querying for "+contents:java - contents:microsoft", but allowing for it
to be more user friendly.

The developer must also specify the analyzer to be used for tokenizing the query. In this case we use StandardAnalyzer, which is the same analyzer used for indexing. Typically the same analyzer should be used for both indexing and QueryParser searching. If we had used the SnowballAnalyzer as shown in the analysis examples, this would enable "laziness" searches to find the "quick brown fox" document.

After searching, a Hits collection is returned. The hits returned are
ordered by Lucene's determination of score. It is beyond the scope
of this article to delve into Lucene scoring, but rest assured that its default
behavior is plenty good enough for the majority of applications, and it can
be customized in the rare cases that the default behavior is insufficient.

The Hits collection is itself not an actual collection
of the documents that match the search. This is done for high-performance
reasons. It is a simple method call to retrieve the document though.
In our example we display the filename field value for each document that
matches the query.

Summary

Lucene is a spectacularly top-notch piece of work.
Even with its wondrous capabilities, it requires developer ingenuity to build
applications around it. We've seen a glimpse of the decisions that developers need to make with the choice of analyzers. There is more to it than this choice, though. Here are some questions to ponder as you consider
adding Lucene to your projects:

  • What are my actual "documents"? (perhaps
    database rows or paragraphs rather than entire files)
  • What are the fields
    that make up my documents?
  • How do users want to search for documents?

This article serves as an introduction to Lucene's capabilities, demonstrating
how simple Lucene usage is, yet how powerful its results are.

Resources

For more information on Lucene, visit the Lucene web site. There you can find more detailed information on the QueryParser syntax and the index file format.

Erik Hatcher is the co-author of the premiere book on Ant, Java Development with Ant (published by Manning), and is co-author of "Lucene in Action".
Related Topics >> Databases   |