How It Works
There's a lot going on in the suggestSimilar() method of SpellChecker, so let's follow it through with an example.
Take the correctly spelled word "lettuce," which appears in the Beatrix Potter texts I've used for this article. In the original index, where each Lucene document corresponds to a text, "lettuce" appears in two Lucene documents in the contents field. On the other hand, the spell index contains a whole Lucene document for every distinct word in the original index. Each document has a number of fields, as shown here with the values for the document representing the word "lettuce."
| Field name |
Field values |
word |
lettuce |
start3 |
let |
gram3 |
let ett ttu tuc uce |
end3 |
uce |
start4 |
lett |
gram4 |
lett ettu ttuc tuce |
end4 |
tuce |
Notice how both trigrams and 4-grams are indexed. In fact, precisely which n-grams are indexed depends on the size of the word. For very short words, unigrams and bigrams are indexed, whereas for longer words, trigrams and 4-grams are indexed.
The suggestSimilar() method forms a Lucene query to search the spell index for candidate suggestions. For the misspelling "lettice" the query is as follows (split over two lines to make it easier to read):
start3:let^2.0 end3:ice gram3:let gram3:ett gram3:tti gram3:tic gram3:ice
start4:lett^2.0 end4:tice gram4:lett gram4:etti gram4:ttic gram4:tice
The start n-grams are given more weight than the other n-grams in the word; here, they are boosted by a factor of two, signified by the ^2.0 notation. Another reason to index the start and end n-grams separately is because they are positional, unlike the other n-grams. For example, the words "eat" and "ate" have the same set of unigrams and bigrams (gram1:e gram1:a gram1:t gram2:ea gram2:at), so they need the start and end fields to distinguish them (start1:e end1:t start2:ea end2:at for "eat," and start1:a end1:e start2:at end2:te for "ate").
Using a Lucene index browser, such as the excellent Luke--the Lucene Index Toolbox, we can manually run this query against the spell index. Figure 3 shows what we get.
|

Figure 3. Browsing the spell index--click image for full-size screenshot
|
But the top hit is "letting," not "lettuce," which the web app presented us with. What's going on? The answer is that the Lucene Spell Checker ranks suggestions by edit distance, not by search relevance. The string "lettice" differs from "lettuce" by a single substitution, whereas "letting" is two substitutions away.
Supporting Composite Queries
SimpleSearchEngine supports composite queries--that is, queries that are composed of a set of clauses; for example, lettuce parsley, which means "find documents in which both of the words 'lettuce' and 'parsley' appear." As noted above, DidYouMeanSearchEngine with SimpleDidYouMeanParser only supports single-word queries, so let's see how we can fix it to support composite queries.
CompositeDidYouMeanParser is an implementation of DidYouMeanParser for use by DidYouMeanSearchEngine that supports composite queries. Recall that the DidYouMeanParser interface has a parse() method and a suggest() method, both of which take query strings and return Lucene Query objects. The implementation of parse() is simple: it uses Lucene's QueryParser, which has built-in support for composite queries. The implementation of suggest() is a little more tricky. It relies on the getFieldQuery() extensibility hook provided by QueryParser, so if a term (or a word in a phrase) is misspelled, then it is replaced with the best suggestion. If no terms (or words in a phrase) in the whole query are misspelled, then suggest() returns null.
Figure 4 is a screenshot for the misspelled composite query "lettice parslee."

Figure 4. Correcting the spelling of multiple query terms
Ensuring High-Quality Suggestions
Having a clever algorithm for detecting and correcting spelling
errors is a good start, but you need a good source of correctly
spelled words to ensure the suggestions are of a high quality. So
far, we have used the terms in the original index as the source of
words (by constructing a LuceneDictionary). There is a
downside to this approach: the content that was indexed will almost
certainly contain spelling errors, so there is a good chance that
certain query suggestions will be misspelled.
You might think that using a compiled word list might
help. However, even the largest dictionaries fall short in word
coverage for proper nouns and newly coined words (e.g., technical
phrases), so a correctly spelled query term that is not in the
dictionary will be incorrectly marked as a misspelling. The user
would then be prompted with a distracting alternative query
suggestion. (As a side note, Lucene Spell Checker provides an
implementation
of Dictionary, PlainTextDictionary, which
can read words from a word list such as /usr/dict/words
commonly found on Unix systems. Use this to do regular spell checking against a dictionary.)
Lucene Spell Checker provides a mechanism to solve this problem, while still using the original index as the source of words. The suggestSimilar() method of SpellChecker is overloaded to support secondary sorting of the suggested words by document frequency in an index;
for example:
spellChecker.suggestSimilar(queryText, 1, originalIndexReader, defaultField, true);
This call restricts suggestions to those words that are more popular (true) in the original index than the query term. On the plausible assumption that across the whole set of documents, misspellings are less common than the correctly spelled instances of the word, this modification will improve the quality of suggestions, even in document collections containing misspellings.
Zeitgeist
Large search engines use user queries for the source of suggestions. The logic is: if you don't understand what a user is asking for, compare it to what other users ask for, as someone else is likely to have searched for something similar.
To implement this strategy, each user query submitted to the system should be indexed in the spell index in order to provide a proper record of query frequencies. (All of the main search engines publish their most popular search terms, which are ultimately derived from such an index.) Then, by using the overloaded suggestSimilar() method introduced in the previous section, suggestions will be ranked firstly by edit distance and secondly by user popularity.
Conclusion
Spell checking users' search queries is a nice feature, and relatively easy to add to a Lucene-powered search application, as this article has shown. Most of the time, the corrections suggested are good ones, but there is plenty of ongoing research in the information retrieval community on improving spell check algorithms (see "References," below). I think we will continue to see the fruits of such research in open source libraries like Lucene Spell Checker.
References
- Download the code supporting this article. The distribution includes everything you need to get up and running quickly, including a pre-built index that you can unpack on your filesystem, and a WAR file to drop into your servlet container.
- I used Lucene 1.4.3 and Lucene Spell Checker 1.1 in this article.
- David Spencer, who created Lucene Spell Checker, has an interesting website dedicated to search experiments.
- Aspell and its Java cousin Jazzy are good general-purpose spell checkers.
- I leaned heavily on Thomas Risberg's thorough "Developing a Spring Framework MVC Application Step-by-Step" for building the Spring Search UI.
- A Technique for Computer Detection and Correction of Spelling Errors (Communications of the ACM, 7(3), 171 - 176; 1964), by
Fred J. Damerau, is an early paper on spelling correction.
- Techniques for Automatically Correcting Words in Text (ACM Computing Surveys, 24(4), 377-439; 1992), by
Karen Kukich, is the definitive survey article on spelling correction.
- The Art of Computer Programming, Volume 3, Sorting and Searching (Addison-Wesley; 1998), by Donald E. Knuth, contains the definitive description of the Soundex algorithm.
- "Spelling Correction for Search Engine Queries," by Bruno Martins and Mario J. Silva, solves the same problem as the present article using a set of heuristics to make spelling suggestions.
- Spelling Correction as an Iterative Process that Exploits the Collective Knowledge of Web Users (2004 Proceedings of EMNLP 2004 293-300), by S. Cucerzan and E. Brill, is an interesting paper that talks about how spelling correction can be driven from knowledge acquired by a search engine.
- Unsurprisingly, Google, Yahoo, and Microsoft all have a lot of published information on information retrieval.
Tom White is lead Java developer at Kizoom, a leading U.K. software company in the delivery of personalized travel information.