Skip to main content

QueryParser Rules

November 6, 2003

{cs.r.title}






Suppose you are searching the Web for pages that contain both the words java and net but not the word dot. What if search engines made you type in something like the following for this simple query?

BooleanQuery query = new BooleanQuery();
query.add(new TermQuery(new Term("contents","java")), true, false);
query.add(new TermQuery(new Term("contents", "net")), true, false);
query.add(new TermQuery(new Term("contents", "dot")), false, true);

That would be a real drag. Thankfully, Google, Nutch, and other search engines are friendlier than that, allowing you to enter something much more succinct:

java AND net NOT dot

This article assumes you have some basic understanding of Lucene, including building and querying an index. You may want to take a look back at the Lucene Intro published in July on java.net. The focus here is on Lucene's QueryParser, providing an in-depth look at its capabilities and caveats.

First we'll see what is involved to use QueryParser in an application. Then, Lucene's Query API is shown in relation to the corresponding QueryParser syntax. Elaboration on the details of QueryParser syntax is then followed by how QueryParser's features can be customized.

Using QueryParser

Using QueryParser is quite straightforward. Three things are needed: an expression, the default field name to use for unqualified fields in the expression, and an analyzer to pieces of the expression.

Field-selection qualifiers are discussed in the query syntax section. Analysis, specific to query parsing, is covered in the "Analysis paralysis" section. For a more detailed look at Lucene analyzers, refer to the Lucene Intro article.

Now, let's parse an expression:

String humanQuery = getHumanQuery();
Query query = QueryParser.parse(humanQuery, "contents", new StandardAnalyzer());

Once you've obtained a Query object, searching is done the same as if the query had been created directly through the API. From the intro article, here is a full method to search an existing index with a user-entered query string, and display the results to the console:

public static void search(File indexDir, String q)  throws Exception{
  Directory fsDir = FSDirectory.getDirectory(indexDir, false);
  IndexSearcher is = new IndexSearcher(fsDir);

  Query query = QueryParser.parse(q, "contents", new StandardAnalyzer());
  Hits hits = is.search(query);
  System.out.println("Found " + hits.length() +
                    " document(s) that matched query '" + q + "':");
  for (int i = 0; i < hits.length(); i++) {
    Document doc = hits.doc(i);
    System.out.println(doc.get("filename"));
  }
}

Expressions handed to QueryParser are parsed according to a simple grammar. When an illegal expression is encountered, QueryParser throws a ParseException.

Lucene's Query API

When a human-readable query is parsed by Lucene's QueryParser, it is converted to a single concrete subclass of the Query class. In the previous article of this Lucene series, the discussion on searching glossed over the Query subclasses by simply using QueryParser to do the Right Thing. However, we need some understanding of the underlying concrete Query subclasses. The relevant subclasses, their purpose, and some example expressions for each are listed in the following table:

Query Implementation Purpose Sample expressions
TermQuery Single term query, which effectively is a single word. reynolds
PhraseQuery A match of several terms in order, or in near vicinity to one another. "light up ahead"
RangeQuery Matches documents with terms between beginning and ending terms, including or excluding the end points. [A TO Z]
{A TO Z}
WildcardQuery Lightweight, regular-expression-like term-matching syntax. j*v?
f??bar
PrefixQuery Matches all terms that begin with a specified string. cheese*
FuzzyQuery Levenshtein algorithm for closeness matching. tree~
BooleanQuery Aggregates other Query instances into complex expressions allowing AND, OR, and NOT logic. reynolds AND "light up ahead"
cheese* -cheesewhiz

All of these Query implementations are in the org.apache.lucene.search package. The BooleanQuery is a bit of a special case because it is a Query container that aggregates other queries (including nested BooleanQuerys for sophisticated expressions).

Here is a BooleanQuery based on the query snippet we used at the beginning of this article. Utilizing "Test-Driven Learning with JUnit," we can see how the QueryParser-created query is equivalent to the API-created one:

public class RulezTest extends TestCase {
  public void testJavaNotDotNet() throws Exception {
    BooleanQuery apiQuery = new BooleanQuery();
    apiQuery.add(new TermQuery(new Term("contents", "java")), true, false);
    apiQuery.add(new TermQuery(new Term("contents", "net")), true, false);
    apiQuery.add(new TermQuery(new Term("contents", "dot")), false, true);

    Query qpQuery = QueryParser.parse("java AND net NOT dot", "contents", new StandardAnalyzer());

    // Query and subclasses behave as expected with .equals
    assertEquals(qpQuery, apiQuery);
  }
}

Some interesting features of the Query classes are their toString methods. Each Query subclass generates the equivalent (though not necessarily textually exact) QueryParser expression. There are two variants: one is the standard Object.toString overridden method, and the other accepts the default field name. The following test case demonstrates how these two methods work, and illustrates how an equivalent (yet not the exact) expression is returned .

  public void testToString() throws Exception {
    Query query = QueryParser.parse("java AND net NOT dot", "contents", new StandardAnalyzer());

    assertEquals("+java +net -dot", query.toString("contents"));

    assertEquals("+contents:java +contents:net -contents:dot", query.toString());
  }

Notice that the expression parsed was java AND net NOT dot, but the expression returned from the toString methods used the abbreviated syntax +java +net -dot. Our first test case (testJavaNotDotNet) demonstrated that the underlying query objects themselves are equivalent.

The no-arg toString method makes no assumptions about the field names for each term, and specifies them explicitly using field-selector syntax. Using these toString methods is handy for diagnosing QueryParser issues.

QueryParser expression syntax

The following items in this section describe the syntax QueryParser supports to create the various query types.

Single-term query

A query string of only a single word is converted to an underlying TermQuery.

Phrase query

To search for a group of words together in a field, surround the words with double-quotes. The query "hello world" corresponds to an exact phrase match, requiring "hello" and "world" to be successive terms for a match.

Lucene also supports sloppy phrase queries, where the terms between quotes do not necessarily have to be in the exact order. The slop factor measures against how many moves it takes to rearrange the terms into the exact order. If the number of moves is less than a specified slop factor, it is a match. QueryParser parses the expression "hello world"~2 as a PhraseQuery with a slop factor of 2, allowing matches on the phrases "world hello", "hello world", "hello * world", and "hello * * world", where the asterisks represent irrelevant words in the index. Note that "world * hello" does not match with a slop factor of 2. Why? Because the number of moves to get that back to "hello world" is 3. Hopping the word "world" to the asterisk position is one, to the "hello" position is two, and the third hop makes the exact match.

Range query

Text or date range queries use bracketed syntax, with TO between the beginning term and ending term. The type of bracket determines whether the range is inclusive (square brackets) or exclusive (curly brackets).

NOTES: Non-date range queries use the start and end terms as the user entered them without modification. In the case of {Aardvark TO Zebra}, the terms are not lowercased. Start and end terms must not contain whitespace, or parsing fails; only single words are allowed. The analyzer is not run on the start and end terms.

Date range handling

When a range query (such as [1/1/03 TO 12/31/03]) is encountered, the parser code first attempts to convert the start and end terms to dates. If the terms are valid dates, according to DateFormat.SHORT and lenient parsing, then the dates are converted to their internal textual representation (however, date field indexing is beyond the scope of this article). If either of the two terms fails to parse as a valid date, they both are used as is for a textual range.

Wildcard and prefix queries

If a term contains an asterisk or question mark, it is considered a WildcardQuery, except when the term only contains a trailing asterisk and QueryParser optimizes it to a PrefixQuery instead. While the WildcardQuery API itself supports a leading wildcard character, the QueryParser does not allow it. An example wildcard query is w*ldc?rd, whereas the query prefix* is optimized as a PrefixQuery.

Fuzzy query

Lucene's FuzzyQuery matches terms close to a specified term. The Levenshtein distance algorithm determines how close terms in the index are to a specified target term. "Edit distance" is another term for "Levenshtein distance," and is a measure of similarity between two strings, where distance is measured as the number of character deletions, insertions, or substitutions required to transform one string to the other string. For example, the edit distance between "three" and "tree" is one, as only one character deletion is needed. The number of moves is used in a threshold calculation, which is ratio of distance to string length.

QueryParser supports fuzzy-term queries using a trailing tilde on a term. For example, searching for wuzza~ will find documents that contain "fuzzy" and "wuzzy". Edit distance affects scoring, such that lower edit distances score higher.

Boolean query

Constructing Boolean queries textually is done using the operators AND, OR, and NOT. Terms listed without an operator specified use an implicit operator, which by default is OR. A query of abc xyz will be interpreted as abc OR xyz. Placing a NOT in front of a term excludes documents containing the following term. Negating a term must be combined with at least one non-negated term to return documents. Each of the uppercase word operators has shortcut syntax shown in the following table.

Verbose syntax Shortcut syntax
a AND b +a +b
a OR b a b
a AND NOT b +a -b

Other syntax

Beyond constructing the different query types, other syntax is needed to group expressions, select a field, or escape characters.

Grouping

Parentheses are used to group expressions, allowing the formation of complex queries. The query (search OR indexing) AND lucene translates to a BooleanQuery with two clauses: the first is a BooleanQuery for the parenthetical expression, and the second is a term query for the "lucene" term.

Field selection
QueryParser needs to know the field name to use when constructing queries, but it would generally be unfriendly to require users to identify the field to search (as the end user may not need or want to know the field names). As we've seen, the default field name is provided to the parse method. Parsed queries are not restricted, however, to only searching the default field. Using field selector notation, terms in other fields can be specified. For example, in the case that HTML documents are indexed with the title and body areas as separate fields, the default field would likely be body. Users could search for title fields using a query such as title:lucene syntax.

Field selection can be grouped over several terms using field:(a b c) syntax.

Special character escaping

Whenever special characters are used in expressions, there is a need to provide an escaping mechanism so that the special characters themselves can be used in a normal fashion. QueryParser uses a backslash to escape special characters within terms. The list of escapable characters is:

\, +, -, !, (, ), :, ^, ], {, }, ~, *, ?

Analysis paralysis

QueryParser uses an analyzer to tokenize pieces of the expression into terms. The entire expression is not passed through the analyzer; only parts that are parsed as term text and not special expression syntax (such as AND, OR, +, -, and so on). Analyzing query text makes sense, as the original text was analyzed during the indexing phase -- or was it? Therein lies a dilemma. Not all fields were necessarily analyzed during indexing, yet users of your application could use field selectors to search on unanalyzed (not tokenized) fields that had been created using Field.Keyword(). This is a thorny issue that requires elaboration beyond the scope of this article. Analysis is perhaps the most important decision to make when developing applications with Lucene. The upcoming final release of Lucene 1.3 includes one possible solution to consider, called PerFieldAnalyzerWrapper.

Customizing QueryParser

So far, we've been using the static parse method on QueryParser. Using this static method uses several default settings. These settings can be changed by actually instantiating QueryParser and calling the setters listed in the following table.

Setting Default value Purpose
setLocale(Locale) default locale Defines the desired locale for date range query parsing.
setLowercaseWildcardTerms(boolean) false Controls whether wildcard and prefix query terms are lowercased or not.
setOperator(int) DEFAULT_OPERATOR_OR Specifies how terms without an explicit Boolean operator are handled. Set to DEFAULT_OPERATOR_AND for AND as the implicit operator.
setPhraseSlop(int) 0 (exact match) Declares how exact expressions within double-quotes must be to match. Zero is an exact match.

This test case shows how the lowercasing feature operates:

public void testLowercasing() throws Exception {
  Query q = QueryParser.parse("PrefixQuery*", "field", new StandardAnalyzer());
  assertEquals("lowercased", "prefixquery*", q.toString("field"));

  QueryParser qp = new QueryParser("field", new StandardAnalyzer());
  qp.setLowercaseWildcardTerms(false);
  q = qp.parse("PrefixQuery*");
  assertEquals("not lowercased", "PrefixQuery*", q.toString("field"));
}

Beyond these built-in switches, QueryParser can be subclassed. The methods listed in the following table can be overridden to achieve the mentioned effects.

Method Why override?
getFieldQuery(String field, Analyzer analyzer, String queryText) This method is responsible for the construction of either a TermQuery or PhraseQuery. If special analysis is needed, or a unique type of query is desired, override this method.
getFuzzyQuery(String field, String termStr) Fuzzy queries can adversely affect performance. Override and throw a ParseException to disallow fuzzy queries.
getPrefixQuery(String field, String termStr) This method is used to construct a query when the term ends with an asterisk. The term string handed to this method does not include the trailing asterisk and is not analyzed. Override this method to perform any desired analysis.
getRangeQuery(String field, Analyzer analyzer, String start, String end, boolean inclusive) Overriding could:

  • Lowercase the start and end terms.
  • Use a different date format.
  • Handle number ranges by padding to match how numbers are indexed.
getWildcardQuery(String field, String termStr) Wildcard queries can adversely affect performance, so overridden methods could throw a ParseException to disallow. The term string is not analyzed, so perhaps special handling is desired.

Here's a simple subclass to prohibit fuzzy and wildcard queries:

public class RestrictedQueryParser extends QueryParser {
  public RestrictedQueryParser(String field, Analyzer analyzer) {
    super(field, analyzer);
  }

  final protected Query getWildcardQuery(String field, String termStr)
    throws ParseException {
    throw new ParseException("Wildcard not allowed");
  }

  final protected Query getFuzzyQuery(String field, String termStr)
    throws ParseException {
    throw new ParseException("Fuzzy queries not allowed");
  }

}

To QueryParse or not to QueryParse?

QueryParser is a quick and effortless way to give users powerful query construction, but it is not for everyone. QueryParser cannot create every type of query that can be constructed using the API. For instance, a PhrasePrefixQuery cannot be constructed. You must keep in mind that all of the possibilities available when exposing freeform query parsing to an end user. Some queries have the potential for performance bottlenecks. The syntax used by the built-in QueryParser may not be suitable for your needs. Some control is possible with subclassing QueryParser, though it is still limited.

Creating a custom query parser is well within reach. Look to JavaCC or other parser technologies, such as ANTLR, for creating your own, should QueryParser be insufficient.

Wrap-up

Lucene's QueryParser is a powerful and relatively flexible way to allow users to enter rich queries into your application. It is designed for human-entered queries. For queries that originate from your application code, it is best to use Lucene's Query API instead.

Without a QueryParser, Lucene would be much tougher to approach and integrate into applications. Even though what QueryParser does is a layer above Lucene's core functionality, thankfully, it is an included feature of this marvelous gem.

Erik Hatcher is the co-author of the premiere book on Ant, Java Development with Ant (published by Manning), and is co-author of "Lucene in Action".
Related Topics >> Search   |