How to implement Field Does Not Exist at Lucene?

Apache Lucene is a core for many search engines and naturally supports queries like Field Exists for static fields. For example, Lucene query “field_name:*” will return all documents that have field_name, any value. But there is no way to query: “-field_name:*” and hence there is no natural support for Field Does Not Exist.

If application’s search engine is based on Apache Lucene and it needs this functionality, it must implement it.

Options

To solve that task you should add meta data on top of Lucene to support it.

Option 1

Add extra text field to every indexed document, e.g. field_names  with string value “fieldName1 fieldName2 fieldName3…. fieldNameN”. In this case Field Name1 Does Not Exists can be expressed as a query: “field_names does not contain fieldName1”. Simple code that demonstrates it:

Document doc = new Document();
doc.add(new TextField("title", "First Title", Field.Store.YES));
doc.add(new TextField("source", "wiki", Field.Store.YES));
doc.add(new TextField("author", "anonymous", Field.Store.YES));

// Add extra meta field_names that has all other field names as a value.
StringBuilder sb = new StringBuilder();
for (IndexableField field : doc.getFields()) {
    sb.append(field.name());
    // Use ' ' as a separator
    sb.append(' ');
}
doc.add(new TextField("field_names", sb.toString(), Field.Store.NO));
writer.addDocument(doc);
writer.close();

As you can see, space character ‘ ‘ is used as separator and this is a one of the problem of this solution. It’s possible to use Lucene Categories that may have multiple values but I don’t consider this option because categories are not supposed to be used for tasks like this one at all.

Now you may execute queries Field Exists and Field Does Not Exist

// Query documents that doesn't have "origin" field.
QueryParser queryParser = new QueryParser("field_names", new StandardAnalyzer());

// Note that you must allow leading wildcard to be able to execute this query.
queryParser.setAllowLeadingWildcard(true);
Query q = queryParser.parse("* AND -origin");

// Execute query and display result ....
System.out.println("Execute Query: " + q.toString() + " and display result");
// ....

Note that QueryParser is only one of the ways to prepare query and you can use other Lucene query classes to achieve it. Namely in this case I wanted to demonstrate that “-origin” isn’t enough and query should use wildcard as well “* AND -origin”. Result query text (Query.toString()) is “+field_names:* -field_names:origin”

You may find implementation and tests at https://github.com/kuzminva/doesnotexist

Option 2

There is another way to achieve it. For every field in a document you add one more meta field:

e_fieldName1=1,

e_fieldName2=1,

… ,

e_fieldNameN=1

Prefix e_  and value 1 are selected only as an example. In this case Field Does Not Exist can be expressed as a Lucene query “-e_fieldNameN:1”.

This method was selected to implement Field Exists and Field Does Not Exist at VMWare vRealize Log Insight product that is VMWare log search engine similar to Elasticsearch. You may see its screenshot at the beginning.

Pros and Cons

Option 1 gives more flexibility and you can even request “all documents that have/don’t have fields like “some field name pattern here including prefixes, wild mask and other Lucene stuff”.

Option 2 doesn’t need to care about separator. And probably can be better and more scalable option if you are going to add/remove fields from the document after it was committed and the document may have many fields.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s