Friday, June 19, 2015

Learn lucene term range query

Today, we are going to learn lucene term range query. But first, what actually is lucene term range query? From the official javadoc definition

A Query that matches documents within an range of terms.

This query matches the documents looking for terms that fall into the supplied range according to Byte.compareTo(Byte). It is not intended for numerical ranges; use NumericRangeQuery instead.

This query uses the MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method.

So byte to byte comparison of between two ranges, because it is byte to byte comparison, the comparison is lexicographic. If you intend to find range between two numbers, this is not the class you should use. Okay, if this is not clear, let's go into the code, shall we?

As you know, lucene is about two parts, the first indexing (write) part and then search (query) part. So in this article, we are going to index and query using term range query. To give you an overall of this article, we have four class.

  • LuceneConstants - just a setting class for this application.
  • Indexer - the class that does the indexing. 
  • Searcher - a class that do the search.
  • LearnTermRangeQuery - our main entry class to bind the above three classes into one. 
We have create an object tester for this learning journey. We then create index by calling method createIndex and then the index using term range query.


1:  LearnTermRangeQuery tester;  
2:    
3:  try {  
4:     tester = new LearnTermRangeQuery();  
5:     tester.createIndex();  
6:     tester.searchUsingTermRangeQuery("record2.txt", "record6.txt");  
7:  } catch (Exception e) {  
8:       
9:  }  

In the method createIndex(), I have some lambda usage, which you can notice with the arrow symbol, so you need to have java8 installed. There are two variables, indexDir and dataDir. The variable, indexDir is there directory where the created index will reside whilst dataDir is the sample data to be index upon. In the class Indexer, method getDocument(), is essentially index all sample documents. Nothing fancy, just ordinary creating lucene document and three fields, filename, filepath and file content.

Back to the class LearnTermRangeQuery, method searchUsingTermRangeQuery(). Notice we search the range with two files as the border. We initialized a lucene directory object and pass to the object index searcher. Everything else for lucene index searcher is just standard. We construct the TermRangeQuery and passed to the searcher object. The results are then shown and eventually close.

Below are the sample output in eclipse output.

 record 21.txt  
 src/resources/samples.termrange/record 21.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 21.txt  
 record 33 .txt  
 src/resources/samples.termrange/record 33 .txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33 .txt  
 record10.txt  
 src/resources/samples.termrange/record10.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record10.txt  
 record7.txt  
 src/resources/samples.termrange/record7.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record7.txt  
 record6.txt  
 src/resources/samples.termrange/record6.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record6.txt  
 record9.txt  
 src/resources/samples.termrange/record9.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record9.txt  
 record33.txt  
 src/resources/samples.termrange/record33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 record2.txt  
 src/resources/samples.termrange/record2.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 record5.txt  
 src/resources/samples.termrange/record5.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 record 33.txt  
 src/resources/samples.termrange/record 33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33.txt  
 record3.txt  
 src/resources/samples.termrange/record3.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 record8.txt  
 src/resources/samples.termrange/record8.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record8.txt  
 record2.1.txt  
 src/resources/samples.termrange/record2.1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.1.txt  
 record1.txt  
 src/resources/samples.termrange/record1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record1.txt  
 record4.txt  
 src/resources/samples.termrange/record4.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 record22.txt  
 src/resources/samples.termrange/record22.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
 16 File indexed, time taken: 800 ms  
 6 documents found. Time :74ms  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
   

As you can see above, the result are not correct if you consider numeric file name from record2.txt to record6.txt. So, always try experiment for few values before you implement. hehe, have fun! You can get the source for this codes at my github.

No comments:

Post a Comment