Showing posts with label apache lucene. Show all posts
Showing posts with label apache lucene. Show all posts

Friday, June 19, 2015

Learn lucene term range query

Today, we are going to learn lucene term range query. But first, what actually is lucene term range query? From the official javadoc definition

A Query that matches documents within an range of terms.

This query matches the documents looking for terms that fall into the supplied range according to Byte.compareTo(Byte). It is not intended for numerical ranges; use NumericRangeQuery instead.

This query uses the MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method.

So byte to byte comparison of between two ranges, because it is byte to byte comparison, the comparison is lexicographic. If you intend to find range between two numbers, this is not the class you should use. Okay, if this is not clear, let's go into the code, shall we?

As you know, lucene is about two parts, the first indexing (write) part and then search (query) part. So in this article, we are going to index and query using term range query. To give you an overall of this article, we have four class.

  • LuceneConstants - just a setting class for this application.
  • Indexer - the class that does the indexing. 
  • Searcher - a class that do the search.
  • LearnTermRangeQuery - our main entry class to bind the above three classes into one. 
We have create an object tester for this learning journey. We then create index by calling method createIndex and then the index using term range query.


1:  LearnTermRangeQuery tester;  
2:    
3:  try {  
4:     tester = new LearnTermRangeQuery();  
5:     tester.createIndex();  
6:     tester.searchUsingTermRangeQuery("record2.txt", "record6.txt");  
7:  } catch (Exception e) {  
8:       
9:  }  

In the method createIndex(), I have some lambda usage, which you can notice with the arrow symbol, so you need to have java8 installed. There are two variables, indexDir and dataDir. The variable, indexDir is there directory where the created index will reside whilst dataDir is the sample data to be index upon. In the class Indexer, method getDocument(), is essentially index all sample documents. Nothing fancy, just ordinary creating lucene document and three fields, filename, filepath and file content.

Back to the class LearnTermRangeQuery, method searchUsingTermRangeQuery(). Notice we search the range with two files as the border. We initialized a lucene directory object and pass to the object index searcher. Everything else for lucene index searcher is just standard. We construct the TermRangeQuery and passed to the searcher object. The results are then shown and eventually close.

Below are the sample output in eclipse output.

 record 21.txt  
 src/resources/samples.termrange/record 21.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 21.txt  
 record 33 .txt  
 src/resources/samples.termrange/record 33 .txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33 .txt  
 record10.txt  
 src/resources/samples.termrange/record10.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record10.txt  
 record7.txt  
 src/resources/samples.termrange/record7.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record7.txt  
 record6.txt  
 src/resources/samples.termrange/record6.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record6.txt  
 record9.txt  
 src/resources/samples.termrange/record9.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record9.txt  
 record33.txt  
 src/resources/samples.termrange/record33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 record2.txt  
 src/resources/samples.termrange/record2.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 record5.txt  
 src/resources/samples.termrange/record5.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 record 33.txt  
 src/resources/samples.termrange/record 33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33.txt  
 record3.txt  
 src/resources/samples.termrange/record3.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 record8.txt  
 src/resources/samples.termrange/record8.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record8.txt  
 record2.1.txt  
 src/resources/samples.termrange/record2.1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.1.txt  
 record1.txt  
 src/resources/samples.termrange/record1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record1.txt  
 record4.txt  
 src/resources/samples.termrange/record4.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 record22.txt  
 src/resources/samples.termrange/record22.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
 16 File indexed, time taken: 800 ms  
 6 documents found. Time :74ms  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
   

As you can see above, the result are not correct if you consider numeric file name from record2.txt to record6.txt. So, always try experiment for few values before you implement. hehe, have fun! You can get the source for this codes at my github.

Sunday, February 1, 2015

Initial study on apache lucene

Today, we are going to learn apache lucene. So first thing first, what is apache lucene?
Apache Lucene is a free open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Let's go into apache lucene "hello world", so we get an basic idea what is it. Go to the offical site and download the latest release. Below is the tutorial I follow from the official documentation, and using apache lucene version 4.10.3 with oracle java 7 with slight modification to the tutorial.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.IndexFiles
Usage: java org.apache.lucene.demo.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]

This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.IndexFiles -index data/ -docs docs/
Indexing to directory 'data/'...
adding docs/grouping/constant-values.html
adding docs/grouping/index.html
adding docs/grouping/allclasses-noframe.html
adding docs/grouping/overview-frame.html
adding docs/grouping/org/apache/lucene/search/grouping/AbstractGroupFacetCollector.html
...
...
...
adding docs/analyzers-phonetic/deprecated-list.html
adding docs/analyzers-phonetic/package-list
adding docs/analyzers-phonetic/allclasses-frame.html
95794 total milliseconds
jason@localhost:~/Desktop/lucene-4.10.3$ uptime
21:10:16 up 16:44, 23 users, load average: 5.45, 4.49, 3.59

As you can see, instead of indexing the source of java class file, I index the javadoc in html format and it works nicely. Although my system is loaded but the index still reasonably quick. Apache lucene finish index within 95seconds for a total of 5818 files. After index are done, if you do a list on the directory data, you will notice the lucene index files. If you want to go into details what are these files before, you should read this documentation.
jason@localhost:~/Desktop/lucene-4.10.3$ ls -l data/
total 13784
-rw-r--r-- 1 jason jason 284 Jan 13 21:07 _0.cfe
-rw-r--r-- 1 jason jason 12387776 Jan 13 21:07 _0.cfs
-rw-r--r-- 1 jason jason 242 Jan 13 21:07 _0.si
-rw-r--r-- 1 jason jason 284 Jan 13 21:07 _1.cfe
-rw-r--r-- 1 jason jason 1677329 Jan 13 21:07 _1.cfs
-rw-r--r-- 1 jason jason 242 Jan 13 21:07 _1.si
-rw-r--r-- 1 jason jason 151 Jan 13 21:07 segments_1
-rw-r--r-- 1 jason jason 36 Jan 13 21:07 segments.gen
-rw-r--r-- 1 jason jason 0 Jan 13 21:06 write.lock

Okay, now to the search.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar  org.apache.lucene.demo.SearchFiles
Exception in thread "main" org.apache.lucene.store.NoSuchDirectoryException: directory '/home/jason/Desktop/lucene-4.10.3/index' does not exist
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:218)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:801)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:91)
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles --help
Exception in thread "main" org.apache.lucene.store.NoSuchDirectoryException: directory '/home/jason/Desktop/lucene-4.10.3/index' does not exist
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:218)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:801)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:91)
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles -h
Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]

See http://lucene.apache.org/core/4_1_0/demo/ for details.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles -index data
Enter query:
string
Searching for: string
1674 total matching documents
1. docs/benchmark/org/apache/lucene/benchmark/byTask/utils/Format.html
2. docs/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html
3. docs/queryparser/deprecated-list.html
4. docs/queryparser/org/apache/lucene/queryparser/classic/class-use/ParseException.html
5. docs/queryparser/org/apache/lucene/queryparser/flexible/core/messages/QueryParserMessages.html
6. docs/core/org/apache/lucene/index/IndexFileNames.html
7. docs/analyzers-stempel/org/egothor/stemmer/Diff.html
8. docs/queryparser/org/apache/lucene/queryparser/ext/Extensions.html
9. docs/facet/org/apache/lucene/facet/FacetsConfig. html
10. docs/queryparser/org/apache/lucene/queryparser/flexible/messages/package-summary.html
Press (n)ext page, (q)uit or enter number to jump to a page.
n
11. docs/highlighter/org/apache/lucene/search/highlight/class-use/InvalidTokenOffsetsException.html
12. docs/queryparser/org/apache/lucene/queryparser/xml/DOMUtils.html
13. docs/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html
14. docs/core/org/apache/lucene/index/SegmentInfo.html
15. docs/highlighter/org/apache/lucene/search/vectorhighlight/FragmentsBuilder.html
16. docs/highlighter/org/apache/lucene/search/vectorhighlight/class-use/FieldFragList.html
17. docs/highlighter/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.html
18. docs/queryparser/org/apache/lucene/queryparser/flexible/standard/QueryParserUtil.html
19. docs/highlighter/org/apache/lucene/search/highlight/GradientFormatter.html
20. docs/highlighter/org/apache/lucene/search/postingshighlight/PostingsHighlighter.html
Press (p)revious page, (n)ext page, (q)uit or enter number to jump to a page.
q
Enter query:
quit
Searching for: quit
2 total matching documents
1. docs/demo/src-html/org/apache/lucene/demo/SearchFiles.html
2. docs/changes/Changes.html
Press (q)uit or enter number to jump to a page.
q
Enter query:
^Cjason@localhost:~/Desktop/lucene-4.10.3$

The search is quick even though in the loaded system. That's it, a light learning experience on apache lucene.