Showing posts with label lucene-4_10_3. Show all posts
Showing posts with label lucene-4_10_3. Show all posts

Sunday, February 1, 2015

Initial study on apache lucene

Today, we are going to learn apache lucene. So first thing first, what is apache lucene?
Apache Lucene is a free open source information retrieval software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License.

Let's go into apache lucene "hello world", so we get an basic idea what is it. Go to the offical site and download the latest release. Below is the tutorial I follow from the official documentation, and using apache lucene version 4.10.3 with oracle java 7 with slight modification to the tutorial.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.IndexFiles
Usage: java org.apache.lucene.demo.IndexFiles [-index INDEX_PATH] [-docs DOCS_PATH] [-update]

This indexes the documents in DOCS_PATH, creating a Lucene indexin INDEX_PATH that can be searched with SearchFiles
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.IndexFiles -index data/ -docs docs/
Indexing to directory 'data/'...
adding docs/grouping/constant-values.html
adding docs/grouping/index.html
adding docs/grouping/allclasses-noframe.html
adding docs/grouping/overview-frame.html
adding docs/grouping/org/apache/lucene/search/grouping/AbstractGroupFacetCollector.html
...
...
...
adding docs/analyzers-phonetic/deprecated-list.html
adding docs/analyzers-phonetic/package-list
adding docs/analyzers-phonetic/allclasses-frame.html
95794 total milliseconds
jason@localhost:~/Desktop/lucene-4.10.3$ uptime
21:10:16 up 16:44, 23 users, load average: 5.45, 4.49, 3.59

As you can see, instead of indexing the source of java class file, I index the javadoc in html format and it works nicely. Although my system is loaded but the index still reasonably quick. Apache lucene finish index within 95seconds for a total of 5818 files. After index are done, if you do a list on the directory data, you will notice the lucene index files. If you want to go into details what are these files before, you should read this documentation.
jason@localhost:~/Desktop/lucene-4.10.3$ ls -l data/
total 13784
-rw-r--r-- 1 jason jason 284 Jan 13 21:07 _0.cfe
-rw-r--r-- 1 jason jason 12387776 Jan 13 21:07 _0.cfs
-rw-r--r-- 1 jason jason 242 Jan 13 21:07 _0.si
-rw-r--r-- 1 jason jason 284 Jan 13 21:07 _1.cfe
-rw-r--r-- 1 jason jason 1677329 Jan 13 21:07 _1.cfs
-rw-r--r-- 1 jason jason 242 Jan 13 21:07 _1.si
-rw-r--r-- 1 jason jason 151 Jan 13 21:07 segments_1
-rw-r--r-- 1 jason jason 36 Jan 13 21:07 segments.gen
-rw-r--r-- 1 jason jason 0 Jan 13 21:06 write.lock

Okay, now to the search.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar  org.apache.lucene.demo.SearchFiles
Exception in thread "main" org.apache.lucene.store.NoSuchDirectoryException: directory '/home/jason/Desktop/lucene-4.10.3/index' does not exist
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:218)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:801)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:91)
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles --help
Exception in thread "main" org.apache.lucene.store.NoSuchDirectoryException: directory '/home/jason/Desktop/lucene-4.10.3/index' does not exist
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:218)
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:801)
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:53)
at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:67)
at org.apache.lucene.demo.SearchFiles.main(SearchFiles.java:91)
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles -h
Usage: java org.apache.lucene.demo.SearchFiles [-index dir] [-field f] [-repeat n] [-queries file] [-query string] [-raw] [-paging hitsPerPage]

See http://lucene.apache.org/core/4_1_0/demo/ for details.
jason@localhost:~/Desktop/lucene-4.10.3$ java -cp ./core/lucene-core-4.10.3.jar:./queryparser/lucene-queryparser-4.10.3.jar:./analysis/common/lucene-analyzers-common-4.10.3.jar:./demo/lucene-demo-4.10.3.jar org.apache.lucene.demo.SearchFiles -index data
Enter query:
string
Searching for: string
1674 total matching documents
1. docs/benchmark/org/apache/lucene/benchmark/byTask/utils/Format.html
2. docs/analyzers-common/org/apache/lucene/analysis/util/AbstractAnalysisFactory.html
3. docs/queryparser/deprecated-list.html
4. docs/queryparser/org/apache/lucene/queryparser/classic/class-use/ParseException.html
5. docs/queryparser/org/apache/lucene/queryparser/flexible/core/messages/QueryParserMessages.html
6. docs/core/org/apache/lucene/index/IndexFileNames.html
7. docs/analyzers-stempel/org/egothor/stemmer/Diff.html
8. docs/queryparser/org/apache/lucene/queryparser/ext/Extensions.html
9. docs/facet/org/apache/lucene/facet/FacetsConfig. html
10. docs/queryparser/org/apache/lucene/queryparser/flexible/messages/package-summary.html
Press (n)ext page, (q)uit or enter number to jump to a page.
n
11. docs/highlighter/org/apache/lucene/search/highlight/class-use/InvalidTokenOffsetsException.html
12. docs/queryparser/org/apache/lucene/queryparser/xml/DOMUtils.html
13. docs/queryparser/org/apache/lucene/queryparser/classic/MultiFieldQueryParser.html
14. docs/core/org/apache/lucene/index/SegmentInfo.html
15. docs/highlighter/org/apache/lucene/search/vectorhighlight/FragmentsBuilder.html
16. docs/highlighter/org/apache/lucene/search/vectorhighlight/class-use/FieldFragList.html
17. docs/highlighter/org/apache/lucene/search/vectorhighlight/BaseFragmentsBuilder.html
18. docs/queryparser/org/apache/lucene/queryparser/flexible/standard/QueryParserUtil.html
19. docs/highlighter/org/apache/lucene/search/highlight/GradientFormatter.html
20. docs/highlighter/org/apache/lucene/search/postingshighlight/PostingsHighlighter.html
Press (p)revious page, (n)ext page, (q)uit or enter number to jump to a page.
q
Enter query:
quit
Searching for: quit
2 total matching documents
1. docs/demo/src-html/org/apache/lucene/demo/SearchFiles.html
2. docs/changes/Changes.html
Press (q)uit or enter number to jump to a page.
q
Enter query:
^Cjason@localhost:~/Desktop/lucene-4.10.3$

The search is quick even though in the loaded system. That's it, a light learning experience on apache lucene.