Sunday, July 5, 2015

Check out what is Python package

It's been a while I learn python and today, I would like to check out what is python package. These two reference give python package definition pretty clear.

Packages are a way of structuring Python's module namespace by using "dotted module names". For example, the module name ‘A.B’ designates a submodule named ‘B’ in a package named ‘A’. Just like the use of modules saves the authors of different modules from having to worry about each other's global variable names, the use of dotted module names saves the authors of multi-module packages like NumPy or the Python Imaging Library from having to worry about each other's module names. 

and from learn python org

Packages are namespaces which contain multiple packages and modules themselves. They are simply directories, but with a twist. 
Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be empty, and it indicates that the directory it contains is a Python package, so it can be imported the same way a module can be imported. 

If you come from java background, essentially java package are directories until you create a class. In python, for that directory, you need to create a unique empty file call __init__.py which denote this is a python package.

So something like

router_statistics
    __init__.py
    routerStats.py
    test
        __init__.py
        router_stats_test.py

The above file structure is from github project.We have a python package router_statistics with a module routerStats.py. Then we have a test python package and a test module router_stats_test.py.

Pretty neat :) That's all for this light learning experience.




Saturday, July 4, 2015

Light walkthrough on Groovy

Today, we will learn another language, groovy. It is a scripting language, much like perl and python. Okay, first, let's understand what is groovy. From wikipedia

Groovy is an object-oriented programming language for the Java platform. It is a dynamic language with features similar to those of Python, Ruby, Perl, and Smalltalk. It can be used as a scripting language for the Java Platform, is dynamically compiled to Java Virtual Machine (JVM) bytecode, and interoperates with other Java code and libraries. Groovy uses a Java-like curly-bracket syntax. Most Java code is also syntactically valid Groovy, although semantics may be different.

Groovy 1.0 was released on January 2, 2007, and Groovy 2.0 in July, 2012. Groovy 3.0 is planned for release in late 2015, with support for a new Meta Object Protocol.[2] Since version 2, Groovy can also be compiled statically, offering type inference and performance very close to that of Java.[3][4] Groovy 2.4 was the last major release under Pivotal Software's sponsorship which ended in March 2015.[5]

A few current facts summarize from groovy official site.


Because it is script and interpreted by jvm, so you need to watch out for jvm that run groovy. Below is the table.

Groovy Branch           JVM Required (non-indy) JVM Required (indy) *
2.3 - current           1.6                                        1.7
2.0 - 2.2                   1.5                                        1.7
1.6 - 1.8                   1.5                                        N/A
1.0 - 1.5                   1.4                                        N/A

Okay, let's start with groovy hello world. Groovy provides three quick way to show "hello world" application. You can do it via groovy console, or groovy script or groovy shell.

1:  $ cat hello.groovy   
2:  #!/usr/bin/env groovy  
3:    
4:  println "Hello world!"  
5:  $ groovy hello.groovy   
6:  Hello world!  

$ groovyConsole


1:  $ groovysh   
2:  Groovy Shell (1.8.6, JVM: 1.7.0_55)  
3:  Type 'help' or '\h' for help.  
4:  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
5:  groovy:000> println "hello world"  
6:  hello world  
7:  ===> null  
8:  groovy:000>   

So that's it, if you want to learn more about groovy, here are a few FAQ and its helpful links.

how much does it different than java?
http://www.groovy-lang.org/differences.html

gimme a few example?
http://www.groovy-lang.org/groovy-dev-kit.html

show me the syntax?
http://www.groovy-lang.org/syntax.html

operator?
http://www.groovy-lang.org/operators.html

groovy compiler?
http://www.groovy-lang.org/groovyc.html

groovy shell?
http://www.groovy-lang.org/groovysh.html

groovy console?
http://www.groovy-lang.org/groovyconsole.html


Friday, July 3, 2015

how big data can help legal firm?

Today, we are going to something a little different than our usual learning journey. By that, I mean not purely on information technology but it is somewhat related. Let me explain further. I was reading Malaysia Personal Data Protection Act 2010 or PDPA 2010 on this blog. Legal is not my profession but reading this article from information technology professional, gave several ideas.

Reading this article, no offence, but really is a daunting activity. It is long blogs and dull. :-) nonetheless, every wordings are as equally important to define what is the act should mean and what scope is an act encompasses. I think information retrieval application like elasticsearch would be a match with this. By indexing all the words in the articles and then search quickly and show which act, section and article that reference it. It would be even better with score as more relevant document is shown first. Something the lawyer would probably want to quickly find the relevant document to further read. I'm sure there are books with thousand pages and to remember every single line of the acts is almost impossible or impractical. Information technology will be able to fit for this gap for them.

For law student, this is especially useful as this will speed up the way they learn law. Nobody wanna sit there hours in library and then spend twelves hours a day to read 1000 pages. I think what drive people is we want active learning, not passived reading. So I guess with elasticsearch, they can quickly search with legal terminology and results show them the book that best serve their interest.

For each court cases, transcript or even any text data can be digitize into query-able data. Then with that, data can be turn into information, with information retrieval tools like elasticsearch. I believe a high court case would take months or even year to complete, to quickly digitize these data and be reference upon later down the day, be it during later day of this court case or in the next court case would put law firm into the next stage.

Of cause, this is just my opinion and maybe expressed only from the information technology point of view (as rightfully, I.T. is my profession), please feel free to comment and improve if you find any. Thank you.

Sunday, June 21, 2015

Learning JavaFX on eclipse luna

Today, we will learn JavaFX using eclipse luna as the IDE. It's a start learning journey to get acquainted with the basic of JavaFX in the eclipse development environment. Essentially it is a 'hello world' application. First, let's take a look what is JavaFX. From wikipedia,

JavaFX is a software platform for creating and delivering rich internet applications (RIAs) that can run across a wide variety of devices. JavaFX is intended to replace Swing as the standard GUI library for Java SE, but both will be included for the foreseeable future.[3] JavaFX has support for desktop computers and web browsers on Microsoft Windows, Linux, and Mac OS X.

Okay, so javaFX is a GUI related development arena. With that said, let's start with a simple hello world GUI application for JavaFX. This article assume your java project is using java 8 and eclipse luna and you have setup already. Below is a sample code.

1:  package play.learn.java.fx;  
2:    
3:  import javafx.application.Application;  
4:  import javafx.event.ActionEvent;  
5:  import javafx.event.EventHandler;  
6:  import javafx.scene.Scene;  
7:  import javafx.scene.control.Button;  
8:  import javafx.scene.layout.StackPane;  
9:  import javafx.stage.Stage;  
10:    
11:  public class HelloWorld extends Application {  
12:    
13:     @Override  
14:     public void start(Stage primaryStage) throws Exception {  
15:        Button btn = new Button();  
16:      btn.setText("Say 'Hello World'");  
17:      btn.setOnAction(new EventHandler<ActionEvent>() {  
18:     
19:        @Override  
20:        public void handle(ActionEvent event) {  
21:          System.out.println("Hello World!");  
22:        }  
23:      });  
24:        
25:      StackPane root = new StackPane();  
26:      root.getChildren().add(btn);  
27:        
28:      Scene scene = new Scene(root, 300, 250);  
29:    
30:      primaryStage.setTitle("Hello World!");  
31:      primaryStage.setScene(scene);  
32:      primaryStage.show();  
33:          
34:     }  
35:       
36:     public static void main(String[] args) {  
37:        launch(args);  
38:    
39:     }  
40:  }  


As you can see above, there is a warning about restrict access to the api. To summarize the warning short, it is because non java library is not import by default into the project. So in this situation, you will have to manually add it. It's simple, on the project, right click and then select Properties, then a window pop up and in the Java Build Path tree, click on the 'Add External JARs...' , now you will have to locate where is the java 8 installed, and then select a jar file name jfxrt.jar. It will be relative to where the JAVA_HOME install such that, <JAVA_HOME>/jre/lib/ext/




When that is done, the warning should be dissapear. Now run the application, a window should pop up and click on it, look at the eclipse console, you should see "Hello World!". A little remarks to understand the basic of this application.

Here are the important things to know about the basic structure of a JavaFX application:


  •     The main class for a JavaFX application extends the javafx.application.Application class. The start() method is the main entry point for all JavaFX applications.
  •     A JavaFX application defines the user interface container by means of a stage and a scene. The JavaFX Stage class is the top-level JavaFX container. The JavaFX Scene class is the container for all content. Example 3-1 creates the stage and scene and makes the scene visible in a given pixel size.
  •     In JavaFX, the content of the scene is represented as a hierarchical scene graph of nodes. In this example, the root node is a StackPane object, which is a resizable layout node. This means that the root node's size tracks the scene's size and changes when the stage is resized by a user.
  •     The root node contains one child node, a button control with text, plus an event handler to print a message when the button is pressed.
  •     The main() method is not required for JavaFX applications when the JAR file for the application is created with the JavaFX Packager tool, which embeds the JavaFX Launcher in the JAR file. However, it is useful to include the main() method so you can run JAR files that were created without the JavaFX Launcher, such as when using an IDE in which the JavaFX tools are not fully integrated. Also, Swing applications that embed JavaFX code require the main() method.


The above are excerpt from official documentation. The code can also be found here. That's it, have fun to explore more of JavaFX.

Saturday, June 20, 2015

Fix corrupted ods file

If you have been working with spreadsheet, then one day, when you open up the file, for some unknown reason, it show gibberish text. You will like OH MY GAWD!! where is my file!!?? well afraid not, today, we will try to recover the file. To be exact, the spreadsheet is ods format from open office. You can find more information here.

So, a good normal working ods file start with PK. See example below.

 $ hexdump -C myfile.ods | head -1  
 00000000 50 4b 03 04 14 00 00 08 00 00 b7 71 c5 46 85 6c |PK.........q.F.l|  

The broken one does not start with PK and for my spreadsheet, it is something like the following. It may be different than you but that does not matter.

 $ hexdump -C myfile.ods | head -1  
 00000000 2c 75 73 65 72 2c 55 73 65 72 57 6f 72 6b 73 |,user,UserWorks|  

because openoffice file is compressed file, and then you can fix using the application zip. To fix it, you can run the command such as the one below.

 user@localhost:~$ zip --fixfix myfile.ods --out myfixfile.ods   
 Fix archive (-FF) - salvage what can  
  Found end record (EOCDR) - says expect single disk archive  
 Scanning for entries...  
  copying: Object 1/styles.xml (398 bytes)  
  copying: Object 1/content.xml (1892 bytes)  
  copying: Object 1/meta.xml (281 bytes)  
  copying: Object 2/content.xml (1999 bytes)  
  copying: Object 2/meta.xml (281 bytes)  
  copying: Object 2/styles.xml (483 bytes)  
  copying: Object 3/content.xml (2116 bytes)  
  copying: Object 3/meta.xml (281 bytes)  
  copying: Object 3/styles.xml (398 bytes)  
  copying: styles.xml (1999 bytes)  
  copying: Object 4/meta.xml (281 bytes)  
  copying: Object 4/content.xml (2405 bytes)  
  copying: Object 4/styles.xml (398 bytes)  
  copying: content.xml (17364 bytes)  
  copying: meta.xml (441 bytes)  
  copying: ObjectReplacements/Object 1 (2278 bytes)  
  copying: ObjectReplacements/Object 2 (3654 bytes)  
  copying: ObjectReplacements/Object 3 (1924 bytes)  
  copying: ObjectReplacements/Object 4 (2483 bytes)  
  copying: META-INF/manifest.xml (449 bytes)  
 Central Directory found...  
 no local entry: mimetype  
 no local entry: settings.xml  
 no local entry: manifest.rdf  
 no local entry: Configurations2/menubar/  
 no local entry: Configurations2/toolpanel/  
 no local entry: Configurations2/progressbar/  
 no local entry: Configurations2/accelerator/current.xml  
 no local entry: Configurations2/statusbar/  
 no local entry: Configurations2/images/Bitmaps/  
 no local entry: Configurations2/toolbar/  
 no local entry: Configurations2/floater/  
 no local entry: Configurations2/popupmenu/  
 no local entry: Thumbnails/thumbnail.png  
 EOCDR found ( 1 73809)...  

So the above command will try to salvage whatever it can. You might have guess it, the fix version file is the one specified by --out parameter.

This method works superb for my corrupted file. The fix version of the file contain all the data as before and I was happy. :) I hope it works for you too. That's it for today learning. Good luck to you!

Friday, June 19, 2015

Learn lucene term range query

Today, we are going to learn lucene term range query. But first, what actually is lucene term range query? From the official javadoc definition

A Query that matches documents within an range of terms.

This query matches the documents looking for terms that fall into the supplied range according to Byte.compareTo(Byte). It is not intended for numerical ranges; use NumericRangeQuery instead.

This query uses the MultiTermQuery.CONSTANT_SCORE_AUTO_REWRITE_DEFAULT rewrite method.

So byte to byte comparison of between two ranges, because it is byte to byte comparison, the comparison is lexicographic. If you intend to find range between two numbers, this is not the class you should use. Okay, if this is not clear, let's go into the code, shall we?

As you know, lucene is about two parts, the first indexing (write) part and then search (query) part. So in this article, we are going to index and query using term range query. To give you an overall of this article, we have four class.

  • LuceneConstants - just a setting class for this application.
  • Indexer - the class that does the indexing. 
  • Searcher - a class that do the search.
  • LearnTermRangeQuery - our main entry class to bind the above three classes into one. 
We have create an object tester for this learning journey. We then create index by calling method createIndex and then the index using term range query.


1:  LearnTermRangeQuery tester;  
2:    
3:  try {  
4:     tester = new LearnTermRangeQuery();  
5:     tester.createIndex();  
6:     tester.searchUsingTermRangeQuery("record2.txt", "record6.txt");  
7:  } catch (Exception e) {  
8:       
9:  }  

In the method createIndex(), I have some lambda usage, which you can notice with the arrow symbol, so you need to have java8 installed. There are two variables, indexDir and dataDir. The variable, indexDir is there directory where the created index will reside whilst dataDir is the sample data to be index upon. In the class Indexer, method getDocument(), is essentially index all sample documents. Nothing fancy, just ordinary creating lucene document and three fields, filename, filepath and file content.

Back to the class LearnTermRangeQuery, method searchUsingTermRangeQuery(). Notice we search the range with two files as the border. We initialized a lucene directory object and pass to the object index searcher. Everything else for lucene index searcher is just standard. We construct the TermRangeQuery and passed to the searcher object. The results are then shown and eventually close.

Below are the sample output in eclipse output.

 record 21.txt  
 src/resources/samples.termrange/record 21.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 21.txt  
 record 33 .txt  
 src/resources/samples.termrange/record 33 .txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33 .txt  
 record10.txt  
 src/resources/samples.termrange/record10.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record10.txt  
 record7.txt  
 src/resources/samples.termrange/record7.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record7.txt  
 record6.txt  
 src/resources/samples.termrange/record6.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record6.txt  
 record9.txt  
 src/resources/samples.termrange/record9.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record9.txt  
 record33.txt  
 src/resources/samples.termrange/record33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 record2.txt  
 src/resources/samples.termrange/record2.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 record5.txt  
 src/resources/samples.termrange/record5.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 record 33.txt  
 src/resources/samples.termrange/record 33.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record 33.txt  
 record3.txt  
 src/resources/samples.termrange/record3.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 record8.txt  
 src/resources/samples.termrange/record8.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record8.txt  
 record2.1.txt  
 src/resources/samples.termrange/record2.1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record2.1.txt  
 record1.txt  
 src/resources/samples.termrange/record1.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record1.txt  
 record4.txt  
 src/resources/samples.termrange/record4.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 record22.txt  
 src/resources/samples.termrange/record22.txt  
 Indexing /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
 16 File indexed, time taken: 800 ms  
 6 documents found. Time :74ms  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record33.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record2.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record5.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record3.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record4.txt  
 File : /home/user/eclipse/test/src/resources/samples.termrange/record22.txt  
   

As you can see above, the result are not correct if you consider numeric file name from record2.txt to record6.txt. So, always try experiment for few values before you implement. hehe, have fun! You can get the source for this codes at my github.

Sunday, June 7, 2015

code path learning on elasticsearch monitoring jvm logging

Today we are going to study the following info logging from elasticsearch 0.90.7.

1:  [2015-03-25 00:45:07,008][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649825][391294] duration [829ms], collections [1]/[1.1s], total [829ms]/[2.8h], memory [8.4gb]->[8.5gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [34.4mb]->[460.1kb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [8.3gb]->[8.4gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  
2:  [2015-03-25 00:45:11,529][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649829][391299] duration [921ms], collections [1]/[1.3s], total [921ms]/[2.8h], memory [8.7gb]->[8.8gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [203.2kb]->[4.1mb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [8.7gb]->[8.8gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  
3:  [2015-03-25 00:45:13,800][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649831][391301] duration [744ms], collections [1]/[1.1s], total [744ms]/[2.8h], memory [8.9gb]->[9gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [5.7mb]->[537.8kb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [8.9gb]->[9gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  
4:  [2015-03-25 00:45:15,088][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649832][391302] duration [891ms], collections [1]/[1.2s], total [891ms]/[2.8h], memory [9gb]->[9.1gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [537.8kb]->[5.4mb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [9gb]->[9.1gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  
5:  [2015-03-25 00:45:17,287][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649834][391304] duration [770ms], collections [1]/[1.1s], total [770ms]/[2.8h], memory [9.2gb]->[9.3gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [359.7kb]->[357.3kb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [9.2gb]->[9.3gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  
6:  [2015-03-25 00:45:18,531][INFO ][monitor.jvm       ] [node03] [gc][ParNew][649835][391305] duration [713ms], collections [1]/[1.2s], total [713ms]/[2.8h], memory [9.3gb]->[9.4gb]/[14gb], all_pools {[Code Cache] [14.4mb]->[14.4mb]/[48mb]}{[Par Eden Space] [357.3kb]->[441.9kb]/[133.1mb]}{[Par Survivor Space] [16.6mb]->[16.6mb]/[16.6mb]}{[CMS Old Gen] [9.3gb]->[9.4gb]/[13.8gb]}{[CMS Perm Gen] [43.5mb]->[43.5mb]/[82mb]}  

Okay, before we go into this codes, let's analyze based on just the output above. We have info logging, nothing very seriously but as a sign we should already pay attention. Happened on one of the node, node03 with class file from package monitor.jvm. Let's take the first line in the log above and format nicely and so we can analyze further. Read below.

 [2015-03-25 00:45:07,008][INFO ][monitor.jvm       ] [node03] [gc]  
   
 [ParNew][649825][391294] duration [829ms],   
 collections [1]/[1.1s],   
 total [829ms]/[2.8h],   
 memory [8.4gb]->[8.5gb]/[14gb],   
 all_pools {[Code Cache]          [14.4mb]->[ 14.4mb ] / [   48mb]}  
           {[Par Eden Space]      [34.4mb]->[460.1kb ] / [133.1mb]}  
           {[Par Survivor Space]  [16.6mb]->[ 16.6mb ] / [ 16.6mb]}  
           {[CMS Old Gen]         [8.3gb] ->[  8.4gb ] / [ 13.8gb]}  
           {[CMS Perm Gen]        [43.5mb]->[ 43.5mb ] / [   82mb]}  

From just the above output, it become even clearer, the gc par new collection duration is 829milliseconds and one collection of 1.1seconds. The heap before was 8.4GB and after collected became 8.5GB with a total heap of 14GB. Then we have all the pools jvm statistics revealed. Pretty obvious, before and after with total memory heap assigned for the pool respectively.

Now, we will read into the code and verify if our assumption are valid. The class that log this is JvmMonitorService.java and we noticed it fall in the if else second evaluation clause.

Again, interpretation based on the code, it is an info logging with gc name is par new. Sequence of this gc is 649825. In this collection, total collection for par new is 391294 and its total collections time is 391294 milliseconds (6 minutes 31 seconds). Note that this collections time is derived with the difference in the last gc and this gc.

Next, the collections total counts between last count and this count. It is follow by the collections time in milliseconds. In this example, we have one collections performed in  1.1seconds.. hmm... Then we have the total time in milliseconds (829 milliseconds) with the gc collections time and in the output, it is showing 2.8 hours!

For the next jvm statistics, we have memory/heap statistics. So it is different than we interpret based on the log before, this statistics shown in the last jvm memory used and then current jvm used. The different is about one GB. The total memory/heap allocated for this jvm is 14GB.

The last statistics is the jvm pool statistics. Notice that the pool statistics is built such that the previous jvm collection run and this jvm collection run were built. The logic of output is such that the previous jvm pool used and then current jvm pool used with current pool max heap allocated. Let's take one from the pool for discussion, {[CMS Old Gen]        [8.3gb] ->[  8.4gb ] / [ 13.8gb]} . So cms old gen pool previous jvm usage was 8.3gb and in this jvm collection, the current usage is 8.4GB. This cms old gen max used is 13.8GB.

That's it and if you think this analysis is not correct, please leave your comment below and make the necessary correction.