Friday, July 31, 2015

Attempting to understand java garbage collect statistics

If you have been develop large java application, at times troubleshooting application can go as deep as looking into garbage collector when application is running. Unfortunately the statistics are just too much to begin to investigate into or trying to understand it. At least for me, it is pretty mundane and I seek your help too if you came across this article and please leave comment.

There are very few documentation describe how are these statistics should be interpreted. There is this from oracle blog which is dated year 2006, pretty outdated to be relevant but nonetheless, it analyze line by line. More recent article from alexy ragozin and poonam bajaj are worth to take a look too.

The gc statistics should be able to regenerate using these parameter to the java command line.  -XX:+PrintGCDetails -XX:+PrintPromotionFailure -XX:PrintFLSStatistics=1 , and the following are snippets extracted from a production machine. Let's take a look at them line by line.

 Before GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 230400  
 Max  Chunk Size: 230400  
 Number of Blocks: 1  
 Av. Block Size: 230400  
 Tree   Height: 1  
 586945.492: [ParNew  
 Desired survivor size 41943040 bytes, new threshold 1 (max 1)  
 - age  1:  10038008 bytes,  10038008 total  
 : 660426K->10292K(737280K), 0.0353470 secs] 9424156K->8774094K(12500992K)After GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 127053189  
 Max  Chunk Size: 21404293  
 Number of Blocks: 125654  
 Av. Block Size: 1011  
 Tree   Height: 36  
   
   
   
 After GC:  
 Statistics for BinaryTreeDictionary:  
 ------------------------------------  
 Total Free Space: 230400  
 Max  Chunk Size: 230400  
 Number of Blocks: 1  
 Av. Block Size: 230400  
 Tree   Height: 1  
 , 0.0359540 secs] [Times: user=0.26 sys=0.00, real=0.03 secs]   
 Heap after GC invocations=550778 (full 2090):  
  par new generation  total 737280K, used 10292K [0x00000004fae00000, 0x000000052ce00000, 0x000000052ce00000)  
  eden space 655360K,  0% used [0x00000004fae00000, 0x00000004fae00000, 0x0000000522e00000)  
  from space 81920K, 12% used [0x0000000522e00000, 0x000000052380d360, 0x0000000527e00000)  
  to  space 81920K,  0% used [0x0000000527e00000, 0x0000000527e00000, 0x000000052ce00000)  
  concurrent mark-sweep generation total 11763712K, used 8763801K [0x000000052ce00000, 0x00000007fae00000, 0x00000007fae00000)  
  concurrent-mark-sweep perm gen total 40952K, used 24563K [0x00000007fae00000, 0x00000007fd5fe000, 0x0000000800000000)  
 }  
 Total time for which application threads were stopped: 0.0675660 seconds  
 {Heap before GC invocations=550778 (full 2090):  
  par new generation  total 737280K, used 11677K [0x00000004fae00000, 0x000000052ce00000, 0x000000052ce00000)  
  eden space 655360K,  0% used [0x00000004fae00000, 0x00000004faf5a220, 0x0000000522e00000)  
  from space 81920K, 12% used [0x0000000522e00000, 0x000000052380d360, 0x0000000527e00000)  
  to  space 81920K,  0% used [0x0000000527e00000, 0x0000000527e00000, 0x000000052ce00000)  
  concurrent mark-sweep generation total 11763712K, used 8763801K [0x000000052ce00000, 0x00000007fae00000, 0x00000007fae00000)  
  concurrent-mark-sweep perm gen total 40952K, used 24563K [0x00000007fae00000, 0x00000007fd5fe000, 0x0000000800000000)  

We can summarize the statistics above with the following points.

* the statistics generated above is from java hotspot and the source code can be foudn here https://github.com/openjdk-mirror/jdk7u-hotspot/blob/master/src/share/vm/gc_implementation/concurrentMarkSweep/binaryTreeDictionary.cpp#L1098-L1112

* there are two statistics, before gc and after gc and this is not full gc.

* before gc, we notice the max chunk size is equal to the total free space, so we assume there is no usage.

* before gc, we also noticed that the total free space has 127053189 and max chunk size is 21404293

* after gc, cpu usage is spent on user 0.26 and real 0.03.

* after gc, from region usage 12% of the heap.

* after gc, concurrent mark sweep generation total of 11,763,712k whilst concurrent mark sweep permanent generation total is 40,954k and used only 24,563K

* total time this application stop were 0.0675660 seconds.

So we can guess that this gc snippet is good. it is not a full gc and usage does not increase to 100%. There is no failure/error appear anywhere. The total time stop is trivial too, less than a second.

That's it and if you think this analysis is wrong and/or can be improve upon, please leave your message below. I would like to learn more too.

Sunday, July 19, 2015

Light learning using tsung on elasticsearch

Today, we will learn another software tool, Tsung. As this is just a introduction course, we will not go into details like interpret the statistics if it make sense. This is course plan to introduce how you quickly install, setup, test and show results using tsung. We want quickly acquaint distributed load testing using tsung. Let's first understand what tsung is.

Tsung is an open-source multi-protocol distributed load testing tool. It can be used to stress HTTP, WebDAV, SOAP, PostgreSQL, MySQL, LDAP and Jabber/XMPP servers. Tsung is a free software released under the GPLv2 license. That we like sir!

Okay, http is very common and let's pick elasticsearch 1.6.0 for this test. You can download elasticsearch from elastic.co , extract the content and just launch from command line. See below.

 user@localhost:~/Desktop/elasticsearch-1.6.0/bin$ ./elasticsearch  
 [2015-07-04 09:46:10,756][INFO ][node           ] [Futurist] version[1.6.0], pid[16404], build[cdd3ac4/2015-06-09T13:36:34Z]  
 [2015-07-04 09:46:10,757][INFO ][node           ] [Futurist] initializing ...  
 [2015-07-04 09:46:10,762][INFO ][plugins         ] [Futurist] loaded [], sites []  
 [2015-07-04 09:46:10,863][INFO ][env           ] [Futurist] using [1] data paths, mounts [[/ (/dev/sda5)]], net usable_space [15.4gb], net total_space [215.2gb], types [ext3]  
 [2015-07-04 09:46:13,833][INFO ][node           ] [Futurist] initialized  
 [2015-07-04 09:46:13,834][INFO ][node           ] [Futurist] starting ...  
 [2015-07-04 09:46:13,960][INFO ][transport        ] [Futurist] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/192.168.1.2:9300]}  
 [2015-07-04 09:46:13,978][INFO ][discovery        ] [Futurist] elasticsearch/eSG-tzQuQdCz5QKIomYm5Q  
 [2015-07-04 09:46:17,774][INFO ][cluster.service     ] [Futurist] new_master [Futurist][eSG-tzQuQdCz5QKIomYm5Q][VerticalHorizon][inet[/192.168.1.2:9300]], reason: zen-disco-join (elected_as_master)  
 [2015-07-04 09:46:17,809][INFO ][http           ] [Futurist] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address {inet[/192.168.1.2:9200]}  
 [2015-07-04 09:46:17,810][INFO ][node           ] [Futurist] started  
 [2015-07-04 09:46:17,934][INFO ][gateway         ] [Futurist] recovered [0] indices into cluster_state  

A single node elasticsearch with no index nor mapping. We just wanna quickly use elasticsearch as software under test.

Next, let's install tsung. If you are using debian, that's a good news! See below.

 $ sudo apt-get install tsung  

easy peasy. Okay, now let's create a directory tsung in user home directory. Then create two files. See below.

 user@localhost:~$ mkdir tsung  
 user@localhost:~$ cd tsung  
 user@localhost:~/tsung$ cat query.json   
 {"size":10,"query":{"filtered":{"query":{"match_all":{}}}}}  
 $ cat tsung.xml  
 <?xml version="1.0" encoding="utf-8"?>  
 <!DOCTYPE tsung SYSTEM "/usr/share/tsung/tsung-1.0.dtd" []>  
 <tsung loglevel="info">  
   
  <clients>  
   <client host="localhost" use_controller_vm="true" cpu="1" maxusers="30000000"/>  
  </clients>  
   
  <servers>  
   <server host="localhost" port="9200" type="tcp"/>  
  </servers>  
   
  <load>  
   <arrivalphase phase="1" duration="1" unit="minute">  
    <users arrivalrate="5" unit="second"/>  
   </arrivalphase>  
  </load>  
   
  <sessions>  
   <session name="es_load" weight="1" type="ts_http">  
    <request>  
    <http url="/myindex/_search"  
        method="GET"  
        contents_from_file="/home/user/tsung/query.json" />  
    </request>  
   </session>  
  </sessions>  
 </tsung>  

Now we are ready to do the load test, you can start tsung with the configuration descriptor. See below.

 user@localhost:~/tsung$ tsung -f /home/user/tsung/tsung.xml start  
 Starting Tsung  
 "Log directory is: /home/user/.tsung/log/20150704-1104"  
 user@localhost:~/tsung$  

Give a few minute for it to run and you should be able to check the log output. Once it is done, then let's generate the graph.

 user@localhost:~/tsung$ ls  
 total 8.0K  
 -rw-r--r-- 1 user user 60 Jul 4 09:33 query.json  
 -rw-r--r-- 1 user user 724 Jul 4 11:04 tsung.xml  
 user@localhost:~/tsung$ mkdir show_me_da_graph  
 user@localhost:~/tsung$ cd show_me_da_graph/  
 user@localhost:~/tsung/show_me_da_graph$ /usr/lib/tsung/bin/tsung_stats.pl --stats /home/user/.tsung/log/20150704-1104/tsung.log   
 creating subdirectory data   
 creating subdirectory gnuplot_scripts   
 creating subdirectory images   
 warn, last interval (5) not equal to the first, use the first one (10)  
 No data for Bosh  
 No data for Match  
 No data for Event  
 No data for Async  
 No data for Errors  
 user@localhost:~/tsung/show_me_da_graph$ ls  
 total 32K  
 drwxr-xr-x 2 user user 4.0K Jul 4 11:07 data  
 drwxr-xr-x 2 user user 4.0K Jul 4 11:07 gnuplot_scripts  
 drwxr-xr-x 2 user user 4.0K Jul 4 11:07 images  
 -rw-r--r-- 1 user user 3.7K Jul 4 11:07 gnuplot.log  
 -rw-r--r-- 1 user user 7.9K Jul 4 11:07 report.html  
 -rw-r--r-- 1 user user 7.1K Jul 4 11:07 graph.html  
 user@localhost:~/tsung/show_me_da_graph$ chromium graph.html  
 [18334:18334:0704/110724:ERROR:nss_util.cc(819)] After loading Root Certs, loaded==false: NSS error code: -8018  
 Created new window in existing browser session.  

Now look at the beautiful output! If you don't have chromium browser, copy this link and paste into your browser url. file:///home/user/tsung/show_me_da_graph/graph.html . Change the link where appropriately in your workstation. See screenshow below.



If tsung is what you like, consider going into details of tsung. Tsung provide a comprehensive documentation! That's it for this article, I hope you learn something.



Saturday, July 18, 2015

Learn what is python method mangling and variable mangling

While I was studying python, there are some unique word, mangling that caught my attention. There are two mangling, the method and the variable. But first, let's take a look what is mangling in python. From the official pep 8 documentation.

  If your class is intended to be subclassed, and you have attributes that you do not want subclasses to use, consider naming them with double leading underscores and no trailing underscores. This invokes Python's name mangling algorithm, where the name of the class is mangled into the attribute name. This helps avoid attribute name collisions should subclasses inadvertently contain attributes with the same name.

  Python mangles these names with the class name: if class Foo has an attribute named __a , it cannot be accessed by Foo.__a . (An insistent user could still gain access by calling Foo._Foo__a .) Generally, double leading underscores should be used only to avoid name conflicts with attributes in classes designed to be subclassed.

Okay, with that said and explained, let's hop into the code. We will create two class, parent and child class and both sharing the same method and same variable.

 $ cat mangle.py  
 class Parent(object):  
   __name = "John Smith"  
   def __init__(self):  
     self.__alive = False  
     self.__parentAlive = False  
   def __show_age(self):  
     print "65"  
   
 class Child(Parent):  
   __name = "John Smith Junior"  
   def __init__(self):  
     super(Child, self).__init__()  
     self.__alive = True  
     self.__childAlive = True  
   def __show_age(self):  
     print "34"  

now import this module into python interpreter and understand how mangling works.

 $ python  
 Python 2.7.10 (default, Jun 1 2015, 16:21:46)   
 [GCC 4.9.2] on linux2  
 Type "help", "copyright", "credits" or "license" for more information.  
 >>> from mangle import Child  
 >>> johnny = Child()  
 >>> dir(johnny)  
 ['_Child__alive', '_Child__childAlive', '_Child__name', '_Child__show_age', '_Parent__alive', '_Parent__name', '_Parent__parentAlive', '_Parent__show_age', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']  
 >>> johnny._Child__alive  
 True  
 >>> johnny._Child__name  
 'John Smith Junior'  
 >>> johnny._Child__show_age()  
 34  
 >>> johnny._Parent__alive  
 False  
 >>> johnny._Parent__name  
 'John Smith'  
 >>> johnny._Parent__show_age()  
 65  
 >>>   

So with this, it should be clear that the attributes that live in the python object will prepend with an underscore and the class name with the variable. Same wise applicable to the method name too.

Friday, July 17, 2015

Generate flame graph using FlameGraph goodies

Lately, I have been reading slideshare from brendan gregg on how to monitor stack in linux. An example would be this slide share. There is an interesting among his slide share using a few commands to generate the flame graph. This flame graph project can be found in his github project.

Today, we are trying flamegraph on my local system. A simple walk through using his great software. Okay, in one of his slide, he gave a few commands and I modified a bit from his original version. Remember I ran these command on my debian box.

 git clone --depth 1 https://github.com/brendangregg/FlameGraph  
 cd FlameGraph  
 perf record -F 99 -a -g -- sleep 30  
 perf script| ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg  

and in the terminal, output below.

 user@localhost:~$ git clone --depth 1 https://github.com/brendangregg/FlameGraph  
 Cloning into 'FlameGraph'...  
 remote: Counting objects: 50, done.  
 remote: Compressing objects: 100% (29/29), done.  
 remote: Total 50 (delta 24), reused 37 (delta 20), pack-reused 0  
 Unpacking objects: 100% (50/50), done.  
 Checking connectivity... done.  
 user@localhost:~$ cd FlameGraph  
 user@localhost:~/FlameGraph$ sudo perf record -F 99 -a -g -- sleep 30  
 /usr/bin/perf: line 24: exec: perf_4.0: not found  
 E: linux-tools-4.0 is not installed.  
 user@localhost:~/FlameGraph$ sudo perf record -F 99 -a -g -- sleep 30  
 [ perf record: Woken up 1 times to write data ]  
 [ perf record: Captured and wrote 1.719 MB perf.data (5082 samples) ]  
 user@localhost:~/FlameGraph$ perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg  
 failed to open perf.data: Permission denied  
 ERROR: No stack counts found  
 user@localhost:~/FlameGraph$ sudo perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg  
 Failed to open /tmp/perf-3763.map, continuing without symbols  
 Failed to open /tmp/perf-4908.map, continuing without symbols  
 Failed to open /usr/lib/i386-linux-gnu/libQtCore.so.4.8.6, continuing without symbols  
 Failed to open /lib/i386-linux-gnu/libglib-2.0.so.0.4200.1, continuing without symbols  
 Failed to open /tmp/perf-5995.map, continuing without symbols  
 Failed to open /tmp/perf-2337.map, continuing without symbols  
 Failed to open /tmp/perf-3012.map, continuing without symbols  
 no symbols found in /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstplayback.so, maybe install a debug package?  
 Failed to open /usr/lib/i386-linux-gnu/libQtGui.so.4.8.6, continuing without symbols  
 Failed to open /tmp/perf-19187.map, continuing without symbols  
 no symbols found in /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstmatroska.so, maybe install a debug package?  
 no symbols found in /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstcoreelements.so, maybe install a debug package?  
 Failed to open /run/user/1000/orcexec.Sg4yUn, continuing without symbols  
 no symbols found in /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstfaad.so, maybe install a debug package?  
 Failed to open /usr/bin/skype, continuing without symbols  
 no symbols found in /usr/lib/x86_64-linux-gnu/gstreamer-1.0/libgstlibav.so, maybe install a debug package?  
 user@localhost:~/FlameGraph$  

As you can see above, you need perf to be installed. perf is provided by this package, linux-tools-4.0 and you need root permission to run perf command. It will take a few seconds to collect the statistics and then you can again using the script to generate the svg. Now, you should be able to view the svg file using eof or gimp. See below for the flame graph generated in my workstation. :) Note, I have to change svg to jpg to upload to this blogger.






Sunday, July 5, 2015

Check out what is Python package

It's been a while I learn python and today, I would like to check out what is python package. These two reference give python package definition pretty clear.

Packages are a way of structuring Python's module namespace by using "dotted module names". For example, the module name ‘A.B’ designates a submodule named ‘B’ in a package named ‘A’. Just like the use of modules saves the authors of different modules from having to worry about each other's global variable names, the use of dotted module names saves the authors of multi-module packages like NumPy or the Python Imaging Library from having to worry about each other's module names. 

and from learn python org

Packages are namespaces which contain multiple packages and modules themselves. They are simply directories, but with a twist. 
Each package in Python is a directory which MUST contain a special file called __init__.py. This file can be empty, and it indicates that the directory it contains is a Python package, so it can be imported the same way a module can be imported. 

If you come from java background, essentially java package are directories until you create a class. In python, for that directory, you need to create a unique empty file call __init__.py which denote this is a python package.

So something like

router_statistics
    __init__.py
    routerStats.py
    test
        __init__.py
        router_stats_test.py

The above file structure is from github project.We have a python package router_statistics with a module routerStats.py. Then we have a test python package and a test module router_stats_test.py.

Pretty neat :) That's all for this light learning experience.




Saturday, July 4, 2015

Light walkthrough on Groovy

Today, we will learn another language, groovy. It is a scripting language, much like perl and python. Okay, first, let's understand what is groovy. From wikipedia

Groovy is an object-oriented programming language for the Java platform. It is a dynamic language with features similar to those of Python, Ruby, Perl, and Smalltalk. It can be used as a scripting language for the Java Platform, is dynamically compiled to Java Virtual Machine (JVM) bytecode, and interoperates with other Java code and libraries. Groovy uses a Java-like curly-bracket syntax. Most Java code is also syntactically valid Groovy, although semantics may be different.

Groovy 1.0 was released on January 2, 2007, and Groovy 2.0 in July, 2012. Groovy 3.0 is planned for release in late 2015, with support for a new Meta Object Protocol.[2] Since version 2, Groovy can also be compiled statically, offering type inference and performance very close to that of Java.[3][4] Groovy 2.4 was the last major release under Pivotal Software's sponsorship which ended in March 2015.[5]

A few current facts summarize from groovy official site.


Because it is script and interpreted by jvm, so you need to watch out for jvm that run groovy. Below is the table.

Groovy Branch           JVM Required (non-indy) JVM Required (indy) *
2.3 - current           1.6                                        1.7
2.0 - 2.2                   1.5                                        1.7
1.6 - 1.8                   1.5                                        N/A
1.0 - 1.5                   1.4                                        N/A

Okay, let's start with groovy hello world. Groovy provides three quick way to show "hello world" application. You can do it via groovy console, or groovy script or groovy shell.

1:  $ cat hello.groovy   
2:  #!/usr/bin/env groovy  
3:    
4:  println "Hello world!"  
5:  $ groovy hello.groovy   
6:  Hello world!  

$ groovyConsole


1:  $ groovysh   
2:  Groovy Shell (1.8.6, JVM: 1.7.0_55)  
3:  Type 'help' or '\h' for help.  
4:  ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------  
5:  groovy:000> println "hello world"  
6:  hello world  
7:  ===> null  
8:  groovy:000>   

So that's it, if you want to learn more about groovy, here are a few FAQ and its helpful links.

how much does it different than java?
http://www.groovy-lang.org/differences.html

gimme a few example?
http://www.groovy-lang.org/groovy-dev-kit.html

show me the syntax?
http://www.groovy-lang.org/syntax.html

operator?
http://www.groovy-lang.org/operators.html

groovy compiler?
http://www.groovy-lang.org/groovyc.html

groovy shell?
http://www.groovy-lang.org/groovysh.html

groovy console?
http://www.groovy-lang.org/groovyconsole.html


Friday, July 3, 2015

how big data can help legal firm?

Today, we are going to something a little different than our usual learning journey. By that, I mean not purely on information technology but it is somewhat related. Let me explain further. I was reading Malaysia Personal Data Protection Act 2010 or PDPA 2010 on this blog. Legal is not my profession but reading this article from information technology professional, gave several ideas.

Reading this article, no offence, but really is a daunting activity. It is long blogs and dull. :-) nonetheless, every wordings are as equally important to define what is the act should mean and what scope is an act encompasses. I think information retrieval application like elasticsearch would be a match with this. By indexing all the words in the articles and then search quickly and show which act, section and article that reference it. It would be even better with score as more relevant document is shown first. Something the lawyer would probably want to quickly find the relevant document to further read. I'm sure there are books with thousand pages and to remember every single line of the acts is almost impossible or impractical. Information technology will be able to fit for this gap for them.

For law student, this is especially useful as this will speed up the way they learn law. Nobody wanna sit there hours in library and then spend twelves hours a day to read 1000 pages. I think what drive people is we want active learning, not passived reading. So I guess with elasticsearch, they can quickly search with legal terminology and results show them the book that best serve their interest.

For each court cases, transcript or even any text data can be digitize into query-able data. Then with that, data can be turn into information, with information retrieval tools like elasticsearch. I believe a high court case would take months or even year to complete, to quickly digitize these data and be reference upon later down the day, be it during later day of this court case or in the next court case would put law firm into the next stage.

Of cause, this is just my opinion and maybe expressed only from the information technology point of view (as rightfully, I.T. is my profession), please feel free to comment and improve if you find any. Thank you.