Sunday, August 30, 2015

First learning into Cloudera Impala

Let's take a look into a vendor big data technology today. In this article, we will take a look into Cloudera Impala. So what is Impala all about?

wikipedia definition

Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.[1]

and from the official github repository definition

Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. 
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Let us download a virtual machine image, this is good as impala works with integration with hadoop and if you don't have hadoop knowledge, you must start from establish hadoop cluster first before integrating it with Impala. With this virtual machine image, it is as easy as import this virtual machine image into the host and power it up. It also save time for you like setting it up and reduce error.

With that said, I'm downloading a virtual box image. Once download and extract to a directory. If you have not install virtualbox, you should by now install it. apt-get install virtualbox virtualbox-guest-additions-iso and make sure virtualbox instance is running.

 root@localhost:~# /etc/init.d/virtualbox status  
 ● virtualbox.service - LSB: VirtualBox Linux kernel module  
   Loaded: loaded (/etc/init.d/virtualbox)  
   Active: active (exited) since Thu 2015-08-20 17:07:43 MYT; 2min 36s ago  
    Docs: man:systemd-sysv-generator(8)  
  Process: 29390 ExecStop=/etc/init.d/virtualbox stop (code=exited, status=0/SUCCESS)  
  Process: 29425 ExecStart=/etc/init.d/virtualbox start (code=exited, status=0/SUCCESS)  
   
 Aug 20 17:07:43 localhost systemd[1]: Starting LSB: VirtualBox Linux kernel module...  
 Aug 20 17:07:43 localhost systemd[1]: Started LSB: VirtualBox Linux kernel module.  
 Aug 20 17:07:43 localhost virtualbox[29425]: Starting VirtualBox kernel modules.  

launch virtualbox and add that virtual image into a new instance, see screenshot below.




now power this virtual machine up! Please be patient as it will take a long time to boot it up. At least for my pc. Be patient and you might want to get some drink in the mean time. The ongoing article is using this tutorial. However, I give up as select statement take a long time and it is very slow in virtual environment, at least for me here. But I will illustrate until the point where it became slow.

first you need to copy this csv files (tab1.csv and tab2.csv) into the virtual machine.







Then you can load the script with the sql to create the tables and load the csv into the table. But the example given in the tutorial does not have database and i suggest you add these two lines into the script and load it up.

 create database testdb;  
 use testdb;  
 DROP TABLE IF EXISTS tab1;  
 -- The EXTERNAL clause means the data is located outside the central location  
 -- for Impala data files and is preserved when the associated Impala table is dropped.  
 -- We expect the data to already ex  



After that, you can issue command impala-shell and you can do sql queries, but as you see, the select statement just hang there forever.



Not a good experience but if impala is what you need, find out what is the problem and let me know. :-)

Saturday, August 29, 2015

First time learning Apache HBase

Today, we will take another look at another big data technology. Apache HBase is the topic for today and before we dip our toe into Apache HBase, let's find out what actually is Apache HBase.

Apache HBase [1] is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al.[2]  Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop [3].

In this article, we can setup a single node for this adventure. Before we begin, let's download a copy of Apache HBase here. Once downloaded, extract the compressed content. At the time of this writing, I'm using Apache HBase version 1.1.1 for this learning experience.

 user@localhost:~/Desktop/hbase-1.1.1$ ls  
 bin CHANGES.txt conf     docs hbase-webapps lib LICENSE.txt NOTICE.txt README.txt  

If you have not install java, go ahead and install it. Pick a recent java or at least java7. Make sure terminal prompt the correct version of java. An example would be as of following

 user@localhost:~/Desktop/hbase-1.1.1$ java -version  
 java version "1.7.0_55"  
 Java(TM) SE Runtime Environment (build 1.7.0_55-b13)  
 Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)  

If you cannot change system configuration for this java, then in the HBase configuration file, conf/hbase-env.sh, uncomment JAVA_HOME variable and set to the java that you installed. The main configuration file for hbase is conf/hbase-site.xml and we will now edit this file so it became such as following. Change to your environment as required.

 user@localhost:~/Desktop/hbase-1.1.1$ cat conf/hbase-site.xml   
 <?xml version="1.0"?>  
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
 <!--  
 /**  
  *  
  * Licensed to the Apache Software Foundation (ASF) under one  
  * or more contributor license agreements. See the NOTICE file  
  * distributed with this work for additional information  
  * regarding copyright ownership. The ASF licenses this file  
  * to you under the Apache License, Version 2.0 (the  
  * "License"); you may not use this file except in compliance  
  * with the License. You may obtain a copy of the License at  
  *  
  *   http://www.apache.org/licenses/LICENSE-2.0  
  *  
  * Unless required by applicable law or agreed to in writing, software  
  * distributed under the License is distributed on an "AS IS" BASIS,  
  * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  
  * See the License for the specific language governing permissions and  
  * limitations under the License.  
  */  
 -->  
 <configuration>  
  <property>  
   <name>hbase.rootdir</name>  
   <value>file:///home/user/Desktop/hbase-1.1.1</value>  
  </property>  
  <property>  
   <name>hbase.zookeeper.property.dataDir</name>  
   <value>/home/user/zookeeper</value>  
  </property>  
 </configuration>  

Okay, we are ready to start hbase. start it with a helpful script bin/start-hbase.sh

 user@localhost:~/Desktop/hbase-1.1.1$ bin/start-hbase.sh   
 starting master, logging to /home/user/Desktop/hbase-1.1.1/bin/../logs/hbase-user-master-localhost.out  
   
 user@localhost:~/Desktop/hbase-1.1.1/logs$ tail -F hbase-user-master-localhost.out SecurityAuth.audit hbase-user-master-localhost.log  
 ==> hbase-user-master-localhost.out <==  
   
 ==> SecurityAuth.audit <==  
 2015-08-18 17:49:41,533 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.1.1 port: 36745 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"  
 2015-08-18 17:49:46,812 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53042 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"  
 2015-08-18 17:49:48,309 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53043 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"  
 2015-08-18 17:49:49,317 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53044 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"  
   
 ==> hbase-user-master-localhost.log <==  
 2015-08-18 17:49:49,281 INFO [StoreOpener-78a2a3664205fcf679d2043ac3259648-1] hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=831688, freeSize=808983544, maxSize=809815232, heapSize=831688, minSize=769324480, minFactor=0.95, multiSize=384662240, multiFactor=0.5, singleSize=192331120, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false  
 2015-08-18 17:49:49,282 INFO [StoreOpener-78a2a3664205fcf679d2043ac3259648-1] compactions.CompactionConfiguration: size [134217728, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000  
 2015-08-18 17:49:49,295 INFO [RS_OPEN_REGION-localhost:60631-0] regionserver.HRegion: Onlined 78a2a3664205fcf679d2043ac3259648; next sequenceid=2  
 2015-08-18 17:49:49,303 INFO [PostOpenDeployTasks:78a2a3664205fcf679d2043ac3259648] regionserver.HRegionServer: Post open deploy tasks for hbase:namespace,,1439891388424.78a2a3664205fcf679d2043ac3259648.  
 2015-08-18 17:49:49,322 INFO [PostOpenDeployTasks:78a2a3664205fcf679d2043ac3259648] hbase.MetaTableAccessor: Updated row hbase:namespace,,1439891388424.78a2a3664205fcf679d2043ac3259648. with server=localhost,60631,1439891378840  
 2015-08-18 17:49:49,332 INFO [AM.ZK.Worker-pool3-t6] master.RegionStates: Transition {78a2a3664205fcf679d2043ac3259648 state=OPENING, ts=1439891389276, server=localhost,60631,1439891378840} to {78a2a3664205fcf679d2043ac3259648 state=OPEN, ts=1439891389332, server=localhost,60631,1439891378840}  
 2015-08-18 17:49:49,603 INFO [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14f4036b87d0000 type:create cxid:0x1d5 zxid:0x44 txntype:-1 reqpath:n/a Error Path:/hbase/namespace/default Error:KeeperErrorCode = NodeExists for /hbase/namespace/default  
 2015-08-18 17:49:49,625 INFO [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14f4036b87d0000 type:create cxid:0x1d8 zxid:0x46 txntype:-1 reqpath:n/a Error Path:/hbase/namespace/hbase Error:KeeperErrorCode = NodeExists for /hbase/namespace/hbase  
 2015-08-18 17:49:49,639 INFO [localhost:51452.activeMasterManager] master.HMaster: Master has completed initialization  
 2015-08-18 17:49:49,642 INFO [localhost:51452.activeMasterManager] quotas.MasterQuotaManager: Quota support disabled  

and you notice, log file is also available and jps shown a HMaster is running.

 user@localhost: $ jps  
 22144 Jps  
 21793 HMaster  

okay, let's experience apache hbase using a hbase shell.

 user@localhost:~/Desktop/hbase-1.1.1$ ./bin/hbase shell  
 2015-08-18 17:55:25,134 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
 HBase Shell; enter 'help<RETURN>' for list of supported commands.  
 Type "exit<RETURN>" to leave the HBase Shell  
 Version 1.1.1, rd0a115a7267f54e01c72c603ec53e91ec418292f, Tue Jun 23 14:44:07 PDT 2015  
   
 hbase(main):001:0>   
   
 A help command show very helpful description such as the followings.  
   
 hbase(main):001:0> help  
 HBase Shell, version 1.1.1, rd0a115a7267f54e01c72c603ec53e91ec418292f, Tue Jun 23 14:44:07 PDT 2015  
 Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.  
 Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.  
   
 COMMAND GROUPS:  
  Group name: general  
  Commands: status, table_help, version, whoami  
   
  Group name: ddl  
  Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters  
   
  Group name: namespace  
  Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables  
   
  Group name: dml  
  Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve  
   
  Group name: tools  
  Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, split, trace, unassign, wal_roll, zk_dump  
   
  Group name: replication  
  Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs  
   
  Group name: snapshots  
  Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot  
   
  Group name: configuration  
  Commands: update_all_config, update_config  
   
  Group name: quotas  
  Commands: list_quotas, set_quota  
   
  Group name: security  
  Commands: grant, revoke, user_permission  
   
  Group name: visibility labels  
  Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility  
   
 SHELL USAGE:  
 Quote all names in HBase Shell such as table and column names. Commas delimit  
 command parameters. Type <RETURN> after entering a command to run it.  
 Dictionaries of configuration used in the creation and alteration of tables are  
 Ruby Hashes. They look like this:  
   
  {'key1' => 'value1', 'key2' => 'value2', ...}  
   
 and are opened and closed with curley-braces. Key/values are delimited by the  
 '=>' character combination. Usually keys are predefined constants such as  
 NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type  
 'Object.constants' to see a (messy) list of all constants in the environment.  
   
 If you are using binary keys or values and need to enter them in the shell, use  
 double-quote'd hexadecimal representation. For example:  
   
  hbase> get 't1', "key\x03\x3f\xcd"  
  hbase> get 't1', "key\003\023\011"  
  hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"  
   
 The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.  
 For more on the HBase Shell, see http://hbase.apache.org/book.html  
 hbase(main):002:0>   

To create a table (column family),

 hbase(main):002:0> create 'test', 'cf'  
 0 row(s) in 1.5700 seconds  
   
 => Hbase::Table - test  
 hbase(main):003:0>   

list information about a table.

 hbase(main):001:0> list 'test'  
 TABLE                                                                                               
 test                                                                                               
 1 row(s) in 0.3530 seconds  
   
 => ["test"]  

let's put something into the table we have just created.

 hbase(main):002:0> put 'test', 'row1', 'cf:a', 'value1'  
 0 row(s) in 0.2280 seconds  
   
 hbase(main):003:0> put 'test', 'row2', 'cf:b', 'value2'  
 0 row(s) in 0.0140 seconds  
   
 hbase(main):004:0> put 'test', 'row3', 'cf:c', 'value3'  
 0 row(s) in 0.0060 seconds  
   
 hbase(main):005:0>   

Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in this case.

To select the row from the table, use scan.

 hbase(main):005:0> scan 'test'  
 ROW                       COLUMN+CELL                                                                   
  row1                      column=cf:a, timestamp=1439892359305, value=value1                                                
  row2                      column=cf:b, timestamp=1439892363921, value=value2                                                
  row3                      column=cf:c, timestamp=1439892369775, value=value3                                                
 3 row(s) in 0.0420 seconds  
   
 hbase(main):006:0>   

To get a row only.

 hbase(main):006:0> get 'test', 'row1'  
 COLUMN                      CELL                                                                       
  cf:a                      timestamp=1439892359305, value=value1                                                      
 1 row(s) in 0.0340 seconds  
   
 hbase(main):007:0>   

Something really interesting about apache hbase, say if you want to delete or change settings of a table, you need to disable it first. After that, you can enable it back.

 hbase(main):007:0> disable 'test'  
 0 row(s) in 2.3610 seconds  
   
 hbase(main):008:0> enable 'test'  
 0 row(s) in 1.2790 seconds  
   
 hbase(main):009:0>   

okay, now, let's delete this table.

 hbase(main):009:0> drop 'test'  
   
 ERROR: Table test is enabled. Disable it first.  
   
 Here is some help for this command:  
 Drop the named table. Table must first be disabled:  
  hbase> drop 't1'  
  hbase> drop 'ns1:t1'  
   
   
 hbase(main):010:0> disable 'test'  
 0 row(s) in 2.2640 seconds  
   
 hbase(main):011:0> drop 'test'  
 0 row(s) in 1.2800 seconds  
   
 hbase(main):012:0>   

Okay, we are done for this basic learning. Let's quit for now.

 hbase(main):012:0> quit  
 user@localhost:~/Desktop/hbase-1.1.1$   
   
 To stop apache hbase instance,   
   
 user@localhost:~/Desktop/hbase-1.1.1$ ./bin/stop-hbase.sh   
 stopping hbase.................  
   
   
 user@localhost:~/Desktop/hbase-1.1.1$ jps  
 23399 Jps  
 5445 org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar  

If you like me who came from apache cassandra, apache hbase looks very similar. If this interest you, I shall leave you with the following three links which will get you further.

http://hbase.apache.org/book.html

http://wiki.apache.org/hadoop/Hbase

https://blogs.apache.org/hbase/

Friday, August 28, 2015

First light learning into Apache Storm part 1

Today we will go through another software, Apache Storm. According to the official Apache Storm github

Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.

Well, if you like me which are new to Apache Storm, this seem a bit vague on what Apache Storm is about. Fear not, we will in this article, go through some basic apache storm like installing storm, setup a storm cluster and perform a storm of hello world. But this is a good video that give introduction to apache storm.

If you study storm, the fundamentals three terminologies which you may come across which are spouts, bolts and topologies. These definition are excerpt from this site link.

There are just three abstractions in Storm: spouts, bolts, and topologies. A spout is a source of streams in a computation. Typically a spout reads from a queueing broker such as Kestrel, RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API. Spout implementations already exist for most queueing systems.
A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed


Let's first download and install Apache Storm. Pick a stable version at here, download and then extract it. By now, your directories should be similar to the one below. I'm using Apache Storm 0.9.5 for this learning experience.

 user@localhost:~/Desktop/apache-storm-0.9.5$ ls   
 bin CHANGELOG.md conf DISCLAIMER examples external lib LICENSE logback     NOTICE     public     README.markdown RELEASE SECURITY.md  
 user@localhost:~/Desktop/apache-storm-0.9.5$   

In the next article, we will setup a storm cluster.

Sunday, August 16, 2015

First time learning gradle

It is difficult to jump start into software development if you are new to introduction of many sub technologies. Today, I'm gonna put aside of my project and start to learn another technology. Gradle, a build system but there are much more than just build. If you are also new to gradle, you might want to find out what actually is gradle.

Gradle on wikipedia

Gradle is a build automation tool that builds upon the concepts of Apache Ant and Apache Maven and introduces a Groovy-based domain-specific language (DSL) instead of the more traditional XML form of declaring the project configuration. Gradle uses a directed acyclic graph ("DAG") to determine the order in which tasks can be run.
Gradle was designed for multi-project builds which can grow to be quite large, and supports incremental builds by intelligently determining which parts of the build tree are up-to-date, so that any task dependent upon those parts will not need to be re-executed.

If you have many projects that depend on a project, gradle will solve your problems. We will look into the basic of gradle build automation tool today. I love to code java and so I will use java as this demo. First, let's install gradle. If you are using deb based distribution like debian or ubuntu, to install gradle, it is as easy as $ sudo apt-get install gradle. Otherwise, you can download gradle from http://gradle.org/ and install in your system. Now let's create a gradle build file. See below.

 user@localhost:~/gradle$ cat build.gradle   
 apply plugin: 'java'  
 user@localhost:~/gradle$ ls -a  
 total 36K  
 -rw-r--r--  1 user user  21 Aug 6 17:15 build.gradle  
 drwxr-xr-x 214 user user 28K Aug 6 17:15 ..  
 drwxr-xr-x  2 user user 4.0K Aug 6 17:15 .  
 user@localhost:~/gradle$ gradle build  
 :compileJava UP-TO-DATE  
 :processResources UP-TO-DATE  
 :classes UP-TO-DATE  
 :jar  
 :assemble  
 :compileTestJava UP-TO-DATE  
 :processTestResources UP-TO-DATE  
 :testClasses UP-TO-DATE  
 :test  
 :check  
 :build  
   
 BUILD SUCCESSFUL  
   
 Total time: 13.304 secs  
 user@localhost:~/gradle$ ls -a  
 total 44K  
 -rw-r--r--  1 user user  21 Aug 6 17:15 build.gradle  
 drwxr-xr-x 214 user user 28K Aug 6 17:15 ..  
 drwxr-xr-x  3 user user 4.0K Aug 6 17:15 .gradle  
 drwxr-xr-x  4 user user 4.0K Aug 6 17:15 .  
 drwxr-xr-x  6 user user 4.0K Aug 6 17:15 build  
 user@localhost:~/gradle$ find .gradle/  
 .gradle/  
 .gradle/1.5  
 .gradle/1.5/taskArtifacts  
 .gradle/1.5/taskArtifacts/fileHashes.bin  
 .gradle/1.5/taskArtifacts/taskArtifacts.bin  
 .gradle/1.5/taskArtifacts/fileSnapshots.bin  
 .gradle/1.5/taskArtifacts/outputFileStates.bin  
 .gradle/1.5/taskArtifacts/cache.properties.lock  
 .gradle/1.5/taskArtifacts/cache.properties  
 user@localhost:~/gradle$ find build  
 build  
 build/libs  
 build/libs/gradle.jar  
 build/test-results  
 build/test-results/binary  
 build/test-results/binary/test  
 build/test-results/binary/test/results.bin  
 build/reports  
 build/reports/tests  
 build/reports/tests/report.js  
 build/reports/tests/index.html  
 build/reports/tests/base-style.css  
 build/reports/tests/style.css  
 build/tmp  
 build/tmp/jar  
 build/tmp/jar/MANIFEST.MF  

one liner of input produce so many output files. Amazing! Why so many files that were generated, read the output of the command output, it compile, process resource, jar, assemble, test check and build. What are all these means, I will not explain to you one by one, you learn better if you read this definition yourself which is documented very well here. You might say, hey , I have different java source path can gradle handle this? Yes of cause! In the build path you created, you can add another line.

 // set the source java folder to another non maven standard path  
 sourceSets.main.java.srcDirs = ['src/java']  

Most of us coming from java has ant build file. If that is the case, gradle integrate nicely with ant too, you just need to import ant build file and then call ant target from gradle. See code snippet below.

 user@localhost:~/gradle$ cat build.xml   
 <project>  
  <target name="helloAnt">  
   <echo message="hello this is ant."/>  
  </target>  
 </project>  
 user@localhost:~/gradle$ cat build.gradle  
 apply plugin: 'java'  
   
 // set the source java folder to another non maven standard path  
 sourceSets.main.java.srcDirs = ['src/java']  
   
 // import ant build file.  
 ant.importBuild 'build.xml'  
 user@localhost:~/gradle$ gradle helloAnt   
 :helloAnt  
 [ant:echo] hello this is ant.  
   
 BUILD SUCCESSFUL  
   
 Total time: 5.573 secs  

That looks pretty good! If you curious about what gradle parameter that you can use during figuring out if the build went wrong, you should really read into this link. Also, if read on the environment variable as you can specify other jdk for gradle or even java parameter during compile big projects.

You might want to ask also, what if I only want to compile, I don't want to go through all the automatic builds above. No problem, since this is a java project, you specify compileJava.

 user@localhost:~/gradle$ gradle compileJava  
 :compileJava UP-TO-DATE  
   
 BUILD SUCCESSFUL  
   
 Total time: 4.976 secs  

As you can see, gradle is very flexible and because of that, you might want to exploit it further. For example, customizing the task in build.gradle, listing projects, listing tasks and others. For that, read here as it explain and give a lot of example how all that can be done. So at this stage, you might want to add more feature into gradle build file. Okay, let's do just that.

 user@localhost:~/gradle$ cat build.gradle   
 apply plugin: 'java'  
 apply plugin: 'eclipse'  
   
 // set the source java folder to another non maven standard path  
 // default src/main/java  
 sourceSets.main.java.srcDirs = ['src/java']  
   
 // default src test   
 //src/test/java  
   
 // default src resources.  
 // src/main/resources   
   
 // default src test resources.  
 // src/test/resources  
   
 // default build  
 // build  
   
 // default jar built  
 // build/libs  
   
   
 // dependencies of external jar, we reference the very good from maven.  
 repositories {  
   mavenCentral()  
 }  
   
 // actual libs dependencies  
 dependencies {  
   compile group: 'commons-collections', name: 'commons-collections', version: '3.2'  
   testCompile group: 'junit', name: 'junit', version: '4.+'  
 }  
   
 test {  
   testLogging {  
     // Show that tests are run in the command-line output  
     events 'started', 'passed'  
   }  
 }  
   
 sourceCompatibility = 1.5  
 version = '1.0'  
 jar {  
   manifest {  
     attributes 'Implementation-Title': 'Gradle Quickstart',  
           'Implementation-Version': version  
   }  
 }  
   
 // import ant build file.  
 ant.importBuild 'build.xml'  
   
 // common for subprojects  
 subprojects {  
   apply plugin: 'java'  
   
   repositories {  
     mavenCentral()  
   }  
   
   dependencies {  
     testCompile 'junit:junit:4.12'  
   }  
   
   version = '1.0'  
   
   jar {  
     manifest.attributes provider: 'gradle'  
   }  
 }  
 user@localhost:~/gradle$ cat settings.gradle   
 include ":nativeapp",":webapp"  

Now, if you want to generate eclipse configuration, just run gradle eclipse, all eclipse configuration and setting are created automatically. Of cause, you can customize settings even further.

 user@localhost:~/gradle$ gradle eclipse  
 :eclipseClasspath  
 Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.pom  
 Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12-sources.jar  
 Download http://repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3-sources.jar  
 Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar  
 :eclipseJdt  
 :eclipseProject  
 :eclipse  
   
 BUILD SUCCESSFUL  
   
 Total time: 19.497 secs  
 user@localhost:~/gradle$ find .  
 .  
 .  
 ./build.xml  
 ./build  
 ./build/classes  
 ./build/classes/test  
 ./build/classes/test/org  
 ./build/classes/test/org/just4fun  
 ./build/classes/test/org/just4fun/voc  
 ./build/classes/test/org/just4fun/voc/file  
 ./build/classes/test/org/just4fun/voc/file/QuickTest.class  
 ./build/libs  
 ./build/libs/gradle.jar  
 ./build/libs/gradle-1.0.jar  
 ./build/test-results  
 ./build/test-results/binary  
 ./build/test-results/binary/test  
 ./build/test-results/binary/test/results.bin  
 ./build/test-results/TEST-org.just4fun.voc.file.QuickTest.xml  
 ./build/reports  
 ./build/reports/tests  
 ./build/reports/tests/report.js  
 ./build/reports/tests/index.html  
 ./build/reports/tests/org.just4fun.voc.file.html  
 ./build/reports/tests/base-style.css  
 ./build/reports/tests/org.just4fun.voc.file.QuickTest.html  
 ./build/reports/tests/style.css  
 ./build/dependency-cache  
 ./build/tmp  
 ./build/tmp/jar  
 ./build/tmp/jar/MANIFEST.MF  
 ./webapp  
 ./webapp/build.gradle  
 ./.gradle  
 ./.gradle/1.5  
 ./.gradle/1.5/taskArtifacts  
 ./.gradle/1.5/taskArtifacts/fileHashes.bin  
 ./.gradle/1.5/taskArtifacts/taskArtifacts.bin  
 ./.gradle/1.5/taskArtifacts/fileSnapshots.bin  
 ./.gradle/1.5/taskArtifacts/outputFileStates.bin  
 ./.gradle/1.5/taskArtifacts/cache.properties.lock  
 ./.gradle/1.5/taskArtifacts/cache.properties  
 ./.classpath  
 ./build.gradle  
 ./.project  
 ./.settings  
 ./.settings/org.eclipse.jdt.core.prefs  
 ./settings.gradle  
 ./nativeapp  
 ./nativeapp/build.gradle  
 ./src  
 ./src/test  
 ./src/test/java  
 ./src/test/java/org  
 ./src/test/java/org/just4fun  
 ./src/test/java/org/just4fun/voc  
 ./src/test/java/org/just4fun/voc/file  
 ./src/test/java/org/just4fun/voc/file/QuickTest.java  

Now, I create a simple unit test class file, see below. Then only run a single unit test, that's very cool.

 user@localhost:~/gradle$ find src/  
 src/  
 src/test  
 src/test/java  
 src/test/java/org  
 src/test/java/org/just4fun  
 src/test/java/org/just4fun/voc  
 src/test/java/org/just4fun/voc/file  
 src/test/java/org/just4fun/voc/file/QuickTest.java  
 $ gradle -Dtest.single=Quick test  
 :compileJava UP-TO-DATE  
 :processResources UP-TO-DATE  
 :classes UP-TO-DATE  
 :compileTestJavawarning: [options] bootstrap class path not set in conjunction with -source 1.5  
 1 warning  
   
 :processTestResources UP-TO-DATE  
 :testClasses  
 :test  
   
 org.just4fun.voc.file.QuickTest > test STARTED  
   
 org.just4fun.voc.file.QuickTest > test PASSED  
   
 BUILD SUCCESSFUL  
   
 Total time: 55.81 secs  
 user@localhost:~/gradle $  

There are two additional directories created , that is nativeapp and webapp, this is subprojects for this big project and it contain its own gradle build file. At the parent of the gradle build file, we see a subprojects configuration as this will applied to all the subprojects. You can create a settings.gradle to specify the subprojects.

That's all for today, as this is just an introduction to quicklyl dive into some of the cool features of gradle, with this shown, I hope it give you some idea where to head next. Good luck!


Saturday, August 15, 2015

First learning Node.js

We will learn another software today, Node.js. Another word that I came across many times when reading on information technology articles. First, let's take a look on what is Node.js. From the official site,

Node.js® is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.

So this is very much to understand what exactly is Node.js from that two sentences but as you continue to read in this article, you will get some idea. If you have basic javascript coding experience, you will think Node.js is just a script that run goodies stuff on browsers to enhance people experience. But as javascript envolve, Node.js evolve into an application where you can code as a server application! We will see that later in a moment.

Okay, let's install Node.js. If you are using deb base linux distribution, for example debian or ubuntu. It is as easy as $ sudo apt-get install nodejs. Otherwise, you can download a copy from this official site and install it.

Let's start with a simple Node.js hello world. Very easy, create a helloworld.js and do the print. See below.

 user@localhost:~/nodejs$ cat helloworld.js   
 console.log("Hello World");  
 user@localhost:~/nodejs$ nodejs helloworld.js   
 Hello World  
 user@localhost:~/nodejs$   

very simple, one liner produce the hello world output. You might ask, what can Node.js functionalities can I use other than console. Well, at the end of this article, I will give you the link so you can explore further. But in the meantime, I will show you how easy to create a web server using Node.js! Let's read the code below.

 user@localhost:~/nodejs$ cat server.js   
 var http = require("http");  
 http.createServer(function(request, response) {  
  response.writeHead(200, {"Content-Type": "text/plain"});  
  response.write("Hello World");  
  response.end();  
 }).listen(8888);  
 console.log("create a webserver at port 8888");  
 user@localhost:~/nodejs$ nodejs server.js   
 create a webserver at port 8888  

As you can read,  we create a file called server.js require a module called http. We pass an anonymous function into the function createServer of http module. The response will return http status 200 with a hello world. You can try to access in your browser with localhost:8888. Notice that the execution of the Node.js continue after http is created, unlike other language which will wait the execution finish before proceed the next line of code, Node.js execution will continue and this make Node.js asynchronous.

Well, by now you should understand what Node.js can do for you and if you interest more on Node.js , I will leave you this very helpful link.

Friday, August 14, 2015

Light learning apache spark

A while back, I was reading articles and many articles referencing spark and in this week, hey, why not check out what actually is spark. Googling spark produced many results return and we are particularly interested in apache spark. Let us take a look today at apache stark and what is all about. From official spark github,

Apache Spark
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
Okay, let's download a copy of spark to your local pc. You can download from this site.


extract the downloaded file and ran the command, not good.

 user@localhost:~/Desktop/spark-1.4.1$ ./bin/pyspark   
 ls: cannot access /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10: No such file or directory  
 Failed to find Spark assembly in /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10.  
 You need to build Spark before running this program.  
 user@localhost:~/Desktop/spark-1.4.1$ ./bin/spark-shell   
 ls: cannot access /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10: No such file or directory  
 Failed to find Spark assembly in /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10.  
 You need to build Spark before running this program.  

Well, the default download setting is source, so you will have to compile the source.

 user@localhost:~/Desktop/spark-1.4.1$ mvn -DskipTests clean package  
 [INFO] Scanning for projects...  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Reactor Build Order:  
 [INFO]   
 [INFO] Spark Project Parent POM  
 [INFO] Spark Launcher Project  
 [INFO] Spark Project Networking  
 [INFO] Spark Project Shuffle Streaming Service  
 [INFO] Spark Project Unsafe  
 ...  
 ...  
 ...  
 constituent[20]: file:/usr/share/maven/lib/wagon-http-shaded.jar  
 constituent[21]: file:/usr/share/maven/lib/maven-settings-builder-3.x.jar  
 constituent[22]: file:/usr/share/maven/lib/maven-aether-provider-3.x.jar  
 constituent[23]: file:/usr/share/maven/lib/maven-core-3.x.jar  
 constituent[24]: file:/usr/share/maven/lib/plexus-cipher.jar  
 constituent[25]: file:/usr/share/maven/lib/aether-util.jar  
 constituent[26]: file:/usr/share/maven/lib/commons-httpclient.jar  
 constituent[27]: file:/usr/share/maven/lib/commons-cli.jar  
 constituent[28]: file:/usr/share/maven/lib/aether-api.jar  
 constituent[29]: file:/usr/share/maven/lib/maven-model-3.x.jar  
 constituent[30]: file:/usr/share/maven/lib/guava.jar  
 constituent[31]: file:/usr/share/maven/lib/wagon-file.jar  
 ---------------------------------------------------  
 Exception in thread "main" java.lang.OutOfMemoryError: PermGen space  
      at java.lang.ClassLoader.defineClass1(Native Method)  
      at java.lang.ClassLoader.defineClass(ClassLoader.java:800)  
      at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)  
      at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)  
      at java.net.URLClassLoader.access$100(URLClassLoader.java:71)  
      at java.net.URLClassLoader$1.run(URLClassLoader.java:361)  
      at java.net.URLClassLoader$1.run(URLClassLoader.java:355)  
      at java.security.AccessController.doPrivileged(Native Method)  
      at java.net.URLClassLoader.findClass(URLClassLoader.java:354)  
      at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClassFromSelf(ClassRealm.java:401)  
      at org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:42)  
      at org.codehaus.plexus.classworlds.realm.ClassRealm.unsynchronizedLoadClass(ClassRealm.java:271)  
      at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:247)  
      at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:239)  
      at org.apache.maven.cli.MavenCli.execute(MavenCli.java:545)  
      at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)  
      at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)  
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)  
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)  
      at java.lang.reflect.Method.invoke(Method.java:606)  
      at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)  
      at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)  
      at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)  
      at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)  

okay, let's beef up a little for the build setting, and the build took very long time, eventually. I switch to build in the directory build. See below.

 user@localhost:~/Desktop/spark-1.4.1$ export MAVEN_OPTS="-XX:MaxPermSize=1024M"  
 user@localhost:~/Desktop/spark-1.4.1$ mvn -DskipTests clean package  
   
 [INFO] Scanning for projects...  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Reactor Build Order:  
 [INFO]   
 [INFO] Spark Project Parent POM  
 [INFO] Spark Launcher Project  
 [INFO] Spark Project Networking  
 [INFO] Spark Project Shuffle Streaming Service  
 [INFO] Spark Project Unsafe  
 [INFO] Spark Project Core  
   
 user@localhost:~/Desktop/spark-1.4.1$ build/mvn -DskipTests clean package  
 [INFO] Scanning for projects...  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Reactor Build Order:  
 [INFO]   
 [INFO] Spark Project Parent POM  
 [INFO] Spark Launcher Project  
 [INFO] Spark Project Networking  
 [INFO] Spark Project Shuffle Streaming Service  
 [INFO] Spark Project Unsafe  
 [INFO] Spark Project Core  
 ..  
 ...  
 ...  
 ...  
 get/spark-streaming-kafka-assembly_2.10-1.4.1-shaded.jar  
 [INFO]   
 [INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @ spark-streaming-kafka-assembly_2.10 ---  
 [INFO] Building jar: /home/user/Desktop/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1-sources.jar  
 [INFO]   
 [INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @ spark-streaming-kafka-assembly_2.10 ---  
 [INFO] Building jar: /home/user/Desktop/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1-test-sources.jar  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Reactor Summary:  
 [INFO]   
 [INFO] Spark Project Parent POM .......................... SUCCESS [26.138s]  
 [INFO] Spark Launcher Project ............................ SUCCESS [1:15.976s]  
 [INFO] Spark Project Networking .......................... SUCCESS [26.347s]  
 [INFO] Spark Project Shuffle Streaming Service ........... SUCCESS [14.123s]  
 [INFO] Spark Project Unsafe .............................. SUCCESS [12.643s]  
 [INFO] Spark Project Core ................................ SUCCESS [9:49.622s]  
 [INFO] Spark Project Bagel ............................... SUCCESS [17.426s]  
 [INFO] Spark Project GraphX .............................. SUCCESS [53.601s]  
 [INFO] Spark Project Streaming ........................... SUCCESS [1:34.290s]  
 [INFO] Spark Project Catalyst ............................ SUCCESS [2:04.020s]  
 [INFO] Spark Project SQL ................................. SUCCESS [2:11.032s]  
 [INFO] Spark Project ML Library .......................... SUCCESS [2:57.880s]  
 [INFO] Spark Project Tools ............................... SUCCESS [6.920s]  
 [INFO] Spark Project Hive ................................ SUCCESS [2:58.649s]  
 [INFO] Spark Project REPL ................................ SUCCESS [36.564s]  
 [INFO] Spark Project Assembly ............................ SUCCESS [3:13.152s]  
 [INFO] Spark Project External Twitter .................... SUCCESS [1:09.316s]  
 [INFO] Spark Project External Flume Sink ................. SUCCESS [42.294s]  
 [INFO] Spark Project External Flume ...................... SUCCESS [37.907s]  
 [INFO] Spark Project External MQTT ....................... SUCCESS [1:20.999s]  
 [INFO] Spark Project External ZeroMQ ..................... SUCCESS [29.090s]  
 [INFO] Spark Project External Kafka ...................... SUCCESS [54.212s]  
 [INFO] Spark Project Examples ............................ SUCCESS [5:54.508s]  
 [INFO] Spark Project External Kafka Assembly ............. SUCCESS [1:24.962s]  
 [INFO] ------------------------------------------------------------------------  
 [INFO] BUILD SUCCESS  
 [INFO] ------------------------------------------------------------------------  
 [INFO] Total time: 41:53.884s  
 [INFO] Finished at: Tue Aug 04 08:56:02 MYT 2015  
 [INFO] Final Memory: 71M/684M  
 [INFO] ------------------------------------------------------------------------  

Yes, finally the build is success. Even though success, as you can see above, it took 41minutes on my pc just to compile. Okay, now that all libs are built, let's repeat the command we type just now.

 $ ./bin/spark-shell  
 log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).  
 log4j:WARN Please initialize the log4j system properly.  
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.  
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties  
 15/08/04 20:21:16 INFO SecurityManager: Changing view acls to: user  
 15/08/04 20:21:16 INFO SecurityManager: Changing modify acls to: user  
 15/08/04 20:21:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)  
 15/08/04 20:21:16 INFO HttpServer: Starting HTTP Server  
 15/08/04 20:21:17 INFO Utils: Successfully started service 'HTTP class server' on port 56379.  
 Welcome to  
    ____       __  
    / __/__ ___ _____/ /__  
   _\ \/ _ \/ _ `/ __/ '_/  
   /___/ .__/\_,_/_/ /_/\_\  version 1.4.1  
    /_/  
   
 Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_55)  
 Type in expressions to have them evaluated.  
 Type :help for more information.  
 15/08/04 20:21:24 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 192.168.133.28 instead (on interface eth0)  
 15/08/04 20:21:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address  
 15/08/04 20:21:24 INFO SparkContext: Running Spark version 1.4.1  
 15/08/04 20:21:24 INFO SecurityManager: Changing view acls to: user  
 15/08/04 20:21:24 INFO SecurityManager: Changing modify acls to: user  
 15/08/04 20:21:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)  
 15/08/04 20:21:25 INFO Slf4jLogger: Slf4jLogger started  
 15/08/04 20:21:26 INFO Remoting: Starting remoting  
 15/08/04 20:21:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.133.28:47888]  
 15/08/04 20:21:26 INFO Utils: Successfully started service 'sparkDriver' on port 47888.  
 15/08/04 20:21:27 INFO SparkEnv: Registering MapOutputTracker  
 15/08/04 20:21:27 INFO SparkEnv: Registering BlockManagerMaster  
 15/08/04 20:21:27 INFO DiskBlockManager: Created local directory at /tmp/spark-660b5f39-26be-4ea2-8593-c0c05a093a23/blockmgr-c3225f03-5ecf-4fed-bbe4-df2331ac7742  
 15/08/04 20:21:27 INFO MemoryStore: MemoryStore started with capacity 265.4 MB  
 15/08/04 20:21:27 INFO HttpFileServer: HTTP File server directory is /tmp/spark-660b5f39-26be-4ea2-8593-c0c05a093a23/httpd-3ab40971-a6d0-42a7-b39e-4d1ce4290642  
 15/08/04 20:21:27 INFO HttpServer: Starting HTTP Server  
 15/08/04 20:21:27 INFO Utils: Successfully started service 'HTTP file server' on port 50089.  
 15/08/04 20:21:27 INFO SparkEnv: Registering OutputCommitCoordinator  
 15/08/04 20:21:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.  
 15/08/04 20:21:28 INFO SparkUI: Started SparkUI at http://192.168.133.28:4040  
 15/08/04 20:21:28 INFO Executor: Starting executor ID driver on host localhost  
 15/08/04 20:21:28 INFO Executor: Using REPL class URI: http://192.168.133.28:56379  
 15/08/04 20:21:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36428.  
 15/08/04 20:21:28 INFO NettyBlockTransferService: Server created on 36428  
 15/08/04 20:21:28 INFO BlockManagerMaster: Trying to register BlockManager  
 15/08/04 20:21:28 INFO BlockManagerMasterEndpoint: Registering block manager localhost:36428 with 265.4 MB RAM, BlockManagerId(driver, localhost, 36428)  
 15/08/04 20:21:28 INFO BlockManagerMaster: Registered BlockManager  
 15/08/04 20:21:29 INFO SparkILoop: Created spark context..  
 Spark context available as sc.  
 15/08/04 20:21:30 INFO SparkILoop: Created sql context..  
 SQL context available as sqlContext.  
   
 scala>   

Okay, everything looks good, the error above no longer exists. Let's explore further.

 scala> sc.parallelize(1 to 1000).count()  
 15/08/04 20:30:05 INFO SparkContext: Starting job: count at <console>:22  
 15/08/04 20:30:05 INFO DAGScheduler: Got job 0 (count at <console>:22) with 4 output partitions (allowLocal=false)  
 15/08/04 20:30:05 INFO DAGScheduler: Final stage: ResultStage 0(count at <console>:22)  
 15/08/04 20:30:05 INFO DAGScheduler: Parents of final stage: List()  
 15/08/04 20:30:05 INFO DAGScheduler: Missing parents: List()  
 15/08/04 20:30:05 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:22), which has no missing parents  
 15/08/04 20:30:05 INFO MemoryStore: ensureFreeSpace(1096) called with curMem=0, maxMem=278302556  
 15/08/04 20:30:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1096.0 B, free 265.4 MB)  
 15/08/04 20:30:05 INFO MemoryStore: ensureFreeSpace(804) called with curMem=1096, maxMem=278302556  
 15/08/04 20:30:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 804.0 B, free 265.4 MB)  
 15/08/04 20:30:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:36428 (size: 804.0 B, free: 265.4 MB)  
 15/08/04 20:30:05 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874  
 15/08/04 20:30:05 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:22)  
 15/08/04 20:30:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks  
 15/08/04 20:30:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1369 bytes)  
 15/08/04 20:30:05 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1369 bytes)  
 15/08/04 20:30:05 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 1369 bytes)  
 15/08/04 20:30:05 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 1426 bytes)  
 15/08/04 20:30:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)  
 15/08/04 20:30:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)  
 15/08/04 20:30:05 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)  
 15/08/04 20:30:05 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)  
 15/08/04 20:30:06 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 658 bytes result sent to driver  
 15/08/04 20:30:06 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 658 bytes result sent to driver  
 15/08/04 20:30:06 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 658 bytes result sent to driver  
 15/08/04 20:30:06 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 658 bytes result sent to driver  
 15/08/04 20:30:06 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 477 ms on localhost (1/4)  
 15/08/04 20:30:06 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 478 ms on localhost (2/4)  
 15/08/04 20:30:06 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 508 ms on localhost (3/4)  
 15/08/04 20:30:06 INFO DAGScheduler: ResultStage 0 (count at <console>:22) finished in 0.520 s  
 15/08/04 20:30:06 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 478 ms on localhost (4/4)  
 15/08/04 20:30:06 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool   
 15/08/04 20:30:06 INFO DAGScheduler: Job 0 finished: count at <console>:22, took 1.079304 s  
 res0: Long = 1000  

That's pretty nice, for a small demo on how is spark work. Now move on to the next example, let's open another terminal.

 user@localhost:~/Desktop/spark-1.4.1$ ./bin/pyspark   
 Python 2.7.10 (default, Jul 1 2015, 10:54:53)   
 [GCC 4.9.2] on linux2  
 Type "help", "copyright", "credits" or "license" for more information.  
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties  
 15/08/04 20:37:42 INFO SparkContext: Running Spark version 1.4.1  
 15/08/04 20:37:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
 15/08/04 20:37:44 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 182.168.133.28 instead (on interface eth0)  
 15/08/04 20:37:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address  
 15/08/04 20:37:44 INFO SecurityManager: Changing view acls to: user  
 15/08/04 20:37:44 INFO SecurityManager: Changing modify acls to: user  
 15/08/04 20:37:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)  
 15/08/04 20:37:46 INFO Slf4jLogger: Slf4jLogger started  
 15/08/04 20:37:46 INFO Remoting: Starting remoting  
 15/08/04 20:37:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@182.168.133.28:35904]  
 15/08/04 20:37:46 INFO Utils: Successfully started service 'sparkDriver' on port 35904.  
 15/08/04 20:37:46 INFO SparkEnv: Registering MapOutputTracker  
 15/08/04 20:37:46 INFO SparkEnv: Registering BlockManagerMaster  
 15/08/04 20:37:47 INFO DiskBlockManager: Created local directory at /tmp/spark-2b46e9e7-1779-45d1-b9cf-46000baf7d9b/blockmgr-e2f47b34-47a8-4b72-a0d6-25d0a7daa02e  
 15/08/04 20:37:47 INFO MemoryStore: MemoryStore started with capacity 265.4 MB  
 15/08/04 20:37:47 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2b46e9e7-1779-45d1-b9cf-46000baf7d9b/httpd-2ec128c2-bad0-4dd9-a826-eab2ee0779cb  
 15/08/04 20:37:47 INFO HttpServer: Starting HTTP Server  
 15/08/04 20:37:47 INFO Utils: Successfully started service 'HTTP file server' on port 45429.  
 15/08/04 20:37:47 INFO SparkEnv: Registering OutputCommitCoordinator  
 15/08/04 20:37:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.  
 15/08/04 20:37:50 INFO Utils: Successfully started service 'SparkUI' on port 4041.  
 15/08/04 20:37:50 INFO SparkUI: Started SparkUI at http://182.168.133.28:4041  
 15/08/04 20:37:50 INFO Executor: Starting executor ID driver on host localhost  
 15/08/04 20:37:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47045.  
 15/08/04 20:37:51 INFO NettyBlockTransferService: Server created on 47045  
 15/08/04 20:37:51 INFO BlockManagerMaster: Trying to register BlockManager  
 15/08/04 20:37:51 INFO BlockManagerMasterEndpoint: Registering block manager localhost:47045 with 265.4 MB RAM, BlockManagerId(driver, localhost, 47045)  
 15/08/04 20:37:51 INFO BlockManagerMaster: Registered BlockManager  
 Welcome to  
    ____       __  
    / __/__ ___ _____/ /__  
   _\ \/ _ \/ _ `/ __/ '_/  
   /__ / .__/\_,_/_/ /_/\_\  version 1.4.1  
    /_/  
   
 Using Python version 2.7.10 (default, Jul 1 2015 10:54:53)  
 SparkContext available as sc, SQLContext available as sqlContext.  
 >>> sc.parallelize(range(1000)).count()  
 15/08/04 20:37:55 INFO SparkContext: Starting job: count at <stdin>:1  
 15/08/04 20:37:55 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 4 output partitions (allowLocal=false)  
 15/08/04 20:37:55 INFO DAGScheduler: Final stage: ResultStage 0(count at <stdin>:1)  
 15/08/04 20:37:55 INFO DAGScheduler: Parents of final stage: List()  
 15/08/04 20:37:55 INFO DAGScheduler: Missing parents: List()  
 15/08/04 20:37:55 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at count at <stdin>:1), which has no missing parents  
 15/08/04 20:37:55 INFO MemoryStore: ensureFreeSpace(4416) called with curMem=0, maxMem=278302556  
 15/08/04 20:37:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.3 KB, free 265.4 MB)  
 15/08/04 20:37:55 INFO MemoryStore: ensureFreeSpace(2722) called with curMem=4416, maxMem=278302556  
 15/08/04 20:37:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 265.4 MB)  
 15/08/04 20:37:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:47045 (size: 2.7 KB, free: 265.4 MB)  
 15/08/04 20:37:55 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874  
 15/08/04 20:37:55 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (PythonRDD[1] at count at <stdin>:1)  
 15/08/04 20:37:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks  
 15/08/04 20:37:55 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1873 bytes)  
 15/08/04 20:37:55 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2117 bytes)  
 15/08/04 20:37:55 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2123 bytes)  
 15/08/04 20:37:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2123 bytes)  
 15/08/04 20:37:55 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)  
 15/08/04 20:37:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)  
 15/08/04 20:37:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)  
 15/08/04 20:37:55 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)  
 15/08/04 20:37:56 INFO PythonRDD: Times: total = 421, boot = 376, init = 44, finish = 1  
 15/08/04 20:37:56 INFO PythonRDD: Times: total = 418, boot = 354, init = 64, finish = 0  
 15/08/04 20:37:56 INFO PythonRDD: Times: total = 423, boot = 372, init = 51, finish = 0  
 15/08/04 20:37:56 INFO PythonRDD: Times: total = 421, boot = 381, init = 40, finish = 0  
 15/08/04 20:37:56 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 698 bytes result sent to driver  
 15/08/04 20:37:56 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 698 bytes result sent to driver  
 15/08/04 20:37:56 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 698 bytes result sent to driver  
 15/08/04 20:37:56 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 698 bytes result sent to driver  
 15/08/04 20:37:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 552 ms on localhost (1/4)  
 15/08/04 20:37:56 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 560 ms on localhost (2/4)  
 15/08/04 20:37:56 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 562 ms on localhost (3/4)  
 15/08/04 20:37:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 626 ms on localhost (4/4)  
 15/08/04 20:37:56 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 0.641 s  
 15/08/04 20:37:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool   
 15/08/04 20:37:56 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 1.137405 s  
 1000  
 >>>   

Looks good, next example will calculate pi using spark.

 user@localhost:~/Desktop/spark-1.4.1$ ./bin/run-example SparkPi  
 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties  
 15/08/04 20:44:50 INFO SparkContext: Running Spark version 1.4.1  
 15/08/04 20:44:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable  
 15/08/04 20:44:51 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 182.168.133.28 instead (on interface eth0)  
 15/08/04 20:44:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address  
 15/08/04 20:44:51 INFO SecurityManager: Changing view acls to: user  
 15/08/04 20:44:51 INFO SecurityManager: Changing modify acls to: user  
 15/08/04 20:44:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)  
 15/08/04 20:44:52 INFO Slf4jLogger: Slf4jLogger started  
 15/08/04 20:44:52 INFO Remoting: Starting remoting  
 15/08/04 20:44:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@182.168.133.28:45817]  
 15/08/04 20:44:53 INFO Utils: Successfully started service 'sparkDriver' on port 45817.  
 15/08/04 20:44:53 INFO SparkEnv: Registering MapOutputTracker  
 15/08/04 20:44:53 INFO SparkEnv: Registering BlockManagerMaster  
 15/08/04 20:44:53 INFO DiskBlockManager: Created local directory at /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/blockmgr-5ed813af-a26f-413c-bdfc-1e08001f9cb2  
 15/08/04 20:44:53 INFO MemoryStore: MemoryStore started with capacity 265.4 MB  
 15/08/04 20:44:53 INFO HttpFileServer: HTTP File server directory is /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/httpd-f07ff755-e34d-4149-b4ac-399e6897221a  
 15/08/04 20:44:53 INFO HttpServer: Starting HTTP Server  
 15/08/04 20:44:53 INFO Utils: Successfully started service 'HTTP file server' on port 50955.  
 15/08/04 20:44:53 INFO SparkEnv: Registering OutputCommitCoordinator  
 15/08/04 20:44:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.  
 15/08/04 20:44:54 INFO SparkUI: Started SparkUI at http://182.168.133.28:4040  
 15/08/04 20:44:58 INFO SparkContext: Added JAR file:/home/user/Desktop/spark-1.4.1/examples/target/scala-2.10/spark-examples-1.4.1-hadoop2.2.0.jar at http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar with timestamp 1438692298221  
 15/08/04 20:44:58 INFO Executor: Starting executor ID driver on host localhost  
 15/08/04 20:44:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45731.  
 15/08/04 20:44:58 INFO NettyBlockTransferService: Server created on 45731  
 15/08/04 20:44:58 INFO BlockManagerMaster: Trying to register BlockManager  
 15/08/04 20:44:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:45731 with 265.4 MB RAM, BlockManagerId(driver, localhost, 45731)  
 15/08/04 20:44:58 INFO BlockManagerMaster: Registered BlockManager  
 15/08/04 20:44:59 INFO SparkContext: Starting job: reduce at SparkPi.scala:35  
 15/08/04 20:44:59 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 2 output partitions (allowLocal=false)  
 15/08/04 20:44:59 INFO DAGScheduler: Final stage: ResultStage 0(reduce at SparkPi.scala:35)  
 15/08/04 20:44:59 INFO DAGScheduler: Parents of final stage: List()  
 15/08/04 20:44:59 INFO DAGScheduler: Missing parents: List()  
 15/08/04 20:44:59 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing parents  
 15/08/04 20:44:59 INFO MemoryStore: ensureFreeSpace(1888) called with curMem=0, maxMem=278302556  
 15/08/04 20:44:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1888.0 B, free 265.4 MB)  
 15/08/04 20:44:59 INFO MemoryStore: ensureFreeSpace(1202) called with curMem=1888, maxMem=278302556  
 15/08/04 20:44:59 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1202.0 B, free 265.4 MB)  
 15/08/04 20:44:59 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45731 (size: 1202.0 B, free: 265.4 MB)  
 15/08/04 20:44:59 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874  
 15/08/04 20:44:59 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31)  
 15/08/04 20:44:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks  
 15/08/04 20:44:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1446 bytes)  
 15/08/04 20:44:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1446 bytes)  
 15/08/04 20:44:59 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)  
 15/08/04 20:44:59 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)  
 15/08/04 20:44:59 INFO Executor: Fetching http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar with timestamp 1438692298221  
 15/08/04 20:45:00 INFO Utils: Fetching http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar to /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/userFiles-f3a72f24-78e5-4d5d-82eb-dcc8c6b899cb/fetchFileTemp5981400277552657211.tmp  
 15/08/04 20:45:03 INFO Executor: Adding file:/tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/userFiles-f3a72f24-78e5-4d5d-82eb-dcc8c6b899cb/spark-examples-1.4.1-hadoop2.2.0.jar to class loader  
 15/08/04 20:45:03 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 736 bytes result sent to driver  
 15/08/04 20:45:03 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 736 bytes result sent to driver  
 15/08/04 20:45:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3722 ms on localhost (1/2)  
 15/08/04 20:45:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3685 ms on localhost (2/2)  
 15/08/04 20:45:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool   
 15/08/04 20:45:03 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:35) finished in 3.750 s  
 15/08/04 20:45:03 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 4.032610 s  
 Pi is roughly 3.14038  
 15/08/04 20:45:03 INFO SparkUI: Stopped Spark web UI at http://182.168.133.28:4040  
 15/08/04 20:45:03 INFO DAGScheduler: Stopping DAGScheduler  
 15/08/04 20:45:03 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!  
 15/08/04 20:45:03 INFO Utils: path = /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/blockmgr-5ed813af-a26f-413c-bdfc-1e08001f9cb2, already present as root for deletion.  
 15/08/04 20:45:03 INFO MemoryStore: MemoryStore cleared  
 15/08/04 20:45:03 INFO BlockManager: BlockManager stopped  
 15/08/04 20:45:03 INFO BlockManagerMaster: BlockManagerMaster stopped  
 15/08/04 20:45:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!  
 15/08/04 20:45:03 INFO SparkContext: Successfully stopped SparkContext  
 15/08/04 20:45:03 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.  
 15/08/04 20:45:03 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.  
 15/08/04 20:45:03 INFO Utils: Shutdown hook called  
 15/08/04 20:45:03 INFO Utils: Deleting directory /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9  

With this introduction, it give you an idea on what spark is all about, you can basically use spark to do distributed processing. These tutorial give some quick idea on what spark is all about and how I can use it. It is definitely worth while to look into the example directory to see what can spark really do for you. Before I end this, I think these two links are very helpful to get you further.

http://spark.apache.org/docs/latest/quick-start.html

http://spark.apache.org/docs/latest/#launching-on-a-cluster

Sunday, August 2, 2015

Learning basic of cobertura

A while back, I was reading an article talk about code coverage and I googled, there is this opensource code coverage tool called cobertura. So naturally I thought to give it a try and the result was not disappoint. Read on to find out why. You might wondering why the name of cobertura, from the official site explanation.

"Cobertura" is the Spanish and Portuguese word for "coverage." We were trying to avoid acronyms and coffee references. It's not too hard to associate the word "cobertura" with the word "coverage," and it even has a bit of a zesty kick to it!

Okay, again, why would I want this as I have already junit running?

Cobertura is a free Java tool that calculates the percentage of code accessed by tests. It can be used to identify which parts of your Java program are lacking test coverage. It is based on jcoverage.

So cobertura is a auxiliary to the exiting test by showing how much of your test currently cover in your main codebase. So a requirement is such that, you need to have tests written before you use cobertura.

Okay, enough for the theory, let's dip toe into water. First, download the library, you can download from this link. Next, unzip this file and change into this library directory. There is a nice ready example for you to play with.

 $ ls  
 cobertura-2.1.1.jar        cobertura-2.1.1-sources.jar cobertura-check.sh cobertura-instrument.bat     cobertura-merge.bat cobertura-report.bat examples LICENSE.txt  
 cobertura-2.1.1-javadoc.jar cobertura-check.bat      coberturaFlush.war cobertura-instrument.sh     cobertura-merge.sh  cobertura-report.sh  lib        README.markdown  

change into this directory and run the command such as below. Yes, you will need ant installed and java.

 $ ant -p  
 Buildfile: /home/user/Desktop/cobertura-2.1.1/examples/basic/build.xml  
   Cobertura - http://cobertura.sourceforge.net/  
   Copyright (C) 2003 jcoverage ltd.  
   Copyright (C) 2005 Mark Doliner <thekingant@users.sourceforge.net>  
   Copyright (C) 2006 Dan Godfrey  
   Cobertura is licensed under the GNU General Public License  
   Cobertura comes with ABSOLUTELY NO WARRANTY  
 Main targets:  
  clean   Remove all files created by the build/test process.  
  coverage Compile, instrument ourself, run the tests and generate JUnit and coverage reports.  
 Default target: coverage  

So that's pretty clear, we have two targets, the clean and coverage. The coverage will generate all necessary files for you. See below.

 $ ant coverage  
 Buildfile: /home/user/Desktop/cobertura-2.1.1/examples/basic/build.xml  
 init:  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/classes  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/instrumented  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/junit-xml  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/junit-html  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/cobertura-xml  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/cobertura-summary-xml  
   [mkdir] Created dir: /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/cobertura-html  
 compile:  
   [javac] /home/user/Desktop/cobertura-2.1.1/examples/basic/build.xml:36: warning: 'includeantruntime' was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds  
   [javac] Compiling 2 source files to /home/user/Desktop/cobertura-2.1.1/examples/basic/classes  
   [javac] Note: /home/user/Desktop/cobertura-2.1.1/examples/basic/src/com/example/simple/SimpleTest.java uses unchecked or unsafe operations.  
   [javac] Note: Recompile with -Xlint:unchecked for details.  
 instrument:  
   [delete] Deleting directory /home/user/Desktop/cobertura-2.1.1/examples/basic/instrumented  
 [cobertura-instrument] 21:55:08,566 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]  
 [cobertura-instrument] 21:55:08,566 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]  
 [cobertura-instrument] 21:55:08,566 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-instrument] 21:55:08,567 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.  
 [cobertura-instrument] 21:55:08,567 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-instrument] 21:55:08,567 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1.jar!/logback.xml]  
 [cobertura-instrument] 21:55:08,601 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@4fce7ceb - URL [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml] is not of type file  
 [cobertura-instrument] 21:55:08,699 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set  
 [cobertura-instrument] 21:55:08,704 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]  
 [cobertura-instrument] 21:55:08,716 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]  
 [cobertura-instrument] 21:55:08,813 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property  
 [cobertura-instrument] 21:55:08,897 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [net.sourceforge.cobertura] to INFO  
 [cobertura-instrument] 21:55:08,897 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG  
 [cobertura-instrument] 21:55:08,897 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]  
 [cobertura-instrument] 21:55:08,898 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.  
 [cobertura-instrument] 21:55:08,899 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@7d6b513b - Registering current configuration as safe fallback point  
 [cobertura-instrument]   
 [cobertura-instrument] 21:55:09,216 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]  
 [cobertura-instrument] 21:55:09,217 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]  
 [cobertura-instrument] 21:55:09,217 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-instrument] 21:55:09,218 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.  
 [cobertura-instrument] 21:55:09,218 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-instrument] 21:55:09,218 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1.jar!/logback.xml]  
 [cobertura-instrument] 21:55:09,243 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@45a5049a - URL [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml] is not of type file  
 [cobertura-instrument] 21:55:09,310 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set  
 [cobertura-instrument] 21:55:09,315 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]  
 [cobertura-instrument] 21:55:09,325 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]  
 [cobertura-instrument] 21:55:09,354 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property  
 [cobertura-instrument] 21:55:09,402 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [net.sourceforge.cobertura] to INFO  
 [cobertura-instrument] 21:55:09,402 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG  
 [cobertura-instrument] 21:55:09,402 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]  
 [cobertura-instrument] 21:55:09,403 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.  
 [cobertura-instrument] 21:55:09,405 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@54d13e2e - Registering current configuration as safe fallback point  
 [cobertura-instrument]   
 [cobertura-instrument] Cobertura 2.1.1 - GNU GPL License (NO WARRANTY) - See COPYRIGHT file  
 [cobertura-instrument] [INFO] Cobertura: Saved information on 1 classes.  
 [cobertura-instrument] [INFO] Cobertura: Saved information on 1 classes.  
 test:  
   [junit] [INFO] Cobertura: Loaded information on 1 classes.  
   [junit] [INFO] Cobertura: Saved information on 1 classes.  
 [junitreport] Processing /home/user/Desktop/cobertura-2.1.1/examples/basic/reports/junit-xml/TESTS-TestSuites.xml to /tmp/null1467716178  
 [junitreport] Loading stylesheet jar:file:/usr/share/ant/lib/ant-junit.jar!/org/apache/tools/ant/taskdefs/optional/junit/xsl/junit-frames.xsl  
 [junitreport] Transform time: 1272ms  
 [junitreport] Deleting: /tmp/null1467716178  
 coverage-report:  
 [cobertura-report] 21:55:13,533 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]  
 [cobertura-report] 21:55:13,533 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]  
 [cobertura-report] 21:55:13,533 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:13,535 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.  
 [cobertura-report] 21:55:13,535 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:13,535 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1.jar!/logback.xml]  
 [cobertura-report] 21:55:13,561 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@6e038230 - URL [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml] is not of type file  
 [cobertura-report] 21:55:13,636 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set  
 [cobertura-report] 21:55:13,643 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]  
 [cobertura-report] 21:55:13,653 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]  
 [cobertura-report] 21:55:13,684 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property  
 [cobertura-report] 21:55:13,748 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [net.sourceforge.cobertura] to INFO  
 [cobertura-report] 21:55:13,748 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG  
 [cobertura-report] 21:55:13,748 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]  
 [cobertura-report] 21:55:13,749 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.  
 [cobertura-report] 21:55:13,751 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@285855bd - Registering current configuration as safe fallback point  
 [cobertura-report]   
 [cobertura-report] Cobertura 2.1.1 - GNU GPL License (NO WARRANTY) - See COPYRIGHT file  
 [cobertura-report] [INFO] Cobertura: Loaded information on 1 classes.  
 [cobertura-report] Report time: 159ms  
 summary-coverage-report:  
 [cobertura-report] 21:55:14,128 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]  
 [cobertura-report] 21:55:14,129 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]  
 [cobertura-report] 21:55:14,129 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:14,131 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.  
 [cobertura-report] 21:55:14,131 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:14,131 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1.jar!/logback.xml]  
 [cobertura-report] 21:55:14,161 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@52633079 - URL [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml] is not of type file  
 [cobertura-report] 21:55:14,234 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set  
 [cobertura-report] 21:55:14,239 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]  
 [cobertura-report] 21:55:14,250 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]  
 [cobertura-report] 21:55:14,281 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property  
 [cobertura-report] 21:55:14,334 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [net.sourceforge.cobertura] to INFO  
 [cobertura-report] 21:55:14,335 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG  
 [cobertura-report] 21:55:14,335 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]  
 [cobertura-report] 21:55:14,336 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.  
 [cobertura-report] 21:55:14,338 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@6e038230 - Registering current configuration as safe fallback point  
 [cobertura-report]   
 [cobertura-report] Cobertura 2.1.1 - GNU GPL License (NO WARRANTY) - See COPYRIGHT file  
 [cobertura-report] [INFO] Cobertura: Loaded information on 1 classes.  
 [cobertura-report] Report time: 124ms  
 alternate-coverage-report:  
 [cobertura-report] 21:55:14,694 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback.groovy]  
 [cobertura-report] 21:55:14,694 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could NOT find resource [logback-test.xml]  
 [cobertura-report] 21:55:14,694 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found resource [logback.xml] at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:14,695 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs multiple times on the classpath.  
 [cobertura-report] 21:55:14,695 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml]  
 [cobertura-report] 21:55:14,695 |-WARN in ch.qos.logback.classic.LoggerContext[default] - Resource [logback.xml] occurs at [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1.jar!/logback.xml]  
 [cobertura-report] 21:55:14,727 |-INFO in ch.qos.logback.core.joran.spi.ConfigurationWatchList@5abce07 - URL [jar:file:/home/user/Desktop/cobertura-2.1.1/cobertura-2.1.1-sources.jar!/logback.xml] is not of type file  
 [cobertura-report] 21:55:14,814 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not set  
 [cobertura-report] 21:55:14,821 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]  
 [cobertura-report] 21:55:14,832 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [STDOUT]  
 [cobertura-report] 21:55:14,874 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property  
 [cobertura-report] 21:55:14,934 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [net.sourceforge.cobertura] to INFO  
 [cobertura-report] 21:55:14,934 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to DEBUG  
 [cobertura-report] 21:55:14,935 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [STDOUT] to Logger[ROOT]  
 [cobertura-report] 21:55:14,935 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.  
 [cobertura-report] 21:55:14,937 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@52633079 - Registering current configuration as safe fallback point  
 [cobertura-report]   
 [cobertura-report] Cobertura 2.1.1 - GNU GPL License (NO WARRANTY) - See COPYRIGHT file  
 [cobertura-report] [INFO] Cobertura: Loaded information on 1 classes.  
 [cobertura-report] Report time: 171ms  
 coverage:  
 BUILD SUCCESSFUL  
 Total time: 9 seconds  

From the output, we see a target called instrument and then test was called. Then all the reports are generated. If you list the directory now, you should see a few additional files are added. cobertura.ser, reports, classes and instrumented. All the generated reports are in the reports directory. The one is relevant for this article cobertura and I will show you cobertura-html report.



That's it, if you want to go further integrating into this code coverage tool into your project, start to look into the example build.xml. Have fun.