Recently (actually yesterday) I attended an event organized by Amazon known as AWEsome day at Kuala Lumpur Malaysia. From the agenda , it read to me that amazon will focus more on development/technical in this one day event. So I have registered particularly interested in the nosql from amazon and how much malaysian adoption on big data technologies.
For a start, I did not expect that much of people to turn out. I have been into many seminar/webinar/forum discussions, seldom I noticed that much of Malaysian participants. That got me excited when I arrived.
People are queuing up for their turn to register badge. Well, as you can see above, food and beverages are everywhere in the lobby. Throughout the event, food and coffees were served as if you can eat and drink as much as you can. It was a pity when I ask for the goodies bag which contain the amazon manual, it was finish but the helpful staff said they can send me the softcopy.
Event start around at 9am and entering into the grand ballroom...
My badge number is 937, I supposed there are around one thousand attendees! But the pictures said for itself. The site survey by the speaker shown many developers, system admin, softtware engineer, etc came to this event.
First half of the event is boring, perhaps I had expectation on the talk focus more on technical than business. First half of the talk mostly on selling amazon web services and convincing people on boarding amazon services with its attractive pricing. The speaker explained how the I.T. world is changing and how amazon fill that role and its pricing. All the marketing jargon teminologies to impress. :-) if you know what I mean.
The second half of the events are more technical. Although I wish the speaker can elaborate much longer but due to the time constraint, it was a brief one.
Topic such as amazon load balancer ssl endpoint end there, nosql technology from amazon, such as dynamo db and amazon elasticcache attract my attention. I was pretty surprised that when the speaker do another site survey on how much the attendees know about nosql, almost zero person hand up. So I would say more big data jobs coming to malaysia and malaysia is very young to this new technology. The speaker also mentioned about using the metric from the cloud watch and auto scalling by provision new server into the cloud is something attracted to me. Being a daily devops and software engineer, I interested in what metrics and how often will only result a new server provision into the cloud servers group.
Last but not least, lucky draw and certificate of attendance is handle out to all the attendees. I wish there is more of this technical seminar or even pure developer seminar focusing on the topic. It would be great nosql technologies like cassandra, elasticsearch, and hadoop can happen in malaysia in the near future.
We learn from community all over the world and we would like to contribute back. We hope to build a knowledge society and share interest on information technology topics through this site.
Friday, September 11, 2015
Sunday, August 30, 2015
First learning into Cloudera Impala
Let's take a look into a vendor big data technology today. In this article, we will take a look into Cloudera Impala. So what is Impala all about?
wikipedia definition
and from the official github repository definition
Let us download a virtual machine image, this is good as impala works with integration with hadoop and if you don't have hadoop knowledge, you must start from establish hadoop cluster first before integrating it with Impala. With this virtual machine image, it is as easy as import this virtual machine image into the host and power it up. It also save time for you like setting it up and reduce error.
With that said, I'm downloading a virtual box image. Once download and extract to a directory. If you have not install virtualbox, you should by now install it. apt-get install virtualbox virtualbox-guest-additions-iso and make sure virtualbox instance is running.
launch virtualbox and add that virtual image into a new instance, see screenshot below.
now power this virtual machine up! Please be patient as it will take a long time to boot it up. At least for my pc. Be patient and you might want to get some drink in the mean time. The ongoing article is using this tutorial. However, I give up as select statement take a long time and it is very slow in virtual environment, at least for me here. But I will illustrate until the point where it became slow.
first you need to copy this csv files (tab1.csv and tab2.csv) into the virtual machine.
Then you can load the script with the sql to create the tables and load the csv into the table. But the example given in the tutorial does not have database and i suggest you add these two lines into the script and load it up.
After that, you can issue command impala-shell and you can do sql queries, but as you see, the select statement just hang there forever.
Not a good experience but if impala is what you need, find out what is the problem and let me know. :-)
wikipedia definition
Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.[1]
and from the official github repository definition
Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters.
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:
Let us download a virtual machine image, this is good as impala works with integration with hadoop and if you don't have hadoop knowledge, you must start from establish hadoop cluster first before integrating it with Impala. With this virtual machine image, it is as easy as import this virtual machine image into the host and power it up. It also save time for you like setting it up and reduce error.
With that said, I'm downloading a virtual box image. Once download and extract to a directory. If you have not install virtualbox, you should by now install it. apt-get install virtualbox virtualbox-guest-additions-iso and make sure virtualbox instance is running.
root@localhost:~# /etc/init.d/virtualbox status
● virtualbox.service - LSB: VirtualBox Linux kernel module
Loaded: loaded (/etc/init.d/virtualbox)
Active: active (exited) since Thu 2015-08-20 17:07:43 MYT; 2min 36s ago
Docs: man:systemd-sysv-generator(8)
Process: 29390 ExecStop=/etc/init.d/virtualbox stop (code=exited, status=0/SUCCESS)
Process: 29425 ExecStart=/etc/init.d/virtualbox start (code=exited, status=0/SUCCESS)
Aug 20 17:07:43 localhost systemd[1]: Starting LSB: VirtualBox Linux kernel module...
Aug 20 17:07:43 localhost systemd[1]: Started LSB: VirtualBox Linux kernel module.
Aug 20 17:07:43 localhost virtualbox[29425]: Starting VirtualBox kernel modules.
launch virtualbox and add that virtual image into a new instance, see screenshot below.
now power this virtual machine up! Please be patient as it will take a long time to boot it up. At least for my pc. Be patient and you might want to get some drink in the mean time. The ongoing article is using this tutorial. However, I give up as select statement take a long time and it is very slow in virtual environment, at least for me here. But I will illustrate until the point where it became slow.
first you need to copy this csv files (tab1.csv and tab2.csv) into the virtual machine.
Then you can load the script with the sql to create the tables and load the csv into the table. But the example given in the tutorial does not have database and i suggest you add these two lines into the script and load it up.
create database testdb;
use testdb;
DROP TABLE IF EXISTS tab1;
-- The EXTERNAL clause means the data is located outside the central location
-- for Impala data files and is preserved when the associated Impala table is dropped.
-- We expect the data to already ex
After that, you can issue command impala-shell and you can do sql queries, but as you see, the select statement just hang there forever.
Not a good experience but if impala is what you need, find out what is the problem and let me know. :-)
Saturday, August 29, 2015
First time learning Apache HBase
Today, we will take another look at another big data technology. Apache HBase is the topic for today and before we dip our toe into Apache HBase, let's find out what actually is Apache HBase.
In this article, we can setup a single node for this adventure. Before we begin, let's download a copy of Apache HBase here. Once downloaded, extract the compressed content. At the time of this writing, I'm using Apache HBase version 1.1.1 for this learning experience.
If you have not install java, go ahead and install it. Pick a recent java or at least java7. Make sure terminal prompt the correct version of java. An example would be as of following
If you cannot change system configuration for this java, then in the HBase configuration file, conf/hbase-env.sh, uncomment JAVA_HOME variable and set to the java that you installed. The main configuration file for hbase is conf/hbase-site.xml and we will now edit this file so it became such as following. Change to your environment as required.
Okay, we are ready to start hbase. start it with a helpful script bin/start-hbase.sh
and you notice, log file is also available and jps shown a HMaster is running.
okay, let's experience apache hbase using a hbase shell.
To create a table (column family),
list information about a table.
let's put something into the table we have just created.
Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in this case.
To select the row from the table, use scan.
To get a row only.
Something really interesting about apache hbase, say if you want to delete or change settings of a table, you need to disable it first. After that, you can enable it back.
okay, now, let's delete this table.
Okay, we are done for this basic learning. Let's quit for now.
If you like me who came from apache cassandra, apache hbase looks very similar. If this interest you, I shall leave you with the following three links which will get you further.
http://hbase.apache.org/book.html
http://wiki.apache.org/hadoop/Hbase
https://blogs.apache.org/hbase/
Apache HBase [1] is an open-source, distributed, versioned, column-oriented store modeled after Google' Bigtable: A Distributed Storage System for Structured Data by Chang et al.[2] Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Apache Hadoop [3].
In this article, we can setup a single node for this adventure. Before we begin, let's download a copy of Apache HBase here. Once downloaded, extract the compressed content. At the time of this writing, I'm using Apache HBase version 1.1.1 for this learning experience.
user@localhost:~/Desktop/hbase-1.1.1$ ls
bin CHANGES.txt conf docs hbase-webapps lib LICENSE.txt NOTICE.txt README.txt
If you have not install java, go ahead and install it. Pick a recent java or at least java7. Make sure terminal prompt the correct version of java. An example would be as of following
user@localhost:~/Desktop/hbase-1.1.1$ java -version
java version "1.7.0_55"
Java(TM) SE Runtime Environment (build 1.7.0_55-b13)
Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode)
If you cannot change system configuration for this java, then in the HBase configuration file, conf/hbase-env.sh, uncomment JAVA_HOME variable and set to the java that you installed. The main configuration file for hbase is conf/hbase-site.xml and we will now edit this file so it became such as following. Change to your environment as required.
user@localhost:~/Desktop/hbase-1.1.1$ cat conf/hbase-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
/**
*
* Licensed to the Apache Software Foundation (ASF) under one
* or more contributor license agreements. See the NOTICE file
* distributed with this work for additional information
* regarding copyright ownership. The ASF licenses this file
* to you under the Apache License, Version 2.0 (the
* "License"); you may not use this file except in compliance
* with the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
-->
<configuration>
<property>
<name>hbase.rootdir</name>
<value>file:///home/user/Desktop/hbase-1.1.1</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/home/user/zookeeper</value>
</property>
</configuration>
Okay, we are ready to start hbase. start it with a helpful script bin/start-hbase.sh
user@localhost:~/Desktop/hbase-1.1.1$ bin/start-hbase.sh
starting master, logging to /home/user/Desktop/hbase-1.1.1/bin/../logs/hbase-user-master-localhost.out
user@localhost:~/Desktop/hbase-1.1.1/logs$ tail -F hbase-user-master-localhost.out SecurityAuth.audit hbase-user-master-localhost.log
==> hbase-user-master-localhost.out <==
==> SecurityAuth.audit <==
2015-08-18 17:49:41,533 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.1.1 port: 36745 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"
2015-08-18 17:49:46,812 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53042 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"
2015-08-18 17:49:48,309 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53043 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"
2015-08-18 17:49:49,317 INFO SecurityLogger.org.apache.hadoop.hbase.Server: Connection from 127.0.0.1 port: 53044 with version info: version: "1.1.1" url: "git://hw11397.local/Volumes/hbase-1.1.1RC0/hbase" revision: "d0a115a7267f54e01c72c603ec53e91ec418292f" user: "ndimiduk" date: "Tue Jun 23 14:44:07 PDT 2015" src_checksum: "6e2d8cecbd28738ad86daacb25dc467e"
==> hbase-user-master-localhost.log <==
2015-08-18 17:49:49,281 INFO [StoreOpener-78a2a3664205fcf679d2043ac3259648-1] hfile.CacheConfig: blockCache=LruBlockCache{blockCount=0, currentSize=831688, freeSize=808983544, maxSize=809815232, heapSize=831688, minSize=769324480, minFactor=0.95, multiSize=384662240, multiFactor=0.5, singleSize=192331120, singleFactor=0.25}, cacheDataOnRead=true, cacheDataOnWrite=false, cacheIndexesOnWrite=false, cacheBloomsOnWrite=false, cacheEvictOnClose=false, cacheDataCompressed=false, prefetchOnOpen=false
2015-08-18 17:49:49,282 INFO [StoreOpener-78a2a3664205fcf679d2043ac3259648-1] compactions.CompactionConfiguration: size [134217728, 9223372036854775807); files [3, 10); ratio 1.200000; off-peak ratio 5.000000; throttle point 2684354560; major period 604800000, major jitter 0.500000, min locality to compact 0.000000
2015-08-18 17:49:49,295 INFO [RS_OPEN_REGION-localhost:60631-0] regionserver.HRegion: Onlined 78a2a3664205fcf679d2043ac3259648; next sequenceid=2
2015-08-18 17:49:49,303 INFO [PostOpenDeployTasks:78a2a3664205fcf679d2043ac3259648] regionserver.HRegionServer: Post open deploy tasks for hbase:namespace,,1439891388424.78a2a3664205fcf679d2043ac3259648.
2015-08-18 17:49:49,322 INFO [PostOpenDeployTasks:78a2a3664205fcf679d2043ac3259648] hbase.MetaTableAccessor: Updated row hbase:namespace,,1439891388424.78a2a3664205fcf679d2043ac3259648. with server=localhost,60631,1439891378840
2015-08-18 17:49:49,332 INFO [AM.ZK.Worker-pool3-t6] master.RegionStates: Transition {78a2a3664205fcf679d2043ac3259648 state=OPENING, ts=1439891389276, server=localhost,60631,1439891378840} to {78a2a3664205fcf679d2043ac3259648 state=OPEN, ts=1439891389332, server=localhost,60631,1439891378840}
2015-08-18 17:49:49,603 INFO [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14f4036b87d0000 type:create cxid:0x1d5 zxid:0x44 txntype:-1 reqpath:n/a Error Path:/hbase/namespace/default Error:KeeperErrorCode = NodeExists for /hbase/namespace/default
2015-08-18 17:49:49,625 INFO [ProcessThread(sid:0 cport:-1):] server.PrepRequestProcessor: Got user-level KeeperException when processing sessionid:0x14f4036b87d0000 type:create cxid:0x1d8 zxid:0x46 txntype:-1 reqpath:n/a Error Path:/hbase/namespace/hbase Error:KeeperErrorCode = NodeExists for /hbase/namespace/hbase
2015-08-18 17:49:49,639 INFO [localhost:51452.activeMasterManager] master.HMaster: Master has completed initialization
2015-08-18 17:49:49,642 INFO [localhost:51452.activeMasterManager] quotas.MasterQuotaManager: Quota support disabled
and you notice, log file is also available and jps shown a HMaster is running.
user@localhost: $ jps
22144 Jps
21793 HMaster
okay, let's experience apache hbase using a hbase shell.
user@localhost:~/Desktop/hbase-1.1.1$ ./bin/hbase shell
2015-08-18 17:55:25,134 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version 1.1.1, rd0a115a7267f54e01c72c603ec53e91ec418292f, Tue Jun 23 14:44:07 PDT 2015
hbase(main):001:0>
A help command show very helpful description such as the followings.
hbase(main):001:0> help
HBase Shell, version 1.1.1, rd0a115a7267f54e01c72c603ec53e91ec418292f, Tue Jun 23 14:44:07 PDT 2015
Type 'help "COMMAND"', (e.g. 'help "get"' -- the quotes are necessary) for help on a specific command.
Commands are grouped. Type 'help "COMMAND_GROUP"', (e.g. 'help "general"') for help on a command group.
COMMAND GROUPS:
Group name: general
Commands: status, table_help, version, whoami
Group name: ddl
Commands: alter, alter_async, alter_status, create, describe, disable, disable_all, drop, drop_all, enable, enable_all, exists, get_table, is_disabled, is_enabled, list, show_filters
Group name: namespace
Commands: alter_namespace, create_namespace, describe_namespace, drop_namespace, list_namespace, list_namespace_tables
Group name: dml
Commands: append, count, delete, deleteall, get, get_counter, get_splits, incr, put, scan, truncate, truncate_preserve
Group name: tools
Commands: assign, balance_switch, balancer, balancer_enabled, catalogjanitor_enabled, catalogjanitor_run, catalogjanitor_switch, close_region, compact, compact_rs, flush, major_compact, merge_region, move, split, trace, unassign, wal_roll, zk_dump
Group name: replication
Commands: add_peer, append_peer_tableCFs, disable_peer, disable_table_replication, enable_peer, enable_table_replication, list_peers, list_replicated_tables, remove_peer, remove_peer_tableCFs, set_peer_tableCFs, show_peer_tableCFs
Group name: snapshots
Commands: clone_snapshot, delete_all_snapshot, delete_snapshot, list_snapshots, restore_snapshot, snapshot
Group name: configuration
Commands: update_all_config, update_config
Group name: quotas
Commands: list_quotas, set_quota
Group name: security
Commands: grant, revoke, user_permission
Group name: visibility labels
Commands: add_labels, clear_auths, get_auths, list_labels, set_auths, set_visibility
SHELL USAGE:
Quote all names in HBase Shell such as table and column names. Commas delimit
command parameters. Type <RETURN> after entering a command to run it.
Dictionaries of configuration used in the creation and alteration of tables are
Ruby Hashes. They look like this:
{'key1' => 'value1', 'key2' => 'value2', ...}
and are opened and closed with curley-braces. Key/values are delimited by the
'=>' character combination. Usually keys are predefined constants such as
NAME, VERSIONS, COMPRESSION, etc. Constants do not need to be quoted. Type
'Object.constants' to see a (messy) list of all constants in the environment.
If you are using binary keys or values and need to enter them in the shell, use
double-quote'd hexadecimal representation. For example:
hbase> get 't1', "key\x03\x3f\xcd"
hbase> get 't1', "key\003\023\011"
hbase> put 't1', "test\xef\xff", 'f1:', "\x01\x33\x40"
The HBase shell is the (J)Ruby IRB with the above HBase-specific commands added.
For more on the HBase Shell, see http://hbase.apache.org/book.html
hbase(main):002:0>
To create a table (column family),
hbase(main):002:0> create 'test', 'cf'
0 row(s) in 1.5700 seconds
=> Hbase::Table - test
hbase(main):003:0>
list information about a table.
hbase(main):001:0> list 'test'
TABLE
test
1 row(s) in 0.3530 seconds
=> ["test"]
let's put something into the table we have just created.
hbase(main):002:0> put 'test', 'row1', 'cf:a', 'value1'
0 row(s) in 0.2280 seconds
hbase(main):003:0> put 'test', 'row2', 'cf:b', 'value2'
0 row(s) in 0.0140 seconds
hbase(main):004:0> put 'test', 'row3', 'cf:c', 'value3'
0 row(s) in 0.0060 seconds
hbase(main):005:0>
Here, we insert three values, one at a time. The first insert is at row1, column cf:a, with a value of value1. Columns in HBase are comprised of a column family prefix, cf in this example, followed by a colon and then a column qualifier suffix, a in this case.
To select the row from the table, use scan.
hbase(main):005:0> scan 'test'
ROW COLUMN+CELL
row1 column=cf:a, timestamp=1439892359305, value=value1
row2 column=cf:b, timestamp=1439892363921, value=value2
row3 column=cf:c, timestamp=1439892369775, value=value3
3 row(s) in 0.0420 seconds
hbase(main):006:0>
To get a row only.
hbase(main):006:0> get 'test', 'row1'
COLUMN CELL
cf:a timestamp=1439892359305, value=value1
1 row(s) in 0.0340 seconds
hbase(main):007:0>
Something really interesting about apache hbase, say if you want to delete or change settings of a table, you need to disable it first. After that, you can enable it back.
hbase(main):007:0> disable 'test'
0 row(s) in 2.3610 seconds
hbase(main):008:0> enable 'test'
0 row(s) in 1.2790 seconds
hbase(main):009:0>
okay, now, let's delete this table.
hbase(main):009:0> drop 'test'
ERROR: Table test is enabled. Disable it first.
Here is some help for this command:
Drop the named table. Table must first be disabled:
hbase> drop 't1'
hbase> drop 'ns1:t1'
hbase(main):010:0> disable 'test'
0 row(s) in 2.2640 seconds
hbase(main):011:0> drop 'test'
0 row(s) in 1.2800 seconds
hbase(main):012:0>
Okay, we are done for this basic learning. Let's quit for now.
hbase(main):012:0> quit
user@localhost:~/Desktop/hbase-1.1.1$
To stop apache hbase instance,
user@localhost:~/Desktop/hbase-1.1.1$ ./bin/stop-hbase.sh
stopping hbase.................
user@localhost:~/Desktop/hbase-1.1.1$ jps
23399 Jps
5445 org.eclipse.equinox.launcher_1.3.0.v20140415-2008.jar
If you like me who came from apache cassandra, apache hbase looks very similar. If this interest you, I shall leave you with the following three links which will get you further.
http://hbase.apache.org/book.html
http://wiki.apache.org/hadoop/Hbase
https://blogs.apache.org/hbase/
Friday, August 28, 2015
First light learning into Apache Storm part 1
Today we will go through another software, Apache Storm. According to the official Apache Storm github
Well, if you like me which are new to Apache Storm, this seem a bit vague on what Apache Storm is about. Fear not, we will in this article, go through some basic apache storm like installing storm, setup a storm cluster and perform a storm of hello world. But this is a good video that give introduction to apache storm.
If you study storm, the fundamentals three terminologies which you may come across which are spouts, bolts and topologies. These definition are excerpt from this site link.
Let's first download and install Apache Storm. Pick a stable version at here, download and then extract it. By now, your directories should be similar to the one below. I'm using Apache Storm 0.9.5 for this learning experience.
In the next article, we will setup a storm cluster.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Well, if you like me which are new to Apache Storm, this seem a bit vague on what Apache Storm is about. Fear not, we will in this article, go through some basic apache storm like installing storm, setup a storm cluster and perform a storm of hello world. But this is a good video that give introduction to apache storm.
If you study storm, the fundamentals three terminologies which you may come across which are spouts, bolts and topologies. These definition are excerpt from this site link.
There are just three abstractions in Storm: spouts, bolts, and topologies. A spout is a source of streams in a computation. Typically a spout reads from a queueing broker such as Kestrel, RabbitMQ, or Kafka, but a spout can also generate its own stream or read from somewhere like the Twitter streaming API. Spout implementations already exist for most queueing systems.
A bolt processes any number of input streams and produces any number of new output streams. Most of the logic of a computation goes into bolts, such as functions, filters, streaming joins, streaming aggregations, talking to databases, and so on.
A topology is a network of spouts and bolts, with each edge in the network representing a bolt subscribing to the output stream of some other spout or bolt. A topology is an arbitrarily complex multi-stage stream computation. Topologies run indefinitely when deployed
Let's first download and install Apache Storm. Pick a stable version at here, download and then extract it. By now, your directories should be similar to the one below. I'm using Apache Storm 0.9.5 for this learning experience.
user@localhost:~/Desktop/apache-storm-0.9.5$ ls
bin CHANGELOG.md conf DISCLAIMER examples external lib LICENSE logback NOTICE public README.markdown RELEASE SECURITY.md
user@localhost:~/Desktop/apache-storm-0.9.5$
In the next article, we will setup a storm cluster.
Sunday, August 16, 2015
First time learning gradle
It is difficult to jump start into software development if you are new to introduction of many sub technologies. Today, I'm gonna put aside of my project and start to learn another technology. Gradle, a build system but there are much more than just build. If you are also new to gradle, you might want to find out what actually is gradle.
Gradle on wikipedia
If you have many projects that depend on a project, gradle will solve your problems. We will look into the basic of gradle build automation tool today. I love to code java and so I will use java as this demo. First, let's install gradle. If you are using deb based distribution like debian or ubuntu, to install gradle, it is as easy as $ sudo apt-get install gradle. Otherwise, you can download gradle from http://gradle.org/ and install in your system. Now let's create a gradle build file. See below.
one liner of input produce so many output files. Amazing! Why so many files that were generated, read the output of the command output, it compile, process resource, jar, assemble, test check and build. What are all these means, I will not explain to you one by one, you learn better if you read this definition yourself which is documented very well here. You might say, hey , I have different java source path can gradle handle this? Yes of cause! In the build path you created, you can add another line.
Most of us coming from java has ant build file. If that is the case, gradle integrate nicely with ant too, you just need to import ant build file and then call ant target from gradle. See code snippet below.
That looks pretty good! If you curious about what gradle parameter that you can use during figuring out if the build went wrong, you should really read into this link. Also, if read on the environment variable as you can specify other jdk for gradle or even java parameter during compile big projects.
You might want to ask also, what if I only want to compile, I don't want to go through all the automatic builds above. No problem, since this is a java project, you specify compileJava.
As you can see, gradle is very flexible and because of that, you might want to exploit it further. For example, customizing the task in build.gradle, listing projects, listing tasks and others. For that, read here as it explain and give a lot of example how all that can be done. So at this stage, you might want to add more feature into gradle build file. Okay, let's do just that.
Now, if you want to generate eclipse configuration, just run gradle eclipse, all eclipse configuration and setting are created automatically. Of cause, you can customize settings even further.
Now, I create a simple unit test class file, see below. Then only run a single unit test, that's very cool.
There are two additional directories created , that is nativeapp and webapp, this is subprojects for this big project and it contain its own gradle build file. At the parent of the gradle build file, we see a subprojects configuration as this will applied to all the subprojects. You can create a settings.gradle to specify the subprojects.
That's all for today, as this is just an introduction to quicklyl dive into some of the cool features of gradle, with this shown, I hope it give you some idea where to head next. Good luck!
Gradle on wikipedia
Gradle is a build automation tool that builds upon the concepts of Apache Ant and Apache Maven and introduces a Groovy-based domain-specific language (DSL) instead of the more traditional XML form of declaring the project configuration. Gradle uses a directed acyclic graph ("DAG") to determine the order in which tasks can be run.
Gradle was designed for multi-project builds which can grow to be quite large, and supports incremental builds by intelligently determining which parts of the build tree are up-to-date, so that any task dependent upon those parts will not need to be re-executed.
If you have many projects that depend on a project, gradle will solve your problems. We will look into the basic of gradle build automation tool today. I love to code java and so I will use java as this demo. First, let's install gradle. If you are using deb based distribution like debian or ubuntu, to install gradle, it is as easy as $ sudo apt-get install gradle. Otherwise, you can download gradle from http://gradle.org/ and install in your system. Now let's create a gradle build file. See below.
user@localhost:~/gradle$ cat build.gradle
apply plugin: 'java'
user@localhost:~/gradle$ ls -a
total 36K
-rw-r--r-- 1 user user 21 Aug 6 17:15 build.gradle
drwxr-xr-x 214 user user 28K Aug 6 17:15 ..
drwxr-xr-x 2 user user 4.0K Aug 6 17:15 .
user@localhost:~/gradle$ gradle build
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:jar
:assemble
:compileTestJava UP-TO-DATE
:processTestResources UP-TO-DATE
:testClasses UP-TO-DATE
:test
:check
:build
BUILD SUCCESSFUL
Total time: 13.304 secs
user@localhost:~/gradle$ ls -a
total 44K
-rw-r--r-- 1 user user 21 Aug 6 17:15 build.gradle
drwxr-xr-x 214 user user 28K Aug 6 17:15 ..
drwxr-xr-x 3 user user 4.0K Aug 6 17:15 .gradle
drwxr-xr-x 4 user user 4.0K Aug 6 17:15 .
drwxr-xr-x 6 user user 4.0K Aug 6 17:15 build
user@localhost:~/gradle$ find .gradle/
.gradle/
.gradle/1.5
.gradle/1.5/taskArtifacts
.gradle/1.5/taskArtifacts/fileHashes.bin
.gradle/1.5/taskArtifacts/taskArtifacts.bin
.gradle/1.5/taskArtifacts/fileSnapshots.bin
.gradle/1.5/taskArtifacts/outputFileStates.bin
.gradle/1.5/taskArtifacts/cache.properties.lock
.gradle/1.5/taskArtifacts/cache.properties
user@localhost:~/gradle$ find build
build
build/libs
build/libs/gradle.jar
build/test-results
build/test-results/binary
build/test-results/binary/test
build/test-results/binary/test/results.bin
build/reports
build/reports/tests
build/reports/tests/report.js
build/reports/tests/index.html
build/reports/tests/base-style.css
build/reports/tests/style.css
build/tmp
build/tmp/jar
build/tmp/jar/MANIFEST.MF
one liner of input produce so many output files. Amazing! Why so many files that were generated, read the output of the command output, it compile, process resource, jar, assemble, test check and build. What are all these means, I will not explain to you one by one, you learn better if you read this definition yourself which is documented very well here. You might say, hey , I have different java source path can gradle handle this? Yes of cause! In the build path you created, you can add another line.
// set the source java folder to another non maven standard path
sourceSets.main.java.srcDirs = ['src/java']
Most of us coming from java has ant build file. If that is the case, gradle integrate nicely with ant too, you just need to import ant build file and then call ant target from gradle. See code snippet below.
user@localhost:~/gradle$ cat build.xml
<project>
<target name="helloAnt">
<echo message="hello this is ant."/>
</target>
</project>
user@localhost:~/gradle$ cat build.gradle
apply plugin: 'java'
// set the source java folder to another non maven standard path
sourceSets.main.java.srcDirs = ['src/java']
// import ant build file.
ant.importBuild 'build.xml'
user@localhost:~/gradle$ gradle helloAnt
:helloAnt
[ant:echo] hello this is ant.
BUILD SUCCESSFUL
Total time: 5.573 secs
That looks pretty good! If you curious about what gradle parameter that you can use during figuring out if the build went wrong, you should really read into this link. Also, if read on the environment variable as you can specify other jdk for gradle or even java parameter during compile big projects.
You might want to ask also, what if I only want to compile, I don't want to go through all the automatic builds above. No problem, since this is a java project, you specify compileJava.
user@localhost:~/gradle$ gradle compileJava
:compileJava UP-TO-DATE
BUILD SUCCESSFUL
Total time: 4.976 secs
As you can see, gradle is very flexible and because of that, you might want to exploit it further. For example, customizing the task in build.gradle, listing projects, listing tasks and others. For that, read here as it explain and give a lot of example how all that can be done. So at this stage, you might want to add more feature into gradle build file. Okay, let's do just that.
user@localhost:~/gradle$ cat build.gradle
apply plugin: 'java'
apply plugin: 'eclipse'
// set the source java folder to another non maven standard path
// default src/main/java
sourceSets.main.java.srcDirs = ['src/java']
// default src test
//src/test/java
// default src resources.
// src/main/resources
// default src test resources.
// src/test/resources
// default build
// build
// default jar built
// build/libs
// dependencies of external jar, we reference the very good from maven.
repositories {
mavenCentral()
}
// actual libs dependencies
dependencies {
compile group: 'commons-collections', name: 'commons-collections', version: '3.2'
testCompile group: 'junit', name: 'junit', version: '4.+'
}
test {
testLogging {
// Show that tests are run in the command-line output
events 'started', 'passed'
}
}
sourceCompatibility = 1.5
version = '1.0'
jar {
manifest {
attributes 'Implementation-Title': 'Gradle Quickstart',
'Implementation-Version': version
}
}
// import ant build file.
ant.importBuild 'build.xml'
// common for subprojects
subprojects {
apply plugin: 'java'
repositories {
mavenCentral()
}
dependencies {
testCompile 'junit:junit:4.12'
}
version = '1.0'
jar {
manifest.attributes provider: 'gradle'
}
}
user@localhost:~/gradle$ cat settings.gradle
include ":nativeapp",":webapp"
Now, if you want to generate eclipse configuration, just run gradle eclipse, all eclipse configuration and setting are created automatically. Of cause, you can customize settings even further.
user@localhost:~/gradle$ gradle eclipse
:eclipseClasspath
Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.pom
Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12-sources.jar
Download http://repo1.maven.org/maven2/org/hamcrest/hamcrest-core/1.3/hamcrest-core-1.3-sources.jar
Download http://repo1.maven.org/maven2/junit/junit/4.12/junit-4.12.jar
:eclipseJdt
:eclipseProject
:eclipse
BUILD SUCCESSFUL
Total time: 19.497 secs
user@localhost:~/gradle$ find .
.
.
./build.xml
./build
./build/classes
./build/classes/test
./build/classes/test/org
./build/classes/test/org/just4fun
./build/classes/test/org/just4fun/voc
./build/classes/test/org/just4fun/voc/file
./build/classes/test/org/just4fun/voc/file/QuickTest.class
./build/libs
./build/libs/gradle.jar
./build/libs/gradle-1.0.jar
./build/test-results
./build/test-results/binary
./build/test-results/binary/test
./build/test-results/binary/test/results.bin
./build/test-results/TEST-org.just4fun.voc.file.QuickTest.xml
./build/reports
./build/reports/tests
./build/reports/tests/report.js
./build/reports/tests/index.html
./build/reports/tests/org.just4fun.voc.file.html
./build/reports/tests/base-style.css
./build/reports/tests/org.just4fun.voc.file.QuickTest.html
./build/reports/tests/style.css
./build/dependency-cache
./build/tmp
./build/tmp/jar
./build/tmp/jar/MANIFEST.MF
./webapp
./webapp/build.gradle
./.gradle
./.gradle/1.5
./.gradle/1.5/taskArtifacts
./.gradle/1.5/taskArtifacts/fileHashes.bin
./.gradle/1.5/taskArtifacts/taskArtifacts.bin
./.gradle/1.5/taskArtifacts/fileSnapshots.bin
./.gradle/1.5/taskArtifacts/outputFileStates.bin
./.gradle/1.5/taskArtifacts/cache.properties.lock
./.gradle/1.5/taskArtifacts/cache.properties
./.classpath
./build.gradle
./.project
./.settings
./.settings/org.eclipse.jdt.core.prefs
./settings.gradle
./nativeapp
./nativeapp/build.gradle
./src
./src/test
./src/test/java
./src/test/java/org
./src/test/java/org/just4fun
./src/test/java/org/just4fun/voc
./src/test/java/org/just4fun/voc/file
./src/test/java/org/just4fun/voc/file/QuickTest.java
Now, I create a simple unit test class file, see below. Then only run a single unit test, that's very cool.
user@localhost:~/gradle$ find src/
src/
src/test
src/test/java
src/test/java/org
src/test/java/org/just4fun
src/test/java/org/just4fun/voc
src/test/java/org/just4fun/voc/file
src/test/java/org/just4fun/voc/file/QuickTest.java
$ gradle -Dtest.single=Quick test
:compileJava UP-TO-DATE
:processResources UP-TO-DATE
:classes UP-TO-DATE
:compileTestJavawarning: [options] bootstrap class path not set in conjunction with -source 1.5
1 warning
:processTestResources UP-TO-DATE
:testClasses
:test
org.just4fun.voc.file.QuickTest > test STARTED
org.just4fun.voc.file.QuickTest > test PASSED
BUILD SUCCESSFUL
Total time: 55.81 secs
user@localhost:~/gradle $
There are two additional directories created , that is nativeapp and webapp, this is subprojects for this big project and it contain its own gradle build file. At the parent of the gradle build file, we see a subprojects configuration as this will applied to all the subprojects. You can create a settings.gradle to specify the subprojects.
That's all for today, as this is just an introduction to quicklyl dive into some of the cool features of gradle, with this shown, I hope it give you some idea where to head next. Good luck!
Saturday, August 15, 2015
First learning Node.js
We will learn another software today, Node.js. Another word that I came across many times when reading on information technology articles. First, let's take a look on what is Node.js. From the official site,
So this is very much to understand what exactly is Node.js from that two sentences but as you continue to read in this article, you will get some idea. If you have basic javascript coding experience, you will think Node.js is just a script that run goodies stuff on browsers to enhance people experience. But as javascript envolve, Node.js evolve into an application where you can code as a server application! We will see that later in a moment.
Okay, let's install Node.js. If you are using deb base linux distribution, for example debian or ubuntu. It is as easy as $ sudo apt-get install nodejs. Otherwise, you can download a copy from this official site and install it.
Let's start with a simple Node.js hello world. Very easy, create a helloworld.js and do the print. See below.
very simple, one liner produce the hello world output. You might ask, what can Node.js functionalities can I use other than console. Well, at the end of this article, I will give you the link so you can explore further. But in the meantime, I will show you how easy to create a web server using Node.js! Let's read the code below.
As you can read, we create a file called server.js require a module called http. We pass an anonymous function into the function createServer of http module. The response will return http status 200 with a hello world. You can try to access in your browser with localhost:8888. Notice that the execution of the Node.js continue after http is created, unlike other language which will wait the execution finish before proceed the next line of code, Node.js execution will continue and this make Node.js asynchronous.
Well, by now you should understand what Node.js can do for you and if you interest more on Node.js , I will leave you this very helpful link.
Node.js® is a platform built on Chrome's JavaScript runtime for easily building fast, scalable network applications. Node.js uses an event-driven, non-blocking I/O model that makes it lightweight and efficient, perfect for data-intensive real-time applications that run across distributed devices.
So this is very much to understand what exactly is Node.js from that two sentences but as you continue to read in this article, you will get some idea. If you have basic javascript coding experience, you will think Node.js is just a script that run goodies stuff on browsers to enhance people experience. But as javascript envolve, Node.js evolve into an application where you can code as a server application! We will see that later in a moment.
Okay, let's install Node.js. If you are using deb base linux distribution, for example debian or ubuntu. It is as easy as $ sudo apt-get install nodejs. Otherwise, you can download a copy from this official site and install it.
Let's start with a simple Node.js hello world. Very easy, create a helloworld.js and do the print. See below.
user@localhost:~/nodejs$ cat helloworld.js
console.log("Hello World");
user@localhost:~/nodejs$ nodejs helloworld.js
Hello World
user@localhost:~/nodejs$
very simple, one liner produce the hello world output. You might ask, what can Node.js functionalities can I use other than console. Well, at the end of this article, I will give you the link so you can explore further. But in the meantime, I will show you how easy to create a web server using Node.js! Let's read the code below.
user@localhost:~/nodejs$ cat server.js
var http = require("http");
http.createServer(function(request, response) {
response.writeHead(200, {"Content-Type": "text/plain"});
response.write("Hello World");
response.end();
}).listen(8888);
console.log("create a webserver at port 8888");
user@localhost:~/nodejs$ nodejs server.js
create a webserver at port 8888
As you can read, we create a file called server.js require a module called http. We pass an anonymous function into the function createServer of http module. The response will return http status 200 with a hello world. You can try to access in your browser with localhost:8888. Notice that the execution of the Node.js continue after http is created, unlike other language which will wait the execution finish before proceed the next line of code, Node.js execution will continue and this make Node.js asynchronous.
Well, by now you should understand what Node.js can do for you and if you interest more on Node.js , I will leave you this very helpful link.
Friday, August 14, 2015
Light learning apache spark
A while back, I was reading articles and many articles referencing spark and in this week, hey, why not check out what actually is spark. Googling spark produced many results return and we are particularly interested in apache spark. Let us take a look today at apache stark and what is all about. From official spark github,
extract the downloaded file and ran the command, not good.
Well, the default download setting is source, so you will have to compile the source.
okay, let's beef up a little for the build setting, and the build took very long time, eventually. I switch to build in the directory build. See below.
Yes, finally the build is success. Even though success, as you can see above, it took 41minutes on my pc just to compile. Okay, now that all libs are built, let's repeat the command we type just now.
Okay, everything looks good, the error above no longer exists. Let's explore further.
That's pretty nice, for a small demo on how is spark work. Now move on to the next example, let's open another terminal.
Looks good, next example will calculate pi using spark.
With this introduction, it give you an idea on what spark is all about, you can basically use spark to do distributed processing. These tutorial give some quick idea on what spark is all about and how I can use it. It is definitely worth while to look into the example directory to see what can spark really do for you. Before I end this, I think these two links are very helpful to get you further.
http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/#launching-on-a-cluster
Apache SparkOkay, let's download a copy of spark to your local pc. You can download from this site.
Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.
extract the downloaded file and ran the command, not good.
user@localhost:~/Desktop/spark-1.4.1$ ./bin/pyspark
ls: cannot access /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10: No such file or directory
Failed to find Spark assembly in /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10.
You need to build Spark before running this program.
user@localhost:~/Desktop/spark-1.4.1$ ./bin/spark-shell
ls: cannot access /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10: No such file or directory
Failed to find Spark assembly in /home/user/Desktop/spark-1.4.1/assembly/target/scala-2.10.
You need to build Spark before running this program.
Well, the default download setting is source, so you will have to compile the source.
user@localhost:~/Desktop/spark-1.4.1$ mvn -DskipTests clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM
[INFO] Spark Launcher Project
[INFO] Spark Project Networking
[INFO] Spark Project Shuffle Streaming Service
[INFO] Spark Project Unsafe
...
...
...
constituent[20]: file:/usr/share/maven/lib/wagon-http-shaded.jar
constituent[21]: file:/usr/share/maven/lib/maven-settings-builder-3.x.jar
constituent[22]: file:/usr/share/maven/lib/maven-aether-provider-3.x.jar
constituent[23]: file:/usr/share/maven/lib/maven-core-3.x.jar
constituent[24]: file:/usr/share/maven/lib/plexus-cipher.jar
constituent[25]: file:/usr/share/maven/lib/aether-util.jar
constituent[26]: file:/usr/share/maven/lib/commons-httpclient.jar
constituent[27]: file:/usr/share/maven/lib/commons-cli.jar
constituent[28]: file:/usr/share/maven/lib/aether-api.jar
constituent[29]: file:/usr/share/maven/lib/maven-model-3.x.jar
constituent[30]: file:/usr/share/maven/lib/guava.jar
constituent[31]: file:/usr/share/maven/lib/wagon-file.jar
---------------------------------------------------
Exception in thread "main" java.lang.OutOfMemoryError: PermGen space
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClassFromSelf(ClassRealm.java:401)
at org.codehaus.plexus.classworlds.strategy.SelfFirstStrategy.loadClass(SelfFirstStrategy.java:42)
at org.codehaus.plexus.classworlds.realm.ClassRealm.unsynchronizedLoadClass(ClassRealm.java:271)
at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:247)
at org.codehaus.plexus.classworlds.realm.ClassRealm.loadClass(ClassRealm.java:239)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:545)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:196)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:141)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
okay, let's beef up a little for the build setting, and the build took very long time, eventually. I switch to build in the directory build. See below.
user@localhost:~/Desktop/spark-1.4.1$ export MAVEN_OPTS="-XX:MaxPermSize=1024M"
user@localhost:~/Desktop/spark-1.4.1$ mvn -DskipTests clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM
[INFO] Spark Launcher Project
[INFO] Spark Project Networking
[INFO] Spark Project Shuffle Streaming Service
[INFO] Spark Project Unsafe
[INFO] Spark Project Core
user@localhost:~/Desktop/spark-1.4.1$ build/mvn -DskipTests clean package
[INFO] Scanning for projects...
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Build Order:
[INFO]
[INFO] Spark Project Parent POM
[INFO] Spark Launcher Project
[INFO] Spark Project Networking
[INFO] Spark Project Shuffle Streaming Service
[INFO] Spark Project Unsafe
[INFO] Spark Project Core
..
...
...
...
get/spark-streaming-kafka-assembly_2.10-1.4.1-shaded.jar
[INFO]
[INFO] --- maven-source-plugin:2.4:jar-no-fork (create-source-jar) @ spark-streaming-kafka-assembly_2.10 ---
[INFO] Building jar: /home/user/Desktop/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1-sources.jar
[INFO]
[INFO] --- maven-source-plugin:2.4:test-jar-no-fork (create-source-jar) @ spark-streaming-kafka-assembly_2.10 ---
[INFO] Building jar: /home/user/Desktop/spark-1.4.1/external/kafka-assembly/target/spark-streaming-kafka-assembly_2.10-1.4.1-test-sources.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .......................... SUCCESS [26.138s]
[INFO] Spark Launcher Project ............................ SUCCESS [1:15.976s]
[INFO] Spark Project Networking .......................... SUCCESS [26.347s]
[INFO] Spark Project Shuffle Streaming Service ........... SUCCESS [14.123s]
[INFO] Spark Project Unsafe .............................. SUCCESS [12.643s]
[INFO] Spark Project Core ................................ SUCCESS [9:49.622s]
[INFO] Spark Project Bagel ............................... SUCCESS [17.426s]
[INFO] Spark Project GraphX .............................. SUCCESS [53.601s]
[INFO] Spark Project Streaming ........................... SUCCESS [1:34.290s]
[INFO] Spark Project Catalyst ............................ SUCCESS [2:04.020s]
[INFO] Spark Project SQL ................................. SUCCESS [2:11.032s]
[INFO] Spark Project ML Library .......................... SUCCESS [2:57.880s]
[INFO] Spark Project Tools ............................... SUCCESS [6.920s]
[INFO] Spark Project Hive ................................ SUCCESS [2:58.649s]
[INFO] Spark Project REPL ................................ SUCCESS [36.564s]
[INFO] Spark Project Assembly ............................ SUCCESS [3:13.152s]
[INFO] Spark Project External Twitter .................... SUCCESS [1:09.316s]
[INFO] Spark Project External Flume Sink ................. SUCCESS [42.294s]
[INFO] Spark Project External Flume ...................... SUCCESS [37.907s]
[INFO] Spark Project External MQTT ....................... SUCCESS [1:20.999s]
[INFO] Spark Project External ZeroMQ ..................... SUCCESS [29.090s]
[INFO] Spark Project External Kafka ...................... SUCCESS [54.212s]
[INFO] Spark Project Examples ............................ SUCCESS [5:54.508s]
[INFO] Spark Project External Kafka Assembly ............. SUCCESS [1:24.962s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 41:53.884s
[INFO] Finished at: Tue Aug 04 08:56:02 MYT 2015
[INFO] Final Memory: 71M/684M
[INFO] ------------------------------------------------------------------------
Yes, finally the build is success. Even though success, as you can see above, it took 41minutes on my pc just to compile. Okay, now that all libs are built, let's repeat the command we type just now.
$ ./bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/04 20:21:16 INFO SecurityManager: Changing view acls to: user
15/08/04 20:21:16 INFO SecurityManager: Changing modify acls to: user
15/08/04 20:21:16 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
15/08/04 20:21:16 INFO HttpServer: Starting HTTP Server
15/08/04 20:21:17 INFO Utils: Successfully started service 'HTTP class server' on port 56379.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_55)
Type in expressions to have them evaluated.
Type :help for more information.
15/08/04 20:21:24 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 192.168.133.28 instead (on interface eth0)
15/08/04 20:21:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/04 20:21:24 INFO SparkContext: Running Spark version 1.4.1
15/08/04 20:21:24 INFO SecurityManager: Changing view acls to: user
15/08/04 20:21:24 INFO SecurityManager: Changing modify acls to: user
15/08/04 20:21:24 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
15/08/04 20:21:25 INFO Slf4jLogger: Slf4jLogger started
15/08/04 20:21:26 INFO Remoting: Starting remoting
15/08/04 20:21:26 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.133.28:47888]
15/08/04 20:21:26 INFO Utils: Successfully started service 'sparkDriver' on port 47888.
15/08/04 20:21:27 INFO SparkEnv: Registering MapOutputTracker
15/08/04 20:21:27 INFO SparkEnv: Registering BlockManagerMaster
15/08/04 20:21:27 INFO DiskBlockManager: Created local directory at /tmp/spark-660b5f39-26be-4ea2-8593-c0c05a093a23/blockmgr-c3225f03-5ecf-4fed-bbe4-df2331ac7742
15/08/04 20:21:27 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/08/04 20:21:27 INFO HttpFileServer: HTTP File server directory is /tmp/spark-660b5f39-26be-4ea2-8593-c0c05a093a23/httpd-3ab40971-a6d0-42a7-b39e-4d1ce4290642
15/08/04 20:21:27 INFO HttpServer: Starting HTTP Server
15/08/04 20:21:27 INFO Utils: Successfully started service 'HTTP file server' on port 50089.
15/08/04 20:21:27 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/04 20:21:28 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/08/04 20:21:28 INFO SparkUI: Started SparkUI at http://192.168.133.28:4040
15/08/04 20:21:28 INFO Executor: Starting executor ID driver on host localhost
15/08/04 20:21:28 INFO Executor: Using REPL class URI: http://192.168.133.28:56379
15/08/04 20:21:28 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 36428.
15/08/04 20:21:28 INFO NettyBlockTransferService: Server created on 36428
15/08/04 20:21:28 INFO BlockManagerMaster: Trying to register BlockManager
15/08/04 20:21:28 INFO BlockManagerMasterEndpoint: Registering block manager localhost:36428 with 265.4 MB RAM, BlockManagerId(driver, localhost, 36428)
15/08/04 20:21:28 INFO BlockManagerMaster: Registered BlockManager
15/08/04 20:21:29 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/08/04 20:21:30 INFO SparkILoop: Created sql context..
SQL context available as sqlContext.
scala>
Okay, everything looks good, the error above no longer exists. Let's explore further.
scala> sc.parallelize(1 to 1000).count()
15/08/04 20:30:05 INFO SparkContext: Starting job: count at <console>:22
15/08/04 20:30:05 INFO DAGScheduler: Got job 0 (count at <console>:22) with 4 output partitions (allowLocal=false)
15/08/04 20:30:05 INFO DAGScheduler: Final stage: ResultStage 0(count at <console>:22)
15/08/04 20:30:05 INFO DAGScheduler: Parents of final stage: List()
15/08/04 20:30:05 INFO DAGScheduler: Missing parents: List()
15/08/04 20:30:05 INFO DAGScheduler: Submitting ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:22), which has no missing parents
15/08/04 20:30:05 INFO MemoryStore: ensureFreeSpace(1096) called with curMem=0, maxMem=278302556
15/08/04 20:30:05 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1096.0 B, free 265.4 MB)
15/08/04 20:30:05 INFO MemoryStore: ensureFreeSpace(804) called with curMem=1096, maxMem=278302556
15/08/04 20:30:05 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 804.0 B, free 265.4 MB)
15/08/04 20:30:05 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:36428 (size: 804.0 B, free: 265.4 MB)
15/08/04 20:30:05 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/08/04 20:30:05 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (ParallelCollectionRDD[0] at parallelize at <console>:22)
15/08/04 20:30:05 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
15/08/04 20:30:05 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1369 bytes)
15/08/04 20:30:05 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1369 bytes)
15/08/04 20:30:05 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 1369 bytes)
15/08/04 20:30:05 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 1426 bytes)
15/08/04 20:30:05 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
15/08/04 20:30:05 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/04 20:30:05 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
15/08/04 20:30:05 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/04 20:30:06 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 658 bytes result sent to driver
15/08/04 20:30:06 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 658 bytes result sent to driver
15/08/04 20:30:06 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 658 bytes result sent to driver
15/08/04 20:30:06 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 658 bytes result sent to driver
15/08/04 20:30:06 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 477 ms on localhost (1/4)
15/08/04 20:30:06 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 478 ms on localhost (2/4)
15/08/04 20:30:06 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 508 ms on localhost (3/4)
15/08/04 20:30:06 INFO DAGScheduler: ResultStage 0 (count at <console>:22) finished in 0.520 s
15/08/04 20:30:06 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 478 ms on localhost (4/4)
15/08/04 20:30:06 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/04 20:30:06 INFO DAGScheduler: Job 0 finished: count at <console>:22, took 1.079304 s
res0: Long = 1000
That's pretty nice, for a small demo on how is spark work. Now move on to the next example, let's open another terminal.
user@localhost:~/Desktop/spark-1.4.1$ ./bin/pyspark
Python 2.7.10 (default, Jul 1 2015, 10:54:53)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/04 20:37:42 INFO SparkContext: Running Spark version 1.4.1
15/08/04 20:37:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/04 20:37:44 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 182.168.133.28 instead (on interface eth0)
15/08/04 20:37:44 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/04 20:37:44 INFO SecurityManager: Changing view acls to: user
15/08/04 20:37:44 INFO SecurityManager: Changing modify acls to: user
15/08/04 20:37:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
15/08/04 20:37:46 INFO Slf4jLogger: Slf4jLogger started
15/08/04 20:37:46 INFO Remoting: Starting remoting
15/08/04 20:37:46 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@182.168.133.28:35904]
15/08/04 20:37:46 INFO Utils: Successfully started service 'sparkDriver' on port 35904.
15/08/04 20:37:46 INFO SparkEnv: Registering MapOutputTracker
15/08/04 20:37:46 INFO SparkEnv: Registering BlockManagerMaster
15/08/04 20:37:47 INFO DiskBlockManager: Created local directory at /tmp/spark-2b46e9e7-1779-45d1-b9cf-46000baf7d9b/blockmgr-e2f47b34-47a8-4b72-a0d6-25d0a7daa02e
15/08/04 20:37:47 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/08/04 20:37:47 INFO HttpFileServer: HTTP File server directory is /tmp/spark-2b46e9e7-1779-45d1-b9cf-46000baf7d9b/httpd-2ec128c2-bad0-4dd9-a826-eab2ee0779cb
15/08/04 20:37:47 INFO HttpServer: Starting HTTP Server
15/08/04 20:37:47 INFO Utils: Successfully started service 'HTTP file server' on port 45429.
15/08/04 20:37:47 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/04 20:37:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
15/08/04 20:37:50 INFO Utils: Successfully started service 'SparkUI' on port 4041.
15/08/04 20:37:50 INFO SparkUI: Started SparkUI at http://182.168.133.28:4041
15/08/04 20:37:50 INFO Executor: Starting executor ID driver on host localhost
15/08/04 20:37:51 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 47045.
15/08/04 20:37:51 INFO NettyBlockTransferService: Server created on 47045
15/08/04 20:37:51 INFO BlockManagerMaster: Trying to register BlockManager
15/08/04 20:37:51 INFO BlockManagerMasterEndpoint: Registering block manager localhost:47045 with 265.4 MB RAM, BlockManagerId(driver, localhost, 47045)
15/08/04 20:37:51 INFO BlockManagerMaster: Registered BlockManager
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.4.1
/_/
Using Python version 2.7.10 (default, Jul 1 2015 10:54:53)
SparkContext available as sc, SQLContext available as sqlContext.
>>> sc.parallelize(range(1000)).count()
15/08/04 20:37:55 INFO SparkContext: Starting job: count at <stdin>:1
15/08/04 20:37:55 INFO DAGScheduler: Got job 0 (count at <stdin>:1) with 4 output partitions (allowLocal=false)
15/08/04 20:37:55 INFO DAGScheduler: Final stage: ResultStage 0(count at <stdin>:1)
15/08/04 20:37:55 INFO DAGScheduler: Parents of final stage: List()
15/08/04 20:37:55 INFO DAGScheduler: Missing parents: List()
15/08/04 20:37:55 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[1] at count at <stdin>:1), which has no missing parents
15/08/04 20:37:55 INFO MemoryStore: ensureFreeSpace(4416) called with curMem=0, maxMem=278302556
15/08/04 20:37:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 4.3 KB, free 265.4 MB)
15/08/04 20:37:55 INFO MemoryStore: ensureFreeSpace(2722) called with curMem=4416, maxMem=278302556
15/08/04 20:37:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 2.7 KB, free 265.4 MB)
15/08/04 20:37:55 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:47045 (size: 2.7 KB, free: 265.4 MB)
15/08/04 20:37:55 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/08/04 20:37:55 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 0 (PythonRDD[1] at count at <stdin>:1)
15/08/04 20:37:55 INFO TaskSchedulerImpl: Adding task set 0.0 with 4 tasks
15/08/04 20:37:55 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1873 bytes)
15/08/04 20:37:55 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 2117 bytes)
15/08/04 20:37:55 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, PROCESS_LOCAL, 2123 bytes)
15/08/04 20:37:55 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, PROCESS_LOCAL, 2123 bytes)
15/08/04 20:37:55 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
15/08/04 20:37:55 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
15/08/04 20:37:55 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/04 20:37:55 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/04 20:37:56 INFO PythonRDD: Times: total = 421, boot = 376, init = 44, finish = 1
15/08/04 20:37:56 INFO PythonRDD: Times: total = 418, boot = 354, init = 64, finish = 0
15/08/04 20:37:56 INFO PythonRDD: Times: total = 423, boot = 372, init = 51, finish = 0
15/08/04 20:37:56 INFO PythonRDD: Times: total = 421, boot = 381, init = 40, finish = 0
15/08/04 20:37:56 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 698 bytes result sent to driver
15/08/04 20:37:56 INFO Executor: Finished task 2.0 in stage 0.0 (TID 2). 698 bytes result sent to driver
15/08/04 20:37:56 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 698 bytes result sent to driver
15/08/04 20:37:56 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 698 bytes result sent to driver
15/08/04 20:37:56 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 552 ms on localhost (1/4)
15/08/04 20:37:56 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 560 ms on localhost (2/4)
15/08/04 20:37:56 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 562 ms on localhost (3/4)
15/08/04 20:37:56 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 626 ms on localhost (4/4)
15/08/04 20:37:56 INFO DAGScheduler: ResultStage 0 (count at <stdin>:1) finished in 0.641 s
15/08/04 20:37:56 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/04 20:37:56 INFO DAGScheduler: Job 0 finished: count at <stdin>:1, took 1.137405 s
1000
>>>
Looks good, next example will calculate pi using spark.
user@localhost:~/Desktop/spark-1.4.1$ ./bin/run-example SparkPi
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/08/04 20:44:50 INFO SparkContext: Running Spark version 1.4.1
15/08/04 20:44:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/08/04 20:44:51 WARN Utils: Your hostname, localhost resolves to a loopback address: 127.0.1.1; using 182.168.133.28 instead (on interface eth0)
15/08/04 20:44:51 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
15/08/04 20:44:51 INFO SecurityManager: Changing view acls to: user
15/08/04 20:44:51 INFO SecurityManager: Changing modify acls to: user
15/08/04 20:44:51 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
15/08/04 20:44:52 INFO Slf4jLogger: Slf4jLogger started
15/08/04 20:44:52 INFO Remoting: Starting remoting
15/08/04 20:44:53 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@182.168.133.28:45817]
15/08/04 20:44:53 INFO Utils: Successfully started service 'sparkDriver' on port 45817.
15/08/04 20:44:53 INFO SparkEnv: Registering MapOutputTracker
15/08/04 20:44:53 INFO SparkEnv: Registering BlockManagerMaster
15/08/04 20:44:53 INFO DiskBlockManager: Created local directory at /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/blockmgr-5ed813af-a26f-413c-bdfc-1e08001f9cb2
15/08/04 20:44:53 INFO MemoryStore: MemoryStore started with capacity 265.4 MB
15/08/04 20:44:53 INFO HttpFileServer: HTTP File server directory is /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/httpd-f07ff755-e34d-4149-b4ac-399e6897221a
15/08/04 20:44:53 INFO HttpServer: Starting HTTP Server
15/08/04 20:44:53 INFO Utils: Successfully started service 'HTTP file server' on port 50955.
15/08/04 20:44:53 INFO SparkEnv: Registering OutputCommitCoordinator
15/08/04 20:44:54 INFO Utils: Successfully started service 'SparkUI' on port 4040.
15/08/04 20:44:54 INFO SparkUI: Started SparkUI at http://182.168.133.28:4040
15/08/04 20:44:58 INFO SparkContext: Added JAR file:/home/user/Desktop/spark-1.4.1/examples/target/scala-2.10/spark-examples-1.4.1-hadoop2.2.0.jar at http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar with timestamp 1438692298221
15/08/04 20:44:58 INFO Executor: Starting executor ID driver on host localhost
15/08/04 20:44:58 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 45731.
15/08/04 20:44:58 INFO NettyBlockTransferService: Server created on 45731
15/08/04 20:44:58 INFO BlockManagerMaster: Trying to register BlockManager
15/08/04 20:44:58 INFO BlockManagerMasterEndpoint: Registering block manager localhost:45731 with 265.4 MB RAM, BlockManagerId(driver, localhost, 45731)
15/08/04 20:44:58 INFO BlockManagerMaster: Registered BlockManager
15/08/04 20:44:59 INFO SparkContext: Starting job: reduce at SparkPi.scala:35
15/08/04 20:44:59 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:35) with 2 output partitions (allowLocal=false)
15/08/04 20:44:59 INFO DAGScheduler: Final stage: ResultStage 0(reduce at SparkPi.scala:35)
15/08/04 20:44:59 INFO DAGScheduler: Parents of final stage: List()
15/08/04 20:44:59 INFO DAGScheduler: Missing parents: List()
15/08/04 20:44:59 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31), which has no missing parents
15/08/04 20:44:59 INFO MemoryStore: ensureFreeSpace(1888) called with curMem=0, maxMem=278302556
15/08/04 20:44:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1888.0 B, free 265.4 MB)
15/08/04 20:44:59 INFO MemoryStore: ensureFreeSpace(1202) called with curMem=1888, maxMem=278302556
15/08/04 20:44:59 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1202.0 B, free 265.4 MB)
15/08/04 20:44:59 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:45731 (size: 1202.0 B, free: 265.4 MB)
15/08/04 20:44:59 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:874
15/08/04 20:44:59 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:31)
15/08/04 20:44:59 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
15/08/04 20:44:59 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1446 bytes)
15/08/04 20:44:59 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1446 bytes)
15/08/04 20:44:59 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
15/08/04 20:44:59 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
15/08/04 20:44:59 INFO Executor: Fetching http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar with timestamp 1438692298221
15/08/04 20:45:00 INFO Utils: Fetching http://182.168.133.28:50955/jars/spark-examples-1.4.1-hadoop2.2.0.jar to /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/userFiles-f3a72f24-78e5-4d5d-82eb-dcc8c6b899cb/fetchFileTemp5981400277552657211.tmp
15/08/04 20:45:03 INFO Executor: Adding file:/tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/userFiles-f3a72f24-78e5-4d5d-82eb-dcc8c6b899cb/spark-examples-1.4.1-hadoop2.2.0.jar to class loader
15/08/04 20:45:03 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 736 bytes result sent to driver
15/08/04 20:45:03 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 736 bytes result sent to driver
15/08/04 20:45:03 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3722 ms on localhost (1/2)
15/08/04 20:45:03 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3685 ms on localhost (2/2)
15/08/04 20:45:03 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
15/08/04 20:45:03 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:35) finished in 3.750 s
15/08/04 20:45:03 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:35, took 4.032610 s
Pi is roughly 3.14038
15/08/04 20:45:03 INFO SparkUI: Stopped Spark web UI at http://182.168.133.28:4040
15/08/04 20:45:03 INFO DAGScheduler: Stopping DAGScheduler
15/08/04 20:45:03 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
15/08/04 20:45:03 INFO Utils: path = /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9/blockmgr-5ed813af-a26f-413c-bdfc-1e08001f9cb2, already present as root for deletion.
15/08/04 20:45:03 INFO MemoryStore: MemoryStore cleared
15/08/04 20:45:03 INFO BlockManager: BlockManager stopped
15/08/04 20:45:03 INFO BlockManagerMaster: BlockManagerMaster stopped
15/08/04 20:45:03 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
15/08/04 20:45:03 INFO SparkContext: Successfully stopped SparkContext
15/08/04 20:45:03 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
15/08/04 20:45:03 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
15/08/04 20:45:03 INFO Utils: Shutdown hook called
15/08/04 20:45:03 INFO Utils: Deleting directory /tmp/spark-da217260-adb6-474e-9908-9dcdd39371e9
With this introduction, it give you an idea on what spark is all about, you can basically use spark to do distributed processing. These tutorial give some quick idea on what spark is all about and how I can use it. It is definitely worth while to look into the example directory to see what can spark really do for you. Before I end this, I think these two links are very helpful to get you further.
http://spark.apache.org/docs/latest/quick-start.html
http://spark.apache.org/docs/latest/#launching-on-a-cluster
Subscribe to:
Posts (Atom)