Showing posts with label cloudera_impala. Show all posts
Showing posts with label cloudera_impala. Show all posts

Sunday, August 30, 2015

First learning into Cloudera Impala

Let's take a look into a vendor big data technology today. In this article, we will take a look into Cloudera Impala. So what is Impala all about?

wikipedia definition

Cloudera Impala is Cloudera's open source massively parallel processing (MPP) SQL query engine for data stored in a computer cluster running Apache Hadoop.[1]

and from the official github repository definition

Lightning-fast, distributed SQL queries for petabytes of data stored in Apache Hadoop clusters. 
Impala is a modern, massively-distributed, massively-parallel, C++ query engine that lets you analyze, transform and combine data from a variety of data sources:

Let us download a virtual machine image, this is good as impala works with integration with hadoop and if you don't have hadoop knowledge, you must start from establish hadoop cluster first before integrating it with Impala. With this virtual machine image, it is as easy as import this virtual machine image into the host and power it up. It also save time for you like setting it up and reduce error.

With that said, I'm downloading a virtual box image. Once download and extract to a directory. If you have not install virtualbox, you should by now install it. apt-get install virtualbox virtualbox-guest-additions-iso and make sure virtualbox instance is running.

 root@localhost:~# /etc/init.d/virtualbox status  
 ● virtualbox.service - LSB: VirtualBox Linux kernel module  
   Loaded: loaded (/etc/init.d/virtualbox)  
   Active: active (exited) since Thu 2015-08-20 17:07:43 MYT; 2min 36s ago  
    Docs: man:systemd-sysv-generator(8)  
  Process: 29390 ExecStop=/etc/init.d/virtualbox stop (code=exited, status=0/SUCCESS)  
  Process: 29425 ExecStart=/etc/init.d/virtualbox start (code=exited, status=0/SUCCESS)  
   
 Aug 20 17:07:43 localhost systemd[1]: Starting LSB: VirtualBox Linux kernel module...  
 Aug 20 17:07:43 localhost systemd[1]: Started LSB: VirtualBox Linux kernel module.  
 Aug 20 17:07:43 localhost virtualbox[29425]: Starting VirtualBox kernel modules.  

launch virtualbox and add that virtual image into a new instance, see screenshot below.




now power this virtual machine up! Please be patient as it will take a long time to boot it up. At least for my pc. Be patient and you might want to get some drink in the mean time. The ongoing article is using this tutorial. However, I give up as select statement take a long time and it is very slow in virtual environment, at least for me here. But I will illustrate until the point where it became slow.

first you need to copy this csv files (tab1.csv and tab2.csv) into the virtual machine.







Then you can load the script with the sql to create the tables and load the csv into the table. But the example given in the tutorial does not have database and i suggest you add these two lines into the script and load it up.

 create database testdb;  
 use testdb;  
 DROP TABLE IF EXISTS tab1;  
 -- The EXTERNAL clause means the data is located outside the central location  
 -- for Impala data files and is preserved when the associated Impala table is dropped.  
 -- We expect the data to already ex  



After that, you can issue command impala-shell and you can do sql queries, but as you see, the select statement just hang there forever.



Not a good experience but if impala is what you need, find out what is the problem and let me know. :-)