Showing posts with label accumulo-1_7_0. Show all posts
Showing posts with label accumulo-1_7_0. Show all posts

Saturday, November 7, 2015

Apache accumulo first learning experience


Today we will take a look into another big data technology. Apache accumulo is the topic for today. First, what is accumulo?

Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. Other notable improvements and feature are outlined here.
Google published the design of BigTable in 2006. Several other open source projects have implemented aspects of this design including HBase, Hypertable, and Cassandra. Accumulo began its development in 2008 and joined the Apache community in 2011.

In this article, as always, we will setup the infrastructure. I reference this article with the following environment.

  • 64bit arch
  • open jdk 1.7/1.8
  • zookeeper-3.4.6
  • hadoop-2.6.1
  • accumulo-1.7.0
  • openssh 
  • rsync
  • debian sid

As accumulo is java based project, you must installed and configured java. Get latest java 1.7 or 1.8 as of this writing. After java is installed you need to export JAVA_HOME in your bash configuration file, .bashrc with this line export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_55

Then you need to source the new .bashrc. . .bashrc is sufficient. For ssh and rsync, you can use apt-get package manager as it is easy. What's important is that, you should enable public and private key in your user configuration ssh directory. 

You can create two directories, $HOME/Downloads and $HOME/Installs respectively. It's pretty intuitive, the downloads directory is for the package downloaded and the install is the working directory after the compress packages are downloaded.


Download the above packages into the $HOME/Downloads directory and extracted into $HOME/Installs. First, let's configure apache hadoop.

 $ vim $HOME/Installs/hadoop-2.6.1/etc/hadoop/hadoop-env.sh  
 $ # uncomment in the file above export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_55  
 $ vim $HOME/Installs/hadoop-2.6.1/etc/hadoop/core-site.xml  
 $ cat $HOME/Installs/hadoop-2.6.1/etc/hadoop/core-site.xml  
 <?xml version="1.0" encoding="UTF-8"?>  
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
 <configuration>  
   <property>  
     <name>fs.defaultFS</name>  
     <value>hdfs://localhost:9000</value>  
   </property>  
 </configuration>  
 $ vim $HOME/Installs/hadoop-2.6.1/etc/hadoop/hdfs-site.xml  
 $ cat $HOME/Installs/hadoop-2.6.1/etc/hadoop/hdfs-site.xml  
 <?xml version="1.0" encoding="UTF-8"?>  
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
 <configuration>  
   <property>  
     <name>dfs.replication</name>  
     <value>1</value>  
   </property>  
   <property>  
     <name>dfs.name.dir</name>  
     <value>hdfs_storage/name</value>  
   </property>  
   <property>  
     <name>dfs.data.dir</name>  
     <value>hdfs_storage/data</value>  
   </property>  
 </configuration>  
 $ vim $HOME/Installs/hadoop-2.6.1/etc/hadoop/mapred-site.xml  
 $ cat $HOME/Installs/hadoop-2.6.1/etc/hadoop/mapred-site.xml  
 <?xml version="1.0"?>  
 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>  
 <configuration>  
    <property>  
      <name>mapred.job.tracker</name>  
      <value>localhost:9001</value>  
    </property>  
 </configuration>  
 $ cd $HOME/Installs/hadoop-2.6.1/  
 $ $HOME/Installs/hadoop-2.6.1/bin/hdfs namenode -format  
 $ $HOME/Installs/hadoop-2.6.1/sbin/start-dfs.sh  

As you can read above, we specify the java home for hadoop and then we configure hadoop to run on port 9000, so make sure this port is free for hadoop to use. Then we format the hadoop namenode and start the hadoop.

Next we will configure zookeeper.

 $ cp $HOME/Installs/zookeeper-3.4.6/conf/zoo_sample.cfg $HOME/Installs/zookeeper-3.4.6/conf/zoo.cfg  
 $ $HOME/Installs/zookeeper-3.4.6/bin/zkServer.sh start  

Pretty simple, get the default config file and start the services. Last steps is the apache accumulo.

 $ cp $HOME/Installs/accumulo-1.7.0/conf/examples/512MB/standalone/* $HOME/Installs/accumulo-1.7.0/conf/  
 $ vim $HOME/.bashrc  
 $ tail -2 $HOME/.bashrc  
 export HADOOP_HOME=$HOME/Installs/hadoop-2.6.1/  
 export ZOOKEEPER_HOME=$HOME/Installs/zookeeper-3.4.6/  
 $ . $HOME/.bashrc  
 $ vim $HOME/Installs/accumulo-1.7.0/conf/accumulo-env.sh  
 $ # SET ACCUMULO_MONITOR_BIND_ALL to true.  
 $ vim $HOME/Installs/accumulo-1.7.0/conf/accumulo-site.xml  
 $ # in file $HOME/Installs/accumulo-1.7.0/conf/accumulo-site.xml  
 <property>  
   <name>instance.volumes</name>  
   <value>hdfs://localhost:9000/accumulo</value>  
 </property>  
 $ # in file $HOME/Installs/accumulo-1.7.0/conf/accumulo-site.xml   
   <name>instance.secret</name>  
   <value>mysecret</value>  
 $ # in file $HOME/Installs/accumulo-1.7.0/conf/accumulo-site.xml    
  <property>  
   <name>trace.token.property.password</name>  
   <value>my scret</value>  
  </property>  

So we have configure the setting for accumulo in .bashrc and some properties settings in accumulo-env.sh and accumulo-site.xml . Next, we will initialize accumulo and start it using the password we specify previously.

 $ $HOME/Installs/accumulo-1.7.0/bin/accumulo init  
 $ # give a instance name.  
 $ # type in the password as specify in trace.token.property.password.  
 $ $HOME/Installs/accumulo-1.7.0/bin/start-all.sh  

That's it! If you want to do CRUD in accumulo, I suggest you go with this official documentation.