Home » hadoop

Category Archives: hadoop

Installing HBase over HDFS on a Single Ubuntu Box

I faced some issues making HBase run over HDFS on my Ubuntu box. This is a informal step-by-step guide from setting up HDFS to running HBase on a single Ubuntu machine.

    1. Download hadoop (hadoop-0.20.203.0rc1.tar.gz)and install it following this great tutorial http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/. I installed on my system user rather than creating hduser. Make sure the 4 files (core-site.xml, hadoop-env.sh, hdfs-site.xml, mapred-site.xml) under hadoop/conf folder have values as shown below. Check the hadoop is working fine by running wordcount example as mentioned in tutorial. Also update .bashrc files with required variables.core-site.xml
      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
      
      <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/shekhar/hadoop-data</value>
        <description>A base for other temporary directories.</description>
      </property>
      
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:54310</value>
        <description>The name of the default file system.  A URI whose
        scheme and authority determine the FileSystem implementation.  The
        uri's scheme determines the config property (fs.SCHEME.impl) naming
        the FileSystem implementation class.  The uri's authority is used to
        determine the host, port, etc. for a filesystem.</description>
      </property>
      
      </configuration>
      1
      
      <strong>hdfs-site.xml</strong>
      1
      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
      
      <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>Default block replication.
        The actual number of replications can be specified when the file is created.
        The default is used if replication is not specified in create time.
        </description>
      </property>
      
      </configuration>
      

      mapred-site.xml

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      
      <!-- Put site-specific property overrides in this file. -->
      
      <configuration>
      
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:54311</value>
        <description>The host and port that the MapReduce job tracker runs
        at.  If "local", then jobs are run in-process as a single map
        and reduce task.
        </description>
      </property>
      
      </configuration>
      

      hadoop-env.sh

      # Set Hadoop-specific environment variables here.
      
      # The only required environment variable is JAVA_HOME.  All others are
      # optional.  When running a distributed configuration it is best to
      # set JAVA_HOME in this file, so that it is correctly defined on
      # remote nodes.
      
      # The java implementation to use.  Required.
      export JAVA_HOME=/usr/lib/jvm/java-6-sun-1.6.0.26
      
      # Extra Java CLASSPATH elements.  Optional.
      # export HADOOP_CLASSPATH=
      
      # The maximum amount of heap to use, in MB. Default is 1000.
      # export HADOOP_HEAPSIZE=2000
      
      # Extra Java runtime options.  Empty by default.
      # export HADOOP_OPTS=-server
      
      # Command specific options appended to HADOOP_OPTS when specified
      export HADOOP_NAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_NAMENODE_OPTS"
      export HADOOP_SECONDARYNAMENODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_SECONDARYNAMENODE_OPTS"
      export HADOOP_DATANODE_OPTS="-Dcom.sun.management.jmxremote $HADOOP_DATANODE_OPTS"
      export HADOOP_BALANCER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_BALANCER_OPTS"
      export HADOOP_JOBTRACKER_OPTS="-Dcom.sun.management.jmxremote $HADOOP_JOBTRACKER_OPTS"
      # export HADOOP_TASKTRACKER_OPTS=
      # The following applies to multiple commands (fs, dfs, fsck, distcp etc)
      # export HADOOP_CLIENT_OPTS
      
      # Extra ssh options.  Empty by default.
      # export HADOOP_SSH_OPTS="-o ConnectTimeout=1 -o SendEnv=HADOOP_CONF_DIR"
      
      # Where log files are stored.  $HADOOP_HOME/logs by default.
      # export HADOOP_LOG_DIR=${HADOOP_HOME}/logs
      
      # File naming remote slave hosts.  $HADOOP_HOME/conf/slaves by default.
      # export HADOOP_SLAVES=${HADOOP_HOME}/conf/slaves
      
      # host:path where hadoop code should be rsync'd from.  Unset by default.
      # export HADOOP_MASTER=master:/home/$USER/src/hadoop
      
      # Seconds to sleep between slave commands.  Unset by default.  This
      # can be useful in large clusters, where, e.g., slave rsyncs can
      # otherwise arrive faster than the master can service them.
      # export HADOOP_SLAVE_SLEEP=0.1
      
      # The directory where pid files are stored. /tmp by default.
      # export HADOOP_PID_DIR=/var/hadoop/pids
      
      # A string representing this instance of hadoop. $USER by default.
      # export HADOOP_IDENT_STRING=$USER
      
      # The scheduling priority for daemon processes.  See 'man nice'.
      # export HADOOP_NICENESS=10
      
    2. Download HBase(version hbase-0.90.4.tar.gz). Update hbase-site.xml in hbase/conf folder with required properties.
      hbase-site.xml

      <?xml version="1.0"?>
      <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
      <configuration>
      
      	<property>
      		<name>hbase.rootdir</name>
          		<value>hdfs://localhost:54310/hbase</value>
      	</property>
      
      	<property>
      		<name>dfs.replication</name>
      		<value>1</value>
      	</property>
      
      	<property>
      	      <name>hbase.zookeeper.property.clientPort</name>
      	      <value>2222</value>
      	      <description>Property from ZooKeeper's config zoo.cfg.
      	      The port at which the clients will connect.
      	      </description>
          	</property>
      	<property>
      	      <name>hbase.zookeeper.quorum</name>
      	      <value>localhost</value>
      	      <description>Comma separated list of servers in the ZooKeeper Quorum.
      	      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com".
      	      By default this is set to localhost for local and pseudo-distributed modes
      	      of operation. For a fully-distributed setup, this should be set to a full
      	      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in hbase-env.sh
      	      this is the list of servers which we will start/stop ZooKeeper on.
      	      </description>
      	</property>
          <property>
            <name>hbase.zookeeper.property.dataDir</name>
            <value>/home/shekhar/zookeeper</value>
            <description>Property from ZooKeeper's config zoo.cfg.
            The directory where the snapshot is stored.
            </description>
          </property>
      
      </configuration>
      

      Update hbase-env.sh so that HBase should manage ZooKeeper.

      # Tell HBase whether it should manage it's own instance of Zookeeper or not.
      export HBASE_MANAGES_ZK=true
      
    3. Run hbase using ./start-hbase.sh in bin folder. You will see following exception in log file.
      2011-12-06 13:59:29,979 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
      java.io.IOException: Call to localhost/127.0.0.1:54310 failed on local exception: java.io.EOFException
      
      2011-12-06 13:59:30,577 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server localhost/127.0.0.1:2181
      2011-12-06 13:59:30,577 WARN org.apache.zookeeper.ClientCnxn: Session 0x134127deaaf0002 for server null, unexpected error, closing socket connection and attempting reconnect
      java.net.ConnectException: Connection refused
      at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
      at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567)
      at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119)
      

      Kill the HBase using kill -9 <processid>

    4. The exception in step 3 is because hadoop jar in hbase lib directory is different from the one used in hadoop. Copy the hadoop-core-0.20.203.0.jar in hadoop folder to the hbase/lib folder.
    5. start the hbase again using ./start-hbase.sh and you will get another exception
      2011-12-06 14:51:05,778 FATAL org.apache.hadoop.hbase.master.HMaster: Unhandled exception. Starting shutdown.
      java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration
      

      Kill the HBase using kill -9 <proceesid>

    6. To fix this copy commons-configuration-1.6.jar from hadoop lib folder to hbase lib folder.
    7. Start the hbase again using ./start-hbase.sh it should start fine now and you should be able to see hbase running at http://localhost:60010/master.jsp . If you see a valid page coming
      hbase has started fine.

How I explained MapReduce to my Wife?

Yesterday I gave a presentation at Xebia India office on MapReduce. It really went well and audience was able to understand the concept of MapReduce (as per their feedback). So, I was happy that I did a good job in explaining MapReduce concept to a technical audience (mainly Java programmer, some Flex programmer and few testers). After all the hard work and a great dinner at Xebia India office I reached back my home. My wife (Supriya) asked me “How was your session on …” , I replied it went well. So next she asked what was your session all about (she is not in software/ programming field)?  I replied MapReduce. MapReduce !! what is it? She replied “is it something related to geographical maps?” . I replied No No.. it has nothing to do with geographical maps. So, She said what is it? .Hmmm… I said lets go to Dominos (A pizza chain)  and I will explain it over the pizza table. She said great and we went to pizza shop.

After we reached the Dominos and placed our order we were told by the guy at the counter that it will take 15 minutes to prepare the Pizza. So, I asked her .. Do you really want to understand MapReduce concept? She replied with firm Yes. So, I started

Shekhar : How do you prepare Onion chutney? (This is not the exact recipe so please don’t try this at home :) )

Supriya : She replied “I will take a onion and cut it into pieces and then mix salt and add the water into it and finally grind it with a Mixer-Grinder . And you will get the Onion Chutney”.

Supriya : How this is related to MapReduce?

Shekhar : Wait ! Let me build the full story you will surely understand MapReduce in 15 minutes.

Supriya : Ok.

Shekhar : Now suppose you want to prepare a mixed chutney using Mint, Onion, Tomato, Chilies, Garlic. How will you do it?

Supriya : I will take a bunch of Mint leaves, 1 onion, 1 tomato, 1 chilly, 1 garlic and cut them to pieces. Add the required salt and water in it. And will grind it with Mixer-Grinder  and you will get a Mixed Chutney.

Shekhar : Great. Let’s apply MapReduce concept to your recipe. Map and Reduce are two operations. Let me explain them in more detail.

Map : Cutting of onion, tomato, chilly, garlic into pieces is a Map operation applied to each of these individually. So you pass one onion to a map and it will cut the onion to pieces. Similarly you pass chilly, garlic, tomato to the map one by one and you will get many pieces. So when you are cutting the pieces of a vegetable like onion you are doing a map operation. Map operation is applied to each vegetable and it will give one or more output. In our case it will be pieces of a vegetable. In Map operation it might happen that one of the onion is rotten and you just throw that onion. So, in case of rotten onion Map operation just did the filtering and you will not produce any output.

Reduce : In this phase you pass all the pieces of different vegetables to the grinder which grinds all of the pieces to give you one Chutney. It means you reduced all of the ingredients to produce one output. So, reducer usually aggregates the output of the map.

Supriya : So, is this MapReduce?

Shekhar : Yes and No. It is just  a part of MapReduce. The power of MapReduce is in distributed computing.

Supriya : Distributed Computing .. What’s that ? Please explain.

Shekhar : Ok..

Shekhar : Suppose that you compete in a Chutney competition and your recipe won the best Chutney award. After wining the award Chutney recipe becomes a hit so you want to start selling your own branded Chutney. Let’s assume you need to produce 10000 Chutney bottles every day. What will you do?

Supriya : I will find a vendor which can provide me ingredients in bulk.

Shekhar : Yes .. That’s correct. Will you be able to do this process alone i.e. cutting of ingredients into pieces? Will a single grinder work now? Also now we need to support different type of chutneys like only onion, only green chilies, only tomato etc.

Supriya : No. I will have to hire more workers which will cut the vegetables. I will also buy more grinders so that I can produce Chutneys faster.

Shekhar : Correct. So you have to distribute the work now. You will need multiple persons cutting the ingredients to pieces parallely. Each person will have to process a bag full of ingredients. Each person corresponds to a single map.  Each person iterate over the bag and will process a single ingredient at one time i.e. cut them to pieces.. This is done till the bag is empty.

So after all the workers have done the work. You will have pieces of onion, tomato, garlic,etc. at all the workplaces (where every person is doing his/her work).

Supriya : But how I will create different types of Chutneys?

Shekhar : Now you will see the missing phase of MapReduce — Shuffle phase. MapReduce will group all the outputs written by every Map based on the key. This will be automatically done for you. You can assume key as just a name of ingredient like Onion. So all the onion keys will be grouped together and will be transferred to a grinder which will just process onions. So, you will get onion Chutney. Similarly all the tomatoes will be transferred to the grinder marked for tomato and will produce tomato Chutney.

Finally Pizza arrived and she nodded her head saying that she understood MapReduce. I just hope next time she hear about MapReduce she can better understand what I am doing.

Hadoop Maven Archetype

Today, I found out the easiest way to generate a maven based Hadoop project using a maven archetype. This will generate a sample Hadoop project which uses hadoop version 0.20.2. The sample project also contains the famous WordCount example. To generate the maven project type following on the command line

mvn archetype:generate -DarchetypeCatalog=http://dev.mafr.de/repos/maven2/ -DgroupId=com.hadoop.example -DartifactId=hadoop-example

You can also get more information about the archetype at 
http://blog.mafr.de/2010/08/01/maven-archetype-hadoop/

Follow

Get every new post delivered to your Inbox.

Join 265 other followers