Sunday, February 24, 2013

Hadoop Installation

Purpose

This document describes how to install, configure and manage non-trivial Hadoop clusters ranging from a few nodes to extremely large clusters with thousands of nodes.
Prerequisites

1.      Make sure all required software is installed on all nodes in your cluster (i.e. jdk1.6 and greater).

2.      Download the Hadoop software.


Installation Process


1.      Installing JAVA

a.      Unzip the jdk tar

tar -xvf jdk-7u2-linux-x64.tar.gz



b.      Move unzipped jdk to usr/lib/jvm

sudo mv jdk1.7.0_03/* /usr/lib/jvm/jdk1.7.0_34/



c.       Update JAVA alternatives

sudo update-alternatives --install "/usr/bin/java" "java" "/usr/lib/jvm/jdk1.7.0_34/bin/java" 1

sudo update-alternatives --install "/usr/bin/javac" "javac" "/usr/lib/jvm/jdk1.7.0_34/bin/javac" 1

sudo update-alternatives --install "/usr/bin/javaws" "javaws" "/usr/lib/jvm/jdk1.7.0_34/bin/javaws" 1



2.      Installing Hadoop

a.      Unzip the hadoop tar inside home/<user>/hadoop

tar xzf hadoop-1.0.4.tar.gz



3.      set JAVAPATH and HADOOP_COMMON_HOME path

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34

export HADOOP_COMMON_HOME="/home/<user>/hadoop/hadoop-1.0.4"

export PATH=$HADOOP_COMMON_HOME/bin/:$PATH






Configure Hadoop

 1.      conf/hadoop-env.h

Add or change these lines to specify the JAVA_HOME and directory to store the logs:

export JAVA_HOME=/usr/lib/jvm/jdk1.7.0_34

export HADOOP_LOG_DIR=/home/<home>/hadoop/hadoop-1.0.4/hadoop_logs




2.      conf/core-site.xml

Here the NameNode runs on 172.16.16.60

<configuration>

<property>

<name>fs.default.name</name>

<value>hdfs:// 172.16.16.60:9000</value>

</property>

</configuration>

3.      conf/hdfs-site.xml

<configuration>

                <property>

                                <name>dfs.replication</name>

                                <value>3</value>

                </property>

                <property>

                                <name>dfs.name.dir</name>

                                <value>/lhome/hadoop/data/dfs/name/</value>

                </property>

                <property>

                                <name>dfs.data.dir</name>

                                <value>/lhome/hadoop/data/dfs/data/</value>

                </property>

<configuration>

dfs.replication is the number of replicas of each block. dfs.name.dir is the path on the local filesystem where the NameNode stores the namespace and transactions logs persistently. dfs.data.dir is comma-separated list of paths on the local filesystem of a DataNode where it stores its blocks.


4.      conf/mapred-site.xml

Here the JobTracker runs on 172.16.16.60

<configuration>

                <property>

                                <name>mapred.job.tracker</name>

                                <value>172.16.16.60:9001</value>

                </property>

                <property>

                                <name>mapred.system.dir</name>

                                <value>/hadoop/data/mapred/system/</value>

                </property>

                <property>

                                <name>mapred.local.dir</name>

                                <value>/lhome/hadoop/data/mapred/local/</value>

                </property>

</configuration>

mapreduce.jobtracker.address is host or IP and port of JobTracker. mapreduce.jobtracker.system.dir is the path on the HDFS where where the Map/Reduce framework stores system files. mapreduce.cluster.local.dir is comma-separated list of paths on the local filesystem where temporary MapReduce data is written.


5.      conf/masters

Delete localhost and add all the names of the namenode, each in on line.

For Example:

<IP of Namenode>
6.      conf/slaves

Delete localhost and add all the names of the TaskTrackers, each in on line.

For Example:

<IP of Slave 1>

<IP of Slave 1>

….

….

<IP of Slave 1>


7.      Configuring SSH

In fully-distributed mode, we have to start daemons, and to do that, we need to have SSH installed. it merely starts daemons on the set of hosts in the cluster (defined by the slaves file) by SSH-ing to each host and starting a daemon process. So we need to make sure that we can SSH to localhost and log in without having to enter a password.

       

        $ sudo apt-get install ssh



Then to enable password-less login, generate a new SSH key with an empty passphrase:



        $ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa



        $ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys



Test this with:



         $ ssh localhost



You should be logged in without having to type a password.



Copy the contents inside the  /<user>/.ssh/  of masternode to other slave nodes as well to have passwordless access to other nodes from masternode.


8.      Duplicate Hadoop configuration files to all nodes



We may duplicate the configuration files under conf directory to all nodes. The script mentioned above can be used. By now, we have finished copying Hadoop softwares and configuring the Hadoop. Now let’s have some fun with Hadoop.


Hadoop Startup

To start a Hadoop cluster you will need to start both the HDFS and Map/Reduce cluster. Format a new distributed file system:



$ bin/hadoop namenode -format



Start the HDFS with the following command, run on the designated NameNode:                                            



$ bin/start-dfs.sh



The bin/start-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and

starts the DataNode daemon on all the listed slaves.



Start Map-Reduce with the following command, run on the designated JobTracker:      



$ bin/start-mapred.sh



The bin/start-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and starts the TaskTracker daemon on all the listed slaves.


Hadoop Shutdown

Stop HDFS with the following command, run on the designated NameNode:



$ bin/stop-dfs.sh



The bin/stop-dfs.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the NameNode and stops the DataNode daemon on all the listed slaves.



Stop Map/Reduce with the following command, run on the designated the designated JobTracker:


$ bin/stop-mapred.sh



The bin/stop-mapred.sh script also consults the ${HADOOP_CONF_DIR}/slaves file on the JobTracker and stops the TaskTracker daemon on all the listed slaves.