Anvil

Installing Hadoop on AWS

This post details how to install Apache Hadoop on a small cluster using Amazon Web Services. Such a cluster is intended solely for small-scale development work: following the steps below for anything but a few nodes would be highly inefficient.

The server nodes will use RHEL: adjust the steps below as required for SLES or Ubuntu. All client commands are for macOS, so you will need something like PuTTY if trying this from Windows.

Launch Nodes

Log in to the AWS Management Console via https://aws.amazon.com. You will need to create an account first if you do not have one.

Note the AWS region displayed on the top right-hand side of the page. Consider whether it is suitable from you purposes. Change it if it is not. The machines and settings will be a for a particular location such as Dublin or London.

Towards the top on the left-hand side of the page, click on Compute and then EC2.

screen-shot

You will now see the EC2 Dashboard. Click on the Launch Instance button. This starts a seven-step launch process.

screen-shot

  1. Choose an Amazon Machine Image: RHEL chosen in this case.
  2. Choose Instance Type: this will depend on you needs and budget.
  3. Configure Instance: enter for 4 for the number of instances for this exercise.
  4. Add Storage: as for step 2 this is a matter of choice.
  5. Tag Instance: this will be changed later, so does not matter.
  6. Configure Security Group: create a new group with roles for SSH on port 22 role plus ones for All TCP and All ICMP (allow all sources for the moment and restrict later)
    screen-shot
  7. Review: create a new key pair file and save it to (for example) /Users/username/.ssh (do NOT lose this!) screen-shot

Initialise Cluster Settings

Open the EC2 Dashboard page of the AWS console. Click on Instances (under INSTANCES) in the menu bar on the left.

On the the Instances page, rename the nodes using the Tags tab for each instance. This makes it much easier to select the correct DNS addresses.

screen-shot

A major way of saving money with AWS is to shut machines down when not using them (note that that this is not the same as the terminate command, which operates exactly as named). A drawback is that all of the machines’ external IP addresses and DNSs will be lost and new ones assigned on the next start-up. Fixed external addresses have to be purchased separately. These incur a charge even when the servers have been shut down, but at a fraction of the cost of keeping the latter running.

To get some fixed external IP addresses, select Elastic IPs (under NETWORK & SECURITY). Allocate four new ones and assign one to each of the nodes. You might as well make a list of all the nodes’ public and private DNSs at this point. This can be found on the Instances page

screen-shot

Establish Client Connections

In Terminal, change the downloaded per key file permissions to owner read+write only:

chmod 600 ~/.ssh/my_aws_file.pem

Connect to the namenode using its public IP or DNS (username is ubuntu if using Ubuntu)

ssh -i ~/.ssh/my_aws_file.pem ec2-user@namenode_public_ip

Type "yes" when prompted. One can exit if successful.

Create/edit the ssh config file with entries for all nodes:

nano ~/.ssh/config

The Host and HostName entries are taken from the EC2 Dashboard. The IdentityFile refers to the file downloaded earlier. The User and IdentityFile entries can be entered only once (under Host * ), if no other hosts are listed in this file

Host namenode
HostName namenode_public_dns
User ec2-user
IdentityFile ~/.ssh/my_aws_file.pem
Host datanode01
HostName datanode1_public_dns
User ec2-user
IdentityFile ~/.ssh/my_aws_file.pem
Host datanode02
HostName datanode2_public_dns
User ec2-user
IdentityFile ~/.ssh/my_aws_file.pem
Host datanode03
HostName datanode3_public_dns
User ec2-user
IdentityFile ~/.ssh/my_aws_file.pem

Test all nodes work with, where host refers to the hosts in the ssh/config file:

ssh host

Copy files from your client to the name node:

scp ~/.ssh/my_aws_file.pem ~/.ssh/config namenode:~/.ssh

Configure SSH

Connect to the name node:

ssh namenode

Create private/public rsa key pair with empty password:

ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""

Copy public fingerprint to authorized_keys:

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Copy the public fingerprint to each data node’s authorized_keys to enable the password-less SSH from the NameNode to any DataNode:

cat ~/.ssh/id_rsa.pub | ssh datanode01 'cat >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh datanode02 'cat >> ~/.ssh/authorized_keys'
cat ~/.ssh/id_rsa.pub | ssh datanode03 'cat >> ~/.ssh/authorized_keys'

Test connections to all nodes including the name node with:

ssh ec2-user@public_dns_for_node

Install and Configure Hadoop

All nodes

Open up windows for all data nodes (ssh datanodeXX) in addition to the namenode and enter the following.

Update RHEL's package management tool:

sudo yum update

Check the Java version:

java -version

Install the desired version of Java if not found:

sudo yum install java-1.7.0-openjdk-devel

Download and extract hadoop files (wget is not included in RHEL):

sudo yum install wget
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz -P ~/downloads
sudo tar zxvf ~/downloads/hadoop-* -C /usr/local
sudo mv /usr/local/hadoop-* /usr/local/hadoop

Using nano is much easier than vim, so install it:

sudo yum install nano

Find the Java home directory (the correct path omits the /jre/bin/java at the end):

readlink -f $(which java)

Setup environment variables:

nano ~/.bash_profile
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64
export PATH=$PATH:$JAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR<=/usr/local/hadoop/etc/hadoop

Exit and log back into each node. Test a variable, for example:

echo $JAVA_HOME

Change Hadoop variables and configuration files on all nodes:

sudo nano $HADOOP_CONF_DIR/hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.121-2.6.8.0.el7_3.x86_64
sudo nano $HADOOP_CONF_DIR/core-site.xml
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://namenode_public_dns:9000</value>
</property>
sudo nano $HADOOP_CONF_DIR/yarn-site.xml
<property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
</property>
<property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>namenode_public_dns</value>
</property>
sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
sudo nano $HADOOP_CONF_DIR/mapred-site.xml
<property>
    <name>mapreduce.jobtracker.address</name>
    <value>namenode_public_dns:54311</value>
</property>
<property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
</property>

Name node only

Map public DNS to hostnames. The latter are based on the private IPs and correspond to the ip-123-123-123-123 prefixes of the internal AWS DNSs. Check using:

echo $(hostname)

Having obtained the host names, update the name node hosts file:

sudo nano /etc/hosts
namenode_public_dns namenode_hostname
datanode1_public_dns datanode1_hostname
datanode2_public_dns datanode2_hostname
datanode3_public_dns datanode3_hostname

Add the masters file and add the name node’s hostname to it:

sudo nano $HADOOP_CONF_DIR/masters
namenode_hostname

Add data node hostnames to the slaves file:

sudo nano $HADOOP_CONF_DIR/slaves
datanode1_hostname
datanode2_hostname
datanode3_hostname

Complete configuration for the name node:

sudo nano $HADOOP_CONF_DIR/hdfs-site.xml
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>

Create the name node directory:

sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode

Change hadoop home directory ownership to the current login:

sudo chown -R ec2-user $HADOOP_HOME

Data nodes only

sudo nano $HADOOP_CONF_DIR/hdfs-site.xml
<property>
    <name>dfs.replication</name>
    <value>3</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanodesudo chown -R ec2-user $HADOOP_HOME

Start Hadoop

From the name node, format HDFS before starting Hadoop for the first time. If repeated, it will delete all Hadoop data.

hdfs namenode -format

Start the Hadoop cluster (still on the name node). Enter yes and enter when prompted, or even if not and nothing is happening.

$HADOOP_HOME/sbin/start-dfs.sh

If no error messages are displayed, start YARN and the MapReduce server:

$HADOOP_HOME/sbin/start-yarn.sh
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start history server

Check that everything is running from a browser on your client:

namenode_public_dns:50070