Installing Hadoop on AWS
This post details how to install Apache Hadoop on a small cluster using Amazon Web Services. Such a cluster is intended solely for small-scale development work: following the steps below for anything but a few nodes would be highly inefficient.
The server nodes will use RHEL: adjust the steps below as required for SLES or Ubuntu. All client commands are for macOS, so you will need something like PuTTY if trying this from Windows.
Log in to the AWS Management Console via https://aws.amazon.com. You will need to create an account first if you do not have one.
Note the AWS region displayed on the top right-hand side of the page. Consider whether it is suitable from you purposes. Change it if it is not. The machines and settings will be a for a particular location such as Dublin or London.
Towards the top on the left-hand side of the page, click on Compute and then EC2.
You will now see the EC2 Dashboard. Click on the Launch Instance button. This starts a seven-step launch process.
- Choose an Amazon Machine Image: RHEL chosen in this case.
- Choose Instance Type: this will depend on you needs and budget.
- Configure Instance: enter for 4 for the number of instances for this exercise.
- Add Storage: as for step 2 this is a matter of choice.
- Tag Instance: this will be changed later, so does not matter.
- Configure Security Group: create a new group with roles for SSH on port 22 role plus ones for All TCP and All ICMP (allow all sources for the moment and restrict later)
- Review: create a new key pair file and save it to (for example) /Users/username/.ssh (do NOT lose this!)
Initialise Cluster Settings
Open the EC2 Dashboard page of the AWS console. Click on Instances (under INSTANCES) in the menu bar on the left.
On the the Instances page, rename the nodes using the Tags tab for each instance. This makes it much easier to select the correct DNS addresses.
A major way of saving money with AWS is to shut machines down when not using them (note that that this is not the same as the terminate command, which operates exactly as named). A drawback is that all of the machines’ external IP addresses and DNSs will be lost and new ones assigned on the next start-up. Fixed external addresses have to be purchased separately. These incur a charge even when the servers have been shut down, but at a fraction of the cost of keeping the latter running.
To get some fixed external IP addresses, select Elastic IPs (under NETWORK & SECURITY). Allocate four new ones and assign one to each of the nodes. You might as well make a list of all the nodes’ public and private DNSs at this point. This can be found on the Instances page
Establish Client Connections
In Terminal, change the downloaded per key file permissions to owner read+write only:
chmod 600 ~/.ssh/my_aws_file.pem
Connect to the namenode using its public IP or DNS (username is ubuntu if using Ubuntu)
ssh -i ~/.ssh/my_aws_file.pem ec2-user@namenode_public_ip
Type "yes" when prompted. One can exit if successful.
Create/edit the ssh config file with entries for all nodes:
The Host and HostName entries are taken from the EC2 Dashboard. The IdentityFile refers to the file downloaded earlier. The User and IdentityFile entries can be entered only once (under Host * ), if no other hosts are listed in this file
Host namenode HostName namenode_public_dns User ec2-user IdentityFile ~/.ssh/my_aws_file.pem Host datanode01 HostName datanode1_public_dns User ec2-user IdentityFile ~/.ssh/my_aws_file.pem Host datanode02 HostName datanode2_public_dns User ec2-user IdentityFile ~/.ssh/my_aws_file.pem Host datanode03 HostName datanode3_public_dns User ec2-user IdentityFile ~/.ssh/my_aws_file.pem
Test all nodes work with, where host refers to the hosts in the ssh/config file:
Copy files from your client to the name node:
scp ~/.ssh/my_aws_file.pem ~/.ssh/config namenode:~/.ssh
Connect to the name node:
Create private/public rsa key pair with empty password:
ssh-keygen -f ~/.ssh/id_rsa -t rsa -P ""
Copy public fingerprint to authorized_keys:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Copy the public fingerprint to each data node’s authorized_keys to enable the password-less SSH from the NameNode to any DataNode:
cat ~/.ssh/id_rsa.pub | ssh datanode01 'cat >> ~/.ssh/authorized_keys' cat ~/.ssh/id_rsa.pub | ssh datanode02 'cat >> ~/.ssh/authorized_keys' cat ~/.ssh/id_rsa.pub | ssh datanode03 'cat >> ~/.ssh/authorized_keys'
Test connections to all nodes including the name node with:
Install and Configure Hadoop
Open up windows for all data nodes (ssh datanodeXX) in addition to the namenode and enter the following.
Update RHEL's package management tool:
sudo yum update
Check the Java version:
Install the desired version of Java if not found:
sudo yum install java-1.7.0-openjdk-devel
Download and extract hadoop files (wget is not included in RHEL):
sudo yum install wget wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.3/hadoop-2.7.3.tar.gz -P ~/downloads sudo tar zxvf ~/downloads/hadoop-* -C /usr/local sudo mv /usr/local/hadoop-* /usr/local/hadoop
Using nano is much easier than vim, so install it:
sudo yum install nano
Find the Java home directory (the correct path omits the /jre/bin/java at the end):
readlink -f $(which java)
Setup environment variables:
export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-18.104.22.168-22.214.171.124.el7_3.x86_64 export PATH=$PATH:$JAVA_HOME/bin export HADOOP_HOME=/usr/local/hadoop export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin export HADOOP_CONF_DIR<=/usr/local/hadoop/etc/hadoop
Exit and log back into each node. Test a variable, for example:
Change Hadoop variables and configuration files on all nodes:
sudo nano $HADOOP_CONF_DIR/hadoop-env.sh
sudo nano $HADOOP_CONF_DIR/core-site.xml
<property> <name>fs.defaultFS</name> <value>hdfs://namenode_public_dns:9000</value> </property>
sudo nano $HADOOP_CONF_DIR/yarn-site.xml
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> <property> <name>yarn.resourcemanager.hostname</name> <value>namenode_public_dns</value> </property>
sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml sudo nano $HADOOP_CONF_DIR/mapred-site.xml
<property> <name>mapreduce.jobtracker.address</name> <value>namenode_public_dns:54311</value> </property> <property> <name>mapreduce.framework.name</name> <value>yarn</value> </property>
Name node only
Map public DNS to hostnames. The latter are based on the private IPs and correspond to the ip-123-123-123-123 prefixes of the internal AWS DNSs. Check using:
Having obtained the host names, update the name node hosts file:
sudo nano /etc/hosts
namenode_public_dns namenode_hostname datanode1_public_dns datanode1_hostname datanode2_public_dns datanode2_hostname datanode3_public_dns datanode3_hostname
Add the masters file and add the name node’s hostname to it:
sudo nano $HADOOP_CONF_DIR/masters
Add data node hostnames to the slaves file:
sudo nano $HADOOP_CONF_DIR/slaves
datanode1_hostname datanode2_hostname datanode3_hostname
Complete configuration for the name node:
sudo nano $HADOOP_CONF_DIR/hdfs-site.xml
<property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value> </property>
Create the name node directory:
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/namenode
Change hadoop home directory ownership to the current login:
sudo chown -R ec2-user $HADOOP_HOME
Data nodes only
sudo nano $HADOOP_CONF_DIR/hdfs-site.xml
<property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value> </property>
sudo mkdir -p $HADOOP_HOME/hadoop_data/hdfs/datanodesudo chown -R ec2-user $HADOOP_HOME
From the name node, format HDFS before starting Hadoop for the first time. If repeated, it will delete all Hadoop data.
hdfs namenode -format
Start the Hadoop cluster (still on the name node). Enter yes and enter when prompted, or even if not and nothing is happening.
If no error messages are displayed, start YARN and the MapReduce server:
$HADOOP_HOME/sbin/start-yarn.sh $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start history server
Check that everything is running from a browser on your client: