Hadoop Cluster Installation
Prerequisites
- At least two ubuntu servers
- One ubuntu server for the master node
- At least one ubuntu server for the slave node (potentially multiple)
hostnames
Set meaningful hostnames for each vm. I try to align mine with the server id provided by Proxmox.
sudo hostnamectl set-hostname ubuntu100
static ip
Set static ip addresses for each vm. First, identify the existing ethernet adapter.
sudo cat /etc/netplan/00-installer-config.yaml
Save the netplan config file as a backup:
sudo cp /etc/netplan/00-installer-config.yaml /etc/netplan/00-installer-config.yaml.bck
Remove the original netplan config file:
sudo rm /etc/netplan/00-installer-config.yaml
Create a new netplan static file:
sudo nano /etc/netplan/static.yaml
Set your static ip address:
network:
version: 2
renderer: networkd
ethernets:
ens18:
addresses:
- 192.168.1.100/24
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
Apply the netplan so your static ip address sets:
sudo netplan apply
Update hosts file
Navigate to the hosts file and open it with nano:
sudo nano /etc/hosts
Comment out the loopback address of the hostname. Instead, we’ll want to map each private IP address to their respective hostnames. For example, my file looks like this on all three VMs in the cluster:
127.0.0.1 localhost
#127.0.1.1 ubuntu103
192.168.1.103 ubuntu103
192.168.1.104 ubuntu104
192.168.1.105 ubuntu105
Save and Exit:
<ctrl+x>
y
<enter>
If you haven’t done so already:
sudo apt update
Snapshot
If you’re building this lab in a hypervisor like Proxmox or Hyper-V, now might be a good time to take a snapshot of your VMs. In the real world, we might not have this option, since we’ll be using this platform for big data.
Install Java on Each Node
Install Java:
sudo apt-get install openjdk-8-jdk
Set JAVA_HOME Environment Variable: After installing Java, you need to set the JAVA_HOME environment variable to point to the Java installation directory. You can find the installation path with:
update-alternatives --config java
Set JAVA_HOME by adding it to your ~/.bashrc
sudo nano ~/.bashrc
Add the following lines to the bashrc file:
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin
Now apply that change by running:
source ~/.bashrc
Show the location:
echo $JAVA_HOME
Confirm the version of java:
java -version
Hadoop User
On each node, create a hadoop user (and password):
sudo adduser hadoop
Switch to the new hadoop user on each node:
sudo su - hadoop
You may need to bump the hadoop user to a sudoer on each machine:
Switch to your main user:
su andrew
Update the sudoers file:
sudo nano /etc/sudoers
Add the following line:
hadoop ALL=(ALL) ALL
Save and Exit:
<ctrl+x>
y
<enter>
Now switch back:
sudo su - hadoop
Passwordless Authentication
On the master node only, create an ssh key:
ssh-keygen -t rsa -b 4096
In the future, we’ll want the master node to be able to execute tasks on itself, and also on the slave nodes. It will need to perform these tasks without using a password. While logged into the master node, we’ll need to execute the following command against the master node, and also against each worker node.
ssh-copy-id hadoop@ubuntu103
ssh-copy-id hadoop@ubuntu104
ssh-copy-id hadoop@ubuntu105
Install Hadoop on Each Node
Download a stable copy of hadoop from hadoop
Download:
wget https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz
Unzip:
tar xvf hadoop-3.3.6.tar.gz
Move and rename:
sudo mv hadoop-3.3.6 hadoop
Hadoop Configuration
Add java to env files
Navigate to the hadoop folder containing the env files:
cd /hadoop/etc/hadoop
We need to edit:
- hadoop-env.sh
- mapred-env.sh
- yarn-env.sh
For each file, we need to edit and add JAVA_HOME on each node:
sudo nano hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
sudo nano mapred-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
sudo nano yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Update Daemons
We need to update the following xml site files:
- core-site.xml
- yarn-site.xml
- hdfs-site.xml
- mapred-site.xml
core-site.xml
sudo nano core-site.xml
- Make sure the following configuration is present on all nodes
- Make sure to point the IP address of hdfs at your master node
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.103:8020</value>
</property>
</configuration>
yarn-site.xml
sudo nano yarn-site.xml
- Make sure the following configuration is present on all nodes
- Make sure to point the IP address to your master node
<property>
<description>The hostname of the RM</description>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.103</value>
</property>
<property>
<description>The address of the applications manager interface in the RM</description>
<name>yarn.resourcemanager.address</name>
<value>192.168.1.103:8032</value>
</property>
hdfs-site.xml
This file will set the default directories for our name nodes and data nodes.
sudo nano hdfs-site.xml
- Make sure the following configuration is present on all nodes
- Note: the dfs.replication value reflects the number of replications we want for our data. My cluster is small, but if you have a larger cluster, you may want to increase this number to 3.
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/hadoop/hadoop-dir/namenode-dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/hadoop/hadoop-dir/datanode-dir</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>
mapred-site.xml
On all nodes:
sudo nano mapred-site.xml
- Make sure the following configuration is present on all nodes
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
workers file
This file tells the master node which nodes to use as workers. You only need to update this file on the master node. If you don’t remove the localhost from this file, hadoop will also treat the master node as a worker node.
sudo nano workers
Add the ip addresses of the workers and remove localhost
192.168.1.104
192.168.1.105
Add Hadoop to Path
At this point, hadoop has been configured for the most part. On each node, open the bashrc file to edit:
sudo nano ~/.bashrc
Add HADOOP_HOME to the path:
export HADOOP_HOME=/home/hadoop/hadoop/sbin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
Apply the bashrc file:
source ~/.bashrc
View the path of HADOOP_HOME:
echo $HADOOP_HOME
Confirm the version of hadooP
hadoop version
Manage Hadoop Cluster
java process status
Hadoop, Hive, and Spark will all run on java. That means we can use jps to see which java processes are running.
jps
At this point, hadoop shouldn’t be running.
start-all.sh
Start all nodes in the hadoop cluster by running:
start-all.sh
stop-all.sh
Stop all nodes in the hadoop cluster by running:
stop-all.sh
review jps
You can kill individual processes on each node by running:
kill <id>
Where <id>
is simply the id of the process you want to kill.
Permissions
You may want to experiment with granting permissions to upload and manage files in HDFS.
- grants read/write on the root directory of hdfs (otherwise you’d get a dr.who error)
- overly permissive, but a good start for the lab
- run this on the master node
hdfs dfs -chmod -R 777 /
Navigate
Replace the following with the ip address of your master node:
Hadoop: http://192.168.1.103:9870/
Hadoop Metrics: http://192.168.1.103:8088/cluster