Hadoop Cluster Installation

Posted Mar 10, 2024 Updated Mar 10, 2024

By Andrew McLaughlin 5 min read

Prerequisites

At least two ubuntu servers
- One ubuntu server for the master node
- At least one ubuntu server for the slave node (potentially multiple)

hostnames

Set meaningful hostnames for each vm. I try to align mine with the server id provided by Proxmox.

sudo hostnamectl set-hostname ubuntu100

static ip

Set static ip addresses for each vm. First, identify the existing ethernet adapter.

sudo cat /etc/netplan/00-installer-config.yaml

Save the netplan config file as a backup:

sudo cp /etc/netplan/00-installer-config.yaml /etc/netplan/00-installer-config.yaml.bck

Remove the original netplan config file:

sudo rm /etc/netplan/00-installer-config.yaml

Create a new netplan static file:

sudo nano /etc/netplan/static.yaml

Set your static ip address:

network:
version: 2
renderer: networkd
ethernets:
ens18:
addresses:
- 192.168.1.100/24
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]

Apply the netplan so your static ip address sets:

sudo netplan apply

Update hosts file

Navigate to the hosts file and open it with nano:

sudo nano /etc/hosts

Comment out the loopback address of the hostname. Instead, we’ll want to map each private IP address to their respective hostnames. For example, my file looks like this on all three VMs in the cluster:

127.0.0.1 localhost
#127.0.1.1 ubuntu103
192.168.1.103 ubuntu103
192.168.1.104 ubuntu104
192.168.1.105 ubuntu105

Save and Exit:

<ctrl+x>
y
<enter>

If you haven’t done so already:

sudo apt update

Snapshot

If you’re building this lab in a hypervisor like Proxmox or Hyper-V, now might be a good time to take a snapshot of your VMs. In the real world, we might not have this option, since we’ll be using this platform for big data.

Install Java on Each Node

Install Java:

sudo apt-get install openjdk-8-jdk

Set JAVA_HOME Environment Variable: After installing Java, you need to set the JAVA_HOME environment variable to point to the Java installation directory. You can find the installation path with:

update-alternatives --config java

Set JAVA_HOME by adding it to your ~/.bashrc

sudo nano ~/.bashrc

Add the following lines to the bashrc file:

export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export PATH=$PATH:$JAVA_HOME/bin

Now apply that change by running:

source ~/.bashrc

Show the location:

echo $JAVA_HOME

Confirm the version of java:

java -version

Hadoop User

On each node, create a hadoop user (and password):

sudo adduser hadoop

Switch to the new hadoop user on each node:

sudo su - hadoop

You may need to bump the hadoop user to a sudoer on each machine:

Switch to your main user:

su andrew

Update the sudoers file:

sudo nano /etc/sudoers

Add the following line:

hadoop ALL=(ALL) ALL

Save and Exit:

<ctrl+x>
y
<enter>

Now switch back:

sudo su - hadoop

Passwordless Authentication

On the master node only, create an ssh key:

ssh-keygen -t rsa -b 4096

In the future, we’ll want the master node to be able to execute tasks on itself, and also on the slave nodes. It will need to perform these tasks without using a password. While logged into the master node, we’ll need to execute the following command against the master node, and also against each worker node.

ssh-copy-id hadoop@ubuntu103
ssh-copy-id hadoop@ubuntu104
ssh-copy-id hadoop@ubuntu105

Install Hadoop on Each Node

Download a stable copy of hadoop from hadoop

Download:

wget https://dlcdn.apache.org/hadoop/common/stable/hadoop-3.3.6.tar.gz

Unzip:

tar xvf hadoop-3.3.6.tar.gz

Move and rename:

sudo mv hadoop-3.3.6 hadoop

Hadoop Configuration

Add java to env files

Navigate to the hadoop folder containing the env files:

cd /hadoop/etc/hadoop

We need to edit:

hadoop-env.sh
mapred-env.sh
yarn-env.sh

For each file, we need to edit and add JAVA_HOME on each node:

sudo nano hadoop-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

sudo nano mapred-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

sudo nano yarn-env.sh
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64

Update Daemons

We need to update the following xml site files:

core-site.xml
yarn-site.xml
hdfs-site.xml
mapred-site.xml

core-site.xml

sudo nano core-site.xml

Make sure the following configuration is present on all nodes
Make sure to point the IP address of hdfs at your master node

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://192.168.1.103:8020</value>
</property>
</configuration>

yarn-site.xml

sudo nano yarn-site.xml

Make sure the following configuration is present on all nodes
Make sure to point the IP address to your master node

<property>
<description>The hostname of the RM</description>
<name>yarn.resourcemanager.hostname</name>
<value>192.168.1.103</value>
</property>
<property>
<description>The address of the applications manager interface in the RM</description>
<name>yarn.resourcemanager.address</name>
<value>192.168.1.103:8032</value>
</property>

hdfs-site.xml

This file will set the default directories for our name nodes and data nodes.

sudo nano hdfs-site.xml

Make sure the following configuration is present on all nodes
Note: the dfs.replication value reflects the number of replications we want for our data. My cluster is small, but if you have a larger cluster, you may want to increase this number to 3.

<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/hadoop/hadoop/hadoop-dir/namenode-dir</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/hadoop/hadoop/hadoop-dir/datanode-dir</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
</configuration>

mapred-site.xml

On all nodes:

sudo nano mapred-site.xml

Make sure the following configuration is present on all nodes

<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>

workers file

This file tells the master node which nodes to use as workers. You only need to update this file on the master node. If you don’t remove the localhost from this file, hadoop will also treat the master node as a worker node.

sudo nano workers

Add the ip addresses of the workers and remove localhost

192.168.1.104
192.168.1.105

Add Hadoop to Path

At this point, hadoop has been configured for the most part. On each node, open the bashrc file to edit:

sudo nano ~/.bashrc

Add HADOOP_HOME to the path:

export HADOOP_HOME=/home/hadoop/hadoop/sbin
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply the bashrc file:

source ~/.bashrc

View the path of HADOOP_HOME:

echo $HADOOP_HOME

Confirm the version of hadooP

hadoop version

Manage Hadoop Cluster

java process status

Hadoop, Hive, and Spark will all run on java. That means we can use jps to see which java processes are running.

jps

At this point, hadoop shouldn’t be running.

start-all.sh

Start all nodes in the hadoop cluster by running:

start-all.sh

stop-all.sh

Stop all nodes in the hadoop cluster by running:

stop-all.sh

review jps

You can kill individual processes on each node by running:

kill <id>

Where <id> is simply the id of the process you want to kill.

Permissions

You may want to experiment with granting permissions to upload and manage files in HDFS.

grants read/write on the root directory of hdfs (otherwise you’d get a dr.who error)
overly permissive, but a good start for the lab
run this on the master node

hdfs dfs -chmod -R 777 /

Navigate

Replace the following with the ip address of your master node:

Hadoop: http://192.168.1.103:9870/

Hadoop Metrics: http://192.168.1.103:8088/cluster

On-Prem Data Lake

hadoop

This post is licensed under CC BY 4.0 by the author.