Linux - Getting Started
Intended Audience:
- data professionals, just getting started with linux
Objectives:
- cover main linux topics necessary to stand up an on-prem data lake
There are many flavors of Linux. I tend to stick with Ubuntu. You have a few options here:
Depending on your environment, you can either deploy the ISO in a VM or you can burn it to a flash drive using something like etcher, then install it on bare metal.
If you’re just starting out, I recommend using the desktop version. Once you get more comfortable and potentially scale out, you may want to switch over to the server version.
Important Concepts
sudo
When you need to execute a command with elevated priviliges, use sudo:
sudo apt update
ssh
Secure Shell (SSH) Protocol is a method for securely sending commands to a computer over an unsecured network. SSH uses cryptography to authenticate and encrypt connections between devices. In other words, this is how we will remote into our linux servers. As a prerequisite, you should probably install the OpenSSH Server on during installation. If you missed that step, you can run the following commands:
Install the SSH Client:
sudo apt install openssh-client
Install the SSH Server
sudo apt install openssh-server
Once openssh has been installed on your linux server, you should be able to ssh into the machine. From another device, open your command prompt and type:
ssh name@192.168.1.123
Make sure to replace your name and the IP address of your server, then enter your password when prompted. Additionally, you may need to accept the fingerprint of the device. This just means you’re accepting that you trust the device. Hostnames and IP addresses can be recycled, so there’s a chance your machine has seen this someething resembling this device before, but is confused. In this case, you may need to navigate to your known_hosts file and remove the expired entry. Make sure to access the file with administrative permissions. This file can be found here on a windows device:
C:\Users\<your name>\.ssh\known_hosts
hostname
If you want to use the hostname instead of an IP address, you can set the name of your server:
sudo hostnamectl set-hostname mynewhostname
While the above comment renames your device, you might also want to update the host file on your laptop/desktop so that you can ssh into the linux server by name instead of IP address. On a Windows PC, you can find this file here:
C:\Windows\System32\drivers\etc\hosts
You’ll need administrative privileges to modify this file. Simply add a new line and save it:
192.168.1.123 mynewhostname
Once saved, you should now be able to ssh into the server by name:
ssh myname@mynewhostname
ls
Use ls
to list files in a directory.
cd
Use cd
to change directories.
cd /etc
Explore it:
ls
Move back one level:
cd ..
Navigate to the root directory:
cd ~
Navigate to the home directory of the user:
cd /home
ip a
Get information about the current IP Address configuration:
ip a
cat
Read the contents of a file with cat
.
Navigate to a file:
cd /etc/netplan
Read contents of one of the files available. For a future step, take note of the the ethernet adaptor found in this file.
cat 00-installer-config.yaml
cp
Use cp
to copy a file. In this example, copy it to a file with a new extension.
sudo cp /etc/netplan/00-installer-config.yaml /etc/netplan/00-installer-config.yaml.bck
rm
Remove files with rm
. In this case, we first created a backup copy of the original file. Now, we’re removing the original fiel.
sudo rm /etc/netplan/00-installer-config.yaml
nano
A simple, but tricky little editor. If this is your first time using linux, editing files in the command line can seem a bit confusing. I prefer nano. We can use nano
to create new files or edit existing files.
Create a file that we can use for creating a static IP address.
sudo nano /etc/netplan/static.yaml
This will open a prompt. You can now use the following commands:
<ctrl+x>
to exity
to accept the changes orn
to disregard the changes<enter>
to save
static ip
Not exaclty a command, but we might want a way to set a static IP address.
Open the file we just created:
sudo nano /etc/netplan/static.yaml
Paste or set the following (be sure to set the ethernets adaptor and IP address to align with your needs):
network:
version: 2
renderer: networkd
ethernets:
ens18:
addresses:
- 192.168.1.100/24
gateway4: 192.168.1.1
nameservers:
addresses: [8.8.8.8, 8.8.4.4]
Once you’ve copied the info, make sure to:
<ctrl+x>
y
<enter>
Now set the apply your new configuration:
sudo netplan apply
hosts
Update the hosts file so you can align IP addresses with hostnames:
sudo nano /etc/hosts
Add a line for each machine of interest:
192.168.1.100 ubuntu100
installing packages
You can use some basic commands to install default packages. For example, you can install curl:
sudo apt install curl
wget
Download a file, using the url with wget
:
wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
unpack a zipped file
tar xvf spark-3.5.1-bin-hadoop3.tgz
move files
You can move that file and rename it at the same time. In this example, we’re renaming “spark-3.5.1-bin-hadoop3” as “spark” and putting it in a new location.
sudo mv spark-3.5.1-bin-hadoop3 /opt/spark
bashrc
Use bashrc with enviromental variables for easy access to files. For example, want access to the contents of that spark folder?
sudo nano ~/.bashrc
Create a variable and point it to the folder we just created. Next, add that variable to our path. Add these two lines to the bashrc file, then save and close.
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Apply the changes by running the following:
source ~/.bashrc
passwordless authentication
When we start working with multi-node clusters, we’ll need to allow our master nodes to execute tasks on the worker nodes.
On the master node, run the following line of code to create an ssh key:
ssh-keygen -t rsa -b 4096
Assumptions:
- One master node with the hostname of ‘masternode’
- One worker node with the hostname of ‘workernode’
- A user with the username of ‘username’
- The user has access to both the masternode and the workdernode
- Execute both commands from the masternode
- Yes, you are copying the command to both the master node and the worker nodes. This will allow future commands to be executed without entering a password.
- When initially copying the ssh keys, you’ll be prompted to the password of the machine you’re copying the files to.
Copy to the master node:
ssh-copy-id username@masternode
Copy to each worker node:
ssh-copy-id username@workernode
Once copied, you can test passwordless authentication by trying to ssh into each machine from the master node. If applied correctly, you should not have to enter a password.