Post

Linux - Getting Started

Intended Audience:

  • data professionals, just getting started with linux

Objectives:

  • cover main linux topics necessary to stand up an on-prem data lake

There are many flavors of Linux. I tend to stick with Ubuntu. You have a few options here:

Depending on your environment, you can either deploy the ISO in a VM or you can burn it to a flash drive using something like etcher, then install it on bare metal.

If you’re just starting out, I recommend using the desktop version. Once you get more comfortable and potentially scale out, you may want to switch over to the server version.

Important Concepts

sudo

When you need to execute a command with elevated priviliges, use sudo:

sudo apt update

ssh

Secure Shell (SSH) Protocol is a method for securely sending commands to a computer over an unsecured network. SSH uses cryptography to authenticate and encrypt connections between devices. In other words, this is how we will remote into our linux servers. As a prerequisite, you should probably install the OpenSSH Server on during installation. If you missed that step, you can run the following commands:

OpenSSH Docs

Install the SSH Client:

sudo apt install openssh-client

Install the SSH Server

sudo apt install openssh-server

Once openssh has been installed on your linux server, you should be able to ssh into the machine. From another device, open your command prompt and type:

ssh name@192.168.1.123

Make sure to replace your name and the IP address of your server, then enter your password when prompted. Additionally, you may need to accept the fingerprint of the device. This just means you’re accepting that you trust the device. Hostnames and IP addresses can be recycled, so there’s a chance your machine has seen this someething resembling this device before, but is confused. In this case, you may need to navigate to your known_hosts file and remove the expired entry. Make sure to access the file with administrative permissions. This file can be found here on a windows device:

C:\Users\<your name>\.ssh\known_hosts

hostname

If you want to use the hostname instead of an IP address, you can set the name of your server:

sudo hostnamectl set-hostname mynewhostname

While the above comment renames your device, you might also want to update the host file on your laptop/desktop so that you can ssh into the linux server by name instead of IP address. On a Windows PC, you can find this file here:

C:\Windows\System32\drivers\etc\hosts

You’ll need administrative privileges to modify this file. Simply add a new line and save it:

192.168.1.123 mynewhostname

Once saved, you should now be able to ssh into the server by name:

ssh myname@mynewhostname

ls

Use ls to list files in a directory.

cd

Use cd to change directories.

cd /etc

Explore it:

ls

Move back one level:

cd ..

Navigate to the root directory:

cd ~

Navigate to the home directory of the user:

cd /home

ip a

Get information about the current IP Address configuration:

ip a

cat

Read the contents of a file with cat.

Navigate to a file:

cd /etc/netplan

Read contents of one of the files available. For a future step, take note of the the ethernet adaptor found in this file.

cat 00-installer-config.yaml

cp

Use cp to copy a file. In this example, copy it to a file with a new extension.

sudo cp /etc/netplan/00-installer-config.yaml /etc/netplan/00-installer-config.yaml.bck

rm

Remove files with rm. In this case, we first created a backup copy of the original file. Now, we’re removing the original fiel.

sudo rm /etc/netplan/00-installer-config.yaml

nano

A simple, but tricky little editor. If this is your first time using linux, editing files in the command line can seem a bit confusing. I prefer nano. We can use nano to create new files or edit existing files.

Create a file that we can use for creating a static IP address.

sudo nano /etc/netplan/static.yaml

This will open a prompt. You can now use the following commands:

  • <ctrl+x> to exit
  • y to accept the changes or n to disregard the changes
  • <enter> to save

static ip

Not exaclty a command, but we might want a way to set a static IP address.

Open the file we just created:

sudo nano /etc/netplan/static.yaml

Paste or set the following (be sure to set the ethernets adaptor and IP address to align with your needs):

network:
  version: 2
  renderer: networkd
  ethernets:
    ens18:
      addresses:
        - 192.168.1.100/24
      gateway4: 192.168.1.1
      nameservers:
          addresses: [8.8.8.8, 8.8.4.4]

Once you’ve copied the info, make sure to:

  • <ctrl+x>
  • y
  • <enter>

Now set the apply your new configuration:

sudo netplan apply

hosts

Update the hosts file so you can align IP addresses with hostnames:

sudo nano /etc/hosts

Add a line for each machine of interest:

192.168.1.100 ubuntu100

installing packages

You can use some basic commands to install default packages. For example, you can install curl:

sudo apt install curl

wget

Download a file, using the url with wget:

wget https://dlcdn.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz

unpack a zipped file

tar xvf spark-3.5.1-bin-hadoop3.tgz

move files

You can move that file and rename it at the same time. In this example, we’re renaming “spark-3.5.1-bin-hadoop3” as “spark” and putting it in a new location.

sudo mv spark-3.5.1-bin-hadoop3 /opt/spark

bashrc

Use bashrc with enviromental variables for easy access to files. For example, want access to the contents of that spark folder?

sudo nano ~/.bashrc

Create a variable and point it to the folder we just created. Next, add that variable to our path. Add these two lines to the bashrc file, then save and close.

export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

Apply the changes by running the following:

source ~/.bashrc

passwordless authentication

When we start working with multi-node clusters, we’ll need to allow our master nodes to execute tasks on the worker nodes.

On the master node, run the following line of code to create an ssh key:

ssh-keygen -t rsa -b 4096

Assumptions:

  • One master node with the hostname of ‘masternode’
  • One worker node with the hostname of ‘workernode’
  • A user with the username of ‘username’
  • The user has access to both the masternode and the workdernode
  • Execute both commands from the masternode
  • Yes, you are copying the command to both the master node and the worker nodes. This will allow future commands to be executed without entering a password.
  • When initially copying the ssh keys, you’ll be prompted to the password of the machine you’re copying the files to.

Copy to the master node:

ssh-copy-id username@masternode

Copy to each worker node:

ssh-copy-id username@workernode

Once copied, you can test passwordless authentication by trying to ssh into each machine from the master node. If applied correctly, you should not have to enter a password.

This post is licensed under CC BY 4.0 by the author.

Trending Tags