Session 11-12 - Apache Spark Installation

Apache Spark Installation

1. Lab Topology

VM

Hadoop Role

Spark Role

Hostname

IP

VM1

NameNode

Master + Worker

SPARK-1

192.168.18.101

VM2

DataNode

Worker

SPARK-2

192.168.18.102

VM3

DataNode

Worker

SPARK-3

192.168.18.103

2. Initial Setup (All VMs)

Update system:

sudo apt update && sudo apt upgrade -y

Install tools:

sudo apt install -y ssh wget tar nano openjdk-11-jdk

Set hostname per VM:

sudo hostnamectl set-hostname SPARK-1   # on VM1
sudo hostnamectl set-hostname SPARK-2   # on VM2
sudo hostnamectl set-hostname SPARK-3   # on VM3

Edit /etc/hosts on every VM:

192.168.18.101 SPARK-1
192.168.18.102 SPARK-2
192.168.18.103 SPARK-3

3. Install Hadoop (HDFS)

3.1 Download Hadoop (All VMs)

wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /opt/hadoop

Add environment variables:

echo 'export HADOOP_HOME=/opt/hadoop' >> ~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc

3.2 Configure core-site.xml (All VMs)

sudo nano /opt/hadoop/etc/hadoop/core-site.xml

Add:

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://SPARK-1:9000</value>
  </property>
</configuration>

3.3 Configure hdfs-site.xml (All VMs)

sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml

Use the same config on all VMs:

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///data/nn</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///data/dn</value>
  </property>
</configuration>

Create data dirs:

sudo mkdir -p /data/nn /data/dn
sudo chown -R $USER:$USER /data

3.4 Format HDFS (Only on SPARK-1)

hdfs namenode -format

3.5 Start HDFS (Run from SPARK-1)

start-dfs.sh

Check with:

jps

HDFS UI:

http://SPARK-1:9870

4. Install Spark

Download on all VMs:

wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark

Environment variables:

echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
source ~/.bashrc

4.1 Configure Spark Master (SPARK-1)

sudo nano /opt/spark/conf/spark-env.sh

Add:

SPARK_MASTER_HOST=SPARK-1

4.2 Configure Worker List (On SPARK-1)

sudo nano /opt/spark/conf/workers

Add:

SPARK-2
SPARK-3

Start Spark Master:

start-master.sh

Start all workers:

start-workers.sh

Spark UI:

http://SPARK-1:8080

5. Store Data in HDFS

Create folder:

hdfs dfs -mkdir /lab

Upload file:

echo "hello big data" > test.txt
hdfs dfs -put test.txt /lab/

Check:

hdfs dfs -ls /lab

6. Run Spark with HDFS

Start PySpark:

pyspark --master spark://SPARK-1:7077

In shell:

text = sc.textFile("hdfs://SPARK-1:9000/lab/test.txt")
words = text.flatMap(lambda x: x.split())
words.count()

7. Submit a Full Spark Job (WordCount)

Create new file:

echo "spark hdfs spark bigdata lab spark" > wordcount.txt
hdfs dfs -put wordcount.txt /lab/

Submit:

spark-submit --master spark://SPARK-1:7077 \
  $SPARK_HOME/examples/src/main/python/wordcount.py \
  hdfs://SPARK-1:9000/lab/wordcount.txt

8. Verification

Check cluster:

hdfs dfsadmin -report

Check processes:

jps

Check Spark UI:

http://SPARK-1:8080

Check HDFS UI:

http://SPARK-1:9870

9. Troubleshooting

Worker not joining:

ping SPARK-1
ps -ef | grep Worker

HDFS not responding:

sudo ss -ltnp | grep 9000

Namenode failure:

sudo rm -rf /data/nn/*
hdfs namenode -format
Updated on