
1. Lab Topology
VM | Hadoop Role | Spark Role | Hostname | IP |
|---|---|---|---|---|
VM1 | NameNode | Master + Worker | SPARK-1 | 192.168.18.101 |
VM2 | DataNode | Worker | SPARK-2 | 192.168.18.102 |
VM3 | DataNode | Worker | SPARK-3 | 192.168.18.103 |
2. Initial Setup (All VMs)
Update system:
sudo apt update && sudo apt upgrade -y
Install tools:
sudo apt install -y ssh wget tar nano openjdk-11-jdk
Set hostname per VM:
sudo hostnamectl set-hostname SPARK-1 # on VM1
sudo hostnamectl set-hostname SPARK-2 # on VM2
sudo hostnamectl set-hostname SPARK-3 # on VM3
Edit /etc/hosts on every VM:
192.168.18.101 SPARK-1
192.168.18.102 SPARK-2
192.168.18.103 SPARK-3
3. Install Hadoop (HDFS)
3.1 Download Hadoop (All VMs)
wget https://downloads.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
tar -xvf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /opt/hadoop
Add environment variables:
echo 'export HADOOP_HOME=/opt/hadoop' >> ~/.bashrc
echo 'export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin' >> ~/.bashrc
echo 'export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64' >> ~/.bashrc
source ~/.bashrc
3.2 Configure core-site.xml (All VMs)
sudo nano /opt/hadoop/etc/hadoop/core-site.xml
Add:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://SPARK-1:9000</value>
</property>
</configuration>
3.3 Configure hdfs-site.xml (All VMs)
sudo nano /opt/hadoop/etc/hadoop/hdfs-site.xml
Use the same config on all VMs:
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/dn</value>
</property>
</configuration>
Create data dirs:
sudo mkdir -p /data/nn /data/dn
sudo chown -R $USER:$USER /data
3.4 Format HDFS (Only on SPARK-1)
hdfs namenode -format
3.5 Start HDFS (Run from SPARK-1)
start-dfs.sh
Check with:
jps
HDFS UI:
http://SPARK-1:9870
4. Install Spark
Download on all VMs:
wget https://downloads.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
tar -xvf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
Environment variables:
echo 'export SPARK_HOME=/opt/spark' >> ~/.bashrc
echo 'export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin' >> ~/.bashrc
source ~/.bashrc
4.1 Configure Spark Master (SPARK-1)
sudo nano /opt/spark/conf/spark-env.sh
Add:
SPARK_MASTER_HOST=SPARK-1
4.2 Configure Worker List (On SPARK-1)
sudo nano /opt/spark/conf/workers
Add:
SPARK-2
SPARK-3
Start Spark Master:
start-master.sh
Start all workers:
start-workers.sh
Spark UI:
http://SPARK-1:8080
5. Store Data in HDFS
Create folder:
hdfs dfs -mkdir /lab
Upload file:
echo "hello big data" > test.txt
hdfs dfs -put test.txt /lab/
Check:
hdfs dfs -ls /lab
6. Run Spark with HDFS
Start PySpark:
pyspark --master spark://SPARK-1:7077
In shell:
text = sc.textFile("hdfs://SPARK-1:9000/lab/test.txt")
words = text.flatMap(lambda x: x.split())
words.count()
7. Submit a Full Spark Job (WordCount)
Create new file:
echo "spark hdfs spark bigdata lab spark" > wordcount.txt
hdfs dfs -put wordcount.txt /lab/
Submit:
spark-submit --master spark://SPARK-1:7077 \
$SPARK_HOME/examples/src/main/python/wordcount.py \
hdfs://SPARK-1:9000/lab/wordcount.txt
8. Verification
Check cluster:
hdfs dfsadmin -report
Check processes:
jps
Check Spark UI:
http://SPARK-1:8080
Check HDFS UI:
http://SPARK-1:9870
9. Troubleshooting
Worker not joining:
ping SPARK-1
ps -ef | grep Worker
HDFS not responding:
sudo ss -ltnp | grep 9000
Namenode failure:
sudo rm -rf /data/nn/*
hdfs namenode -format