Overview and Topology

Node roles:
SPARK-1 (192.168.18.101)
NameNode, SecondaryNameNode, Spark Master, Spark Worker
SPARK-2 (192.168.18.102)
DataNode, Spark Worker
SPARK-3 (192.168.18.103)
DataNode, Spark Worker
Assumptions:
Ubuntu Server 20.04 / 22.04 / 24.04 (clean)
User: cakrawala (sudo access)
Java: OpenJDK 11
Hadoop: 3.3.6
Spark: 3.5.0 (Hadoop 3 build)
Step 1 – Initial System Setup (ALL NODES)
Update system:
sudo apt update && sudo apt upgrade -y
Install required packages:
sudo apt install -y ssh wget tar nano openjdk-11-jdk
Verify Java:
java -version
Step 2 – Hostname Configuration
SPARK-1:
sudo hostnamectl set-hostname SPARK-1
SPARK-2:
sudo hostnamectl set-hostname SPARK-2
SPARK-3:
sudo hostnamectl set-hostname SPARK-3
Logout/login or reboot after setting hostname.
Step 3 – Configure /etc/hosts (ALL NODES)
Edit:
sudo nano /etc/hosts
Add:
192.168.18.101 SPARK-1
192.168.18.102 SPARK-2
192.168.18.103 SPARK-3
Verify:
ping SPARK-1
ping SPARK-2
ping SPARK-3
Step 4 – Passwordless SSH (MANDATORY) Run on SPARK-1 only.
Generate SSH key:
ssh-keygen -t rsa -P ""
Enable local SSH:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 600 ~/.ssh/authorized_keys
Copy key to other nodes:
ssh-copy-id SPARK-2
ssh-copy-id SPARK-3
Test:
ssh SPARK-1
ssh SPARK-2
ssh SPARK-3
No password prompt should appear.
Step 5 – Hadoop Installation (ALL NODES)
Download Hadoop:
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
Extract and install:
tar -xvf hadoop-3.3.6.tar.gz
sudo mv hadoop-3.3.6 /opt/hadoop
Step 6 – Hadoop Environment Variables (ALL NODES)
Edit:
nano ~/.bashrc
Add:
export HADOOP_HOME=/opt/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Apply:
source ~/.bashrc
Step 7 – Hadoop Configuration (ALL NODES)
core-site.xml
nano /opt/hadoop/etc/hadoop/core-site.xml
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://SPARK-1:9000</value>
</property>
</configuration>
hdfs-site.xml
nano /opt/hadoop/etc/hadoop/hdfs-site.xml
<configuration>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data/nn</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:///data/dn</value>
</property>
</configuration>
Step 8 – Hadoop Workers File (ALL NODES)
Edit:
nano /opt/hadoop/etc/hadoop/workers
SPARK-1
SPARK-2
SPARK-3
Step 9 – Set JAVA_HOME for Hadoop (ALL NODES)
Edit:
nano /opt/hadoop/etc/hadoop/hadoop-env.sh
Set:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
Step 10 – Create HDFS Data Directories (ALL NODES)
sudo mkdir -p /data/nn /data/dn
sudo chown -R cakrawala:cakrawala /data
Step 11 – Format HDFS (SPARK-1 ONLY)
hdfs namenode -format
Step 12 – Start HDFS (SPARK-1 ONLY)
start-dfs.sh
Verify:
SPARK-1:
jps
Expected:
NameNode
SecondaryNameNode
SPARK-2 / SPARK-3:
jps
Expected:
DataNode
HDFS Web UI:
http://SPARK-1:9870
Step 13 – Spark Installation (ALL NODES)
Download Spark 3.5.0:
wget https://archive.apache.org/dist/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
Extract and install:
tar -xvf spark-3.5.0-bin-hadoop3.tgz
sudo mv spark-3.5.0-bin-hadoop3 /opt/spark
Step 14 – Spark Environment Variables (ALL NODES)
Edit:
nano ~/.bashrc
Add:
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
Apply:
source ~/.bashrc
Verify:
spark-submit --version
Step 15 – Spark Configuration (ALL NODES)
spark-env.sh
cp /opt/spark/conf/spark-env.sh.template /opt/spark/conf/spark-env.sh
nano /opt/spark/conf/spark-env.sh
Add:
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
export SPARK_MASTER_HOST=SPARK-1
spark-defaults.conf
nano /opt/spark/conf/spark-defaults.conf
Add:
spark.master spark://SPARK-1:7077
spark.eventLog.enabled true
spark.eventLog.dir hdfs://SPARK-1:9000/spark-logs
spark.history.fs.logDirectory hdfs://SPARK-1:9000/spark-logs
Create Spark log directory in HDFS:
hdfs dfs -mkdir /spark-logs
Step 16 – Start Spark Cluster
SPARK-1:
start-master.sh
start-worker.sh spark://SPARK-1:7077
SPARK-2 and SPARK-3:
start-worker.sh spark://SPARK-1:7077
Step 17 – Verify Spark
SPARK-1:
jps
Expected:
Master
Worker
SPARK-2 / SPARK-3:
Worker
Spark Web UI:
http://SPARK-1:8080
Troubleshooting Checklist
Verify ports:
sudo ss -lntp | grep -E '9870|7077|8080'
Verify SSH:
ssh SPARK-2
ssh SPARK-3
Check logs if something fails:
/opt/hadoop/logs
/opt/spark/logs
Run Testing
export SPARK_HOME=/opt/spark
export PATH=$PATH:$SPARK_HOME/bin
create example test file name cakrawala.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("SimpleSparkDemo") \
.getOrCreate()
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
print("=== Data ===")
df.show()
spark.stop()
Run it using spark
spark-submit --master spark://SPARK-1:7077 --conf spark.executor.instances=1 job.py
Test using Dataset from Kagle
Dataset: Netflix Movies and TV Shows
-
Original source: Kaggle
-
Widely used in data analytics courses
-
CSV format, clean, realistic columns
Columns include:
-
type (Movie / TV Show)
-
title
-
release_year
-
country
-
rating
-
duration
Step 1 – Download dataset (SPARK-1)
cd ~
wget https://raw.githubusercontent.com/justmarkham/DAT8/master/data/netflix_titles.csv
Verify:
head netflix_titles.csv
Step 2 – Upload dataset to HDFS
Create directory:
hdfs dfs -mkdir -p /datasets/netflix
Upload file:
hdfs dfs -put netflix_titles.csv /datasets/netflix/
Verify:
hdfs dfs -ls /datasets/netflix
Step 3 – Create Spark job (REAL analytics)
Create file:
nano netflix_job.py
Paste this full script:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder \
.appName("Netflix Kaggle Demo") \
.getOrCreate()
# Read data from HDFS
df = spark.read \
.option("header", "true") \
.option("inferSchema", "true") \
.csv("hdfs://SPARK-1:9000/datasets/netflix/netflix_titles.csv")
print("=== Dataset Schema ===")
df.printSchema()
print("=== Total Rows ===")
print(df.count())
print("=== Movies Only ===")
movies = df.filter(col("type") == "Movie")
movies.select("title", "release_year", "rating").show(10, truncate=False)
print("=== Number of Movies per Release Year ===")
movies.groupBy("release_year") \
.count() \
.orderBy("release_year") \
.show(10)
print("=== Movies per Country (Top 10) ===")
movies.groupBy("country") \
.count() \
.orderBy(col("count").desc()) \
.show(10, truncate=False)
spark.stop()
Why this is “real”:
-
Reads real dataset
-
Uses filter, groupBy, aggregation
-
Same operations used in industry analytics
Step 4 – Run Spark job (single executor)
/opt/spark/bin/spark-submit \
--master spark://SPARK-1:7077 \
--conf spark.executor.instances=1 \
netflix_job.py
Expected output (example)
You should see outputs like:
=== Total Rows ===
8807
=== Movies Only ===
+----------------------+------------+------+
|title |release_year|rating|
+----------------------+------------+------+
|Inception |2010 |PG-13 |
|The Irishman |2019 |R |
...
=== Movies per Country (Top 10) ===
+--------------------+-----+
|country |count|
+--------------------+-----+
|United States |2750 |
|India | 962 |
|United Kingdom | 534 |
...