1. Contoh Dataset Kaggle
Ambil dataset gratis apa saja, misalnya: Iris Dataset (CSV).
Nama file: iris.csv
Struktur sederhana:
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
...
2. Instalasi di Semua Node (3 Ubuntu Nodes)
Semua node harus dipasang Python + MPI:
sudo apt update
sudo apt install -y python3 python3-pip mpich
pip3 install mpi4py pandas
Pastikan 3 node sudah bisa SSH tanpa password.
3. File Program: mpi_csv.py
Program ini melakukan:
-
Rank 0 membaca CSV
-
Membagi data ke semua rank
-
Setiap rank menghitung rata-rata
sepal_length -
Rank 0 menggabungkan hasil
from mpi4py import MPI
import pandas as pd
import math
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
# --- Rank 0 read dataset ---
if rank == 0:
df = pd.read_csv("iris.csv")
rows = len(df)
# Split indices for each rank
chunk_size = math.ceil(rows / size)
chunks = [df.iloc[i:i+chunk_size] for i in range(0, rows, chunk_size)]
else:
chunks = None
# --- Scatter chunks ---
local_df = comm.scatter(chunks, root=0)
# --- Each rank calculates something ---
local_avg = local_df["sepal_length"].mean()
# --- Gather results ---
all_avgs = comm.gather(local_avg, root=0)
if rank == 0:
global_avg = sum(all_avgs) / len(all_avgs)
print("Average sepal_length from all ranks:", global_avg)
π 4. File Host MPI (hostfile)
Buat hosts:
node1 slots=4
node2 slots=4
node3 slots=4
Sesuaikan hostname/IP.
π 5. Jalankan MPI di 3 Node
Salin file program & CSV ke node1, node2, node3 (atau gunakan NFS).
Kemudian jalankan dari node1:
mpiexec -n 3 --hostfile hosts python3 mpi_csv.py
Atau:
mpirun -np 3 -host node1,node2,node3 python3 mpi_csv.py