Week 3

1. Contoh Dataset Kaggle

Ambil dataset gratis apa saja, misalnya: Iris Dataset (CSV).
Nama file: iris.csv

Struktur sederhana:

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
...

2. Instalasi di Semua Node (3 Ubuntu Nodes)

Semua node harus dipasang Python + MPI:

sudo apt update
sudo apt install -y python3 python3-pip mpich
pip3 install mpi4py pandas

Pastikan 3 node sudah bisa SSH tanpa password.


3. File Program: mpi_csv.py

Program ini melakukan:

  • Rank 0 membaca CSV

  • Membagi data ke semua rank

  • Setiap rank menghitung rata-rata sepal_length

  • Rank 0 menggabungkan hasil

from mpi4py import MPI
import pandas as pd
import math

comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()

# --- Rank 0 read dataset ---
if rank == 0:
    df = pd.read_csv("iris.csv")
    rows = len(df)

    # Split indices for each rank
    chunk_size = math.ceil(rows / size)
    chunks = [df.iloc[i:i+chunk_size] for i in range(0, rows, chunk_size)]
else:
    chunks = None

# --- Scatter chunks ---
local_df = comm.scatter(chunks, root=0)

# --- Each rank calculates something ---
local_avg = local_df["sepal_length"].mean()

# --- Gather results ---
all_avgs = comm.gather(local_avg, root=0)

if rank == 0:
    global_avg = sum(all_avgs) / len(all_avgs)
    print("Average sepal_length from all ranks:", global_avg)

πŸ“Œ 4. File Host MPI (hostfile)

Buat hosts:

node1 slots=4
node2 slots=4
node3 slots=4

Sesuaikan hostname/IP.


πŸ“Œ 5. Jalankan MPI di 3 Node

Salin file program & CSV ke node1, node2, node3 (atau gunakan NFS).

Kemudian jalankan dari node1:

mpiexec -n 3 --hostfile hosts python3 mpi_csv.py

Atau:

mpirun -np 3 -host node1,node2,node3 python3 mpi_csv.py
Updated on