Silver to Gold Layer | Running ML - Debug Help Ne...

ManojkMohan · ‎08-28-2025

Problem I am solving:

Reads the raw sports data IPL CSV → bronze layer
Cleans and aggregates → silver layer
Summarizes team stats → gold layer
Prepares ML-ready features and trains a Random Forest classifier to predict match winners

Getting error: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input. SQLSTATE: 42601 when i run code:

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.model_selection import train_test_split

# Ensure 'features' is array<float> and collect to Pandas (classic Spark API)

pdf = (

df_ml

.select("features", "label")

.limit(10000) # Optional: limit for performance

.collect()

)

# Expand features into proper numeric columns

X = pd.DataFrame([

np.array(f) if f is not None else np.zeros(len(pdf["features"][0]))

for f in pdf["features"]

])

y = pdf["label"].astype(int)

# Train/test split

X_train, X_test, y_train, y_test = train_test_split(

X, y,

test_size=0.2,

random_state=42,

stratify=y

)

# Train Random Forest (scikit-learn)

rf = RandomForestClassifier(

n_estimators=50,

max_depth=10,

random_state=42,

n_jobs=-1

)

rf.fit(X_train, y_train)

# Predictions & evaluation

y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

display(f"✅ RandomForest trained successfully. Test Accuracy = {acc:.2f}")

# Confusion matrix

cm = confusion_matrix(y_test, y_pred)

display(cm)

# Map back to team names

index_to_team = {idx: team for team, idx in team_to_index.items()}

pred_vs_actual = pd.DataFrame({

"actual": y_test.map(index_to_team).reset_index(drop=True),

"predicted": pd.Series(y_pred).map(index_to_team)

})

display(pred_vs_actual.head(10))

Silver to Gold Layer | Running ML - Debug Help Needed