Silver to Gold Layer | Running ML - Debug Help Needed

ManojkMohan
Honored Contributor II

Problem I am solving:

  • Reads the raw sports data  IPL CSV → bronze layer

  • Cleans and aggregates → silver layer

  • Summarizes team stats → gold layer

  • Prepares ML-ready features and trains a Random Forest classifier to predict match winners

 

Getting error: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input. SQLSTATE: 42601  when i run code: 

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Ensure 'features' is array<float> and collect to Pandas (classic Spark API)
pdf = (
    df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()
)

# Expand features into proper numeric columns
X = pd.DataFrame([
    np.array(f) if f is not None else np.zeros(len(pdf["features"][0]))
    for f in pdf["features"]
])
y = pdf["label"].astype(int)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Train Random Forest (scikit-learn)
rf = RandomForestClassifier(
    n_estimators=50,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Predictions & evaluation
y_pred = rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
display(f" RandomForest trained successfully. Test Accuracy = {acc:.2f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
display(cm)

# Map back to team names
index_to_team = {idx: team for team, idx in team_to_index.items()}
pred_vs_actual = pd.DataFrame({
    "actual": y_test.map(index_to_team).reset_index(drop=True),
    "predicted": pd.Series(y_pred).map(index_to_team)
})
display(pred_vs_actual.head(10))    

 

 
ManojkMohan_0-1756389913835.png