Databricks Community

ManojkMohan · ‎08-28-2025

Problem I am solving:

Reads the raw sports data IPL CSV → bronze layer
Cleans and aggregates → silver layer
Summarizes team stats → gold layer
Prepares ML-ready features and trains a Random Forest classifier to predict match winners

Getting error: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input. SQLSTATE: 42601 when i run code:

import pandas as pd

import numpy as np

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn.model_selection import train_test_split

# Ensure 'features' is array<float> and collect to Pandas (classic Spark API)

pdf = (

df_ml

.select("features", "label")

.limit(10000) # Optional: limit for performance

.collect()

)

# Expand features into proper numeric columns

X = pd.DataFrame([

np.array(f) if f is not None else np.zeros(len(pdf["features"][0]))

for f in pdf["features"]

])

y = pdf["label"].astype(int)

# Train/test split

X_train, X_test, y_train, y_test = train_test_split(

X, y,

test_size=0.2,

random_state=42,

stratify=y

)

# Train Random Forest (scikit-learn)

rf = RandomForestClassifier(

n_estimators=50,

max_depth=10,

random_state=42,

n_jobs=-1

)

rf.fit(X_train, y_train)

# Predictions & evaluation

y_pred = rf.predict(X_test)

acc = accuracy_score(y_test, y_pred)

display(f"✅ RandomForest trained successfully. Test Accuracy = {acc:.2f}")

# Confusion matrix

cm = confusion_matrix(y_test, y_pred)

display(cm)

# Map back to team names

index_to_team = {idx: team for team, idx in team_to_index.items()}

pred_vs_actual = pd.DataFrame({

"actual": y_test.map(index_to_team).reset_index(drop=True),

"predicted": pd.Series(y_pred).map(index_to_team)

})

display(pred_vs_actual.head(10))

BS_THE_ANALYST · ‎08-28-2025

Hi @ManojkMohan ,

This section here:

df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()

I don't see anywhere prior to this code block where you actually created "df_ml"? Has that dataframe even been created prior to this? If yes, are you certain both of those columns ["features", "label"] are present in that dataframe.

All the best,
BS

View solution in original post

BS_THE_ANALYST · ‎08-28-2025

Hi @ManojkMohan ,

This section here:

df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()

I don't see anywhere prior to this code block where you actually created "df_ml"? Has that dataframe even been created prior to this? If yes, are you certain both of those columns ["features", "label"] are present in that dataframe.

All the best,
BS

ManojkMohan · ‎08-29-2025

@BS_THE_ANALYST any framework recommendations for which ML to chose based on data , the way i have solved the problem for now

Data is ingested and converted to a usable format.

Features and labels define the ML problem.
Validation ensures data integrity.
Train/test split prepares for robust evaluation.
Random Forest learns patterns in IPL team stats.
Predictions and metrics evaluate model quality.
Output reporting allows easy interpretation and decision support.

Building Block: Data Source → Pandas DataFrame
Value Added:

Reads historical IPL data from a “gold” table in Spark.
Converts it to Pandas for use with scikit-learn.
Provides the raw material (features + labels) needed for ML.

Building Block: Feature Engineering
Value Added:

Selects numeric attributes (TotalRunsScored, MatchesPlayed, MaxMarginWon) as predictors.
Assigns the match winner (team1) as the target variable.
Ensures ML model knows what to learn from and what to predict.

Building Block: Data Quality Checks
Value Added:

Ensures all required features exist.
Warns if a team occurs only once (prevents issues with small data).
Improves robustness and interpretability

Building Block: Model Validation Setup
Value Added:

Splits data into training (to learn patterns) and testing (to evaluate performance).
Supports generalization, ensuring the model is not overfitting.
Stratification maintains class proportions where possible.

Building Block: ML Model
Value Added:

Random Forest is an ensemble method that captures nonlinear relationships.
Learns the mapping between numeric match stats and match winner.
Model training creates the predictive engine.

Building Block: Model Evaluation
Value Added:

Measures accuracy (how many winners were predicted correctly).
Confusion matrix shows true vs predicted class counts, giving insight into model behavior.
Ensures model performance is quantified before deployment.

Output: Prediction Comparison

Building Block: Results Visualization / Reporting
Value Added:

Provides a side-by-side view of predictions vs actual winners.
Helps stakeholders understand model outputs.
Makes model results actionable for further analysis or decision-making.

ManojkMohan · ‎08-29-2025

BS_THE_ANALYST · ‎08-29-2025

@ManojkMohan thanks for sharing this, I'm looking at starting an ML project in the coming weeks, I might have to bring this forward 😂. Feeling motivated with that confusion matrix in your output 👍.

Congrats on getting it working!

All the best,
BS

Databricks Community

Silver to Gold Layer | Running ML - Debug Help Needed

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples