cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Silver to Gold Layer | Running ML - Debug Help Needed

ManojkMohan
Valued Contributor III

Problem I am solving:

  • Reads the raw sports data  IPL CSV → bronze layer

  • Cleans and aggregates → silver layer

  • Summarizes team stats → gold layer

  • Prepares ML-ready features and trains a Random Forest classifier to predict match winners

 

Getting error: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input. SQLSTATE: 42601  when i run code: 

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split

# Ensure 'features' is array<float> and collect to Pandas (classic Spark API)
pdf = (
    df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()
)

# Expand features into proper numeric columns
X = pd.DataFrame([
    np.array(f) if f is not None else np.zeros(len(pdf["features"][0]))
    for f in pdf["features"]
])
y = pdf["label"].astype(int)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

# Train Random Forest (scikit-learn)
rf = RandomForestClassifier(
    n_estimators=50,
    max_depth=10,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)

# Predictions & evaluation
y_pred = rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
display(f" RandomForest trained successfully. Test Accuracy = {acc:.2f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
display(cm)

# Map back to team names
index_to_team = {idx: team for team, idx in team_to_index.items()}
pred_vs_actual = pd.DataFrame({
    "actual": y_test.map(index_to_team).reset_index(drop=True),
    "predicted": pd.Series(y_pred).map(index_to_team)
})
display(pred_vs_actual.head(10))    

 

 
ManojkMohan_0-1756389913835.png

 

1 ACCEPTED SOLUTION

Accepted Solutions

BS_THE_ANALYST
Esteemed Contributor

Hi @ManojkMohan ,

This section here:

df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()

I don't see anywhere prior to this code block where you actually created "df_ml"? Has that dataframe even been created prior to this? If yes, are you certain both of those columns ["features""label"] are present in that dataframe.

All the best,
BS

View solution in original post

4 REPLIES 4

BS_THE_ANALYST
Esteemed Contributor

Hi @ManojkMohan ,

This section here:

df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()

I don't see anywhere prior to this code block where you actually created "df_ml"? Has that dataframe even been created prior to this? If yes, are you certain both of those columns ["features""label"] are present in that dataframe.

All the best,
BS

ManojkMohan
Valued Contributor III

  @BS_THE_ANALYST   any framework recommendations for which ML to chose based on data , the way i have solved the problem for now 

 Data is ingested and converted to a usable format.

  1. Features and labels define the ML problem.
  2. Validation ensures data integrity.
  3. Train/test split prepares for robust evaluation.
  4. Random Forest learns patterns in IPL team stats.
  5. Predictions and metrics evaluate model quality.
  6. Output reporting allows easy interpretation and decision support.

 

Building Block: Data Source → Pandas DataFrame
Value Added:

  • Reads historical IPL data from a “gold” table in Spark.
  • Converts it to Pandas for use with scikit-learn.
  • Provides the raw material (features + labels) needed for ML.

Building Block: Feature Engineering
Value Added:

  • Selects numeric attributes (TotalRunsScored, MatchesPlayed, MaxMarginWon) as predictors.
  • Assigns the match winner (team1) as the target variable.
  • Ensures ML model knows what to learn from and what to predict.

Building Block: Data Quality Checks
Value Added:

  • Ensures all required features exist.
  • Warns if a team occurs only once (prevents issues with small data).
  • Improves robustness and interpretability

Building Block: Model Validation Setup
Value Added:

  • Splits data into training (to learn patterns) and testing (to evaluate performance).
  • Supports generalization, ensuring the model is not overfitting.
  • Stratification maintains class proportions where possible.


Building Block: ML Model
Value Added:

  • Random Forest is an ensemble method that captures nonlinear relationships.
  • Learns the mapping between numeric match stats and match winner.
  • Model training creates the predictive engine.


Building Block: Model Evaluation
Value Added:

  • Measures accuracy (how many winners were predicted correctly).
  • Confusion matrix shows true vs predicted class counts, giving insight into model behavior.
  • Ensures model performance is quantified before deployment.

Output: Prediction Comparison

Building Block: Results Visualization / Reporting
Value Added:

  • Provides a side-by-side view of predictions vs actual winners.
  • Helps stakeholders understand model outputs.
  • Makes model results actionable for further analysis or decision-making.

 

ManojkMohan
Valued Contributor III

ManojkMohan_0-1756480062735.png

 

BS_THE_ANALYST
Esteemed Contributor

@ManojkMohan thanks for sharing this, I'm looking at starting an ML project in the coming weeks, I might have to bring this forward 😂. Feeling motivated with that confusion matrix in your output 👍.

Congrats on getting it working!

All the best,
BS