Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-28-2025 07:06 AM
Problem I am solving:
Reads the raw sports data IPL CSV → bronze layer
Cleans and aggregates → silver layer
Summarizes team stats → gold layer
Prepares ML-ready features and trains a Random Forest classifier to predict match winners
Getting error: [PARSE_SYNTAX_ERROR] Syntax error at or near end of input. SQLSTATE: 42601 when i run code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
# Ensure 'features' is array<float> and collect to Pandas (classic Spark API)
pdf = (
df_ml
.select("features", "label")
.limit(10000) # Optional: limit for performance
.collect()
)
# Expand features into proper numeric columns
X = pd.DataFrame([
np.array(f) if f is not None else np.zeros(len(pdf["features"][0]))
for f in pdf["features"]
])
y = pdf["label"].astype(int)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
random_state=42,
stratify=y
)
# Train Random Forest (scikit-learn)
rf = RandomForestClassifier(
n_estimators=50,
max_depth=10,
random_state=42,
n_jobs=-1
)
rf.fit(X_train, y_train)
# Predictions & evaluation
y_pred = rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
display(f"✅ RandomForest trained successfully. Test Accuracy = {acc:.2f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
display(cm)
# Map back to team names
index_to_team = {idx: team for team, idx in team_to_index.items()}
pred_vs_actual = pd.DataFrame({
"actual": y_test.map(index_to_team).reset_index(drop=True),
"predicted": pd.Series(y_pred).map(index_to_team)
})
display(pred_vs_actual.head(10))