<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Silver to Gold Layer | Running ML  - Debug Help Needed in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130190#M48725</link>
    <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;thanks for sharing this, I'm looking at starting an ML project in the coming weeks, I might have to bring this forward &lt;span class="lia-unicode-emoji" title=":face_with_tears_of_joy:"&gt;😂&lt;/span&gt;. Feeling motivated with that confusion matrix in your output &lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt;.&lt;BR /&gt;&lt;BR /&gt;Congrats on getting it working!&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
    <pubDate>Fri, 29 Aug 2025 16:53:57 GMT</pubDate>
    <dc:creator>BS_THE_ANALYST</dc:creator>
    <dc:date>2025-08-29T16:53:57Z</dc:date>
    <item>
      <title>Silver to Gold Layer | Running ML  - Debug Help Needed</title>
      <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130034#M48674</link>
      <description>&lt;P&gt;Problem I am solving:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;Reads the raw sports data&amp;nbsp; IPL CSV → &lt;STRONG&gt;bronze layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Cleans and aggregates → &lt;STRONG&gt;silver layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Summarizes team stats → &lt;STRONG&gt;gold layer&lt;/STRONG&gt;&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;Prepares ML-ready features and trains a &lt;STRONG&gt;Random Forest classifier&lt;/STRONG&gt; to predict match winners&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Getting error:&amp;nbsp;&lt;SPAN&gt;[&lt;/SPAN&gt;&lt;A class="" href="https://docs.databricks.com/error-messages/error-classes.html#parse_syntax_error" target="_blank" rel="noopener noreferrer"&gt;PARSE_SYNTAX_ERROR&lt;/A&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;SPAN&gt; Syntax error at or near end of input. SQLSTATE: 42601&amp;nbsp; when i run code:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; pandas &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; pd&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; numpy &lt;/SPAN&gt;&lt;SPAN&gt;as&lt;/SPAN&gt;&lt;SPAN&gt; np&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; sklearn.ensemble &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; RandomForestClassifier&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; sklearn.metrics &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; accuracy_score, confusion_matrix&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt;&lt;SPAN&gt; sklearn.model_selection &lt;/SPAN&gt;&lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; train_test_split&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Ensure 'features' is array&amp;lt;float&amp;gt; and collect to Pandas (classic Spark API)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;pdf &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; (&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; df_ml&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;select&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"features"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"label"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;limit&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;10000&lt;/SPAN&gt;&lt;SPAN&gt;) &amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;# Optional: limit for performance&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;collect&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Expand features into proper numeric columns&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;X &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; pd.&lt;/SPAN&gt;&lt;SPAN&gt;DataFrame&lt;/SPAN&gt;&lt;SPAN&gt;([&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; np.&lt;/SPAN&gt;&lt;SPAN&gt;array&lt;/SPAN&gt;&lt;SPAN&gt;(f) &lt;/SPAN&gt;&lt;SPAN&gt;if&lt;/SPAN&gt;&lt;SPAN&gt; f &lt;/SPAN&gt;&lt;SPAN&gt;is&lt;/SPAN&gt; &lt;SPAN&gt;not&lt;/SPAN&gt; &lt;SPAN&gt;None&lt;/SPAN&gt; &lt;SPAN&gt;else&lt;/SPAN&gt;&lt;SPAN&gt; np.&lt;/SPAN&gt;&lt;SPAN&gt;zeros&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;len&lt;/SPAN&gt;&lt;SPAN&gt;(pdf[&lt;/SPAN&gt;&lt;SPAN&gt;"features"&lt;/SPAN&gt;&lt;SPAN&gt;][&lt;/SPAN&gt;&lt;SPAN&gt;0&lt;/SPAN&gt;&lt;SPAN&gt;]))&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; f &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; pdf[&lt;/SPAN&gt;&lt;SPAN&gt;"features"&lt;/SPAN&gt;&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;y &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; pdf[&lt;/SPAN&gt;&lt;SPAN&gt;"label"&lt;/SPAN&gt;&lt;SPAN&gt;].&lt;/SPAN&gt;&lt;SPAN&gt;astype&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;int&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Train/test split&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;X_train, X_test, y_train, y_test &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;train_test_split&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; X, y,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;test_size&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;0.2&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;random_state&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;42&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;stratify&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;y&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Train Random Forest (scikit-learn)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;rf &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;RandomForestClassifier&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;n_estimators&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;50&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;max_depth&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;10&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;random_state&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;42&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;n_jobs&lt;/SPAN&gt;&lt;SPAN&gt;=-&lt;/SPAN&gt;&lt;SPAN&gt;1&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;rf.&lt;/SPAN&gt;&lt;SPAN&gt;fit&lt;/SPAN&gt;&lt;SPAN&gt;(X_train, y_train)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Predictions &amp;amp; evaluation&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;y_pred &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; rf.&lt;/SPAN&gt;&lt;SPAN&gt;predict&lt;/SPAN&gt;&lt;SPAN&gt;(X_test)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;acc &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;accuracy_score&lt;/SPAN&gt;&lt;SPAN&gt;(y_test, y_pred)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;f&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;span class="lia-unicode-emoji" title=":white_heavy_check_mark:"&gt;✅&lt;/span&gt; RandomForest trained successfully. Test Accuracy = &lt;/SPAN&gt;&lt;SPAN&gt;{&lt;/SPAN&gt;&lt;SPAN&gt;acc&lt;/SPAN&gt;&lt;SPAN&gt;:.2f}&lt;/SPAN&gt;&lt;SPAN&gt;"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Confusion matrix&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;cm &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt; &lt;SPAN&gt;confusion_matrix&lt;/SPAN&gt;&lt;SPAN&gt;(y_test, y_pred)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(cm)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;BR /&gt;&lt;DIV&gt;&lt;SPAN&gt;# Map back to team names&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;index_to_team &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; {idx: team &lt;/SPAN&gt;&lt;SPAN&gt;for&lt;/SPAN&gt;&lt;SPAN&gt; team, idx &lt;/SPAN&gt;&lt;SPAN&gt;in&lt;/SPAN&gt;&lt;SPAN&gt; team_to_index.&lt;/SPAN&gt;&lt;SPAN&gt;items&lt;/SPAN&gt;&lt;SPAN&gt;()}&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;pred_vs_actual &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; pd.&lt;/SPAN&gt;&lt;SPAN&gt;DataFrame&lt;/SPAN&gt;&lt;SPAN&gt;({&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"actual"&lt;/SPAN&gt;&lt;SPAN&gt;: y_test.&lt;/SPAN&gt;&lt;SPAN&gt;map&lt;/SPAN&gt;&lt;SPAN&gt;(index_to_team).&lt;/SPAN&gt;&lt;SPAN&gt;reset_index&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;drop&lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;"predicted"&lt;/SPAN&gt;&lt;SPAN&gt;: pd.&lt;/SPAN&gt;&lt;SPAN&gt;Series&lt;/SPAN&gt;&lt;SPAN&gt;(y_pred).&lt;/SPAN&gt;&lt;SPAN&gt;map&lt;/SPAN&gt;&lt;SPAN&gt;(index_to_team)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;})&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;display&lt;/SPAN&gt;&lt;SPAN&gt;(pred_vs_actual.&lt;/SPAN&gt;&lt;SPAN&gt;head&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;10&lt;/SPAN&gt;&lt;SPAN&gt;))&amp;nbsp; &amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_0-1756389913835.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19443i9FAF1371CCD04B4D/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_0-1756389913835.png" alt="ManojkMohan_0-1756389913835.png" /&gt;&lt;/span&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 14:06:18 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130034#M48674</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-08-28T14:06:18Z</dc:date>
    </item>
    <item>
      <title>Re: Silver to Gold Layer | Running ML  - Debug Help Needed</title>
      <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130085#M48692</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;,&lt;BR /&gt;&lt;BR /&gt;This section here:&lt;/P&gt;&lt;LI-CODE lang="markup"&gt;df_ml
    .select("features", "label")
    .limit(10000)  # Optional: limit for performance
    .collect()&lt;/LI-CODE&gt;&lt;P&gt;I don't see anywhere prior to this code block where you actually created "df_ml"? Has that dataframe even been created prior to this? If yes, are you certain both of those columns [&lt;STRONG&gt;"features"&lt;/STRONG&gt;&lt;SPAN&gt;,&amp;nbsp;&lt;/SPAN&gt;&lt;SPAN&gt;&lt;STRONG&gt;"label"&lt;/STRONG&gt;]&lt;/SPAN&gt;&amp;nbsp;are present in that dataframe.&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Thu, 28 Aug 2025 19:20:11 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130085#M48692</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-08-28T19:20:11Z</dc:date>
    </item>
    <item>
      <title>Re: Silver to Gold Layer | Running ML  - Debug Help Needed</title>
      <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130175#M48722</link>
      <description>&lt;P&gt;&amp;nbsp; &lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/146924"&gt;@BS_THE_ANALYST&lt;/a&gt;&amp;nbsp; &amp;nbsp;any framework recommendations for which ML to chose based on data , the way i have solved the problem for now&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;Data is ingested and converted to a usable format.&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;Features and labels define the ML problem.&lt;/LI&gt;&lt;LI&gt;Validation ensures data integrity.&lt;/LI&gt;&lt;LI&gt;Train/test split prepares for robust evaluation.&lt;/LI&gt;&lt;LI&gt;Random Forest learns patterns in IPL team stats.&lt;/LI&gt;&lt;LI&gt;Predictions and metrics evaluate model quality.&lt;/LI&gt;&lt;LI&gt;Output reporting allows easy interpretation and decision support.&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Building Block: Data Source → Pandas DataFrame&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Reads historical IPL data from a “gold” table in Spark.&lt;/LI&gt;&lt;LI&gt;Converts it to Pandas for use with scikit-learn.&lt;/LI&gt;&lt;LI&gt;Provides the raw material (features + labels) needed for ML.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Building Block: Feature Engineering&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Selects numeric attributes (TotalRunsScored, MatchesPlayed, MaxMarginWon) as predictors.&lt;/LI&gt;&lt;LI&gt;Assigns the match winner (team1) as the target variable.&lt;/LI&gt;&lt;LI&gt;Ensures ML model knows what to learn from and what to predict.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Building Block: Data Quality Checks&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Ensures all required features exist.&lt;/LI&gt;&lt;LI&gt;Warns if a team occurs only once (prevents issues with small data).&lt;/LI&gt;&lt;LI&gt;Improves robustness and interpretability&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Building Block: Model Validation Setup&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Splits data into training (to learn patterns) and testing (to evaluate performance).&lt;/LI&gt;&lt;LI&gt;Supports generalization, ensuring the model is not overfitting.&lt;/LI&gt;&lt;LI&gt;Stratification maintains class proportions where possible.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;BR /&gt;Building Block: ML Model&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Random Forest is an ensemble method that captures nonlinear relationships.&lt;/LI&gt;&lt;LI&gt;Learns the mapping between numeric match stats and match winner.&lt;/LI&gt;&lt;LI&gt;Model training creates the predictive engine.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;BR /&gt;Building Block: Model Evaluation&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Measures accuracy (how many winners were predicted correctly).&lt;/LI&gt;&lt;LI&gt;Confusion matrix shows true vs predicted class counts, giving insight into model behavior.&lt;/LI&gt;&lt;LI&gt;Ensures model performance is quantified before deployment.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;Output: Prediction Comparison&lt;/P&gt;&lt;P&gt;Building Block: Results Visualization / Reporting&lt;BR /&gt;Value Added:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Provides a side-by-side view of predictions vs actual winners.&lt;/LI&gt;&lt;LI&gt;Helps stakeholders understand model outputs.&lt;/LI&gt;&lt;LI&gt;Makes model results actionable for further analysis or decision-making.&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Aug 2025 15:07:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130175#M48722</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-08-29T15:07:04Z</dc:date>
    </item>
    <item>
      <title>Re: Silver to Gold Layer | Running ML  - Debug Help Needed</title>
      <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130177#M48723</link>
      <description>&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="ManojkMohan_0-1756480062735.png" style="width: 400px;"&gt;&lt;img src="https://community.databricks.com/t5/image/serverpage/image-id/19482i1561B482A7F5717F/image-size/medium?v=v2&amp;amp;px=400" role="button" title="ManojkMohan_0-1756480062735.png" alt="ManojkMohan_0-1756480062735.png" /&gt;&lt;/span&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 29 Aug 2025 15:08:04 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130177#M48723</guid>
      <dc:creator>ManojkMohan</dc:creator>
      <dc:date>2025-08-29T15:08:04Z</dc:date>
    </item>
    <item>
      <title>Re: Silver to Gold Layer | Running ML  - Debug Help Needed</title>
      <link>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130190#M48725</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/155141"&gt;@ManojkMohan&lt;/a&gt;&amp;nbsp;thanks for sharing this, I'm looking at starting an ML project in the coming weeks, I might have to bring this forward &lt;span class="lia-unicode-emoji" title=":face_with_tears_of_joy:"&gt;😂&lt;/span&gt;. Feeling motivated with that confusion matrix in your output &lt;span class="lia-unicode-emoji" title=":thumbs_up:"&gt;👍&lt;/span&gt;.&lt;BR /&gt;&lt;BR /&gt;Congrats on getting it working!&lt;BR /&gt;&lt;BR /&gt;All the best,&lt;BR /&gt;BS&lt;/P&gt;</description>
      <pubDate>Fri, 29 Aug 2025 16:53:57 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/silver-to-gold-layer-running-ml-debug-help-needed/m-p/130190#M48725</guid>
      <dc:creator>BS_THE_ANALYST</dc:creator>
      <dc:date>2025-08-29T16:53:57Z</dc:date>
    </item>
  </channel>
</rss>

