Databricks Community

ThomasSvane · ‎05-16-2024

I'm using Databricks for a machine learning project -- a fairly standard text classification problem, where I want to use the description of an item (i.e. AXELTTNING KOLKERAMIK MM) to predict which of n product categories the item belongs to ('Bushings', 'Adaptors', 'Sealings', etc.). My strategy basically involves transforming the text into sparse vectors using tokenization and the TDF-IF algorithm, and then fitting a model using logistic regression.

On my first attempt, I did everything in a single Databricks notebook -- data cleaning, data transformation, splitting into test/train data sets, and model training. Fitting the model takes several minutes (on a dataset with ~4500 lines), but the model predicts really well, with a accuracy of about 75% (good considering the quality of my data).

Now, to clean up my workspace, I split the code into several notebooks -- one for data cleaning, one for data transformation, on for model fitting and evalutation. Each notebook ends with a

df.write.mode("overwrite").saveAsTable('tablename')

and the next notebook then begins by reading this table. Otherwise, the code is copied line-by-line from the first, big notebook. Here's where it gets strange: If I run the notebook that just reads transformed, clenased data from a table in the catalog and proceeds with the model training, the training is much faster (less than a minute), but the results are poor (accuracy of ~35%).

I can somehow explain the difference in training time by looking at the execution plans for the two datasets: If I have all my work in a single notebook, the execution plan is rather complex, and maybe that messes with the regression algorithm. On the other hand, if I read the data from a table and proceed directly to the model training, the execution plan is very simple. But that does not explain the huge difference in the performance of the model. I've checked at double checked that the data sets are the same in the two scenarios, so the difference is not caused by random seeds when splitting data or anything of that sort.

-werners- · ‎05-16-2024

that is weird.
The regression algorithm should just do a prediction on a dataframe. Such a huge difference in accuracy seems very suspicious.
I would test the algorithm on a reference dataset, for which you know the accuracy beforehand.
Perhaps your transform script in the initial notebook interferes with the model itself, but that seems strange.

Databricks Community

Machine learning accuracy depends on execution plans

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!