Databricks Community

ThomasSvane · ‎05-16-2024

I'm using Databricks for a machine learning project -- a fairly standard text classification problem, where I want to use the description of an item (i.e. AXELTTNING KOLKERAMIK MM) to predict which of n product categories the item belongs to ('Bushings', 'Adaptors', 'Sealings', etc.). My strategy basically involves transforming the text into sparse vectors using tokenization and the TDF-IF algorithm, and then fitting a model using logistic regression.

On my first attempt, I did everything in a single Databricks notebook -- data cleaning, data transformation, splitting into test/train data sets, and model training. Fitting the model takes several minutes (on a dataset with ~4500 lines), but the model predicts really well, with a accuracy of about 75% (good considering the quality of my data).

Now, to clean up my workspace, I split the code into several notebooks -- one for data cleaning, one for data transformation, on for model fitting and evalutation. Each notebook ends with a

df.write.mode("overwrite").saveAsTable('tablename')

and the next notebook then begins by reading this table. Otherwise, the code is copied line-by-line from the first, big notebook. Here's where it gets strange: If I run the notebook that just reads transformed, clenased data from a table in the catalog and proceeds with the model training, the training is much faster (less than a minute), but the results are poor (accuracy of ~35%).

I can somehow explain the difference in training time by looking at the execution plans for the two datasets: If I have all my work in a single notebook, the execution plan is rather complex, and maybe that messes with the regression algorithm. On the other hand, if I read the data from a table and proceed directly to the model training, the execution plan is very simple. But that does not explain the huge difference in the performance of the model. I've checked at double checked that the data sets are the same in the two scenarios, so the difference is not caused by random seeds when splitting data or anything of that sort.

-werners- · ‎05-16-2024

that is weird.
The regression algorithm should just do a prediction on a dataframe. Such a huge difference in accuracy seems very suspicious.
I would test the algorithm on a reference dataset, for which you know the accuracy beforehand.
Perhaps your transform script in the initial notebook interferes with the model itself, but that seems strange.

Databricks Community

Machine learning accuracy depends on execution plans

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon