Showing results for 
Search instead for 
Did you mean: 
Machine Learning
Dive into the world of machine learning on the Databricks platform. Explore discussions on algorithms, model training, deployment, and more. Connect with ML enthusiasts and experts.
Showing results for 
Search instead for 
Did you mean: 

Machine learning accuracy depends on execution plans

New Contributor

I'm using Databricks for a machine learning project -- a fairly standard text classification problem, where I want to use the description of an item (i.e. AXELTTNING KOLKERAMIK MM) to predict which of n product categories the item belongs to ('Bushings', 'Adaptors', 'Sealings', etc.). My strategy basically involves transforming the text into sparse vectors using tokenization and the TDF-IF algorithm, and then fitting a model using logistic regression.

On my first attempt, I did everything in a single Databricks notebook -- data cleaning, data transformation, splitting into test/train data sets, and model training. Fitting the model takes several minutes (on a dataset with ~4500 lines), but the model predicts really well, with a accuracy of about 75% (good considering the quality of my data).

Now, to clean up my workspace, I split the code into several notebooks -- one for data cleaning, one for data transformation, on for model fitting and evalutation. Each notebook ends with a 

and the next notebook then begins by reading this table. Otherwise, the code is copied line-by-line from the first, big notebook. Here's where it gets strange: If I run the notebook that just reads transformed, clenased data from a table in the catalog and proceeds with the model training, the training is much faster (less than a minute), but the results are poor (accuracy of ~35%).  
I can somehow explain the difference in training time by looking at the execution plans for the two datasets: If I have all my work in a single notebook, the execution plan is rather complex, and maybe that messes with the regression algorithm. On the other hand, if I read the data from a table and proceed directly to the model training, the execution plan is very simple. But that does not explain the huge difference in the performance of the model. I've checked at double checked that the data sets are the same in the two scenarios, so the difference is not caused by random seeds when splitting data or anything of that sort.

Esteemed Contributor III

that is weird.
The regression algorithm should just do a prediction on a dataframe.  Such a huge difference in accuracy seems very suspicious.
I would test the algorithm on a reference dataset, for which you know the accuracy beforehand.
Perhaps your transform script in the initial notebook interferes with the model itself, but that seems strange.

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!