Machine learning accuracy depends on execution plans
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2024 05:17 AM - edited 05-16-2024 05:19 AM
I'm using Databricks for a machine learning project -- a fairly standard text classification problem, where I want to use the description of an item (i.e. AXELTTNING KOLKERAMIK MM) to predict which of n product categories the item belongs to ('Bushings', 'Adaptors', 'Sealings', etc.). My strategy basically involves transforming the text into sparse vectors using tokenization and the TDF-IF algorithm, and then fitting a model using logistic regression.
On my first attempt, I did everything in a single Databricks notebook -- data cleaning, data transformation, splitting into test/train data sets, and model training. Fitting the model takes several minutes (on a dataset with ~4500 lines), but the model predicts really well, with a accuracy of about 75% (good considering the quality of my data).
Now, to clean up my workspace, I split the code into several notebooks -- one for data cleaning, one for data transformation, on for model fitting and evalutation. Each notebook ends with a
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
05-16-2024 06:19 AM
that is weird.
The regression algorithm should just do a prediction on a dataframe. Such a huge difference in accuracy seems very suspicious.
I would test the algorithm on a reference dataset, for which you know the accuracy beforehand.
Perhaps your transform script in the initial notebook interferes with the model itself, but that seems strange.

