cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

PicklingError: Could not pickle the task to send it to the workers.

AlexRomano
New Contributor

I am using sklearn in a databricks notebook to fit an estimator in parallel. Sklearn uses joblib with loky backend to do this. Now, I have file in databricks which I can import my custom Classifier from, and everything works fine. However, if I literally copy the code from that file into the databricks notebook and run, I get the following output and error:

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 118 tasks | elapsed: 2.1s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 4.3s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 48.5s [Parallel(n_jobs=-1)]: Done 182 out of 182 | elapsed: 57.4s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 104 tasks | elapsed: 1.8s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 3.3s finished Fitting 4 folds for each of 1 candidates, totalling 4 fits /databricks/python/lib/python3.5/site-packages/sklearn/model_selection/_split.py:626: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4. % (min_groups, self.n_splits)), Warning) [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.

PicklingError: Could not pickle the task to send it to the workers.

Is there some different functionality in joblib when using an imported class vs defining the class in the notebook? I can provide the stack trace if it would be helpful, but the error just occurs when calling estimator.fit, where the estimator is a scikit-learn GridSearchCV.

1 REPLY 1

Anonymous
Not applicable

Hi, aromano

I know this issue was opened almost a year ago, but I faced the same problem and I was able to solve it. So, I'm sharing the solution in order to help others.

Probably, you're using SparkTrials to optimize the model's hyperparameters in Databricks, In this case, you need to do 3 things:

1.define two environment variables:

import os
os.environ["DATABRICKS_HOST"] = "<YOUR DATABRICKS HOST>"
os.environ["DATABRICKS_TOKEN"] = "<YOUR DATABRICKS TOKEN>"

2. register spark as a backend for joblib.Parallel:

from joblibspark import register_spark
register_spark()

3. define the joblib.Parallel as "spark". For instance:

Parallel(n_jobs=-1, backend="spark")

I hope it helps

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group