<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic PicklingError: Could not pickle the task to send it to the workers. in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/picklingerror-could-not-pickle-the-task-to-send-it-to-the/m-p/27887#M19730</link>
    <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am using sklearn in a databricks notebook to fit an estimator in parallel. Sklearn uses joblib with loky backend to do this. Now, I have file in databricks which I can import my custom Classifier from, and everything works fine. However, if I literally copy the code from that file into the databricks notebook and run, I get the following output and error:&lt;/P&gt;[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 118 tasks | elapsed: 2.1s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 4.3s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 48.5s [Parallel(n_jobs=-1)]: Done 182 out of 182 | elapsed: 57.4s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 104 tasks | elapsed: 1.8s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 3.3s finished Fitting 4 folds for each of 1 candidates, totalling 4 fits /databricks/python/lib/python3.5/site-packages/sklearn/model_selection/_split.py:626: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4. % (min_groups, self.n_splits)), Warning) [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;PicklingError: Could not pickle the task to send it to the workers. &lt;/P&gt;
&lt;P&gt;Is there some different functionality in joblib when using an imported class vs defining the class in the notebook? I can provide the stack trace if it would be helpful, but the error just occurs when calling estimator.fit, where the estimator is a scikit-learn GridSearchCV.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Thu, 15 Aug 2019 16:54:41 GMT</pubDate>
    <dc:creator>AlexRomano</dc:creator>
    <dc:date>2019-08-15T16:54:41Z</dc:date>
    <item>
      <title>PicklingError: Could not pickle the task to send it to the workers.</title>
      <link>https://community.databricks.com/t5/data-engineering/picklingerror-could-not-pickle-the-task-to-send-it-to-the/m-p/27887#M19730</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt;I am using sklearn in a databricks notebook to fit an estimator in parallel. Sklearn uses joblib with loky backend to do this. Now, I have file in databricks which I can import my custom Classifier from, and everything works fine. However, if I literally copy the code from that file into the databricks notebook and run, I get the following output and error:&lt;/P&gt;[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 118 tasks | elapsed: 2.1s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 4.3s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 33 tasks | elapsed: 6.5s [Parallel(n_jobs=-1)]: Done 154 tasks | elapsed: 48.5s [Parallel(n_jobs=-1)]: Done 182 out of 182 | elapsed: 57.4s finished [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers. [Parallel(n_jobs=-1)]: Done 104 tasks | elapsed: 1.8s [Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 3.3s finished Fitting 4 folds for each of 1 candidates, totalling 4 fits /databricks/python/lib/python3.5/site-packages/sklearn/model_selection/_split.py:626: Warning: The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=4. % (min_groups, self.n_splits)), Warning) [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
&lt;P&gt;&lt;/P&gt; 
&lt;P&gt;PicklingError: Could not pickle the task to send it to the workers. &lt;/P&gt;
&lt;P&gt;Is there some different functionality in joblib when using an imported class vs defining the class in the notebook? I can provide the stack trace if it would be helpful, but the error just occurs when calling estimator.fit, where the estimator is a scikit-learn GridSearchCV.&lt;/P&gt; 
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 15 Aug 2019 16:54:41 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/picklingerror-could-not-pickle-the-task-to-send-it-to-the/m-p/27887#M19730</guid>
      <dc:creator>AlexRomano</dc:creator>
      <dc:date>2019-08-15T16:54:41Z</dc:date>
    </item>
    <item>
      <title>Re: PicklingError: Could not pickle the task to send it to the workers.</title>
      <link>https://community.databricks.com/t5/data-engineering/picklingerror-could-not-pickle-the-task-to-send-it-to-the/m-p/27888#M19731</link>
      <description>&lt;P&gt;&lt;/P&gt;
&lt;P&gt; Hi, aromano&lt;/P&gt;
&lt;P&gt; I know this issue was opened almost a year ago, but I faced the same problem and I was able to solve it. So, I'm sharing the solution in order to help others.&lt;/P&gt;
&lt;P&gt; Probably, you're using SparkTrials to optimize the model's hyperparameters in Databricks, In this case, you need to do 3 things:&lt;/P&gt;
&lt;P&gt;&lt;B&gt; 1.define two environment variables:&lt;/B&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;import os
os.environ["DATABRICKS_HOST"] = "&amp;lt;YOUR DATABRICKS HOST&amp;gt;"
os.environ["DATABRICKS_TOKEN"] = "&amp;lt;YOUR DATABRICKS TOKEN&amp;gt;"
&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;2. register spark as a backend for joblib.Parallel:&lt;/B&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from joblibspark import register_spark
register_spark()&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;3. define the joblib.Parallel as "spark".&lt;/B&gt; For instance:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;Parallel(n_jobs=-1, backend="spark")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;I hope it helps&lt;/P&gt;
&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 16 Jul 2020 21:06:42 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/picklingerror-could-not-pickle-the-task-to-send-it-to-the/m-p/27888#M19731</guid>
      <dc:creator>Anonymous</dc:creator>
      <dc:date>2020-07-16T21:06:42Z</dc:date>
    </item>
  </channel>
</rss>

