<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: [Pyspark.Pandas]   PicklingError: Could not serialize object (this error is happening only for large datasets) in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32045#M23362</link>
    <description>&lt;P&gt;Hi @werners , thanks for your response.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As a beginner , I would like to use pyspark.pandas as a plug and play. (by converting my classic pandas code to pyspark.pandas ). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Would you know why I am getting the error (mentioned in the question)? &lt;/P&gt;&lt;P&gt;It's really peculiar that it happens only with larger datasets..&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you recommend I raise an issue with Databricks ?&lt;/P&gt;</description>
    <pubDate>Tue, 13 Sep 2022 18:24:02 GMT</pubDate>
    <dc:creator>KrishZ</dc:creator>
    <dc:date>2022-09-13T18:24:02Z</dc:date>
    <item>
      <title>[Pyspark.Pandas]   PicklingError: Could not serialize object (this error is happening only for large datasets)</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32043#M23360</link>
      <description>&lt;P&gt;&lt;B&gt;Context:&lt;/B&gt; I am using &lt;A href="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" alt="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" target="_blank"&gt;pyspark.pandas&lt;/A&gt; in a Databricks jupyter notebook and doing some text manipulation within the dataframe..&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" alt="https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html" target="_blank"&gt;pyspark.pandas&lt;/A&gt; is the Pandas API on Spark and can be used exactly the same as usual &lt;A href="https://pandas.pydata.org/docs/user_guide/10min.html" alt="https://pandas.pydata.org/docs/user_guide/10min.html" target="_blank"&gt;Pandas&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Error: &lt;/B&gt;&lt;I&gt;&lt;U&gt;PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object&lt;/U&gt;&lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Some clues that can help you understand the error:&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I do not get any error if I run my script on:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;300 rows of data.&lt;/LI&gt;&lt;LI&gt;600 rows of data (created by replicating the original 300 x2)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I get an error if I run my script on :&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;On 3000 rows of data (created by replicating the original 300 x10)&lt;/LI&gt;&lt;LI&gt;On 3000 unique rows (not related to the original 300 rows)&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This makes me think the error is not code specific rather databricks/pyspark.pandas might have an intricacy / limitation / bug which happens with higher number of rows of data(3000 in my case)&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Can somebody please explain why I am getting this error and how to resolve it? &lt;/P&gt;&lt;P&gt;Would appreciate if we stick to pyspark.pandas and not go into alternatives that suggest using spark sql..&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;If you want to look at the full stack trace for the error , you may check the code snippet you see in this &lt;A href="https://stackoverflow.com/questions/73469453/picklingerror-could-not-serialize-object-happens-only-for-large-datasets" alt="https://stackoverflow.com/questions/73469453/picklingerror-could-not-serialize-object-happens-only-for-large-datasets" target="_blank"&gt;question&lt;/A&gt; which I have asked.&lt;/P&gt;</description>
      <pubDate>Sun, 11 Sep 2022 14:49:10 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32043#M23360</guid>
      <dc:creator>KrishZ</dc:creator>
      <dc:date>2022-09-11T14:49:10Z</dc:date>
    </item>
    <item>
      <title>Re: [Pyspark.Pandas]   PicklingError: Could not serialize object (this error is happening only for large datasets)</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32044#M23361</link>
      <description>&lt;P&gt;pyspark pandas can indeed be used instead of classic pandas.  However that being said, there are some caveats.  If you just copy/paste your python code, chances are real you are not using the parallel processing capabilities of spark (distributed computing).  For example iterating over values of a structure.  this will still execute on a single node.&lt;/P&gt;&lt;P&gt;So to be able to scale out your code, you should use pyspark (can be pandas) and think in vectors/sets instead of records.&lt;/P&gt;&lt;P&gt;Instead of iterating over records, try to apply a function in one go.  that is why spark was invented.&lt;/P&gt;</description>
      <pubDate>Tue, 13 Sep 2022 09:52:05 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32044#M23361</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-09-13T09:52:05Z</dc:date>
    </item>
    <item>
      <title>Re: [Pyspark.Pandas]   PicklingError: Could not serialize object (this error is happening only for large datasets)</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32045#M23362</link>
      <description>&lt;P&gt;Hi @werners , thanks for your response.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;As a beginner , I would like to use pyspark.pandas as a plug and play. (by converting my classic pandas code to pyspark.pandas ). &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Would you know why I am getting the error (mentioned in the question)? &lt;/P&gt;&lt;P&gt;It's really peculiar that it happens only with larger datasets..&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Do you recommend I raise an issue with Databricks ?&lt;/P&gt;</description>
      <pubDate>Tue, 13 Sep 2022 18:24:02 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32045#M23362</guid>
      <dc:creator>KrishZ</dc:creator>
      <dc:date>2022-09-13T18:24:02Z</dc:date>
    </item>
    <item>
      <title>Re: [Pyspark.Pandas]   PicklingError: Could not serialize object (this error is happening only for large datasets)</title>
      <link>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32047#M23364</link>
      <description>&lt;P&gt;@Krishna Zanwar​&amp;nbsp;, i'm receiving the same error.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;For me, the behavior is when trying to broadcast a random forest (sklearn 1.2.0) recently loaded from mlflow, and using Pandas UDF to predict a model.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;However, the same code works perfectly on Spark 2.4 + our OnPrem cluster.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;I thought it was due to Spark 2.4 to 3 changes, and probably some breaking changes related to Pandas UDF API, however i've changed to the newer template and same behavior still happens.&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;​&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;schema = input_df.drop("ID").schema
schema.add(StructField("predict", IntegerType()), False)
schema.add(StructField("predict_proba", FloatType()), False)
    
    
@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def udf_predict(df: pd.DataFrame) -&amp;gt; pd.DataFrame:
    df = df[list(input_df.drop("ID").columns)]
    y_pred = broadcasted_model.value.predict(df)
    if hasattr(broadcasted_model.value, "predict_proba"):
        prediction_scores = broadcasted_model.value.predict_proba(df)
        df["predict_proba"] = prediction_scores[:, -1]
    df["predict"] = y_pred
    return df&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;​&lt;/P&gt;&lt;P&gt;The code above is fairly simple, without any secrets and I'm having the same error as Krishz&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;in predict_with_regression
    output_df = partitioned_df.groupby("scoring_partition").apply(udf_predict)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/pandas/group_ops.py", line 86, in apply
    return self.applyInPandas(udf.func, schema=udf.returnType)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/pandas/group_ops.py", line 201, in applyInPandas
    udf_column = udf(*[df[col] for col in df.columns])
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/udf.py", line 199, in wrapper
    return self(*args)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/udf.py", line 177, in __call__
    judf = self._judf
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/udf.py", line 161, in _judf
    self._judf_placeholder = self._create_judf()
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/udf.py", line 170, in _create_judf
    wrapped_func = _wrap_function(sc, self.func, self.returnType)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/sql/udf.py", line 34, in _wrap_function
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/rdd.py", line 2850, in _prepare_for_python_RDD
    pickled_command = ser.dumps(command)
  File "/opt/conda/envs/kedro-new-runtime/lib/python3.8/site-packages/pyspark/serializers.py", line 483, in dumps
    raise pickle.PicklingError(msg)
_pickle.PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Sun, 15 Jan 2023 05:06:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/pyspark-pandas-picklingerror-could-not-serialize-object-this/m-p/32047#M23364</guid>
      <dc:creator>ryojikn</dc:creator>
      <dc:date>2023-01-15T05:06:21Z</dc:date>
    </item>
  </channel>
</rss>

