<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Performance for pyspark dataframe is very slow after using a @pandas_udf in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24102#M16725</link>
    <description>&lt;P&gt;Databricks Runtime Version: 10.3 ML (includes Apache Spark 3.2.1, Scala 2.12)&lt;/P&gt;</description>
    <pubDate>Thu, 31 Mar 2022 12:33:31 GMT</pubDate>
    <dc:creator>RRO</dc:creator>
    <dc:date>2022-03-31T12:33:31Z</dc:date>
    <item>
      <title>Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24098#M16721</link>
      <description>&lt;P&gt;Hello,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I am currently working on a time series forecasting with FBProphet. Since I have data with many time series groups (~3000) I use a @pandas_udf to parallelize the training. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;@pandas_udf(schema, PandasUDFType.GROUPED_MAP)
def forecast_netprofit(prophtrain):
&amp;nbsp;
     ... 
&amp;nbsp;
     return results_pd
&amp;nbsp;
&amp;nbsp;
time_series_id_column_names = ['Grp1', 'Grp2', 'Grp3']
&amp;nbsp;
results = (prophtrain
           .groupby(time_series_id_column_names)
           .apply(forecast_netprofit)
          )&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Now every time I want to display or do some operations on the &lt;I&gt;results&amp;nbsp;&lt;/I&gt;dataframe the performance is really low. For example: Just to display the first 1000 rows takes around 6min.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Is there a reason why the performance of the results is so slow and can I fix that somehow?&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 10:12:14 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24098#M16721</guid>
      <dc:creator>RRO</dc:creator>
      <dc:date>2022-03-31T10:12:14Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24099#M16722</link>
      <description>&lt;P&gt;Spark will run on the whole dataset in background and return 1000 rows of that. So it might be that, not necessarily the function itself.&lt;/P&gt;&lt;P&gt;You can test that by f.e. starting with a dataset of 1000 records and apply the function on that.&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 10:59:43 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24099#M16722</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-31T10:59:43Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24100#M16723</link>
      <description>&lt;P&gt;Alright, the dataset has around 80.000 rows and 12 columns - so it should not be to much to handle. I have different datasets that are bigger than that can be displayed within seconds. That is why I think it might be somehow related to the function...&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 12:24:40 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24100#M16723</guid>
      <dc:creator>RRO</dc:creator>
      <dc:date>2022-03-31T12:24:40Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24101#M16724</link>
      <description>&lt;P&gt;could be, although it should use Arrow these days.&lt;/P&gt;&lt;P&gt;What version of spark do you use?&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 12:31:23 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24101#M16724</guid>
      <dc:creator>-werners-</dc:creator>
      <dc:date>2022-03-31T12:31:23Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24102#M16725</link>
      <description>&lt;P&gt;Databricks Runtime Version: 10.3 ML (includes Apache Spark 3.2.1, Scala 2.12)&lt;/P&gt;</description>
      <pubDate>Thu, 31 Mar 2022 12:33:31 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24102#M16725</guid>
      <dc:creator>RRO</dc:creator>
      <dc:date>2022-03-31T12:33:31Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24104#M16727</link>
      <description>&lt;P&gt;Please specify the type hint in the function so you will save some time. Something similar to (can be different hints needed, it is an example):&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;@pandas_udf(schema)&lt;/P&gt;&lt;P&gt;def forecast_netprofit(prophtrain: pd.Series) -&amp;gt; pd.Series&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;you could consider also using .agg(forecast_netprofit(prophtrain)) instead of .apply()&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 01 Apr 2022 19:58:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24104#M16727</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2022-04-01T19:58:32Z</dc:date>
    </item>
    <item>
      <title>Re: Performance for pyspark dataframe is very slow after using a @pandas_udf</title>
      <link>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24105#M16728</link>
      <description>&lt;P&gt;Thank you for the answers. &lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Unfortunately this did not solve the performance issue.&lt;/P&gt;&lt;P&gt;What I did now is I saved the results into a table:&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;I&gt;results.write.mode("overwrite").saveAsTable("db.results") &lt;/I&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is probably not the best solution but after I do that I can work with the results data from the table.&lt;/P&gt;</description>
      <pubDate>Tue, 12 Apr 2022 15:01:24 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/performance-for-pyspark-dataframe-is-very-slow-after-using-a/m-p/24105#M16728</guid>
      <dc:creator>RRO</dc:creator>
      <dc:date>2022-04-12T15:01:24Z</dc:date>
    </item>
  </channel>
</rss>

