cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

PySpark UDF is taking long to process

sanjay
Valued Contributor II

Hi,

I have UDF which runs for each spark dataframe row, does some complex processing and return string output. But it takes very long if data is 15000 rows. I have configured cluster with autoscaling, but its not spinning more servers.

Please suggest how to make UDF fasters or any reference implementations.

Regards,

Sanjay

2 ACCEPTED SOLUTIONS

Accepted Solutions

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Sanjay Jainโ€‹ , Python UDFs are generally slower to process because it runs mostly in the driver which can also lead to OOM errors on Driver. To resolve this issue, please consider the below:

  1. Use spark built-in functions to do the same functionality.
  2. Use pandas UDF instead of python UDFs.
  3. If above 2 options are not suitable, use the configuration : spark.databricks.execution.pythonUDF.arrow.enabled = True

View solution in original post

Kaniz_Fatma
Community Manager
Community Manager

Hi @Sanjay Jainโ€‹ โ€‹โ€‹, We haven't heard from you since the last response from @Lakshay Goelโ€‹, @rishabh and @Vigneshraja Palanirajโ€‹โ€‹, and I was checking back to see if their suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

View solution in original post

4 REPLIES 4

pvignesh92
Honored Contributor

@Sanjay Jainโ€‹ Hi Sanjay. You did not mention what kind of processing you are doing in UDF. Python UDF definitely will create performance issues as Spark optimizer does not apply optimization on what you are doing within the UDF. Please see if you can do any of those processing using Spark native functions.

If still, you need to use python UDF, see if you can try with Pandas UDF. This can provide significant performance improvements for certain types of operations. Pandas UDFs use Apache Arrow to transfer data between Python and Spark, which can result in faster processing times.

Rishabh-Pandey
Esteemed Contributor

Write ...whether you can perform the same things by using pyspark native logics and functions then no need to use a UDF. Because in most cases we can do by using pyspark also because UDF will definitely create a performance issues โ€‹

Rishabh Pandey

Lakshay
Esteemed Contributor
Esteemed Contributor

Hi @Sanjay Jainโ€‹ , Python UDFs are generally slower to process because it runs mostly in the driver which can also lead to OOM errors on Driver. To resolve this issue, please consider the below:

  1. Use spark built-in functions to do the same functionality.
  2. Use pandas UDF instead of python UDFs.
  3. If above 2 options are not suitable, use the configuration : spark.databricks.execution.pythonUDF.arrow.enabled = True

Kaniz_Fatma
Community Manager
Community Manager

Hi @Sanjay Jainโ€‹ โ€‹โ€‹, We haven't heard from you since the last response from @Lakshay Goelโ€‹, @rishabh and @Vigneshraja Palanirajโ€‹โ€‹, and I was checking back to see if their suggestions helped you.

Or else, If you have any solution, please share it with the community, as it can be helpful to others. 

Also, Please don't forget to click on the "Select As Best" button whenever the information provided helps resolve your question.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group