Accelerating row-wise Python UDF functions without using Pandas UDF ProblemSpark will not automatically parallelize UDF operations on smaller/medium d...

Data Engineering

Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.

Accelerating row-wise Python UDF functions without using Pandas UDF

Problem

Spark will not automatically parallelize UDF operations on smaller/medium dataframes. As a result, spark will process the UDF as a single non parallelized task. For row-wise operations, this can be a time-intensive task.

Solution

Force Spark to parallelize the tasks across available workers using the repartition dataframe function.

df = sql('select * from table').repartition(<number of tasks>)
df = df.withColumn('column_name', python_udf(col('a_column')))

For best performance make the number of tasks equal to the cores available for maximum parallelization.

0 REPLIES 0

never-displayed

You must be signed in to add attachments

never-displayed

Announcements

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs

Databricks Community

Accelerating row-wise Python UDF functions without using Pandas UDF ProblemSpark will not automatically parallelize UDF operations on smaller/medium d...

Connect with Databricks Users in Your Area

Introducing an exclusively Databricks-hosted Assistant

How to present and share your Notebook insights in AI/BI Dashboards

Meet the Databricks MVPs

Now Hiring: Databricks Community Technical Moderator

Insights from a global survey of 1,100 technologists and interviews with 28 CIOs