cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Accelerating row-wise Python UDF functions without using Pandas UDFย ProblemSpark will not automatically parallelize UDF operations on smaller/medium d...

Artem_Yevtushen
New Contributor III

Accelerating row-wise Python UDF functions without using Pandas UDF

Problem

Spark will not automatically parallelize UDF operations on smaller/medium dataframes. As a result, spark will process the UDF as a single non parallelized task. For row-wise operations, this can be a time-intensive task.

Solution

Force Spark to parallelize the tasks across available workers using the repartition dataframe function.

df = sql('select * from table').repartition(<number of tasks>)
df = df.withColumn('column_name', python_udf(col('a_column')))

For best performance make the number of tasks equal to the cores available for maximum parallelization.

0 REPLIES 0
Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.