Accelerating row-wise Python UDF functions without using Pandas UDF
Problem
Spark will not automatically parallelize UDF operations on smaller/medium dataframes. As a result, spark will process the UDF as a single non parallelized task. For row-wise operations, this can be a time-intensive task.
Solution
Force Spark to parallelize the tasks across available workers using the repartition dataframe function.
df = sql('select * from table').repartition(<number of tasks>)
df = df.withColumn('column_name', python_udf(col('a_column')))
For best performance make the number of tasks equal to the cores available for maximum parallelization.