12-05-2022 12:53 PM
Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions?
1)
def task(x):
y = dostuff(x)
return y
2)
y = dostuff(x)
12-05-2022 03:32 PM
Hi @pjp, could you provide some more information? I'm not aware of any mechanism in Spark that can have such impact, but maybe example will make it easier for community to replicate, perform some benchmarking and help you.
Cheers
Bartek
12-05-2022 03:49 PM
Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent on the latter (delta table query/manipulations). I don't know why that is. I even cache the tables when I query. When I functionalize, the result takes 15 min and if I run outside of a function, it takes 3 min.
12-06-2022 01:22 AM
UDF is more expensive in Spark
That could be the reason for this
12-06-2022 02:04 AM
yes, there is difference in performance between Python and Scala - still, @Paras Patel sees performance penalty using Python in both cases
12-06-2022 03:06 AM
It would be easier if you share whole your code @pjp94
12-06-2022 04:40 AM
Assuming that dostuff you mentioned is a spark sql function, you can take a look at this stack overflow thread and links in the same thread to get some idea.
https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance
12-29-2022 02:51 PM
If you can convert your Python udfs to sql udfs. These play nice adaptive query executions and won’t have performance penalties.
01-02-2023 06:07 AM
Seems to be you are using UDF here. UDFs in spark are expensive because spark doesn't know how to optimize the UDF. Better to avoid them unless you have no other choice.
01-05-2023 10:30 PM
don't use python normal function use UDF in pyspark so that will be faster
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group