โ12-05-2022 12:53 PM
Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions?
1)
def task(x):
y = dostuff(x)
return y
2)
y = dostuff(x)
โ12-05-2022 03:32 PM
Hi @pjpโ, could you provide some more information? I'm not aware of any mechanism in Spark that can have such impact, but maybe example will make it easier for community to replicate, perform some benchmarking and help you.
Cheers
Bartek
โ12-05-2022 03:49 PM
Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent on the latter (delta table query/manipulations). I don't know why that is. I even cache the tables when I query. When I functionalize, the result takes 15 min and if I run outside of a function, it takes 3 min.
โ12-06-2022 01:22 AM
UDF is more expensive in Spark
That could be the reason for this
โ12-06-2022 02:04 AM
yes, there is difference in performance between Python and Scala - still, @Paras Patelโ sees performance penalty using Python in both cases
โ12-06-2022 03:06 AM
It would be easier if you share whole your code @pjp94
โ12-06-2022 04:40 AM
Assuming that dostuff you mentioned is a spark sql function, you can take a look at this stack overflow thread and links in the same thread to get some idea.
https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance
โ12-29-2022 02:51 PM
If you can convert your Python udfs to sql udfs. These play nice adaptive query executions and wonโt have performance penalties.
โ01-02-2023 06:07 AM
Seems to be you are using UDF here. UDFs in spark are expensive because spark doesn't know how to optimize the UDF. Better to avoid them unless you have no other choice.
โ01-05-2023 10:30 PM
don't use python normal function use UDF in pyspark so that will be faster
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group