Databricks

pjp94 · ‎12-05-2022

Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions?

1)

def task(x):

y = dostuff(x)

return y

2)

y = dostuff(x)

Bartek · ‎12-05-2022

Hi @pjp, could you provide some more information? I'm not aware of any mechanism in Spark that can have such impact, but maybe example will make it easier for community to replicate, perform some benchmarking and help you.

Cheers

Bartek

pjp94 · ‎12-05-2022

Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent on the latter (delta table query/manipulations). I don't know why that is. I even cache the tables when I query. When I functionalize, the result takes 15 min and if I run outside of a function, it takes 3 min.

Ajay-Pandey · ‎12-06-2022

UDF is more expensive in Spark

That could be the reason for this

Bartek · ‎12-06-2022

yes, there is difference in performance between Python and Scala - still, @Paras Patel sees performance penalty using Python in both cases

Hubert-Dudek · ‎12-06-2022

It would be easier if you share whole your code @pjp94

UmaMahesh1 · ‎12-06-2022

Assuming that dostuff you mentioned is a spark sql function, you can take a look at this stack overflow thread and links in the same thread to get some idea.

https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance

huyd · ‎12-29-2022

If you can convert your Python udfs to sql udfs. These play nice adaptive query executions and won’t have performance penalties.

ramravi · ‎01-02-2023

Seems to be you are using UDF here. UDFs in spark are expensive because spark doesn't know how to optimize the UDF. Better to avoid them unless you have no other choice.

sher · ‎01-05-2023

don't use python normal function use UDF in pyspark so that will be faster

Databricks

Calling a python function (def) in databricks

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI