cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling a python function (def) in databricks

pjp94
Contributor

Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions?

1)

def task(x):

y = dostuff(x)

return y

2)

y = dostuff(x)

9 REPLIES 9

Bartek
Contributor

Hi @pjp​, could you provide some more information? I'm not aware of any mechanism in Spark that can have such impact, but maybe example will make it easier for community to replicate, perform some benchmarking and help you.

Cheers

Bartek

Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent on the latter (delta table query/manipulations). I don't know why that is. I even cache the tables when I query. When I functionalize, the result takes 15 min and if I run outside of a function, it takes 3 min.

Ajay-Pandey
Esteemed Contributor III

UDF is more expensive in Spark

That could be the reason for this

yes, there is difference in performance between Python and Scala - still, @Paras Patel​ sees performance penalty using Python in both cases

Hubert-Dudek
Esteemed Contributor III

It would be easier if you share whole your code @pjp94

UmaMahesh1
Honored Contributor III

Assuming that dostuff you mentioned is a spark sql function, you can take a look at this stack overflow thread and links in the same thread to get some idea.

https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance

huyd
New Contributor III

If you can convert your Python udfs to sql udfs. These play nice adaptive query executions and won’t have performance penalties.

ramravi
Contributor II

Seems to be you are using UDF here. UDFs in spark are expensive because spark doesn't know how to optimize the UDF. Better to avoid them unless you have no other choice.

sher
Valued Contributor II

don't use python normal function use UDF in pyspark so that will be faster

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.