cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Calling a python function (def) in databricks

pjp94
Contributor

Not sure if I'm missing something here, but running a task outside of a python function runs much much quicker than executing the same task inside a function. Is there something I'm missing with how spark handles functions?

1)

def task(x):

y = dostuff(x)

return y

2)

y = dostuff(x)

9 REPLIES 9

Bartek
Contributor

Hi @pjp​, could you provide some more information? I'm not aware of any mechanism in Spark that can have such impact, but maybe example will make it easier for community to replicate, perform some benchmarking and help you.

Cheers

Bartek

Sure. My function queries an external database (jdbc) along with a delta table. I'm not performing any expensive computations - just filtering for the most part. When printing timestamps in the function, I notice that most of the time is being spent on the latter (delta table query/manipulations). I don't know why that is. I even cache the tables when I query. When I functionalize, the result takes 15 min and if I run outside of a function, it takes 3 min.

Ajay-Pandey
Esteemed Contributor III

UDF is more expensive in Spark

That could be the reason for this

Ajay Kumar Pandey

yes, there is difference in performance between Python and Scala - still, @Paras Patel​ sees performance penalty using Python in both cases

Hubert-Dudek
Esteemed Contributor III

It would be easier if you share whole your code @pjp94

UmaMahesh1
Honored Contributor III

Assuming that dostuff you mentioned is a spark sql function, you can take a look at this stack overflow thread and links in the same thread to get some idea.

https://stackoverflow.com/questions/38296609/spark-functions-vs-udf-performance

Uma Mahesh D

huyd
New Contributor III

If you can convert your Python udfs to sql udfs. These play nice adaptive query executions and won’t have performance penalties.

ramravi
Contributor II

Seems to be you are using UDF here. UDFs in spark are expensive because spark doesn't know how to optimize the UDF. Better to avoid them unless you have no other choice.

sher
Valued Contributor II

don't use python normal function use UDF in pyspark so that will be faster

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group