08-18-2023 04:25 AM
Hello,
I am contacting you because I am having a problem with the performance of my notebooks on databricks.
My notebook is written in python (pypark) in it I read a delta table that I copy to a dataframe and do several transformations and create several columns (I have several different dataframes created for it). For the creation of several coumnas I have created a function in my library that takes the initial dataframe as input and returns the same dataframe with the new column.
I can't avoid using the functions but the execution time of my notebook has increased considerably.
I have read that they are executed on the driver node of a Spark application, rather than on the worker nodes where data is distributed and this is the reason why they are slower.
This is the configuration of my cluster:
The driver node is quite large and I have also added the configuration: spark.databricks.io.cache.enabled true.
I am working with approximately 15 million records.
Is there anything I can do to improve this performance? Any additional settings I'm missing that would help my notebook run faster? Right now it is taking an hour and a half.
Thank you so much in advanced.
08-18-2023 05:08 AM
You can use Memory Profiler to profile your UDF. This will help us to understand which part of the UDF is leading to high memory utilization and has multiple calls during the execution.
https://www.databricks.com/blog/2022/11/30/memory-profiling-pyspark.html
08-23-2023 12:49 AM
Hi,
Sorry for the delay of my answer.
I installed the library and I run my code and I couldn't see any problem in the UDF functions. The biggest one is running in 5 mins. The problem is when we write the final dataframe in the delta table (with the command
Could it be that databricks does some recalculation at the end of the process and that is why it is taking so long?
Thank you very much!
08-23-2023 01:11 AM
why do you use a UDF to add columns? You could write a pyspark function without registering it as a UDF.
A UDF will bypass all optimizations and, when you use Python, performance is bad.
Can you share your UDF code?
08-28-2023 05:37 AM
In our original dataframe we have a column in OriginalCurrency and we use the function "convert_to_euros" to convert to euros this amount. The rates are in another delta table:
This one is the simplest function that we have, as the rest have a lot of business logic.
We could avoid using the functions in some cases (like the one I show you) but since we use this logic in many different notebooks we prefer to have the function and create unit tests on them in order to have a less repetitive code.
But as you can see in our functions we only use basic pyspark code, we don't do collects or displays (which we have read are very bad for performance). At most we do some grouping in some dataframe but nothing really weird.
08-28-2023 05:45 AM
ok, so it is not registered as a UDF, which is good.
Looking at your code, nothing special there. The reason why the performance is so bad is not clear to me.
It could be the table distribution (of source table or rates table), perhaps data skew?
If the rates table is small, it should be broadcasted so the join would go pretty fast.
I'd take a look into the spark UI and query plan what exactly is going on. Because your code seems fine to me, as long as you don't apply a loop over records (I hope not)?
08-28-2023 06:10 AM
We just apply a loop over record in another function. Before we used collect (for row in df.collect()) to loop over them but we changed it to use localIterator (for row in df.rdd.toLocalIterator()) we read that is better for performance. Are we doing it in the correct way?
08-28-2023 06:37 AM
looping over records is a performance killer. To be avoided at all costs.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group