Re: Bad performance UDFs functions

SaraCorralLou · ‎08-23-2023

Hi,

Sorry for the delay of my answer.

I installed the library and I run my code and I couldn't see any problem in the UDF functions. The biggest one is running in 5 mins. The problem is when we write the final dataframe in the delta table (with the command

write.format("delta")... ). It's running for more than 30 minutes. So, the schema of our notebook is:

df = spark.read.table(source_table_with_15millon_records)

df2 = udf_function_to_add_columns(df)

df3 = udf_function_to_add_columns(df2)

df4 = udf_function_to_add_columns(df3)

...

df10.write.format("delta")

The final dataframe (df10 in the example) has the same number of rows but with extra columns.

Could it be that databricks does some recalculation at the end of the process and that is why it is taking so long?

Thank you very much!