Options
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-23-2023 12:49 AM
Hi,
Sorry for the delay of my answer.
I installed the library and I run my code and I couldn't see any problem in the UDF functions. The biggest one is running in 5 mins. The problem is when we write the final dataframe in the delta table (with the command
write.format("delta")... ). It's running for more than 30 minutes. So, the schema of our notebook is:
df = spark.read.table(source_table_with_15millon_records)
df2 = udf_function_to_add_columns(df)
df3 = udf_function_to_add_columns(df2)
df4 = udf_function_to_add_columns(df3)
...
df10.write.format("delta")
The final dataframe (df10 in the example) has the same number of rows but with extra columns.
Could it be that databricks does some recalculation at the end of the process and that is why it is taking so long?
Thank you very much!