โ08-17-2023 01:35 AM
hello,
am running into in issue while trying to write the data into a delta table, the query is a join between 3 tables and it takes 5 minutes to fetch the data but 3hours to write the data into the table, the select has 700 records.
here are the approaches i tested:
Shared cluster | 3h |
Isolated cluster | 2.88h |
External table + parquet + compression "ZSTD" | 2.63h |
Adjusting table properties : 'delta.targetFileSize' = '256mb', | 2.9h |
buket insert (batches of 100M record each) | too long I had to cancel it |
partitioning | not an option |
cluster Summary
1-15 Workers: 140-2,100 GB Memory
20-300 Cores
1 Driver : 140 GB Memory, 20 Cores
Runtime: 12.2.x-scala2.12
โ08-22-2023 02:30 AM
it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min
โ08-17-2023 02:07 AM
thank you for your prompt response, more context to the issue.
the table that am writing data into gets truncated every time i run my script (its used as staging table). which means that am inserting into an empty table every time,
โ08-21-2023 10:24 AM
I wonder if you have already looked at the sql plan to see which phase is taking more time.
โ08-22-2023 02:30 AM
it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group