- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-17-2023 01:35 AM
hello,
am running into in issue while trying to write the data into a delta table, the query is a join between 3 tables and it takes 5 minutes to fetch the data but 3hours to write the data into the table, the select has 700 records.
here are the approaches i tested:
Shared cluster | 3h |
Isolated cluster | 2.88h |
External table + parquet + compression "ZSTD" | 2.63h |
Adjusting table properties : 'delta.targetFileSize' = '256mb', | 2.9h |
buket insert (batches of 100M record each) | too long I had to cancel it |
partitioning | not an option |
cluster Summary
1-15 Workers: 140-2,100 GB Memory
20-300 Cores
1 Driver : 140 GB Memory, 20 Cores
Runtime: 12.2.x-scala2.12
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-17-2023 02:07 AM
thank you for your prompt response, more context to the issue.
the table that am writing data into gets truncated every time i run my script (its used as staging table). which means that am inserting into an empty table every time,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-21-2023 10:24 AM
I wonder if you have already looked at the sql plan to see which phase is taking more time.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-22-2023 02:30 AM
it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min