08-17-2023 01:35 AM
hello,
am running into in issue while trying to write the data into a delta table, the query is a join between 3 tables and it takes 5 minutes to fetch the data but 3hours to write the data into the table, the select has 700 records.
here are the approaches i tested:
Shared cluster | 3h |
Isolated cluster | 2.88h |
External table + parquet + compression "ZSTD" | 2.63h |
Adjusting table properties : 'delta.targetFileSize' = '256mb', | 2.9h |
buket insert (batches of 100M record each) | too long I had to cancel it |
partitioning | not an option |
cluster Summary
1-15 Workers: 140-2,100 GB Memory
20-300 Cores
1 Driver : 140 GB Memory, 20 Cores
Runtime: 12.2.x-scala2.12
08-22-2023 02:30 AM
it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min
08-17-2023 02:01 AM
Hi @Axatar,
The issue with the long write time could be due to several reasons.
Here are some suggestions based on the information you've given and the provided sources:
1. **Data Duplication**: As per the information provided in the [merge documentation](https://docs.databricks.com/delta/merge.html), if there are duplicate records within the new data set, it takes more time as it tries to insert the same data. You can use the merge
operation to avoid inserting duplicate records. This will help in optimizing the write operation.
2. **Optimize the Delta table**: The [optimize documentation](https://docs.databricks.com/_extras/_extras/notebooks/generated/delta/optimize-sql.sql.html) suggests that the OPTIMIZE
the command can consolidate files and order the Delta table data for faster retrieval.
3. **Reduce the search space for matches**: As per the [best practices documentation](https://docs.databricks.com/delta/best-practices.html), you can reduce the time taken by merging using approaches such as reducing the search space for matches, compacting files, controlling the shuffle partitions for writes, enabling optimized writes, and tuning file sizes in the table.
08-17-2023 02:07 AM
thank you for your prompt response, more context to the issue.
the table that am writing data into gets truncated every time i run my script (its used as staging table). which means that am inserting into an empty table every time,
08-17-2023 05:54 AM
Hi @Axatar , Based on the information provided, you can use the preactions
option to execute a SQL DELETE
command before loading new data into your staging table. This would ensure that the table is emptied before new data is inserted. The preactions
option allows you to specify a ;
separated list of SQL commands to be executed before the COPY
command.
08-21-2023 10:24 AM
I wonder if you have already looked at the sql plan to see which phase is taking more time.
08-22-2023 02:30 AM
it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min