cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

query takes too long to write into delta table.

Axatar
New Contributor III

hello, 

am running into in issue while trying to write the data into a delta table, the query is a join between 3 tables and it takes 5 minutes to fetch the data but 3hours to write the data into the table, the select has 700 records. 

here are the approaches i tested: 

Shared cluster 

3h

Isolated cluster 

2.88h

External table + parquet + compression "ZSTD"

2.63h

Adjusting table properties : 'delta.targetFileSize' = '256mb',
'delta.tuneFileSizesForRewrites'= 'true'

2.9h

buket insert (batches of 100M record each)

too long I had to cancel it 

partitioning

not an option

cluster Summary
1-15 Workers: 140-2,100 GB Memory
                        20-300 Cores
1 Driver : 140 GB Memory, 20 Cores
Runtime: 12.2.x-scala2.12

1 ACCEPTED SOLUTION

Accepted Solutions

Axatar
New Contributor III

it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min 

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Axatar

The issue with the long write time could be due to several reasons.

Here are some suggestions based on the information you've given and the provided sources:

1. **Data Duplication**: As per the information provided in the [merge documentation](https://docs.databricks.com/delta/merge.html), if there are duplicate records within the new data set, it takes more time as it tries to insert the same data. You can use the merge operation to avoid inserting duplicate records. This will help in optimizing the write operation.

2. **Optimize the Delta table**: The [optimize documentation](https://docs.databricks.com/_extras/_extras/notebooks/generated/delta/optimize-sql.sql.html) suggests that the OPTIMIZE the command can consolidate files and order the Delta table data for faster retrieval.

3. **Reduce the search space for matches**: As per the [best practices documentation](https://docs.databricks.com/delta/best-practices.html), you can reduce the time taken by merging using approaches such as reducing the search space for matches, compacting files, controlling the shuffle partitions for writes, enabling optimized writes, and tuning file sizes in the table.

Axatar
New Contributor III

thank you for your prompt response, more context to the issue. 

the table that am writing data into gets truncated every time i run my script (its used as staging table). which means that am inserting into an empty table every time,

Kaniz
Community Manager
Community Manager

Hi @Axatar , Based on the information provided, you can use the preactions option to execute a SQL DELETE command before loading new data into your staging table. This would ensure that the table is emptied before new data is inserted. The preactions option allows you to specify a ; separated list of SQL commands to be executed before the COPY command.

Lakshay
Esteemed Contributor
Esteemed Contributor

I wonder if you have already looked at the sql plan to see which phase is taking more time. 

Axatar
New Contributor III

it turned out that the issue was not in the writing side, even when i was getting the results in under 5min, the issue was in the cross join in my query i resolved the issue by doing the same cross joins via dataframes got the results computed and written in 17min 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.