cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Merge Operation is very slow for S/4 Table ACDOCA

Kishan1003
New Contributor

Hello,

we have a scenario in Databricks where every day  we get 60-70 million records  and it takes a lot of time to merge the data into 28 billion records which is already sitting there . The time taken to rewrite the files which are affected is too much. Merge time is not directly proportional to number of records in delta but solely depends on number of files delta is updating. Table is partitioned on Period and each period has around 800 million records which is sitting there and delta records are present in 3 years  basically in all 36 partition and sometimes it can go till 2020 also.

Please note this is a one to one table from source with no logic at all.

we have tried all the spark settings , Optimize the table , Zordering , Big cluster with Photon ( E16 ) but still it takes a lot of time to rewrite the updated files.

can anyone suggest something or if someone has done similar before and improved the performance.

Table Size is 1.4 TB

Columns - 563

Partioned by Period

Time take to merge and rewrite files - over 10 hours to update 3000 files and files are also not that huge in terms of size.

Storage - Azure Blob Gen 2 in Parquet format

Type of Table  - Delta

if someone could help then it would be great ๐Ÿ™‚

 

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @Kishan1003, To optimize the performance of merging large amounts of data into a Delta table, you can try the following methods:

  1. Use dynamic partition pruning to reduce the data scanned during the merge operations. Dynamic partition pruning can prune the partitions based on the query's filter conditions. Therefore, the amount of data reviewed can be reduced significantly, which will help to speed up the merge operation. Learn more about partition pruning here.

  2. Optimize the design schema of your Delta tables. A good schema design can help reduce the amount of data being scanned and the number of files being rewritten during the merge operation, leading to faster performance. Consider partitioning the table based on a column with high cardinality to create smaller, manageable partitions. You can also try using delta auto-optimize to improve the table layout. Additionally, you can explore bucketing to help further reduce the amount of data scanned during the merge operation.

  3. Another approach is using a time series database, like InfluxDB or OpenTSDB, designed to handle large amounts of time series data. These databases use different indexing and storage techniques than traditional RDBMS or Delta Lake tables, which can provide much better performance for time series data.

  4. Use Delta Lake Optimize API with ZORDERING to optimize writes. It works by reformatting the underlying Delta Lake table to group files by one or more columns with the same values. Read more on Delta Lake optimization here.

  5. Enable Databricks caching to share data between multiple jobs if applicable in your use case. Caching avoids the I/O time that may occur from reading data repeatedly by keeping the data in memory and ensures faster read performance.

  6. Try tuning the cluster configurations to achieve the best performance, then scale your cluster configuration based on the needs. You can try autoscaling based on the quantity of data being processed daily per the following criteria. By following this, you will only pay for the resources required per your workload.

  7. Monitor the write operations and observe the Spark UI to identify bottlenecks and optimize the read and write performance accordingly.

177991
New Contributor II

Hi @Kishan1003  did you find something helpful? Im dealing with a similar situation, acdoca table on my side is around 300M (fairly smaller), and incoming daily data is usually around 1M. I have try partition using period, like fiscyearper column, zorder and dynamic prunning. So far the best time of the merge process has been around 1 hour. I want to understand if I can achieve a better performance before scaling. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.