02-01-2023 08:43 AM
When using Delta tables with DBR jobs or even with DLT pipelines, the upserts (especially updates) (on key and timestamp) are taking quite higher than expected time to update the files/tables data (~2 mins for even 1 record poll) (Inserts are lightning fast). The backend parquet file which is being updated for even that 1 record contains other records as well.
What we tried:
Partitioning on key proved to be a verry bad idea and made even the inserts too slow.
ZORDER on key was also not helpful.
Please help on what can we improve to update delta table in real time with Kafka topic as source and Spark Streaming keeping compute as last option
02-01-2023 09:58 PM
Hi, how big in sizes the files/tables are?
02-02-2023 12:34 AM
There is only 1 target table (dev approx 45Mn records), the Delta table. Backend parquet files (abfs) are dispersed by internal DBR algorithms.
Also, After ZORDER on PKey, the files got arranged in almost same size, but still slow upserts were there.
result after doing ZORDER:
This is Dev result. Prod data size is more than 10x
04-06-2023 09:40 AM
Which DBR version are you using? Low shuffle merge might help docs https://docs.databricks.com/optimizations/low-shuffle-merge.html
04-08-2023 09:28 PM
Hi @Surya Agarwal
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group