Databricks Community

SS0201 · ‎02-01-2023

When using Delta tables with DBR jobs or even with DLT pipelines, the upserts (especially updates) (on key and timestamp) are taking quite higher than expected time to update the files/tables data (~2 mins for even 1 record poll) (Inserts are lightning fast). The backend parquet file which is being updated for even that 1 record contains other records as well.

What we tried:

Partitioning on key proved to be a verry bad idea and made even the inserts too slow.

ZORDER on key was also not helpful.

Please help on what can we improve to update delta table in real time with Kafka topic as source and Spark Streaming keeping compute as last option

Debayan · ‎02-01-2023

Hi, how big in sizes the files/tables are?

SS0201 · ‎02-02-2023

There is only 1 target table (dev approx 45Mn records), the Delta table. Backend parquet files (abfs) are dispersed by internal DBR algorithms.

Also, After ZORDER on PKey, the files got arranged in almost same size, but still slow upserts were there.

result after doing ZORDER:

This is Dev result. Prod data size is more than 10x

jose_gonzalez · ‎04-06-2023

Which DBR version are you using? Low shuffle merge might help docs https://docs.databricks.com/optimizations/low-shuffle-merge.html

Anonymous · ‎04-08-2023

Hi @Surya Agarwal

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!