cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

MERGE operation on PI data getting slower. How can I debug?

sajith_appukutt
Honored Contributor II

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.