cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

MERGE operation on PI data getting slower. How can I debug?

sajith_appukutt
Honored Contributor II

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now