cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

MERGE operation on PI data getting slower. How can I debug?

sajith_appukutt
Honored Contributor II

We have a structured streaming job configured to read from event-hub and persist to the delta raw/bronze layer via MERGE inside a foreachBatch, However of-late, the merge process is taking longer time. How can i optimize this pipeline ?

1 ACCEPTED SOLUTION

Accepted Solutions

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

View solution in original post

1 REPLY 1

sajith_appukutt
Honored Contributor II

Delta Lake completes a  MERGE  in two steps

  1. Perform an inner join between the target table and source table to select all files that have matches.
  2. Perform an outer join between the selected files in the target and source tables and write out the updated/deleted/inserted data.

If finding the files that Delta Lake needs to rewrite is taking too long, try:

Add more predicates to narrow down the search space.

  • Adjust shuffle partitions.
  • Adjust broadcast join thresholds.
  • Right-size the files ( balance between too many small files vs few large files )

If rewriting the actual files itself is taking too long, try:

  • Adjust shuffle partitions / AQE
  • Enable Optimized writes
  • Adjust broadcast thresholds.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group