cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Liquid Clustering With Merge

Mous92i
New Contributor

Hello Iโ€™m facing severe performance issues with a  merge into databricks

merge_condition = """
    source.data_hierarchy = target.data_hierarchy AND
    source.sensor_id = target.sensor_id AND
    source.timestamp = target.timestamp
"""

The target Delta table is using Liquid Clustering on exactly the columns:
data_hierarchy, sensor_id, timestamp.

Yet, the MERGE operation takes over 40 minutes (or sometimes more).
From the Spark logs, it seems like all files are being scanned:

 

MERGE operation - scanning files for matches
โ€ฆ 32 min | 3113/3113 files scanned (~72.2 GiB)

 

I would expect Liquid Clustering to reduce full-file scans when merge keys align with clustering keys.

My questions:

  • Why would Databricks still scan all files even with Liquid Clustering enabled?

  • Are there specific configuration tweaks or best practices to speed up MERGE in this setup?

Thanks a lot for any help or shared experiences!

 

2 ACCEPTED SOLUTIONS

Accepted Solutions

ManojkMohan
Honored Contributor

@Mous92i 

Root Cause

Observing the log:
"MERGE operation, scanning files for matches โ€ฆ 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.

Why is this happening despite Liquid Clustering?
Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.

Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.

Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.

https://docs.databricks.com/aws/en/delta/clustering

Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.

https://www.youtube.com/watch?v=yZmrpXJg-G8

Solution and Best Practices to Improve MERGE Performance

Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.

https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost...

Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.

Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.

https://docs.databricks.com/aws/en/delta/clustering

Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.

Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.

Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.

View solution in original post

K_Anudeep
Databricks Employee
Databricks Employee

Hi @Mous92i 

DFP is what pushes source filters down to the target to skip files. For MERGE/UPDATE/DELETE, DFP only works on Photon-enabled compute. If youโ€™re not on Photon, MERGE will scan everything.

Enabling Liquid Clustering doesnโ€™t recluster past files. Until you run  OPTIMIZE FULL once (after enabling or changing keys), old files remain unclustered and donโ€™t prune well. Then do regular OPTIMIZE to keep up.

View solution in original post

3 REPLIES 3

ManojkMohan
Honored Contributor

@Mous92i 

Root Cause

Observing the log:
"MERGE operation, scanning files for matches โ€ฆ 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.

Why is this happening despite Liquid Clustering?
Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.

Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.

Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.

https://docs.databricks.com/aws/en/delta/clustering

Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.

https://www.youtube.com/watch?v=yZmrpXJg-G8

Solution and Best Practices to Improve MERGE Performance

Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.

https://dev.to/aj_ankit85/liquid-clustering-optimizing-databricks-workloads-for-performance-and-cost...

Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.

Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.

https://docs.databricks.com/aws/en/delta/clustering

Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.

Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.

Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.

K_Anudeep
Databricks Employee
Databricks Employee

Hi @Mous92i 

DFP is what pushes source filters down to the target to skip files. For MERGE/UPDATE/DELETE, DFP only works on Photon-enabled compute. If youโ€™re not on Photon, MERGE will scan everything.

Enabling Liquid Clustering doesnโ€™t recluster past files. Until you run  OPTIMIZE FULL once (after enabling or changing keys), old files remain unclustered and donโ€™t prune well. Then do regular OPTIMIZE to keep up.

Mous92i
New Contributor

Thanks for your response