Wednesday
Hello Iโm facing severe performance issues with a merge into databricks
merge_condition = """
source.data_hierarchy = target.data_hierarchy AND
source.sensor_id = target.sensor_id AND
source.timestamp = target.timestamp
"""
The target Delta table is using Liquid Clustering on exactly the columns:
data_hierarchy, sensor_id, timestamp.
Yet, the MERGE operation takes over 40 minutes (or sometimes more).
From the Spark logs, it seems like all files are being scanned:
MERGE operation - scanning files for matches
โฆ 32 min | 3113/3113 files scanned (~72.2 GiB)
I would expect Liquid Clustering to reduce full-file scans when merge keys align with clustering keys.
My questions:
Why would Databricks still scan all files even with Liquid Clustering enabled?
Are there specific configuration tweaks or best practices to speed up MERGE in this setup?
Thanks a lot for any help or shared experiences!
Wednesday - last edited Wednesday
Root Cause
Observing the log:
"MERGE operation, scanning files for matches โฆ 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.
Why is this happening despite Liquid Clustering?
Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.
Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.
Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.
https://docs.databricks.com/aws/en/delta/clustering
Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.
https://www.youtube.com/watch?v=yZmrpXJg-G8
Solution and Best Practices to Improve MERGE Performance
Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.
Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.
Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.
https://docs.databricks.com/aws/en/delta/clustering
Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.
Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.
Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.
yesterday
Hi @Mous92i
DFP is what pushes source filters down to the target to skip files. For MERGE/UPDATE/DELETE
, DFP only works on Photon-enabled compute. If youโre not on Photon, MERGE will scan everything.
Enabling Liquid Clustering doesnโt recluster past files. Until you run OPTIMIZE FULL
once (after enabling or changing keys), old files remain unclustered and donโt prune well. Then do regular OPTIMIZE
to keep up.
Wednesday - last edited Wednesday
Root Cause
Observing the log:
"MERGE operation, scanning files for matches โฆ 32 min | 3113/3113 files scanned (~72.2 GiB)" shows that every data file in the target is scanned during the merge. This leads to high input/output and long execution times.
Why is this happening despite Liquid Clustering?
Liquid Clustering only reclusters newly written files a little at a time. Older unoptimized files cause full scans.
Without frequent OPTIMIZE operations, files stay fragmented and are not reorganized by clustering keys. Because of this, Spark cannot effectively prune files for the predicate in MERGE.
Incremental Nature of Liquid Clustering: Liquid Clustering clusters newly written data files gradually. It does not reorganize existing files right away. So, if you don't run OPTIMIZE, files created before enabling Liquid Clustering stay unclustered. This leads to full scans during MERGE.
https://docs.databricks.com/aws/en/delta/clustering
Lack of OPTIMIZE After MERGE: MERGE operations do not automatically trigger full reclustering. Without regular OPTIMIZE to compact and recluster the data, many files remain fragmented and poorly organized along clustering keys. This causes the system to scan all files.
https://www.youtube.com/watch?v=yZmrpXJg-G8
Solution and Best Practices to Improve MERGE Performance
Schedule Frequent OPTIMIZE Commands: Run OPTIMIZE on your Delta table regularly after MERGE. This process physically reorganizes files based on clustering keys and merges small files. It helps with effective data skipping and file pruning during MERGE.
Leverage Predicate Pushdown: Write your MERGE conditions to allow Spark to push down filters on clustering keys. This limits the files scanned by removing irrelevant files early.
Enable Photon Runtime: Use the Photon engine in Databricks Runtime 15.2 or later to gain from faster query execution and improvements in MERGE and clustering performance.
https://docs.databricks.com/aws/en/delta/clustering
Monitor File Size and Skew: Set up auto-compaction and adjust cluster size to cut down on too many small files and balance data distribution for better clustering.
Use Change Data Feed (CDF) for Incremental Updates: Whenever possible, handle incremental changes with CDF methods instead of full MERGE scans to lessen overhead.
Maintain Table Metadata and History: Regularly check Delta table metadata and transaction logs to confirm the clustering state and ensure OPTIMIZE jobs are running effectively.
yesterday
Hi @Mous92i
DFP is what pushes source filters down to the target to skip files. For MERGE/UPDATE/DELETE
, DFP only works on Photon-enabled compute. If youโre not on Photon, MERGE will scan everything.
Enabling Liquid Clustering doesnโt recluster past files. Until you run OPTIMIZE FULL
once (after enabling or changing keys), old files remain unclustered and donโt prune well. Then do regular OPTIMIZE
to keep up.
2 hours ago
Thanks for your response
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now