cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Liquid Clustering With Merge

Raja_Databricks
New Contributor II

Hi there,

I'm working with a large Delta table (2TB) and I'm looking for the best way to efficiently update it with new data (10GB). I'm particularly interested in using Liquid Clustering for faster queries, but I'm unsure if it supports updates efficiently.

Currently, I can add new data using methods like INSERT, CTAS, COPY INTO, and Spark's append mode. However, these don't handle updates well, treating them as completely new entries.

Here's my challenge:

  • I want to update existing data (even data from previous months) efficiently.
  • My current code only checks for recent data (current and previous day) to avoid scanning the entire table. This means updates to older data get treated as new entries, causing slow processing.

Is there a way to perform efficient "merge" or "upsert" operations with Liquid Clustering? This would ideally combine inserting new data and updating existing data in a single step.

Thank you for your help!

@JunYang , @youssefmrini  Please help me here

1 ACCEPTED SOLUTION

Accepted Solutions

Here's I how would do it, leveraging the medallion architecture : 

- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.

- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version) 

- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.

This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF. 

It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !

View solution in original post

5 REPLIES 5

Kaniz
Community Manager
Community Manager

Hi @Raja_DatabricksLiquid Clustering in Delta Lake is a powerful feature that can significantly enhance query performance and simplify data layout decisions.

Let’s dive into how you can leverage it for efficient updates and merges.

  1. What is Liquid Clustering?

  2. Scenarios Benefiting from Clustering:

  3. Enabling Liquid Clustering:

  4. Efficient Upsert Operations:

    • While Liquid Clustering doesn’t directly handle upserts, you can achieve efficient upserts using the following approach:
      • Step 1: Insert new data into the clustered table using INSERT INTO.
      • Step 2: Use a MERGE statement to update existing data based on a condition (e.g., matching keys).
      • This combines inserting new data and updating existing data in a single step.
      • Example:
        MERGE INTO my_table AS target
        USING new_data AS source
        ON target.col0 = source.col0
        WHEN MATCHED THEN
          UPDATE SET target.col1 = source.col1
        WHEN NOT MATCHED THEN
          INSERT (col0, col1) VALUES (source.col0, source.col1);
        
    • Remember to adjust the MERGE condition according to your specific use case.
  5. Liquid Clustering Internals:

In summary, Liquid Clustering is a game-changer for optimizing query performance and handling evolving data layouts. Consider using it for your large Delta table to efficiently manage updates and merges. Happy clustering! 🚀🍃

 

erigaud
Honored Contributor

Hello, 
I'll complement @Kaniz's answer, which is already very detailed.

First of all if you're considering Liquid Clustering, you need to know on which columns you'll cluster your table. These need to be the columns on which you're gonna merge.

Then, before running the MERGE you need to fetch only the rows that have changed or that are new : this is easily achieved with Change Data Feed.

Enable it for your table using : 
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true'); 
Once you've got that, you can access the last modified rows with 
SELECT * FROM table_changes('my_table', last_commit_version) 
and use that dataset to run your merge, as explained in Kaniz's answer.

Let me know if that helped or if you need more details !

Raja_Databricks
New Contributor II

Hi @Kaniz , @erigaud ,

Thank you for your assistance.

I've enabled Liquid Clustering on my Delta table, using the same columns as my merge keys. Daily, I load new data into a temporary table that's an exact structure of my main Delta table (also Liquid Clustered). Then, I merge the temporary data into the main table.

I have three questions:

  1. Before merging the temporary table into the main table, do I need to optimize the temporary table (Databricks suggests this in a hint)?
  2. Would Change Data Capture (CDC) be beneficial in this scenario?
  3. My Main Delta table will be highly growing one, will merge on primary key will be efficient over time?

Thanks in advance for your insights!

Here's I how would do it, leveraging the medallion architecture : 

- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.

- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version) 

- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.

This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF. 

It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !

youssefmrini
Honored Contributor III
Honored Contributor III

Liquid will be a good option. Just make sure to run the optimize whenever you upsert data. Don't worry the optimize won't be expensive as it will run on only the latest data in order to cluster them and have fast queries

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.