Databricks Community

Raja_Databricks · ‎04-08-2024

Hi there,

I'm working with a large Delta table (2TB) and I'm looking for the best way to efficiently update it with new data (10GB). I'm particularly interested in using Liquid Clustering for faster queries, but I'm unsure if it supports updates efficiently.

Currently, I can add new data using methods like INSERT, CTAS, COPY INTO, and Spark's append mode. However, these don't handle updates well, treating them as completely new entries.

Here's my challenge:

I want to update existing data (even data from previous months) efficiently.
My current code only checks for recent data (current and previous day) to avoid scanning the entire table. This means updates to older data get treated as new entries, causing slow processing.

Is there a way to perform efficient "merge" or "upsert" operations with Liquid Clustering? This would ideally combine inserting new data and updating existing data in a single step.

Thank you for your help!

@SparkJun , @youssefmrini Please help me here

erigaud · ‎04-11-2024

Here's I how would do it, leveraging the medallion architecture :

- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.

- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version)

- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.

This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF.

It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !

View solution in original post

erigaud · ‎04-10-2024

Hello,
I'll complement @Retired_mod's answer, which is already very detailed.

First of all if you're considering Liquid Clustering, you need to know on which columns you'll cluster your table. These need to be the columns on which you're gonna merge.

Then, before running the MERGE you need to fetch only the rows that have changed or that are new : this is easily achieved with Change Data Feed.

Enable it for your table using :
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
Once you've got that, you can access the last modified rows with
SELECT * FROM table_changes('my_table', last_commit_version)
and use that dataset to run your merge, as explained in Kaniz's answer.

Let me know if that helped or if you need more details !

Raja_Databricks · ‎04-10-2024

Hi @Retired_mod , @erigaud ,

Thank you for your assistance.

I've enabled Liquid Clustering on my Delta table, using the same columns as my merge keys. Daily, I load new data into a temporary table that's an exact structure of my main Delta table (also Liquid Clustered). Then, I merge the temporary data into the main table.

I have three questions:

Before merging the temporary table into the main table, do I need to optimize the temporary table (Databricks suggests this in a hint)?
Would Change Data Capture (CDC) be beneficial in this scenario?
My Main Delta table will be highly growing one, will merge on primary key will be efficient over time?

Thanks in advance for your insights!

erigaud · ‎04-11-2024

Here's I how would do it, leveraging the medallion architecture :

- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.

- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version)

- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.

This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF.

It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !

Maxence · ‎09-27-2024

Regarding the bronze table, since we're using CDF to incrementally get the data, why is it needed/better to use liquid clustering in this table?

youssefmrini · ‎04-12-2024

Liquid will be a good option. Just make sure to run the optimize whenever you upsert data. Don't worry the optimize won't be expensive as it will run on only the latest data in order to cluster them and have fast queries

RV-Gokul · ‎06-20-2024

@youssefmrini @erigaud

I have a similar issue, and I've pretty much tried the solution mentioned above. However, I'm not noticing any changes when I use a temporary table or persist the table.

My main table contains 3.1 terabytes of data with 42 billion records, and the incoming staging table has 7 gigabytes of data with 90 million records. The staging table is also similarly partitioned as the main table, and I can see that almost 90% of the main table has been scanned.

For Your Information. I am using 15.2 DBR.

Regards,

Gokul