04-08-2024 05:27 AM
Hi there,
I'm working with a large Delta table (2TB) and I'm looking for the best way to efficiently update it with new data (10GB). I'm particularly interested in using Liquid Clustering for faster queries, but I'm unsure if it supports updates efficiently.
Currently, I can add new data using methods like INSERT, CTAS, COPY INTO, and Spark's append mode. However, these don't handle updates well, treating them as completely new entries.
Here's my challenge:
Is there a way to perform efficient "merge" or "upsert" operations with Liquid Clustering? This would ideally combine inserting new data and updating existing data in a single step.
Thank you for your help!
@JunYang , @youssefmrini Please help me here
04-11-2024 01:45 AM
Here's I how would do it, leveraging the medallion architecture :
- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.
- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version)
- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.
This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF.
It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !
04-10-2024 07:51 AM - last edited on 04-15-2024 08:26 AM by Retired_mod
Hello,
I'll complement @Retired_mod's answer, which is already very detailed.
First of all if you're considering Liquid Clustering, you need to know on which columns you'll cluster your table. These need to be the columns on which you're gonna merge.
Then, before running the MERGE you need to fetch only the rows that have changed or that are new : this is easily achieved with Change Data Feed.
Enable it for your table using :
ALTER TABLE my_table SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');
Once you've got that, you can access the last modified rows with
SELECT * FROM table_changes('my_table', last_commit_version)
and use that dataset to run your merge, as explained in Kaniz's answer.
Let me know if that helped or if you need more details !
04-10-2024 11:37 PM
Hi @Retired_mod , @erigaud ,
Thank you for your assistance.
I've enabled Liquid Clustering on my Delta table, using the same columns as my merge keys. Daily, I load new data into a temporary table that's an exact structure of my main Delta table (also Liquid Clustered). Then, I merge the temporary data into the main table.
I have three questions:
Thanks in advance for your insights!
04-11-2024 01:45 AM
Here's I how would do it, leveraging the medallion architecture :
- Instead of a temporary table, load everything in a bronze table in append mode. This table should be liquid-clustered, and you should run OPTIMIZE on it regularly. It should also have CDF enabled.
- Next, from this table fetch only the new rows using change datafeed. Something like SELECT * FROM table_changes('bronze_table', last_commit_version)
- Run the MERGE into the silver table, which in your case would be the full updated table, that should also be with liquid clustering, and also with an OPTIMIZE statement scheduled regularly.
This way both your tables are optimized, you keep all of the history in your bronze table but it remains efficient because you're loading only what's necessary with CDF.
It will remain efficient as long as you don't forget to OPTIMIZE. Also note that with liquid clustering, OPTIMIZE is incremental, so it only optimized what's new, so don't hesitate to run that statement often !
09-27-2024 04:25 AM - edited 09-27-2024 04:48 AM
Regarding the bronze table, since we're using CDF to incrementally get the data, why is it needed/better to use liquid clustering in this table?
04-12-2024 07:43 AM
Liquid will be a good option. Just make sure to run the optimize whenever you upsert data. Don't worry the optimize won't be expensive as it will run on only the latest data in order to cluster them and have fast queries
06-20-2024 08:54 AM
I have a similar issue, and I've pretty much tried the solution mentioned above. However, I'm not noticing any changes when I use a temporary table or persist the table.
My main table contains 3.1 terabytes of data with 42 billion records, and the incoming staging table has 7 gigabytes of data with 90 million records. The staging table is also similarly partitioned as the main table, and I can see that almost 90% of the main table has been scanned.
For Your Information. I am using 15.2 DBR.
Regards,
Gokul
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group