Liquid Clustering With Merge

Raja_Databricks · ‎04-08-2024

Hi there,

I'm working with a large Delta table (2TB) and I'm looking for the best way to efficiently update it with new data (10GB). I'm particularly interested in using Liquid Clustering for faster queries, but I'm unsure if it supports updates efficiently.

Currently, I can add new data using methods like INSERT, CTAS, COPY INTO, and Spark's append mode. However, these don't handle updates well, treating them as completely new entries.

Here's my challenge:

I want to update existing data (even data from previous months) efficiently.
My current code only checks for recent data (current and previous day) to avoid scanning the entire table. This means updates to older data get treated as new entries, causing slow processing.

Is there a way to perform efficient "merge" or "upsert" operations with Liquid Clustering? This would ideally combine inserting new data and updating existing data in a single step.

Thank you for your help!

@SparkJun , @youssefmrini Please help me here