07-08-2025 04:21 PM
Hi Everyone,
Currently we are facing issue with OPTIMIZE table_name FULL operation. The dataset consists of 150 billion rows of data and it takes 8 hours to optimize the reloaded clustered table. The table is refreshed every month and it needs clustering because we need faster query performance.
Q1) Can we insert data in parallel to clustered table? if so how to reduce the write times?
Q2) Is it good idea to use liquid clustering on table that refreshes 150 Billion rows every month? please suggest any alternatives.
Thank you in advance for your valuable time.
07-08-2025 11:01 PM - edited 07-08-2025 11:02 PM
Hi there,
Thanks for sharing your issue — working with 150 billion rows monthly is definitely a serious scale, and optimizing performance matters a lot. Let me try to address both your questions clearly:
Q1) Can we insert data in parallel into a clustered table? How to reduce write times?
Yes, you can insert in parallel into a clustered table, but clustering (especially static clustering) can slow things down during heavy writes.
To reduce write time:
Q2) Is Liquid Clustering good for a table refreshed with 150B rows monthly? Any alternatives?
Liquid clustering is a great fit for this scenario.
Why?
So yes — switch to Liquid Clustering for this use case.
Alternative Suggestions:
Hope this helps!
07-09-2025 06:46 AM
Hi Intuz,
Thanks for your time on this question.
As mentioned - You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.
- Currently I am partitioning the data to allow the process write in parallel and running a alter table script to add cluster columns like - ALTER table table_name clusterby(x,y,z)
Don't we need to run optimize full on the table because my entire data is essentially truncated and reloaded. Will optimize runs automatically?
07-09-2025 11:47 PM
Hi again!
Great question — when using Liquid Clustering, you do not need to run OPTIMIZE manually, even if you're truncating and reloading the entire dataset.
Once you’ve defined CLUSTER BY (x, y, z), Liquid Clustering automatically organizes the data in the background over time. It’s designed exactly for use cases like yours — large, frequent rewrites.
That said, right after a big reload, queries might still hit unclustered data initially. If that's a problem, you can trigger a manual OPTIMIZE WHERE on recent partitions, but in most cases, Liquid will handle it soon after.
Let me know if you're seeing delays even after switching to Liquid.
07-12-2025 07:05 AM
Hi Intuz,
Could you please let me know if there is any setting that enable a asyn - sync process to run optimize back-end , because we see some delays. As the table I am refreshing is backend table for dashboard and UI there is 0 down time on this table. So at least it needs to trigger a asyn or sync process in background to organize the data.
07-12-2025 11:52 AM
Hi @vsam , Have you tried the Auto liquid clustering with Predictive optimization enabled where you don't need to mention cluster by columns specifically and also the optimization will be handled in the backend by predictive optimization concept.
https://docs.databricks.com/aws/en/delta/clustering#automatic-liquid-clustering please let me know if you handled it already in any way and have any issues faced.
Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!
Sign Up Now