cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Optimize taking FULL Taking Longer time on Clustered Table

vsam
New Contributor II

Hi Everyone,

 

Currently we are facing issue with OPTIMIZE table_name FULL operation. The dataset consists of 150 billion rows of data and it takes 8 hours to optimize the reloaded clustered table. The table is refreshed every month and it needs clustering because we need faster query performance. 

Q1) Can we insert data in parallel to clustered table? if so how to reduce the write times?

Q2) Is it good idea to use liquid clustering on table that refreshes 150 Billion rows every month? please suggest any alternatives.

 

Thank you in advance for your valuable time.

5 REPLIES 5

intuz
Contributor II

Hi there,

Thanks for sharing your issue — working with 150 billion rows monthly is definitely a serious scale, and optimizing performance matters a lot. Let me try to address both your questions clearly:

Q1) Can we insert data in parallel into a clustered table? How to reduce write times?

Yes, you can insert in parallel into a clustered table, but clustering (especially static clustering) can slow things down during heavy writes.

To reduce write time:

  • Write with partitioning: Break up your data using a good partition column (like month, region, or any field with high cardinality that fits your query pattern).
  • Write in parallel: Use DataFrame.repartition() or coalesce() smartly — too few = underutilized cluster; too many = small files and shuffle overhead.
  • Avoid small files: Use OPTIMIZE after insert to compact data — but that leads to your second question.

Q2) Is Liquid Clustering good for a table refreshed with 150B rows monthly? Any alternatives?

Liquid clustering is a great fit for this scenario.

Why?

  • You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.
  • Works better with frequent, large-scale writes like your monthly refresh.
  • Keeps query performance fast without the huge OPTIMIZE table FULL cost.

So yes — switch to Liquid Clustering for this use case.

Alternative Suggestions:

  • If Liquid Clustering is not an option (e.g., due to workspace limits), you could:
  • Use Z-Ordering only on necessary columns (instead of full table OPTIMIZE).
  • Consider merging only delta (updated/new rows) instead of reloading full 150B every month.
  • Use Auto Compaction and Auto Optimize features (Databricks Runtime) to handle file sizes and layout in the background.

Hope this helps!

vsam
New Contributor II

Hi Intuz, 

 

Thanks for your time on this question.

As mentioned - You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.

 - Currently I am partitioning the data to allow the process write in parallel and running a alter table script to add cluster columns like - ALTER table table_name clusterby(x,y,z)

Don't we need to run optimize full on the table because my entire data is essentially truncated and reloaded. Will optimize runs automatically?

 

intuz
Contributor II

Hi again!

Great question — when using Liquid Clustering, you do not need to run OPTIMIZE manually, even if you're truncating and reloading the entire dataset.

Once you’ve defined CLUSTER BY (x, y, z), Liquid Clustering automatically organizes the data in the background over time. It’s designed exactly for use cases like yours — large, frequent rewrites.

That said, right after a big reload, queries might still hit unclustered data initially. If that's a problem, you can trigger a manual OPTIMIZE WHERE on recent partitions, but in most cases, Liquid will handle it soon after.

Let me know if you're seeing delays even after switching to Liquid.

vsam
New Contributor II

Hi Intuz, 

Could you please let me know if there is any setting that enable a asyn - sync process to run optimize back-end , because we see some delays. As the table I am refreshing is backend table for dashboard and UI there is 0 down time on this table. So at least it needs to trigger a asyn or sync process in background to organize the data. 

 

sridharplv
Valued Contributor II

Hi @vsam , Have you tried the Auto liquid clustering with Predictive optimization enabled where you don't need to mention cluster by columns specifically and also the optimization will be handled in the backend by predictive optimization concept.

https://docs.databricks.com/aws/en/delta/clustering#automatic-liquid-clustering please let me know if you handled it already in any way and have any issues faced.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now