Re: Optimize taking FULL Taking Longer time on Cl...

intuz · ‎07-08-2025

Hi there,

Thanks for sharing your issue — working with 150 billion rows monthly is definitely a serious scale, and optimizing performance matters a lot. Let me try to address both your questions clearly:

Q1) Can we insert data in parallel into a clustered table? How to reduce write times?

Yes, you can insert in parallel into a clustered table, but clustering (especially static clustering) can slow things down during heavy writes.

To reduce write time:

Write with partitioning: Break up your data using a good partition column (like month, region, or any field with high cardinality that fits your query pattern).
Write in parallel: Use DataFrame.repartition() or coalesce() smartly — too few = underutilized cluster; too many = small files and shuffle overhead.
Avoid small files: Use OPTIMIZE after insert to compact data — but that leads to your second question.

Q2) Is Liquid Clustering good for a table refreshed with 150B rows monthly? Any alternatives?

Liquid clustering is a great fit for this scenario.

Why?

You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.
Works better with frequent, large-scale writes like your monthly refresh.
Keeps query performance fast without the huge OPTIMIZE table FULL cost.

So yes — switch to Liquid Clustering for this use case.

Alternative Suggestions:

If Liquid Clustering is not an option (e.g., due to workspace limits), you could:
Use Z-Ordering only on necessary columns (instead of full table OPTIMIZE).
Consider merging only delta (updated/new rows) instead of reloading full 150B every month.
Use Auto Compaction and Auto Optimize features (Databricks Runtime) to handle file sizes and layout in the background.

Hope this helps!