- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-08-2025 11:01 PM - edited 07-08-2025 11:02 PM
Hi there,
Thanks for sharing your issue — working with 150 billion rows monthly is definitely a serious scale, and optimizing performance matters a lot. Let me try to address both your questions clearly:
Q1) Can we insert data in parallel into a clustered table? How to reduce write times?
Yes, you can insert in parallel into a clustered table, but clustering (especially static clustering) can slow things down during heavy writes.
To reduce write time:
- Write with partitioning: Break up your data using a good partition column (like month, region, or any field with high cardinality that fits your query pattern).
- Write in parallel: Use DataFrame.repartition() or coalesce() smartly — too few = underutilized cluster; too many = small files and shuffle overhead.
- Avoid small files: Use OPTIMIZE after insert to compact data — but that leads to your second question.
Q2) Is Liquid Clustering good for a table refreshed with 150B rows monthly? Any alternatives?
Liquid clustering is a great fit for this scenario.
Why?
- You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.
- Works better with frequent, large-scale writes like your monthly refresh.
- Keeps query performance fast without the huge OPTIMIZE table FULL cost.
So yes — switch to Liquid Clustering for this use case.
Alternative Suggestions:
- If Liquid Clustering is not an option (e.g., due to workspace limits), you could:
- Use Z-Ordering only on necessary columns (instead of full table OPTIMIZE).
- Consider merging only delta (updated/new rows) instead of reloading full 150B every month.
- Use Auto Compaction and Auto Optimize features (Databricks Runtime) to handle file sizes and layout in the background.
Hope this helps!