intuz
Contributor II

Hi there,

Thanks for sharing your issue — working with 150 billion rows monthly is definitely a serious scale, and optimizing performance matters a lot. Let me try to address both your questions clearly:

Q1) Can we insert data in parallel into a clustered table? How to reduce write times?

Yes, you can insert in parallel into a clustered table, but clustering (especially static clustering) can slow things down during heavy writes.

To reduce write time:

  • Write with partitioning: Break up your data using a good partition column (like month, region, or any field with high cardinality that fits your query pattern).
  • Write in parallel: Use DataFrame.repartition() or coalesce() smartly — too few = underutilized cluster; too many = small files and shuffle overhead.
  • Avoid small files: Use OPTIMIZE after insert to compact data — but that leads to your second question.

Q2) Is Liquid Clustering good for a table refreshed with 150B rows monthly? Any alternatives?

Liquid clustering is a great fit for this scenario.

Why?

  • You don’t need to OPTIMIZE manually anymore — Liquid clustering handles clustering in the background automatically.
  • Works better with frequent, large-scale writes like your monthly refresh.
  • Keeps query performance fast without the huge OPTIMIZE table FULL cost.

So yes — switch to Liquid Clustering for this use case.

Alternative Suggestions:

  • If Liquid Clustering is not an option (e.g., due to workspace limits), you could:
  • Use Z-Ordering only on necessary columns (instead of full table OPTIMIZE).
  • Consider merging only delta (updated/new rows) instead of reloading full 150B every month.
  • Use Auto Compaction and Auto Optimize features (Databricks Runtime) to handle file sizes and layout in the background.

Hope this helps!