Databricks Community

DatBoi · ‎02-29-2024

My questions is pretty straightforward - how big should a delta table be to benefit from liquid clustering? I know the answer will most likely depend on the details of how you are querying the data, but what is the recommendation?

I know Databricks recommends not partitioning on tables less than 1 TB and aim for 1 GB partitions. Does this hold true for liquid clustering?

And vice versa - will clustering on a small table < 1TB or even < 1GB hinder the performance of queries?

I have been looking for some documentation / resources to dive into these details but can't seem to find any. Everything I have found online is just covering the basics. Is there something like this out there?

Thanks in advance for the help.

daniel_sahal · ‎03-04-2024

@DatBoi
Once you watch this video you'll understand more about Liquid Clustering 🙂

https://www.youtube.com/watch?v=5t6wX28JC_M&ab_channel=DeltaLake

Long story short:

I know Databricks recommends not partitioning on tables less than 1 TB and aim for 1 GB partitions. Does this hold true for liquid clustering?

Clustering is a little bit different from partitioning. The main issue with partitioning tables less than 1TB is that it could create a lot of small files, that could negatively impact performance. With liquid clustering there's no such issue.

And vice versa - will clustering on a small table < 1TB or even < 1GB hinder the performance of queries?

From my experience - not really.

View solution in original post

daniel_sahal · ‎03-04-2024

@DatBoi
Once you watch this video you'll understand more about Liquid Clustering 🙂

https://www.youtube.com/watch?v=5t6wX28JC_M&ab_channel=DeltaLake

Long story short:

I know Databricks recommends not partitioning on tables less than 1 TB and aim for 1 GB partitions. Does this hold true for liquid clustering?

Clustering is a little bit different from partitioning. The main issue with partitioning tables less than 1TB is that it could create a lot of small files, that could negatively impact performance. With liquid clustering there's no such issue.