cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

From Partitioning to Liquid Clustering

Volker
Contributor

We had some delta tables that where previously partitioned on year, month, day, and hour. This resulted in quite small partitions and we now switched to liquid clustering.

We followed these steps:

  • Remove partitioning by doing REPLACE
  • ALTER TABLE --- CLUSTER BY 
  • Run OPTIMIZE --- FULL

I see in the query output that some files have been written and some have been removed but in the underlying s3 bucket I still see the parquet files in the old hive-style partition layout.

Are these old files that will be removed by some VACUUM job or what does OPTIMIZE do if the data in the root bucket is stored in the hive-style partition layout even though we removed the partitioning from the delta table.

Also we are now using the processing_dttm as cluster key instead of year, month, day, hour. The processing_dttm column contains th dttm like so: 2024-11-19T09:30:00.765+00:00. 
Would it be better to only include year, month and day or maybe hour instead of minutes or seconds? Or is liquid clustering smart enough the infere this from the dttm?

1 REPLY 1

Isi
Contributor

Hey @Volker ,

First of all, Iโ€™d recommend considering Auto Liquid Clustering, as it can simplify the process of defining clustering keys.

You can read more about it in the Databricks documentation (itโ€™s currently in Public Preview, but you can probably start using it already)

Since the official docs are still limited, hereโ€™s a quick summary of the criteria used by the backend to trigger Liquid Clustering:

โ€ข The table size must be at least 256 MB.

โ€ข There must be at least 10 pruning-eligible scans with pruning predicates.

โ€ข The clustering key must not have been changed in the last 2 weeks.

โ€ข It usually takes 2 to 5 hours for the table to reflect the Liquid Clustering key after the conditions are met.

Answering your questions:

โ€ข Yes, deleted files should be removed with VACUUM after 7 days โ€” this is the default behavior.

โ€ข Yes, Liquid Clustering can handle full timestamps like processing_dttm.

However, using a timestamp with minutes and seconds can lead to too many small clusters if the values are highly distinct and that level of granularity isnโ€™t relevant for filtering. In such cases, this may reduce clustering efficiency rather than improve it.

Maybe if your queries donโ€™t require high precision, I recommend using truncated versions of your timestamp when filtering

Hopee this helps ๐Ÿ™‚

Isi

 

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now