Databricks Community

Volker · 4 weeks ago

We had some delta tables that where previously partitioned on year, month, day, and hour. This resulted in quite small partitions and we now switched to liquid clustering.

We followed these steps:

Remove partitioning by doing REPLACE
ALTER TABLE --- CLUSTER BY
Run OPTIMIZE --- FULL

I see in the query output that some files have been written and some have been removed but in the underlying s3 bucket I still see the parquet files in the old hive-style partition layout.

Are these old files that will be removed by some VACUUM job or what does OPTIMIZE do if the data in the root bucket is stored in the hive-style partition layout even though we removed the partitioning from the delta table.

Also we are now using the processing_dttm as cluster key instead of year, month, day, hour. The processing_dttm column contains th dttm like so: 2024-11-19T09:30:00.765+00:00.
Would it be better to only include year, month and day or maybe hour instead of minutes or seconds? Or is liquid clustering smart enough the infere this from the dttm?

Isi · 2 weeks ago

Hey @Volker ,

First of all, I’d recommend considering Auto Liquid Clustering, as it can simplify the process of defining clustering keys.

You can read more about it in the Databricks documentation (it’s currently in Public Preview, but you can probably start using it already)

Since the official docs are still limited, here’s a quick summary of the criteria used by the backend to trigger Liquid Clustering:

• The table size must be at least 256 MB.

• There must be at least 10 pruning-eligible scans with pruning predicates.

• The clustering key must not have been changed in the last 2 weeks.

• It usually takes 2 to 5 hours for the table to reflect the Liquid Clustering key after the conditions are met.

Answering your questions:

• Yes, deleted files should be removed with VACUUM after 7 days — this is the default behavior.

• Yes, Liquid Clustering can handle full timestamps like processing_dttm.

However, using a timestamp with minutes and seconds can lead to too many small clusters if the values are highly distinct and that level of granularity isn’t relevant for filtering. In such cases, this may reduce clustering efficiency rather than improve it.

Maybe if your queries don’t require high precision, I recommend using truncated versions of your timestamp when filtering

Hopee this helps 🙂

Isi