Databricks Community

ShivangiB · 4 weeks ago

What are the factors on which we should choose the optimization approach

Nik_Vanderhoof · 3 weeks ago

In several ways, liquid clustering is more flexible than either hive-style partitioning or z-ordering.

Liquid clustering allows us to change clustering keys without re-writing the entire table. Since stakeholders often query only more recent data, this can be very powerful as we can change cluster keys, and all future OPTIMIZE commands will only rewrite recent data.

With OPTIMIZE ZORDER, you have to specify the zorder keys each time you run an OPTIMIZE command, which can be error-prone. Liquid cluster keys are stored in table properties, so you need not remember the keys when running an OPTIMIZE command.

Partitioning can help data skipping for a single column, or multiple related columns (like year/month/day), but not unrelated columns.

Both ZORDER and Liquid Clustering are techniques to improve data skipping for multiple independent columns. To do this, both map multi-dimensional data into a single dimension, and group data points with a similar value together. However, Liquid Clustering's technique for this is better at grouping similar data together. You can learn more about it here: https://www.youtube.com/watch?v=5t6wX28JC_M

View solution in original post

Nik_Vanderhoof · 3 weeks ago

In several ways, liquid clustering is more flexible than either hive-style partitioning or z-ordering.

Liquid clustering allows us to change clustering keys without re-writing the entire table. Since stakeholders often query only more recent data, this can be very powerful as we can change cluster keys, and all future OPTIMIZE commands will only rewrite recent data.

With OPTIMIZE ZORDER, you have to specify the zorder keys each time you run an OPTIMIZE command, which can be error-prone. Liquid cluster keys are stored in table properties, so you need not remember the keys when running an OPTIMIZE command.

Partitioning can help data skipping for a single column, or multiple related columns (like year/month/day), but not unrelated columns.

Both ZORDER and Liquid Clustering are techniques to improve data skipping for multiple independent columns. To do this, both map multi-dimensional data into a single dimension, and group data points with a similar value together. However, Liquid Clustering's technique for this is better at grouping similar data together. You can learn more about it here: https://www.youtube.com/watch?v=5t6wX28JC_M