topic Re: How to identify which columns we need to consider for liquid clustering from a table of 200+ col in Data Engineering

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

TejeshS — Fri, 03 Jan 2025 14:09:11 GMT

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.

Objective: The goal is to determine which columns to select based on their ability to meaningfully contribute to the clustering process, thereby improving query performance and insights.

Re: How to identify which columns we need to consider for liquid clustering from a table of 200+ col

Alberto_Umana — Fri, 03 Jan 2025 14:14:12 GMT

Hi @TejeshS,

Thanks for your post!

To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can significantly contribute to data skipping and efficient query performance. Here are some guidelines:

High Cardinality Columns: Choose columns with high cardinality (i.e., columns with a large number of unique values) as clustering keys. These columns are more likely to benefit from clustering because they can help in efficiently skipping irrelevant data during queries.
Commonly Used Query Filters: Identify the columns that are most frequently used in query filters. These columns should be prioritized as clustering keys to improve query performance.
Avoid Correlated Columns: Try not to add correlated columns to clustering keys. For example, if you have both local time and UTC time columns, you only need to add one of them as a clustering key.
Fine-Grained Columns: Use the most fine-grained column you filter on as the clustering key. For example, if you have columns like event_timestamp, year, month, and date, use event_timestamp as the clustering key. Liquid clustering will automatically manage the data distribution based on the data volume.
Limit the Number of Clustering Columns: Liquid clustering supports a maximum of 4 columns. Therefore, you should carefully select up to 4 columns that provide the most benefit for clustering.
Data Skew and Distribution: Consider columns that help manage data skew and distribution. Tables with significant skew in data distribution can benefit from clustering on columns that help balance the data distribution.

https://learn.microsoft.com/en-us/azure/databricks/delta/clustering

Re: How to identify which columns we need to consider for liquid clustering from a table of 200+ col

noorbasha534 — Tue, 22 Jul 2025 21:59:11 GMT

@Alberto_Umana is it possible to get from system table the columns used in joins & filters of a table being queried?

Re: How to identify which columns we need to consider for liquid clustering from a table of 200+ col

pokornyt — Thu, 23 Oct 2025 12:16:58 GMT

Mentioning High Cardinality Columns ("columns with a large number of unique values"), how is it with primary key column containing a unique value in each row, are they a good or bad candidate for liquid clustering column?

Imagine primary key composed of random number-like strings of a fixed length (i.e. 10 characters).