Databricks Community

TejeshS · ‎01-03-2025

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.

Objective: The goal is to determine which columns to select based on their ability to meaningfully contribute to the clustering process, thereby improving query performance and insights.

Alberto_Umana · ‎01-03-2025

Hi @TejeshS,

Thanks for your post!

To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can significantly contribute to data skipping and efficient query performance. Here are some guidelines:

High Cardinality Columns: Choose columns with high cardinality (i.e., columns with a large number of unique values) as clustering keys. These columns are more likely to benefit from clustering because they can help in efficiently skipping irrelevant data during queries.
Commonly Used Query Filters: Identify the columns that are most frequently used in query filters. These columns should be prioritized as clustering keys to improve query performance.
Avoid Correlated Columns: Try not to add correlated columns to clustering keys. For example, if you have both local time and UTC time columns, you only need to add one of them as a clustering key.
Fine-Grained Columns: Use the most fine-grained column you filter on as the clustering key. For example, if you have columns like event_timestamp, year, month, and date, use event_timestamp as the clustering key. Liquid clustering will automatically manage the data distribution based on the data volume.
Limit the Number of Clustering Columns: Liquid clustering supports a maximum of 4 columns. Therefore, you should carefully select up to 4 columns that provide the most benefit for clustering.
Data Skew and Distribution: Consider columns that help manage data skew and distribution. Tables with significant skew in data distribution can benefit from clustering on columns that help balance the data distribution.

https://learn.microsoft.com/en-us/azure/databricks/delta/clustering