Databricks Community

TejeshS · ‎01-03-2025

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.

Objective: The goal is to determine which columns to select based on their ability to meaningfully contribute to the clustering process, thereby improving query performance and insights.

Alberto_Umana · ‎01-03-2025

Hi @TejeshS,

Thanks for your post!

To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can significantly contribute to data skipping and efficient query performance. Here are some guidelines:

High Cardinality Columns: Choose columns with high cardinality (i.e., columns with a large number of unique values) as clustering keys. These columns are more likely to benefit from clustering because they can help in efficiently skipping irrelevant data during queries.
Commonly Used Query Filters: Identify the columns that are most frequently used in query filters. These columns should be prioritized as clustering keys to improve query performance.
Avoid Correlated Columns: Try not to add correlated columns to clustering keys. For example, if you have both local time and UTC time columns, you only need to add one of them as a clustering key.
Fine-Grained Columns: Use the most fine-grained column you filter on as the clustering key. For example, if you have columns like event_timestamp, year, month, and date, use event_timestamp as the clustering key. Liquid clustering will automatically manage the data distribution based on the data volume.
Limit the Number of Clustering Columns: Liquid clustering supports a maximum of 4 columns. Therefore, you should carefully select up to 4 columns that provide the most benefit for clustering.
Data Skew and Distribution: Consider columns that help manage data skew and distribution. Tables with significant skew in data distribution can benefit from clustering on columns that help balance the data distribution.

https://learn.microsoft.com/en-us/azure/databricks/delta/clustering

pokornyt · ‎10-23-2025

Mentioning High Cardinality Columns ("columns with a large number of unique values"), how is it with primary key column containing a unique value in each row, are they a good or bad candidate for liquid clustering column?

Imagine primary key composed of random number-like strings of a fixed length (i.e. 10 characters).

noorbasha534 · ‎07-22-2025

@Alberto_Umana is it possible to get from system table the columns used in joins & filters of a table being queried?

Databricks Community

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐