cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How to identify which columns we need to consider for liquid clustering from a table of 200+ columns

TejeshS
New Contributor

In Databricks, when working with a table that has a large number of columns (e.g., 200), it can be challenging to determine which columns are most important for liquid clustering.

Objective: The goal is to determine which columns to select based on their ability to meaningfully contribute to the clustering process, thereby improving query performance and insights.

1 REPLY 1

Alberto_Umana
Databricks Employee
Databricks Employee

Hi @TejeshS,

Thanks for your post!

To determine which columns are most important for liquid clustering in a table with a large number of columns, you should focus on the columns that are most frequently used in query filters and those that can significantly contribute to data skipping and efficient query performance. Here are some guidelines:

 

  1. High Cardinality Columns: Choose columns with high cardinality (i.e., columns with a large number of unique values) as clustering keys. These columns are more likely to benefit from clustering because they can help in efficiently skipping irrelevant data during queries.
  2. Commonly Used Query Filters: Identify the columns that are most frequently used in query filters. These columns should be prioritized as clustering keys to improve query performance.
  3. Avoid Correlated Columns: Try not to add correlated columns to clustering keys. For example, if you have both local time and UTC time columns, you only need to add one of them as a clustering key.
  4. Fine-Grained Columns: Use the most fine-grained column you filter on as the clustering key. For example, if you have columns like event_timestamp, year, month, and date, use event_timestamp as the clustering key. Liquid clustering will automatically manage the data distribution based on the data volume.
  5. Limit the Number of Clustering Columns: Liquid clustering supports a maximum of 4 columns. Therefore, you should carefully select up to 4 columns that provide the most benefit for clustering.
  6. Data Skew and Distribution: Consider columns that help manage data skew and distribution. Tables with significant skew in data distribution can benefit from clustering on columns that help balance the data distribution.

https://learn.microsoft.com/en-us/azure/databricks/delta/clustering

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group