cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Liquid clustering performance issue

OODataEng
New Contributor III

Hello,

I have a table with approximately 300 million records. It weighs 3.4 GB and consists of 305 files.

I wanted to create liquid clustering for it and chose a date column as the key for clustering.

 

When I created a new table with the above details based on the original table, the size of the table increased to nearly 5 GB. Although the number of files indeed decreased to 24, query performance was slower on the new table, and in the performance metrics, I noticed that the amount of data being read was larger compared to the original table without clustering.

 

What am I missing here? Why is this happening?

 

Thanks!

6 REPLIES 6

Vasireddy
New Contributor III
Hey @OODataEng ,
 
Great question—and you're definitely not alone in running into this when first applying liquid clustering. 
It sounds counterintuitive at first: you apply a performance optimization (clustering), but the table size increases and the queries slow down. 
Let me break down what's likely happening and share some insights that might help.
 
Why Your Table Size Increased (3.4 GB ➝ 5 GB)
 
When you enable liquid clustering, the engine rewrites the table into fewer, larger files that are organized by the clustering key (in your case, a date column). 
While this reduces the number of files (305 ➝ 24), it also introduces some overhead:
 
1. Compression efficiency drops: If the original files were compressed better (due to more uniform data chunks), the new clustered layout might actually be less 
compressible—especially if the date column has a wide range of values per file.
2. Metadata overhead: Clustering introduces additional metadata to help with file pruning and optimization. That metadata has a cost, especially at scale.
3. Larger files can mean more I/O: If your queries only need small slices of data, larger files mean more unnecessary data gets read.
 
 
Why Your Queries Are Slower
 
Even though clustering is meant to improve performance, it depends heavily on how you query the data:
 
Are your queries filtering on the date column?
If not, the clustering isn’t helping prune files—and you’re paying the price of larger files without any benefit.
 
How selective is the date filter?
If you’re scanning large date ranges, the system still needs to read most of the files.
 
Was your original table accidentally optimized?
It’s possible the original 305 small files gave better parallelism or caching benefits.
 
 
What You Can Try Next
 
Here are some tips to investigate and improve things:
 
1. Check pruning effectiveness
   Run this to see how much data is being skipped:
   
   explain select * from your_table where date_col = '2024-01-01';
   If it's still reading most of the data, clustering isn’t helping much yet.
 
2. Use `describe detail` or `input_file_name()`
   These can help you analyze file sizes and distribution after clustering.
 
3. Test with Z-Ordering (if applicable)
   If your queries filter on multiple columns (not just date), consider Z-ordering for multi-dimensional clustering.
 
4. Try a smaller clustering target
   Smaller file sizes (using config like `delta.targetFileSize`) may help restore parallelism.
 
5. Review distribution of your clustering column
   Is the data evenly spread across dates, or is there a skew (e.g., most records fall in a few days)? Skewed data can lead to unbalanced files, which slows things down.
 
 
In Summary
 
Liquid clustering is a powerful tool, but it’s not a silver bullet. Its effectiveness depends heavily on:
 
What column you use for clustering
How your queries are written
How your data is distributed
 
You’re on the right track by experimenting with it! I’d recommend measuring query performance again after a few days or with different filters—sometimes clustering 
benefits kick in once the system has had time to optimize further.
harisankar

OODataEng
New Contributor III

I really appreciate your detailed response.
I tried changing the key to a column with higher cardinality—a column that includes both date and time (timestamp)—to achieve a more even distribution.
Now the size has dropped to 4.3 GiB, and the number of files is 64.

Currently, the queries performed similarly, with a slight advantage for the table with the CLUSTER BY, when filtering on the clustered column. However, the amount of MB read was still higher.
I was under the impression that, unlike PARTITION BY, I don’t need to find a column with an even distribution for CLUSTER BY.

Could you clarify in which cases I would see a significant improvement with a CLUSTER BY table compared to a table with no optimization strategy at all?

Thank you!

nikhilj0421
Databricks Employee
Databricks Employee

Hi @OODataEng, could you please elaborate on what exact command you used to create a new table using the older one?

Hi

create table x cluster by (col1,col2,col3)

as select * from table y

and than I run optimize command on table x

 

Thank you

OODataEng
New Contributor III

I tried it with 1 column as a clustered key and also with 3 columns, both scenarios the size of the table increased 

Yogesh_378691
New Contributor III

Hey @OODDATAEng 

To create a new table in Databricks using the schema and data from an existing table, you can use the CREATE TABLE AS SELECT command. This command allows you to define a new table based on the results of a SELECT query executed on the existing table. 

Here's a breakdown of the command and its components:

Code

 

CREATE TABLE new_table_name

AS

SELECT * FROM existing_table_name;

CREATE TABLE new_table_name:

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now