Databricks Community

OODataEng · ‎06-15-2025

Hello,

I have a table with approximately 300 million records. It weighs 3.4 GB and consists of 305 files.

I wanted to create liquid clustering for it and chose a date column as the key for clustering.

When I created a new table with the above details based on the original table, the size of the table increased to nearly 5 GB. Although the number of files indeed decreased to 24, query performance was slower on the new table, and in the performance metrics, I noticed that the amount of data being read was larger compared to the original table without clustering.

What am I missing here? Why is this happening?

Thanks!

HariSankar · ‎06-15-2025

Hey @OODataEng ,

Great question—and you're definitely not alone in running into this when first applying liquid clustering.

It sounds counterintuitive at first: you apply a performance optimization (clustering), but the table size increases and the queries slow down.

Let me break down what's likely happening and share some insights that might help.

Why Your Table Size Increased (3.4 GB ➝ 5 GB)

When you enable liquid clustering, the engine rewrites the table into fewer, larger files that are organized by the clustering key (in your case, a date column).

While this reduces the number of files (305 ➝ 24), it also introduces some overhead:

1. Compression efficiency drops: If the original files were compressed better (due to more uniform data chunks), the new clustered layout might actually be less

compressible—especially if the date column has a wide range of values per file.

2. Metadata overhead: Clustering introduces additional metadata to help with file pruning and optimization. That metadata has a cost, especially at scale.

3. Larger files can mean more I/O: If your queries only need small slices of data, larger files mean more unnecessary data gets read.

Why Your Queries Are Slower

Even though clustering is meant to improve performance, it depends heavily on how you query the data:

Are your queries filtering on the date column?

If not, the clustering isn’t helping prune files—and you’re paying the price of larger files without any benefit.

How selective is the date filter?

If you’re scanning large date ranges, the system still needs to read most of the files.

Was your original table accidentally optimized?

It’s possible the original 305 small files gave better parallelism or caching benefits.

What You Can Try Next

Here are some tips to investigate and improve things:

1. Check pruning effectiveness

Run this to see how much data is being skipped:

explain select * from your_table where date_col = '2024-01-01';

If it's still reading most of the data, clustering isn’t helping much yet.

2. Use `describe detail` or `input_file_name()`

These can help you analyze file sizes and distribution after clustering.

3. Test with Z-Ordering (if applicable)

If your queries filter on multiple columns (not just date), consider Z-ordering for multi-dimensional clustering.

4. Try a smaller clustering target

Smaller file sizes (using config like `delta.targetFileSize`) may help restore parallelism.

5. Review distribution of your clustering column

Is the data evenly spread across dates, or is there a skew (e.g., most records fall in a few days)? Skewed data can lead to unbalanced files, which slows things down.

In Summary

Liquid clustering is a powerful tool, but it’s not a silver bullet. Its effectiveness depends heavily on:

What column you use for clustering

How your queries are written

How your data is distributed

You’re on the right track by experimenting with it! I’d recommend measuring query performance again after a few days or with different filters—sometimes clustering

benefits kick in once the system has had time to optimize further.

harisankar

OODataEng · ‎06-16-2025

I really appreciate your detailed response.
I tried changing the key to a column with higher cardinality—a column that includes both date and time (timestamp)—to achieve a more even distribution.
Now the size has dropped to 4.3 GiB, and the number of files is 64.

Currently, the queries performed similarly, with a slight advantage for the table with the CLUSTER BY, when filtering on the clustered column. However, the amount of MB read was still higher.
I was under the impression that, unlike PARTITION BY, I don’t need to find a column with an even distribution for CLUSTER BY.

Could you clarify in which cases I would see a significant improvement with a CLUSTER BY table compared to a table with no optimization strategy at all?

Thank you!

nikhilj0421 · ‎06-15-2025

Hi @OODataEng, could you please elaborate on what exact command you used to create a new table using the older one?

OODataEng · ‎06-15-2025

Hi

create table x cluster by (col1,col2,col3)

as select * from table y

and than I run optimize command on table x

Thank you

OODataEng · ‎06-15-2025

I tried it with 1 column as a clustered key and also with 3 columns, both scenarios the size of the table increased

Yogesh_Verma_ · ‎06-16-2025

Hey @OODDATAEng

To create a new table in Databricks using the schema and data from an existing table, you can use the CREATE TABLE AS SELECT command. This command allows you to define a new table based on the results of a SELECT query executed on the existing table.

Here's a breakdown of the command and its components:

Code

CREATE TABLE new_table_name

AS

SELECT * FROM existing_table_name;

CREATE TABLE new_table_name:

Yogesh Verma