cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Liquid Clustering - Number of files are increasing

data-engineer-d
New Contributor III

We enabled liquid clustering on one of the large tables (380GBs). This table goes many operations daily, which improved many folds after liquid clustering. However, after enabling liquid clustering and optimizing it number of files are increased.

Previously it had around 4300 files and now it shows 7900 files. Though table size is almost the same before and after. 

It is clustered using two columns which are both in first 32 columns. How can we justify this increase in number of file sizes i.e decrease in data per file. 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @data-engineer-d, First I would like to explain Liquid Clustering:-

Now, seeing your problem,

  • You mentioned that after enabling liquid clustering, the number of files increased from around 4300 to 7900, even though the table size remained similar.
  • This behaviour is expected due to the way liquid clustering works. When you optimize a table, it reorganizes the data into ZCubes. Some files may be split or merged to form these ZCubes.
  • The increase in the number of files doesnโ€™t necessarily mean a decrease in data per file. Instead, it reflects the new organization of data into more efficient clusters.

To justify this increase, consider the following factors:

  • If your data has skewed distributions (e.g., some values are more frequent than others), liquid clustering might create more files to evenly distribute the data.
  • The clustering columns matter. If the two columns used for clustering have high cardinality (many distinct values), it could lead to more files.
  • Despite the increased file count, query performance should improve due to better data skipping and locality.
  • Check if the new files are compressed efficiently. Sometimes, smaller files can still hold a significant amount of data due to better compression.
  • Monitor query performance after liquid clustering. If itโ€™s improved, the increase in files is likely beneficial.

Remember that liquid clustering optimizes query efficiency, and the increase in files is a trade-off for better performance. If your queries are faster, itโ€™s a sign that the approach is working as intended!  

 

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!