cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Liquid Clustering and S3 Performance

RyanHager
Contributor

Are there any performance concerns when using liquid clustering and AWS S3.  I believe all the parquet files go in the same folder (Prefix in AWS S3 Terms) verses folders per partition when using "partition by".  And there is this note on S3 performance "Amazon S3 automatically scales to high request rates. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix" on this page:  https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

1 ACCEPTED SOLUTION

Accepted Solutions

iyashk-DB
Databricks Employee
Databricks Employee

Even though liquid clustering removes Hive-style partition folders, it typically doesn’t cause S3 prefix performance issues on Databricks. Delta tables don’t rely on directory listing for reads; they use the transaction log to locate exact files. In addition, when column mapping is enabled (or when delta.randomizeFilePrefixes is used), Databricks automatically shards data files across randomized S3 prefixes, so requests are spread out and don’t hit single-prefix limits. Because of this, most workloads don’t see S3 throttling issues with liquid-clustered tables.

View solution in original post

2 REPLIES 2

iyashk-DB
Databricks Employee
Databricks Employee

Even though liquid clustering removes Hive-style partition folders, it typically doesn’t cause S3 prefix performance issues on Databricks. Delta tables don’t rely on directory listing for reads; they use the transaction log to locate exact files. In addition, when column mapping is enabled (or when delta.randomizeFilePrefixes is used), Databricks automatically shards data files across randomized S3 prefixes, so requests are spread out and don’t hit single-prefix limits. Because of this, most workloads don’t see S3 throttling issues with liquid-clustered tables.

We set 'delta.randomizeFilePrefixes' = 'true' and 2 letters(default) with a slash were added to the "path" before the file name.  This should handle any S3 performance concerns as we scale to production.  Thank you.