cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How should you optimize <1GB delta tables?

PJ
New Contributor III

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger.

I am working with a ~7million row dataset that is 211 MB. Without any of my intervention, this delta table dataset was split into 70 files. When I run the OPTIMIZE command on it, it collapses into 1 file. Did running the OPTIMIZE function help or hurt me in terms of efficiency? And if it hurt me, what is an alternative method for improving efficiency, particularly with respect to reading/querying? Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Size 100-200 MB is perfect for Spark.

Regarding efficiency, it depends on many factors. If you do a lot of filters on some fields, you can add a bloom filter. If your query is by timestamp, ZORDER will be enough. Suppose your data is queried and divided by some infrequent category that only needs to be imported (for example, finance data ledger for three separate companies). In that case, partitioning per that category is ok, so there will be three files after optimization, for example, 60 MB each (which make sense when we know that only some of the partitions have to be imported).

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Size 100-200 MB is perfect for Spark.

Regarding efficiency, it depends on many factors. If you do a lot of filters on some fields, you can add a bloom filter. If your query is by timestamp, ZORDER will be enough. Suppose your data is queried and divided by some infrequent category that only needs to be imported (for example, finance data ledger for three separate companies). In that case, partitioning per that category is ok, so there will be three files after optimization, for example, 60 MB each (which make sense when we know that only some of the partitions have to be imported).

PJ
New Contributor III

Thank you @Hubert Dudek​ !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardinality field for further query optimization. Is this all correct?

Hubert-Dudek
Esteemed Contributor III

yes

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.