cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How should you optimize <1GB delta tables?

PJ
New Contributor III

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger.

I am working with a ~7million row dataset that is 211 MB. Without any of my intervention, this delta table dataset was split into 70 files. When I run the OPTIMIZE command on it, it collapses into 1 file. Did running the OPTIMIZE function help or hurt me in terms of efficiency? And if it hurt me, what is an alternative method for improving efficiency, particularly with respect to reading/querying? Thank you!

1 ACCEPTED SOLUTION

Accepted Solutions

Hubert-Dudek
Esteemed Contributor III

Size 100-200 MB is perfect for Spark.

Regarding efficiency, it depends on many factors. If you do a lot of filters on some fields, you can add a bloom filter. If your query is by timestamp, ZORDER will be enough. Suppose your data is queried and divided by some infrequent category that only needs to be imported (for example, finance data ledger for three separate companies). In that case, partitioning per that category is ok, so there will be three files after optimization, for example, 60 MB each (which make sense when we know that only some of the partitions have to be imported).

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

Size 100-200 MB is perfect for Spark.

Regarding efficiency, it depends on many factors. If you do a lot of filters on some fields, you can add a bloom filter. If your query is by timestamp, ZORDER will be enough. Suppose your data is queried and divided by some infrequent category that only needs to be imported (for example, finance data ledger for three separate companies). In that case, partitioning per that category is ok, so there will be three files after optimization, for example, 60 MB each (which make sense when we know that only some of the partitions have to be imported).

PJ
New Contributor III

Thank you @Hubert Dudek​ !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardinality field for further query optimization. Is this all correct?

Hubert-Dudek
Esteemed Contributor III

yes

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group