topic Re: How should you optimize <1GB delta tables? in Data Engineering

How should you optimize <1GB delta tables?

PJ — Thu, 21 Apr 2022 19:45:17 GMT

I have seen the following documentation that details how you can work with the OPTIMIZE function to improve storage and querying efficiency. However, most of the documentation focuses on big data, 10 GB or larger.

I am working with a ~7million row dataset that is 211 MB. Without any of my intervention, this delta table dataset was split into 70 files. When I run the OPTIMIZE command on it, it collapses into 1 file. Did running the OPTIMIZE function help or hurt me in terms of efficiency? And if it hurt me, what is an alternative method for improving efficiency, particularly with respect to reading/querying? Thank you!

Re: How should you optimize <1GB delta tables?

Hubert-Dudek — Thu, 21 Apr 2022 19:57:21 GMT

Size 100-200 MB is perfect for Spark.

Regarding efficiency, it depends on many factors. If you do a lot of filters on some fields, you can add a bloom filter. If your query is by timestamp, ZORDER will be enough. Suppose your data is queried and divided by some infrequent category that only needs to be imported (for example, finance data ledger for three separate companies). In that case, partitioning per that category is ok, so there will be three files after optimization, for example, 60 MB each (which make sense when we know that only some of the partitions have to be imported).

Re: How should you optimize <1GB delta tables?

PJ — Thu, 21 Apr 2022 20:45:33 GMT

Thank you @Hubert Dudek !! So I gather from your response that it's totally fine to have a delta table that lives under 1 file that's roughly 211 MB. And I can use OPTIMIZE in conjunction with ZORDER to filter on a frequently filtered, high cardinality field for further query optimization. Is this all correct?

Re: How should you optimize <1GB delta tables?

Hubert-Dudek — Thu, 21 Apr 2022 21:50:40 GMT

yes