cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

AutoOptimize, OPTIMIZE command and Vacuum command : Order, production implementation best practices

AP
New Contributor III

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.

Should we enable "optimized writes" by setting the following at a workspace level?

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") # for writing speed

spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true") # compressing files

OR

Should we explicitly execute OPTIMIZE command on tables and databases at a set frequency. Also, if we enable Optimized writes at a workspace level, should we separately have to execute OPTIMIZE again at a table level. Are they same or different?

After the decision around OPTIMIZE is settled, when should we run VACUUM. Should we run both OPTIMIZE and vacuum in the same script? If not, what should be the ideal order

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

ok no problem.

Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)

which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.

So then there is also auto compaction (delta.autoOptimize.autoCompact).

After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard

OPTIMIZE) to further compact files for partitions that have the most number of small files.

These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.

The increase in throughput is as follows: let's say you want to write about 1000MB.

In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.

If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

for the optimize part I think the docs do a great job.

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize

Basically it is: use auto optimize but if your data gets big, also use manual optimize.

For the vacuum part:

https://community.databricks.com/s/question/0D53f00001SKZVmCAP/optimize-and-vacuum-which-is-the-best...

AP
New Contributor III

Thanks Werner for sharing the link. It helped a bit. But I am still not completely intuitive on what exactly happens when we configure auto optimize.

Can you please tell me the workflow that happens when we enable auto-optimize.

For ex: It says auto-optimize increases throughput while writing which is not intuitive for me because we are adding a management overhead with this approach which feels like reduced throughput for me. So, Question here is what happens step by step when we enable auto-optimize

-werners-
Esteemed Contributor III

ok no problem.

Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)

which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.

So then there is also auto compaction (delta.autoOptimize.autoCompact).

After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard

OPTIMIZE) to further compact files for partitions that have the most number of small files.

These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.

The increase in throughput is as follows: let's say you want to write about 1000MB.

In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.

If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.

Anonymous
Not applicable

@AKSHAY PALLERLAโ€‹ Just checking in to see if you got a solution to the issue you shared above. Let us know!

Thanks to @Werner Stinckensโ€‹ for jumping in, as always!

AP
New Contributor III

Hello Lindsay, Yes. @Werner Stinckensโ€‹ has done an excellent job of distilling a few things down for me. Thank you!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group