cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoOptimize, OPTIMIZE command and Vacuum command : Order, production implementation best practices

AP
New Contributor III

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.

Should we enable "optimized writes" by setting the following at a workspace level?

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") # for writing speed

spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true") # compressing files

OR

Should we explicitly execute OPTIMIZE command on tables and databases at a set frequency. Also, if we enable Optimized writes at a workspace level, should we separately have to execute OPTIMIZE again at a table level. Are they same or different?

After the decision around OPTIMIZE is settled, when should we run VACUUM. Should we run both OPTIMIZE and vacuum in the same script? If not, what should be the ideal order

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

ok no problem.

Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)

which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.

So then there is also auto compaction (delta.autoOptimize.autoCompact).

After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard

OPTIMIZE) to further compact files for partitions that have the most number of small files.

These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.

The increase in throughput is as follows: let's say you want to write about 1000MB.

In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.

If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.

View solution in original post

5 REPLIES 5

-werners-
Esteemed Contributor III

for the optimize part I think the docs do a great job.

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize

Basically it is: use auto optimize but if your data gets big, also use manual optimize.

For the vacuum part:

https://community.databricks.com/s/question/0D53f00001SKZVmCAP/optimize-and-vacuum-which-is-the-best...

AP
New Contributor III

Thanks Werner for sharing the link. It helped a bit. But I am still not completely intuitive on what exactly happens when we configure auto optimize.

Can you please tell me the workflow that happens when we enable auto-optimize.

For ex: It says auto-optimize increases throughput while writing which is not intuitive for me because we are adding a management overhead with this approach which feels like reduced throughput for me. So, Question here is what happens step by step when we enable auto-optimize

-werners-
Esteemed Contributor III

ok no problem.

Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)

which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.

So then there is also auto compaction (delta.autoOptimize.autoCompact).

After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard

OPTIMIZE) to further compact files for partitions that have the most number of small files.

These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.

The increase in throughput is as follows: let's say you want to write about 1000MB.

In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.

If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.

Anonymous
Not applicable

@AKSHAY PALLERLA​ Just checking in to see if you got a solution to the issue you shared above. Let us know!

Thanks to @Werner Stinckens​ for jumping in, as always!

AP
New Contributor III

Hello Lindsay, Yes. @Werner Stinckens​ has done an excellent job of distilling a few things down for me. Thank you!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.