โ07-31-2022 08:20 PM
So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.
Should we enable "optimized writes" by setting the following at a workspace level?
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") # for writing speed
spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true") # compressing files
OR
Should we explicitly execute OPTIMIZE command on tables and databases at a set frequency. Also, if we enable Optimized writes at a workspace level, should we separately have to execute OPTIMIZE again at a table level. Are they same or different?
After the decision around OPTIMIZE is settled, when should we run VACUUM. Should we run both OPTIMIZE and vacuum in the same script? If not, what should be the ideal order
โ08-02-2022 02:02 AM
ok no problem.
Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)
which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.
So then there is also auto compaction (delta.autoOptimize.autoCompact).
After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard
OPTIMIZE) to further compact files for partitions that have the most number of small files.
These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.
The increase in throughput is as follows: let's say you want to write about 1000MB.
In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.
If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.
โ08-01-2022 02:07 AM
for the optimize part I think the docs do a great job.
https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize
Basically it is: use auto optimize but if your data gets big, also use manual optimize.
For the vacuum part:
โ08-01-2022 02:49 PM
Thanks Werner for sharing the link. It helped a bit. But I am still not completely intuitive on what exactly happens when we configure auto optimize.
Can you please tell me the workflow that happens when we enable auto-optimize.
For ex: It says auto-optimize increases throughput while writing which is not intuitive for me because we are adding a management overhead with this approach which feels like reduced throughput for me. So, Question here is what happens step by step when we enable auto-optimize
โ08-02-2022 02:02 AM
ok no problem.
Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)
which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.
So then there is also auto compaction (delta.autoOptimize.autoCompact).
After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard
OPTIMIZE) to further compact files for partitions that have the most number of small files.
These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.
The increase in throughput is as follows: let's say you want to write about 1000MB.
In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.
If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.
โ08-03-2022 11:09 AM
@AKSHAY PALLERLAโ Just checking in to see if you got a solution to the issue you shared above. Let us know!
Thanks to @Werner Stinckensโ for jumping in, as always!
โ08-03-2022 11:57 AM
Hello Lindsay, Yes. @Werner Stinckensโ has done an excellent job of distilling a few things down for me. Thank you!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group