Databricks

AP · ‎07-31-2022

So databricks gives us great toolkit in the form optimization and vacuum. But, in terms of operationaling them, I am really confused on the best practice.

Should we enable "optimized writes" by setting the following at a workspace level?

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true") # for writing speed

spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true") # compressing files

OR

Should we explicitly execute OPTIMIZE command on tables and databases at a set frequency. Also, if we enable Optimized writes at a workspace level, should we separately have to execute OPTIMIZE again at a table level. Are they same or different?

After the decision around OPTIMIZE is settled, when should we run VACUUM. Should we run both OPTIMIZE and vacuum in the same script? If not, what should be the ideal order

-werners- · ‎08-02-2022

ok no problem.

Auto optimize exists in fact as two operations. You have optimized writes (delta.autoOptimize.optimizeWrite)

which aims to write files of 128 MB. This is an approximate size and can vary depending on dataset characteristics. Often 128 MB will not be possible.

So then there is also auto compaction (delta.autoOptimize.autoCompact).

After an individual write, databricks checks if files can further be compacted, and runs an optimize job (with 128 MB file sizes instead of the 1 GB file size used in the standard

OPTIMIZE) to further compact files for partitions that have the most number of small files.

These optimizations come with a cost of course (shuffle f.e.). However, the net outcome is often positive because you write smaller files which are still large enough for good query performance.

The increase in throughput is as follows: let's say you want to write about 1000MB.

In a classic optimize example, this would create a single partition of 1000MB. A single partition means 1 task executed by one worker.

If you would write that 1000MB in 128MB partitions, you could parallelize the write into 4 or 5 tasks, hence more throughput.

View solution in original post

-werners- · ‎08-01-2022

for the optimize part I think the docs do a great job.

https://docs.microsoft.com/en-us/azure/databricks/delta/optimizations/auto-optimize

Basically it is: use auto optimize but if your data gets big, also use manual optimize.

For the vacuum part:

https://community.databricks.com/s/question/0D53f00001SKZVmCAP/optimize-and-vacuum-which-is-the-best...

AP · ‎08-01-2022

Thanks Werner for sharing the link. It helped a bit. But I am still not completely intuitive on what exactly happens when we configure auto optimize.

Can you please tell me the workflow that happens when we enable auto-optimize.

For ex: It says auto-optimize increases throughput while writing which is not intuitive for me because we are adding a management overhead with this approach which feels like reduced throughput for me. So, Question here is what happens step by step when we enable auto-optimize

-werners- · ‎08-02-2022