I am trying to run VACUUM on a delta table that i know has millions of obselete files.
out of the box, VACUUM runs the deletes in sequence on the driver. that is bad news for me!
According to OSS delta docs, the setting spark.databricks.delta.vacuum.parallelDelete.enabled will override that behavior and distributes the delete operations to the worker nodes. (apparently you have to opt in because this risks hitting DeleteObject rate limits -- but i'm doing it anyway.)
However there's some ambiguity! the Databricks VACUUM doc doesn't mention this setting. It furthermore asserts that the deletes happen on the driver, and the best way to speed that up is to increase driver size. I doubt this will get the job done for me.
Furthermore, I'm seeing behavior on my VACUUM operation that suggests it isn't being parallelized as I intended. I ran `SET spark.databricks.delta.vacuum.parallelDelete.enabled = true; VACUUM <my table> using inventory (<query on an AWS S3 inventory>);`, using a cluster with autosizing. The cluster autosized up at first, presumably while reading the inventory. Now it has been running over 30 minutes with only 2 workers and no active jobs on it. I have strong suspicion it is running the deletes in parallel on the driver, despite using the parallelDelete setting.
Does databricks actually respect the OSS parallelDelete setting?