Databricks Community

felix_counter · ‎06-07-2024

Hello,

I have a structured stream job writing every 5 mins into a table with liquid clustering enabled. After migrating from DBR 13.3 LTS to DBR 14.3 LTS I observe that the table is newly regularly optimized despite I have not set the "spark.databricks.delta.autoCompact.enabled" option. Furthermore, the "operationParameters" column of the table history exhibits for the auto optimize step the configuration "clusterBy":"[]", although I have specified cluster columns on this table. Does that mean the table has been (auto) optimized without triggering clustering?

Is there an option to prevent DBR 14.3 LTS from auto optimizing (liquid clustered) tables? I would prefer running OPTIMIZE (and thus triggering clustering) on a regular basis manually.

I tried setting "spark.databricks.delta.autoCompact.enabled" to "false", but that did not prevent regular auto-optimization using DBR 14.3 LTS.

thanks a lot for your help!

raphaelblg · ‎06-11-2024

@felix_counter,

The auto compaction algorithm faced changes between DBR versions, specially the oldest ones (10.4, 11.3) but I don't know the specifics of what have changed.

My assumption is that since auto-compaction is a different operation than a clustered write, it won't use the clusterBy property. Unfortunately I don't have more details.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

View solution in original post

raphaelblg · ‎06-07-2024

Hello @felix_counter ,

It seems you're referring to Predictive optimization for Delta Lake, a relatively new feature.

In contrast to Optimized writes for Delta Lake on Databricks (basically `spark.databricks.delta.autoCompact.enabled ` and `spark.databricks.delta.optimizeWrite.enabled`), which apply optimizations during task write time and are not recorded in the Delta Logs, predictive optimization initiates individual OPTIMIZE operations that are indeed logged. However, it's worth noting that predictive optimization does not run OPTIMIZE operations on tables using liquid clustering or Z-order.

Are you sure that your table uses liquid clustering? If so can you please provide the results of the following statements?

DESCRIBE FORMATTED your_table_name

SHOW CREATE TABLE your_table_name

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

felix_counter · ‎06-09-2024

Dear @raphaelblg,

thank you for your message. From my understanding, predictive optimization requires Unity Catalog to be enabled, which is not the case for the respective workspace.

The history entry of the auto OPTIMIZE looks like this:

As you can see, the optimization has been triggered automatically, and no cluster columns are listed.

In contrast, when I manually trigger OPTIMIZE, the history entry looks like this:

As you can see, when triggered manually, OPTIMIZE clusters the table as expected, indicated by the columns listed under "clusterBy".

Regarding the spark config: "spark.databricks.delta.autoCompact.enabled" for the respective job has not been set (and hence is false). However, I set "spark.databricks.delta.optimizeWrite.enabled" to "true" to reduce the number of small files written. To my understanding, this is an on-write operation, and should not trigger auto OPTIMIZE.

Again, I can clearly correlate the onset of the auto OPTIMIZE entries in the table history with our switch from DBR 13.3 LTS to DBR 14.3 LTS.

As per your request, I verified that the table has indeed liquid clustering activated:

Best regards!

raphaelblg · ‎06-10-2024

Hello @felix_counter ,

Thank you for providing the details.

~~The "auto OPTIMIZE" you're referring to is not automatically triggered by Databricks~~. ~~Instead, it seems to be initiated by an external service.~~ The only features inside Databricks that trigger OPTIMIZE operations at the current moment are Predictive optimization for Delta Lake , Maintenance tasks performed by Delta Live Tables and Auto compaction for Delta Lake on Databricks .

You are correct in your understanding that the 'spark.databricks.delta.optimizeWrite.enabled' setting performs an on-write operation, which does not register as an OPTIMIZE operation in the history.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

felix_counter · ‎06-10-2024

Dear @raphaelblg,

thanks a lot for your message. Do you have an idea, which external service could trigger OPTIMIZE and register in the delta history as "auto"="true"?

The only thing I can definitely correlate with the onset of the auto OPTIMIZE is the change from DBR 13.3 LTS to DBR 14.3 LTS. Is there a new feature / behavior with respect to optimization of clustered tables that can cause this behavior?

Thanks and best regards!

raphaelblg · ‎06-10-2024

Hello @felix_counter ,

I conducted further research and discovered that I made a mistake. I have determined that the trace with "auto": "true" in the Delta History is actually left by auto compaction. Apologies for the confusion.

We were correct that optimized writes are indeed write-only tasks, but auto-compaction will leave this trace in the Delta history and I wasn't aware of that.

Even though you haven't explicitly enabled auto compaction, there's a chance you've faced this condition:

"In Databricks Runtime 10.4 LTS and above, auto compaction and optimized writes are always enabled for MERGE, UPDATE, and DELETE operations. You cannot disable this functionality."

Source: https://docs.databricks.com/en/delta/tune-file-size.html#configure-delta-lake-to-control-data-file-s...

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks

felix_counter · ‎06-10-2024

Dear @raphaelblg,

thanks for your response. I'm not sure this is what I observe, due to the following two reasons:

- the auto optimize is triggered every ~3 hrs or so, not after every MERGE

- the auto optimize only appears when using DBR 14.3 LTS. Specifically, it does not appear when I run the exact same code under DBR 13.3 LTS.

Again, in principle I do not mind at all the auto optimization. The only thing that raises my attention is that the auto optimize does not cluster (see screenshot above).

Is there a reason why an auto optimization step on a liquid clustering table should not use the cluster information? And why does the auto optimization only occurs for DBR 14.3 LTS, but not in DBR 13.3 LTS?

Thanks and best!

raphaelblg · ‎06-11-2024

@felix_counter,

The auto compaction algorithm faced changes between DBR versions, specially the oldest ones (10.4, 11.3) but I don't know the specifics of what have changed.

My assumption is that since auto-compaction is a different operation than a clustered write, it won't use the clusterBy property. Unfortunately I don't have more details.

Best regards,

Raphael Balogo
Sr. Technical Solutions Engineer
Databricks