โ08-07-2023 06:53 AM
Hi all,
recently I am facing a strange behaviour after an OPTIMZE ZOrder command. For a large table around (400 mio. rows) I executed the OPTIMIZE command with ZOrder for 3 columns. However, it seems that the command does not have any effect and the command does not do anything. Looking into the table history, I also can not find a new OPTIMIZE operation. In the history I can see, that the table was OPTIMIZED without ZOrder a couple of months ago. But I would expect that the OPTIMIZED with ZOrder would have triggered some file optimization. Do you have any idea? Do I overlook something?
Thanks in advance.
โ08-07-2023 08:25 AM
Thank you for the quick reply Kaniz. I can provide more information to your questions.
Looking at the metrics, it seems that all Files were skipped. It says that partitionsOptimized = 2137. However, this is the first time we are using OPTIMIZE with ZOrder. Hence, I would expect that the OPTIMIZE with ZOrder command would process files. Any further ideas?
Thank you in advance
โ08-07-2023 09:59 PM
From the metrics above, we could see that totalConsideredFiles is 2137 and totalFilesSkipped is 2137. So, all files were skipped during this Optimize.
Could you try modifying this flag - spark.databricks.delta.optimize.preserveInsertionOrder false?
For modifying it at cluster level. Go to Advanced Settings of the Cluster and Spark Options.
spark.databricks.delta.optimize.preserveInsertionOrder false
For notebook level use below code
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)
โ08-08-2023 12:58 AM
@Tharun-Kumar thank you for the information. I was not aware of this configuration parameter.
I will check this.
Thank you
โ08-09-2023 08:30 AM
I have test it with setting the config: spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false). Unfortunately, the OPTIMIZE ZORDER BY still does not process anything.
...
totalConsideredFiles: 2137
totalFilesSkipped: 2137
...
โ08-08-2023 08:48 AM
There are several potential reasons why your OPTIMIZE ZORDER command may not have had any effect on your table:
If the data is already optimized based on the ZOrder and column ordering, the OPTIMIZE ZORDER command will not have any effect. You can check the FILES section of the table details to see if the number of files or file size has changed after the optimization.
For the ZOrder optimization to be effective, the columns used for ZOrder should have high cardinality, meaning that the number of possible distinct values for the columns should be large. If the cardinality is low, then the data may not be distributed optimally across the files after the optimization.
To check if the optimize command has failed, you can look for any error messages in the cluster logs or use the DESCRIBE HISTORY command to view the history of the table and see if there were any failures or issues with the optimization.
Make sure that you are using Databricks Runtime version 6.4 or later, as the OPTIMIZE ZORDER command was introduced in that version.
You can also try running the command again and checking the logs for any error messages or contacting Databricks support for further assistance.
I hope that helps! Let me know if you have any other questions.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group