โ08-07-2023 06:53 AM
Hi all,
recently I am facing a strange behaviour after an OPTIMZE ZOrder command. For a large table around (400 mio. rows) I executed the OPTIMIZE command with ZOrder for 3 columns. However, it seems that the command does not have any effect and the command does not do anything. Looking into the table history, I also can not find a new OPTIMIZE operation. In the history I can see, that the table was OPTIMIZED without ZOrder a couple of months ago. But I would expect that the OPTIMIZED with ZOrder would have triggered some file optimization. Do you have any idea? Do I overlook something?
Thanks in advance.
โ08-07-2023 07:47 AM
Hi @Luu, Based on the information provided, it is impossible to determine the cause of your issue.
However, it is essential to note that OPTIMIZE ZORDER operation now uses Hilbert space-filling curves by default, providing better clustering characteristics than Z-order in higher dimensions.
For Delta tables using OPTIMIZE ZORDER with many columns, Hilbert curves can speed up read queries by skipping more data than Z-order.
To investigate the issue further, you can try the following:
โข Check if the table is a Delta table and if it has many columns. OPTIMIZE ZORDER works best with Delta tables with many columns.
โข Check if the table was partitioned before running the OPTIMIZE ZORDER command. OPTIMIZE ZORDER works best with partitioned tables.
โข Check if there was enough data to optimize. If the table is small, OPTIMIZE ZORDER may not have much effect.
โข Check the output of the OPTIMIZE ZORDER command to see if there were any errors or warnings.
If you still face issues after trying the above steps, please provide more information about the issue and the actions you followed to run the OPTIMIZE ZORDER command.
โ08-07-2023 08:25 AM
Thank you for the quick reply Kaniz. I can provide more information to your questions.
Looking at the metrics, it seems that all Files were skipped. It says that partitionsOptimized = 2137. However, this is the first time we are using OPTIMIZE with ZOrder. Hence, I would expect that the OPTIMIZE with ZOrder command would process files. Any further ideas?
Thank you in advance
โ08-08-2023 01:06 AM - edited โ08-08-2023 03:00 AM
Hi @Luu ,
OPTIMIZE with the Z-Order command, did not process any files, and all files were skipped.
โข Possible reasons for this could be data skipping, Z-Ordering effectiveness, and file size.
โข Data skipping is enabled in the Delta table, which allows the query optimizer to skip unnecessary files based on statistics collected during data write.
โข If the data in the table is already well-organized and Z-ordering is not improving data skipping, files may be ignored.
โข Z-ordering is most effective for columns with high cardinality and is commonly used in query predicates.
โข If the selected columns for Z-ordering do not meet these criteria, Z-ordering may not significantly impact query performance, resulting in skipped files.
โข The table has many small files (2137), totalling 18GB.
โข If the file size is already optimized and the files are not small enough to benefit from compaction, the OPTIMIZE command may have no effect.
โข To investigate and improve the effectiveness of the OPTIMIZE with the ZOrder command, consider the following:
delta.dataSkippingNumIndexedCols
if needed.
delta.autoCompact
and delta.optimizeWrite
settings.
โข Sources:
โ08-07-2023 09:59 PM
From the metrics above, we could see that totalConsideredFiles is 2137 and totalFilesSkipped is 2137. So, all files were skipped during this Optimize.
Could you try modifying this flag - spark.databricks.delta.optimize.preserveInsertionOrder false?
For modifying it at cluster level. Go to Advanced Settings of the Cluster and Spark Options.
spark.databricks.delta.optimize.preserveInsertionOrder false
For notebook level use below code
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)
โ08-08-2023 12:58 AM
@Tharun-Kumar thank you for the information. I was not aware of this configuration parameter.
I will check this.
Thank you
โ08-09-2023 08:30 AM
I have test it with setting the config: spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false). Unfortunately, the OPTIMIZE ZORDER BY still does not process anything.
...
totalConsideredFiles: 2137
totalFilesSkipped: 2137
...
โ08-08-2023 08:48 AM
There are several potential reasons why your OPTIMIZE ZORDER command may not have had any effect on your table:
If the data is already optimized based on the ZOrder and column ordering, the OPTIMIZE ZORDER command will not have any effect. You can check the FILES section of the table details to see if the number of files or file size has changed after the optimization.
For the ZOrder optimization to be effective, the columns used for ZOrder should have high cardinality, meaning that the number of possible distinct values for the columns should be large. If the cardinality is low, then the data may not be distributed optimally across the files after the optimization.
To check if the optimize command has failed, you can look for any error messages in the cluster logs or use the DESCRIBE HISTORY command to view the history of the table and see if there were any failures or issues with the optimization.
Make sure that you are using Databricks Runtime version 6.4 or later, as the OPTIMIZE ZORDER command was introduced in that version.
You can also try running the command again and checking the logs for any error messages or contacting Databricks support for further assistance.
I hope that helps! Let me know if you have any other questions.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group