Databricks Community

Luu · ‎08-07-2023

Hi all,

recently I am facing a strange behaviour after an OPTIMZE ZOrder command. For a large table around (400 mio. rows) I executed the OPTIMIZE command with ZOrder for 3 columns. However, it seems that the command does not have any effect and the command does not do anything. Looking into the table history, I also can not find a new OPTIMIZE operation. In the history I can see, that the table was OPTIMIZED without ZOrder a couple of months ago. But I would expect that the OPTIMIZED with ZOrder would have triggered some file optimization. Do you have any idea? Do I overlook something?

Thanks in advance.

Luu · ‎08-07-2023

Thank you for the quick reply Kaniz. I can provide more information to your questions.

The table which I am considering is a Delta table with many columns (194 columns)
The table is partitioned by two columns
The table is large (around 400 mio rows, 2137 files, 18GB of data)
The output of the OPTIMIZE ZOrder is a table with path and metrics:
See:
object
numFilesAdded: 0
numFilesRemoved: 0
filesAdded:
min: null
max: null
avg: 0
totalFiles: 0
totalSize: 0
filesRemoved:
min: null
max: null
avg: 0
totalFiles: 0
totalSize: 0
partitionsOptimized: 2137
zOrderStats: null
numBatches: 0
totalConsideredFiles: 2137
totalFilesSkipped: 2137
preserveInsertionOrder: true
numFilesSkippedToReduceWriteAmplification: 0
numBytesSkippedToReduceWriteAmplification: 0
startTimeMs: 1691419395184
endTimeMs: 1691419396139
totalClusterParallelism: 8
totalScheduledTasks: 0
autoCompactParallelismStats: null
deletionVectorStats:
numDeletionVectorsRemoved: 0
numDeletionVectorRowsRemoved: 0
numTableColumns: 194
numTableColumnsWithStats: 32
totalTaskExecutionTimeMs: 0
skippedArchivedFiles: 0
clusteringMetrics: null

Looking at the metrics, it seems that all Files were skipped. It says that partitionsOptimized = 2137. However, this is the first time we are using OPTIMIZE with ZOrder. Hence, I would expect that the OPTIMIZE with ZOrder command would process files. Any further ideas?

Thank you in advance

Tharun-Kumar · ‎08-07-2023

@Luu

From the metrics above, we could see that totalConsideredFiles is 2137 and totalFilesSkipped is 2137. So, all files were skipped during this Optimize.

Could you try modifying this flag - spark.databricks.delta.optimize.preserveInsertionOrder false?

For modifying it at cluster level. Go to Advanced Settings of the Cluster and Spark Options.
spark.databricks.delta.optimize.preserveInsertionOrder false

For notebook level use below code
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)

Luu · ‎08-08-2023

@Tharun-Kumar thank you for the information. I was not aware of this configuration parameter.

I will check this.

Thank you

Luu · ‎08-09-2023

I have test it with setting the config: spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false). Unfortunately, the OPTIMIZE ZORDER BY still does not process anything.

...

totalConsideredFiles: 2137
totalFilesSkipped: 2137

...

youssefmrini · ‎08-08-2023

There are several potential reasons why your OPTIMIZE ZORDER command may not have had any effect on your table:

The existing data files may already be optimally sorted based on the ZOrder and/or column ordering.

If the data is already optimized based on the ZOrder and column ordering, the OPTIMIZE ZORDER command will not have any effect. You can check the FILES section of the table details to see if the number of files or file size has changed after the optimization.

The ZOrder columns may not have sufficient cardinality.

For the ZOrder optimization to be effective, the columns used for ZOrder should have high cardinality, meaning that the number of possible distinct values for the columns should be large. If the cardinality is low, then the data may not be distributed optimally across the files after the optimization.

The OPTIMIZE ZORDER command may have failed.

To check if the optimize command has failed, you can look for any error messages in the cluster logs or use the DESCRIBE HISTORY command to view the history of the table and see if there were any failures or issues with the optimization.

Make sure that you are using Databricks Runtime version 6.4 or later, as the OPTIMIZE ZORDER command was introduced in that version.

You can also try running the command again and checking the logs for any error messages or contacting Databricks support for further assistance.

I hope that helps! Let me know if you have any other questions.