cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

OPTIMZE ZOrder does not have an effect

Luu
New Contributor III

Hi all,

recently I am facing a strange behaviour after an OPTIMZE ZOrder command. For a large table around (400 mio. rows) I executed the OPTIMIZE command with ZOrder for 3 columns. However, it seems that the command does not have any effect and the command does not do anything. Looking into the table history, I also can not find a new OPTIMIZE operation. In the history I can see, that the table was OPTIMIZED without ZOrder a couple of months ago. But I would expect that the OPTIMIZED with ZOrder would have triggered some file optimization. Do you have any idea? Do I overlook something?

Thanks in advance.

 

5 REPLIES 5

Luu
New Contributor III

Thank you for the quick reply Kaniz. I can provide more information to your questions.

  • The table which I am considering is a Delta table with many columns (194 columns)
  • The table is partitioned by two columns
  • The table is large (around 400 mio rows, 2137 files, 18GB of data)
  • The output of the OPTIMIZE ZOrder is a table with path and metrics:
  • See: 
    object
    numFilesAdded: 0
    numFilesRemoved: 0
    filesAdded:
    min: null
    max: null
    avg: 0
    totalFiles: 0
    totalSize: 0
    filesRemoved:
    min: null
    max: null
    avg: 0
    totalFiles: 0
    totalSize: 0
    partitionsOptimized: 2137
    zOrderStats: null
    numBatches: 0
    totalConsideredFiles: 2137
    totalFilesSkipped: 2137
    preserveInsertionOrder: true
    numFilesSkippedToReduceWriteAmplification: 0
    numBytesSkippedToReduceWriteAmplification: 0
    startTimeMs: 1691419395184
    endTimeMs: 1691419396139
    totalClusterParallelism: 8
    totalScheduledTasks: 0
    autoCompactParallelismStats: null
    deletionVectorStats:
    numDeletionVectorsRemoved: 0
    numDeletionVectorRowsRemoved: 0
    numTableColumns: 194
    numTableColumnsWithStats: 32
    totalTaskExecutionTimeMs: 0
    skippedArchivedFiles: 0
    clusteringMetrics: null

Looking at the metrics, it seems that all Files were skipped. It says that partitionsOptimized = 2137. However, this is the first time we are using OPTIMIZE with ZOrder. Hence, I would expect that the OPTIMIZE with ZOrder command would process files. Any further ideas?

Thank you in advance

Tharun-Kumar
Databricks Employee
Databricks Employee

@Luu 

From the metrics above, we could see that totalConsideredFiles is 2137 and totalFilesSkipped is 2137. So, all files were skipped during this Optimize. 

Could you try modifying this flag - spark.databricks.delta.optimize.preserveInsertionOrder false?

For modifying it at cluster level. Go to Advanced Settings of the Cluster and Spark Options.
spark.databricks.delta.optimize.preserveInsertionOrder false

For notebook level use below code
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)

 

Luu
New Contributor III

@Tharun-Kumar thank you for the information. I was not aware of this configuration parameter.

I will check this.

Thank you

Luu
New Contributor III

I have test it with setting the config: spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false). Unfortunately, the OPTIMIZE ZORDER BY still does not process anything.

...

totalConsideredFiles: 2137
totalFilesSkipped: 2137

...

youssefmrini
Databricks Employee
Databricks Employee

There are several potential reasons why your OPTIMIZE ZORDER command may not have had any effect on your table:

  1. The existing data files may already be optimally sorted based on the ZOrder and/or column ordering.

If the data is already optimized based on the ZOrder and column ordering, the OPTIMIZE ZORDER command will not have any effect. You can check the FILES section of the table details to see if the number of files or file size has changed after the optimization.

  1. The ZOrder columns may not have sufficient cardinality.

For the ZOrder optimization to be effective, the columns used for ZOrder should have high cardinality, meaning that the number of possible distinct values for the columns should be large. If the cardinality is low, then the data may not be distributed optimally across the files after the optimization.

  1. The OPTIMIZE ZORDER command may have failed.

To check if the optimize command has failed, you can look for any error messages in the cluster logs or use the DESCRIBE HISTORY command to view the history of the table and see if there were any failures or issues with the optimization.

Make sure that you are using Databricks Runtime version 6.4 or later, as the OPTIMIZE ZORDER command was introduced in that version.

You can also try running the command again and checking the logs for any error messages or contacting Databricks support for further assistance.

I hope that helps! Let me know if you have any other questions.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group