cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

OPTIMZE ZOrder does not have an effect

Luu
New Contributor III

Hi all,

recently I am facing a strange behaviour after an OPTIMZE ZOrder command. For a large table around (400 mio. rows) I executed the OPTIMIZE command with ZOrder for 3 columns. However, it seems that the command does not have any effect and the command does not do anything. Looking into the table history, I also can not find a new OPTIMIZE operation. In the history I can see, that the table was OPTIMIZED without ZOrder a couple of months ago. But I would expect that the OPTIMIZED with ZOrder would have triggered some file optimization. Do you have any idea? Do I overlook something?

Thanks in advance.

 

7 REPLIES 7

Kaniz_Fatma
Community Manager
Community Manager

Hi @LuuBased on the information provided, it is impossible to determine the cause of your issue.

However, it is essential to note that OPTIMIZE ZORDER operation now uses Hilbert space-filling curves by default, providing better clustering characteristics than Z-order in higher dimensions.

For Delta tables using OPTIMIZE ZORDER with many columns, Hilbert curves can speed up read queries by skipping more data than Z-order. 

To investigate the issue further, you can try the following:
โ€ข Check if the table is a Delta table and if it has many columns. OPTIMIZE ZORDER works best with Delta tables with many columns.
โ€ข Check if the table was partitioned before running the OPTIMIZE ZORDER command. OPTIMIZE ZORDER works best with partitioned tables.
โ€ข Check if there was enough data to optimize. If the table is small, OPTIMIZE ZORDER may not have much effect.
โ€ข Check the output of the OPTIMIZE ZORDER command to see if there were any errors or warnings.

If you still face issues after trying the above steps, please provide more information about the issue and the actions you followed to run the OPTIMIZE ZORDER command.

Luu
New Contributor III

Thank you for the quick reply Kaniz. I can provide more information to your questions.

  • The table which I am considering is a Delta table with many columns (194 columns)
  • The table is partitioned by two columns
  • The table is large (around 400 mio rows, 2137 files, 18GB of data)
  • The output of the OPTIMIZE ZOrder is a table with path and metrics:
  • See: 
    object
    numFilesAdded: 0
    numFilesRemoved: 0
    filesAdded:
    min: null
    max: null
    avg: 0
    totalFiles: 0
    totalSize: 0
    filesRemoved:
    min: null
    max: null
    avg: 0
    totalFiles: 0
    totalSize: 0
    partitionsOptimized: 2137
    zOrderStats: null
    numBatches: 0
    totalConsideredFiles: 2137
    totalFilesSkipped: 2137
    preserveInsertionOrder: true
    numFilesSkippedToReduceWriteAmplification: 0
    numBytesSkippedToReduceWriteAmplification: 0
    startTimeMs: 1691419395184
    endTimeMs: 1691419396139
    totalClusterParallelism: 8
    totalScheduledTasks: 0
    autoCompactParallelismStats: null
    deletionVectorStats:
    numDeletionVectorsRemoved: 0
    numDeletionVectorRowsRemoved: 0
    numTableColumns: 194
    numTableColumnsWithStats: 32
    totalTaskExecutionTimeMs: 0
    skippedArchivedFiles: 0
    clusteringMetrics: null

Looking at the metrics, it seems that all Files were skipped. It says that partitionsOptimized = 2137. However, this is the first time we are using OPTIMIZE with ZOrder. Hence, I would expect that the OPTIMIZE with ZOrder command would process files. Any further ideas?

Thank you in advance

Kaniz_Fatma
Community Manager
Community Manager

Hi @Luu  , 

OPTIMIZE with the Z-Order command, did not process any files, and all files were skipped.
โ€ข Possible reasons for this could be data skipping, Z-Ordering effectiveness, and file size.
โ€ข Data skipping is enabled in the Delta table, which allows the query optimizer to skip unnecessary files based on statistics collected during data write.
โ€ข If the data in the table is already well-organized and Z-ordering is not improving data skipping, files may be ignored.
โ€ข Z-ordering is most effective for columns with high cardinality and is commonly used in query predicates.
โ€ข If the selected columns for Z-ordering do not meet these criteria, Z-ordering may not significantly impact query performance, resulting in skipped files.
โ€ข The table has many small files (2137), totalling 18GB.
โ€ข If the file size is already optimized and the files are not small enough to benefit from compaction, the OPTIMIZE command may have no effect.


โ€ข To investigate and improve the effectiveness of the OPTIMIZE with the ZOrder command, consider the following:

 - Verify Z-Ordering Columns: Check if the selected columns have high cardinality and are commonly used in query predicates. Select different columns if necessary.
 - Review Data Skipping: Ensure proper configuration and accuracy of statistics collected during data writing. Adjust the value of delta.dataSkippingNumIndexedCols if needed.
 - File Size Optimization: Adjust the target file size or use auto-tuning based on table size by configuring delta.autoCompact and delta.optimizeWrite settings.


โ€ข Sources:
 - [Docs: data-skipping](https://docs.databricks.com/delta/data-skipping.html)
 - [Docs: optimize](https://docs.databricks.com/delta/optimize.html)
 - [Docs: best-practices](https://docs.databricks.com/lakehouse-architecture/performance-efficiency/best-practices.html)

Tharun-Kumar
Honored Contributor II

@Luu 

From the metrics above, we could see that totalConsideredFiles is 2137 and totalFilesSkipped is 2137. So, all files were skipped during this Optimize. 

Could you try modifying this flag - spark.databricks.delta.optimize.preserveInsertionOrder false?

For modifying it at cluster level. Go to Advanced Settings of the Cluster and Spark Options.
spark.databricks.delta.optimize.preserveInsertionOrder false

For notebook level use below code
spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false)

 

Luu
New Contributor III

@Tharun-Kumar thank you for the information. I was not aware of this configuration parameter.

I will check this.

Thank you

Luu
New Contributor III

I have test it with setting the config: spark.conf.set("spark.databricks.delta.optimize.preserveInsertionOrder", false). Unfortunately, the OPTIMIZE ZORDER BY still does not process anything.

...

totalConsideredFiles: 2137
totalFilesSkipped: 2137

...

youssefmrini
Honored Contributor III

There are several potential reasons why your OPTIMIZE ZORDER command may not have had any effect on your table:

  1. The existing data files may already be optimally sorted based on the ZOrder and/or column ordering.

If the data is already optimized based on the ZOrder and column ordering, the OPTIMIZE ZORDER command will not have any effect. You can check the FILES section of the table details to see if the number of files or file size has changed after the optimization.

  1. The ZOrder columns may not have sufficient cardinality.

For the ZOrder optimization to be effective, the columns used for ZOrder should have high cardinality, meaning that the number of possible distinct values for the columns should be large. If the cardinality is low, then the data may not be distributed optimally across the files after the optimization.

  1. The OPTIMIZE ZORDER command may have failed.

To check if the optimize command has failed, you can look for any error messages in the cluster logs or use the DESCRIBE HISTORY command to view the history of the table and see if there were any failures or issues with the optimization.

Make sure that you are using Databricks Runtime version 6.4 or later, as the OPTIMIZE ZORDER command was introduced in that version.

You can also try running the command again and checking the logs for any error messages or contacting Databricks support for further assistance.

I hope that helps! Let me know if you have any other questions.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group