cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

dnz
New Contributor

Hello Databricks Community,

Iโ€™m experiencing performance issues with the OPTIMIZE command when migrating historical data into a table with liquid clustering. Specifically, I am processing one yearโ€™s worth of data at a time. For example:

  • The OPTIMIZE command for the 2021 data took approximately 28 hours to complete.
  • The same command for 2020, with similar data volume on the same cluster (27 m7gd.2xlarge machines), completed within 12 hours.

The schema of the data has not changed over these years, so itโ€™s puzzling why there is such a significant difference in processing times for similar data volumes.

Recently, we switched to r6g.2xlarge instances as per recommendations. Currently, the OPTIMIZE command for the 2023 data has been running for over 30 hours without completion. This is on a cluster with 23 nodes (r6g.2xlarge machines), processing approximately 35 billion rows and 3.3 TB of data on disk. All the cluster metrics are well within limits.

Here are a few specifics:

  • The cluster has 4 keys.
  • I verified the size of the data chunks by loading data into a temporary table and checking the size using the DESCRIBE TABLE command.

Could someone help me understand why there are such discrepancies in the processing times and provide any recommendations to improve the performance? 

Thank you!

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @dnz, Could you please ensure that the size of your data files is appropriate? Databricks recommends configuring the maximum file size for optimization using the spark.databricks.delta.optimize.maxFileSize setting. Adjusting this can help manage the number of files processed during the OPTIMIZE command, potentially speeding up the operation.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group