Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

dnz — Wed, 31 Jul 2024 15:15:58 GMT

Hello Databricks Community,

I’m experiencing performance issues with the OPTIMIZE command when migrating historical data into a table with liquid clustering. Specifically, I am processing one year’s worth of data at a time. For example:

The OPTIMIZE command for the 2021 data took approximately 28 hours to complete.
The same command for 2020, with similar data volume on the same cluster (27 m7gd.2xlarge machines), completed within 12 hours.

The schema of the data has not changed over these years, so it’s puzzling why there is such a significant difference in processing times for similar data volumes.

Recently, we switched to r6g.2xlarge instances as per recommendations. Currently, the OPTIMIZE command for the 2023 data has been running for over 30 hours without completion. This is on a cluster with 23 nodes (r6g.2xlarge machines), processing approximately 35 billion rows and 3.3 TB of data on disk. All the cluster metrics are well within limits.

Here are a few specifics:

The cluster has 4 keys.
I verified the size of the data chunks by loading data into a temporary table and checking the size using the DESCRIBE TABLE command.

Could someone help me understand why there are such discrepancies in the processing times and provide any recommendations to improve the performance?

Thank you!

Re: Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

HimanshuSingh — Wed, 23 Jul 2025 12:18:54 GMT

Did you got any solution? If Yes please post it.

topic Re: Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering in Data Engineering

Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

Re: Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering