Databricks Community

dnz · ‎07-31-2024

Hello Databricks Community,

I’m experiencing performance issues with the OPTIMIZE command when migrating historical data into a table with liquid clustering. Specifically, I am processing one year’s worth of data at a time. For example:

The OPTIMIZE command for the 2021 data took approximately 28 hours to complete.
The same command for 2020, with similar data volume on the same cluster (27 m7gd.2xlarge machines), completed within 12 hours.

The schema of the data has not changed over these years, so it’s puzzling why there is such a significant difference in processing times for similar data volumes.

Recently, we switched to r6g.2xlarge instances as per recommendations. Currently, the OPTIMIZE command for the 2023 data has been running for over 30 hours without completion. This is on a cluster with 23 nodes (r6g.2xlarge machines), processing approximately 35 billion rows and 3.3 TB of data on disk. All the cluster metrics are well within limits.

Here are a few specifics:

The cluster has 4 keys.
I verified the size of the data chunks by loading data into a temporary table and checking the size using the DESCRIBE TABLE command.

Could someone help me understand why there are such discrepancies in the processing times and provide any recommendations to improve the performance?

Thank you!

HimanshuSingh · ‎07-23-2025

Did you got any solution? If Yes please post it.

Databricks Community

Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

🌟 Community Pulse: Your Weekly Roundup! July 06 – 12, 2026

Upcoming Community BrickTalk | Sports Analytics: Turning Tracking Data into Real-Time AI Decisions

How to Optimize Your Content for GEO: Best Practices for Writing Discoverable Community Content

Solution Accelerator Series | Building Common Sense Product Recommendations With LLMs

Databricks Community Fellows – June 2026 Recap