Databricks Community

dnz · ‎07-31-2024

Hello Databricks Community,

I’m experiencing performance issues with the OPTIMIZE command when migrating historical data into a table with liquid clustering. Specifically, I am processing one year’s worth of data at a time. For example:

The OPTIMIZE command for the 2021 data took approximately 28 hours to complete.
The same command for 2020, with similar data volume on the same cluster (27 m7gd.2xlarge machines), completed within 12 hours.

The schema of the data has not changed over these years, so it’s puzzling why there is such a significant difference in processing times for similar data volumes.

Recently, we switched to r6g.2xlarge instances as per recommendations. Currently, the OPTIMIZE command for the 2023 data has been running for over 30 hours without completion. This is on a cluster with 23 nodes (r6g.2xlarge machines), processing approximately 35 billion rows and 3.3 TB of data on disk. All the cluster metrics are well within limits.

Here are a few specifics:

The cluster has 4 keys.
I verified the size of the data chunks by loading data into a temporary table and checking the size using the DESCRIBE TABLE command.

Could someone help me understand why there are such discrepancies in the processing times and provide any recommendations to improve the performance?

Thank you!

HimanshuSingh · ‎07-23-2025

Did you got any solution? If Yes please post it.

Databricks Community

Performance Issue with OPTIMIZE Command for Historical Data Migration Using Liquid Clustering

Join Us as a Local Community Builder!

Lakehouse, Lagers & Legends — Bangalore Meetup | December 13

🌟 Community Pulse: Your Weekly Roundup! November 21 – 27, 2025

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐