Hello Databricks Community,
Iโm experiencing performance issues with the OPTIMIZE command when migrating historical data into a table with liquid clustering. Specifically, I am processing one yearโs worth of data at a time. For example:
- The OPTIMIZE command for the 2021 data took approximately 28 hours to complete.
- The same command for 2020, with similar data volume on the same cluster (27 m7gd.2xlarge machines), completed within 12 hours.
The schema of the data has not changed over these years, so itโs puzzling why there is such a significant difference in processing times for similar data volumes.
Recently, we switched to r6g.2xlarge instances as per recommendations. Currently, the OPTIMIZE command for the 2023 data has been running for over 30 hours without completion. This is on a cluster with 23 nodes (r6g.2xlarge machines), processing approximately 35 billion rows and 3.3 TB of data on disk. All the cluster metrics are well within limits.
Here are a few specifics:
- The cluster has 4 keys.
- I verified the size of the data chunks by loading data into a temporary table and checking the size using the DESCRIBE TABLE command.
Could someone help me understand why there are such discrepancies in the processing times and provide any recommendations to improve the performance?
Thank you!