Sorry for my very poor English and low Databricks Skill.
At work, my boss asked me to perform liquid clustering on four columns for a Delta Lake table with an 11TB capacity and over 80 columns, and I was estimating the resources and costs required to implement it.
When I conveyed the results of the calculation to my boss, he was told that the cost was too high, so he had me execute the process using a cluster started with the following configuration.
ใปCluster configuration
- Driver ... r6g.large x 1,
- Worker... r6g.large x min2 to max10(Auto-Scaling)
Of course, this dataset is so large that OPTIMIZE processing will not finish for more than five days.
Looking at the Spark UI, job processing is not progressing at all, and the amount of remaining tasks and spill is rapidly increasing to over 60TB.
OPTIMIZE is supposed to leave checkpoints, so I'm thinking of convincing my boss to cancel it now.
In that case, what kind of cluster configuration would be desirable to run it again?