- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2024 09:13 AM
Sorry for my very poor English and low Databricks Skill.
At work, my boss asked me to perform liquid clustering on four columns for a Delta Lake table with an 11TB capacity and over 80 columns, and I was estimating the resources and costs required to implement it.
When I conveyed the results of the calculation to my boss, he was told that the cost was too high, so he had me execute the process using a cluster started with the following configuration.
ใปCluster configuration
- Driver ... r6g.large x 1,
- Worker... r6g.large x min2 to max10(Auto-Scaling)
Of course, this dataset is so large that OPTIMIZE processing will not finish for more than five days.
Looking at the Spark UI, job processing is not progressing at all, and the amount of remaining tasks and spill is rapidly increasing to over 60TB.
OPTIMIZE is supposed to leave checkpoints, so I'm thinking of convincing my boss to cancel it now.
In that case, what kind of cluster configuration would be desirable to run it again?
- Labels:
-
Delta Lake
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2024 10:56 PM - edited โ07-26-2024 10:57 PM
Couple of things:
OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.
I had to look into the AWS instances but it seems the r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers.
And last but not least, set the delta.targetFileSize to 1GB. This is is the recommended size for tables of ~10TB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-26-2024 10:56 PM - edited โ07-26-2024 10:57 PM
Couple of things:
OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.
I had to look into the AWS instances but it seems the r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers.
And last but not least, set the delta.targetFileSize to 1GB. This is is the recommended size for tables of ~10TB.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-28-2024 06:57 PM
Mr. jacovangelder, Thank you for your reply.
And Sorry for incontinence about my description of VM in AWS.
There is no doubt that r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3 is insufficient in terms of computational processing and memory capacity.
When I looked into it, as you said, OPTIMIZE seems to place a large load on the CPU and memory by calculating column statistics for skipping.
I would like to somehow convince my boss to use r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3.
Your answer was very helpful. thank you. May good things be with you for your kindness.

