topic Re: How to run OPTIMIZE to too big data set which has 11TB and more ? in Data Engineering

How to run OPTIMIZE to too big data set which has 11TB and more ?

Takao — Fri, 26 Jul 2024 16:13:23 GMT

Sorry for my very poor English and low Databricks Skill.

At work, my boss asked me to perform liquid clustering on four columns for a Delta Lake table with an 11TB capacity and over 80 columns, and I was estimating the resources and costs required to implement it.

When I conveyed the results of the calculation to my boss, he was told that the cost was too high, so he had me execute the process using a cluster started with the following configuration.

・Cluster configuration
- Driver ... r6g.large x 1,
- Worker... r6g.large x min2 to max10(Auto-Scaling)

Of course, this dataset is so large that OPTIMIZE processing will not finish for more than five days.

Looking at the Spark UI, job processing is not progressing at all, and the amount of remaining tasks and spill is rapidly increasing to over 60TB.

OPTIMIZE is supposed to leave checkpoints, so I'm thinking of convincing my boss to cancel it now.

In that case, what kind of cluster configuration would be desirable to run it again?

Re: How to run OPTIMIZE to too big data set which has 11TB and more ?

jacovangelder — Sat, 27 Jul 2024 05:57:24 GMT

Couple of things:
OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.
I had to look into the AWS instances but it seems the r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers.

And last but not least, set the delta.targetFileSize to 1GB. This is is the recommended size for tables of ~10TB.

Re: How to run OPTIMIZE to too big data set which has 11TB and more ?

Takao — Mon, 29 Jul 2024 01:57:03 GMT

Mr. jacovangelder, Thank you for your reply.

And Sorry for incontinence about my description of VM in AWS.

There is no doubt that r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3 is insufficient in terms of computational processing and memory capacity.

When I looked into it, as you said, OPTIMIZE seems to place a large load on the CPU and memory by calculating column statistics for skipping.

I would like to somehow convince my boss to use r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3.

Your answer was very helpful. thank you. May good things be with you for your kindness.