cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to run OPTIMIZE to too big data set which has 11TB and more ?

Takao
New Contributor II

Sorry for my very poor English and low Databricks Skill.

At work, my boss asked me to perform liquid clustering on four columns for a Delta Lake table with an 11TB capacity and over 80 columns, and I was estimating the resources and costs required to implement it.

When I conveyed the results of the calculation to my boss, he was told that the cost was too high, so he had me execute the process using a cluster started with the following configuration.

ใƒปCluster configuration
- Driver ... r6g.large x 1,
- Worker... r6g.large x min2 to max10(Auto-Scaling)

Of course, this dataset is so large that OPTIMIZE processing will not finish for more than five days.

Looking at the Spark UI, job processing is not progressing at all, and the amount of remaining tasks and spill is rapidly increasing to over 60TB.

OPTIMIZE is supposed to leave checkpoints, so I'm thinking of convincing my boss to cancel it now.

In that case, what kind of cluster configuration would be desirable to run it again?

1 ACCEPTED SOLUTION

Accepted Solutions

jacovangelder
Honored Contributor

Couple of things:
OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.
I had to look into the AWS instances but it seems the r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers. 

And last but not least, set the delta.targetFileSize to 1GB. This is is the recommended size for tables of ~10TB. 

View solution in original post

2 REPLIES 2

jacovangelder
Honored Contributor

Couple of things:
OPTIMIZE is a very compute intensive operation. Make sure you pick a VM that is compute optimized.
I had to look into the AWS instances but it seems the r6g.large you're using is just a 2 CPU 16GB machine. This is by far not sufficient enough to optimize a table of 11TB. The spill you're getting is the result of this. I would lower your mount of workers but scale up the VM's vertically, for example to a r6g.4xlarge with 1-6 workers or a r6g.8xlarge with 1-3 workers. 

And last but not least, set the delta.targetFileSize to 1GB. This is is the recommended size for tables of ~10TB. 

Takao
New Contributor II

Mr. jacovangelder, Thank you for your reply.

And Sorry for incontinence about my description of VM in AWS. 

There is no doubt that r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3 is insufficient in terms of computational processing and memory capacity.

When I looked into it, as you said, OPTIMIZE seems to place a large load on the CPU and memory by calculating column statistics for skipping.

I would like to somehow convince my boss to use r6g.4xlarge x 1~6 or r6g.8xlarge x 1~3.

Your answer was very helpful. thank you. May good things be with you for your kindness.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group