cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How does databricks optimized auto-scaling behave when scaling-out is failing (Eg: Insufficient resources on AWS side)?

Vaibhav1000
New Contributor II
 
1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@Vaibhav Gourโ€‹ , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

View solution in original post

4 REPLIES 4

Kaniz
Community Manager
Community Manager

Hi @ Vaibhav1000! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Vaibhav1000
New Contributor II

Thanks @Kaniz Fatmaโ€‹ for the support.

-werners-
Esteemed Contributor III

@Vaibhav Gourโ€‹ , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

Kaniz
Community Manager
Community Manager

Hi @Vaibhav Gourโ€‹ , Just a friendly follow-up. Do you still need help, or @Werner Stinckensโ€‹'s response help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.