cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How does databricks optimized auto-scaling behave when scaling-out is failing (Eg: Insufficient resources on AWS side)?

Vaibhav1000
New Contributor II
 
1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

@Vaibhav Gourโ€‹ , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

View solution in original post

2 REPLIES 2

Vaibhav1000
New Contributor II

Thanks @Kaniz Fatmaโ€‹ for the support.

-werners-
Esteemed Contributor III

@Vaibhav Gourโ€‹ , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group