Databricks Community

Vaibhav1000 · ‎11-03-2021

-werners- · ‎11-04-2021

@Vaibhav Gour , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

View solution in original post

Vaibhav1000 · ‎11-03-2021

Thanks @Kaniz Fatma for the support.

-werners- · ‎11-04-2021

@Vaibhav Gour , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

Databricks Community

How does databricks optimized auto-scaling behave when scaling-out is failing (Eg: Insufficient resources on AWS side)?

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟