Databricks

Vaibhav1000 · ‎11-03-2021

-werners- · ‎11-04-2021

@Vaibhav Gour , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

View solution in original post

Kaniz · ‎11-03-2021

Hi @ Vaibhav1000! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Vaibhav1000 · ‎11-03-2021

Thanks @Kaniz Fatma for the support.

-werners- · ‎11-04-2021

@Vaibhav Gour , It kinda depends on the case:

if there are no workers available when your job starts, you get an error. As the cluster is unable to start so code cannot be executed. But this is not an autoscale issue.

If you need to scale up, but for some reason, you cannot (CPU quota f.e.), the spark program will continue to run but data just has to be distributed over fewer workers than asked.

I had this a few times when I launched too many jobs at the same time. So I exceeded my CPU quotum on Azure. However, all my jobs were finished without error. Slower than intended, yes, but they finished.

Of course there is the possibility that the job does fail (timeout, ...) in the case you need a lot of workers and the ones you actually get is way too low (f.e. you need 20 workers but only get 1).

But Databricks is pretty good in fault tolerance that way. I did not even notice I was hitting the quotum until a sysadmin told me he got warnings from Azure saying the CPU quotum was exceeded.

I do not know if that is the case on AWS (I use Azure as mentioned above), but I assume the same rules apply over there.

Kaniz · ‎05-23-2022

Hi @Vaibhav Gour , Just a friendly follow-up. Do you still need help, or @Werner Stinckens's response help you to find the solution? Please let us know.

Databricks

How does databricks optimized auto-scaling behave when scaling-out is failing (Eg: Insufficient resources on AWS side)?

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs