Databricks

Leodatabricks · ‎02-07-2023

When there are slow nodes, sometimes a job needs to resize its number of clusters to reach the required number of nodes. Is there any way to make sure no code is running before all nodes are secured? Thank you!

Anonymous · ‎02-07-2023

What do you mean by slow nodes?

Jobs run only on a single cluster.

What do you mean by nodes are secure? There is no concept of an unsecure or secure node.

Aviral-Bhardwaj · ‎02-07-2023

what mean by secure here?

Cluster autoscaling option is there

Leodatabricks · ‎02-08-2023

@Bilal Aslam might have a better phrasing of the question. What I mean is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"

BilalAslamDbrx · ‎02-08-2023

You are running into a rare situation. Likely what's happening is that we cannot acquire the instance type you chose for the job cluster in time and you're hitting an optimization where we go ahead and start your job anyway with the workers we could acquire, and add additional nodes as they arrive. My recommendation is to configure a cluster with a different instance type and a smaller number of nodes.

By the way, why do you want the job to only start when all the workers are available?

Leodatabricks · ‎02-08-2023

Is there an option to turn that optimization feature off then? When some workers are added later, I've experienced some weird connection lost issues when reading/writing data while running the code. Everything works well when all nodes are ready before starting the job.

BilalAslamDbrx · ‎02-08-2023

@Leo Bao I think what you are saying is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"? If that's what you want, set the cluster size e.g. 5 workers and disable autoscaling. This way, Databricks will make sure all workers are ready before submitting your code to them.

Leodatabricks · ‎02-08-2023

Thank you for the clarification. That's exactly what I mean. I'm not using autoscaling since I've been submitting a job to run with a required number of worker nodes. From my observation, even when some workers are not ready, the cluster will begin once it has a decent number of worker nodes. The description shown from databricks is 'Some nodes are taking much longer to become ready than others, and have been skipped in order to unblock cluster launch'. And my issue I would need to make sure all worker nodes are ready and then run the code.

Anonymous · ‎02-08-2023

Why do you need all the workers to start at the same time?

Manoj12421 · ‎02-09-2023

If you want your workers ready before submitting your code then just set the cluster size like 7,3 workers and disable autoscaling

Leodatabricks · ‎02-09-2023

already disabled autoscaling. When you set larger number of worker nodes, you might not get all at once. Resize might be needed.

BilalAslamDbrx · ‎02-09-2023

@Leo Bao I talked to an engineer and found out a bit more about what you're running into. First of all, it sounds like we should investigate it as it shouldn’t happen - can you open a support ticket?

In the meantime, you can make the first step in the job just wait for all the executors to become active by doing something along these lines and sleeping until you see the desired number == active executors.

Leodatabricks · ‎02-10-2023

Thank you again for your reply. Could you please let me know how can I open a support ticket? Also for the solution you mentioned, I'm using job submit instead of the interactive notebook, so I'm not sure when exactly all executors will be available as the time to resize cluster varies. If there is a way to check whether all nodes are ready, please let me know how I can do that using scala code. Thanks!

BilalAslamDbrx · ‎02-11-2023

@Leo Bao here is documentation on how to create a support ticket. Here's some code --- it should do what you are looking for. Please tweak the wait time as you like, I've set it to 10 mins.

def numWorkers: Int = sc.getExecutorMemoryStatus.size - 1
 
def waitForWorkers(requiredWorkers: Int, tries: Int) : Unit = {
  for (i <- 0 to (tries-1)) {
    if (numWorkers >= requiredWorkers) {
      println(s"Waited ${i}s. for $numWorkers/$targetWorkers workers to be ready")
      return
    }
    if (i % 60 == 0) println(s"Waiting ${i}s. for workers to be ready, got only $numWorkers/$targetWorkers workers")
    Thread sleep 1000
  }
  throw new Exception(s"Timed out waiting for workers to be ready after ${tries}s.")
}
 
waitForWorkers(targetWorkers, 600) //wait up to 10m

Anonymous · ‎04-08-2023

Hi @Leo Bao

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.

Cheers!

Databricks

How to secure all clusters and then start running the code

Unity Catalog Lakeguard: Industry-first and only data governance for multi-user Apache™ Spark cluste

Announcing the General Availability of Databricks Asset Bundles

Register now and save 50% on training at Data + AI Summit!

How to successfully build GenAI applications

Meet DBRX, the New Standard for High-Quality LLMs