cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to secure all clusters and then start running the code

Leodatabricks
Contributor

When there are slow nodes, sometimes a job needs to resize its number of clusters to reach the required number of nodes. Is there any way to make sure no code is running before all nodes are secured? Thank you!

14 REPLIES 14

Anonymous
Not applicable

What do you mean by slow nodes?

Jobs run only on a single cluster.

What do you mean by nodes are secure? There is no concept of an unsecure or secure node.

Aviral-Bhardwaj
Esteemed Contributor III

what mean by secure here?

Cluster autoscaling option is there

@Bilal Aslam​  might have a better phrasing of the question. What I mean is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"

You are running into a rare situation. Likely what's happening is that we cannot acquire the instance type you chose for the job cluster in time and you're hitting an optimization where we go ahead and start your job anyway with the workers we could acquire, and add additional nodes as they arrive. My recommendation is to configure a cluster with a different instance type and a smaller number of nodes.

By the way, why do you want the job to only start when all the workers are available?

Is there an option to turn that optimization feature off then? When some workers are added later, I've experienced some weird connection lost issues when reading/writing data while running the code. Everything works well when all nodes are ready before starting the job.

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Leo Bao​  I think what you are saying is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"? If that's what you want, set the cluster size e.g. 5 workers and disable autoscaling. This way, Databricks will make sure all workers are ready before submitting your code to them.

Thank you for the clarification. That's exactly what I mean. I'm not using autoscaling since I've been submitting a job to run with a required number of worker nodes. From my observation, even when some workers are not ready, the cluster will begin once it has a decent number of worker nodes. The description shown from databricks is 'Some nodes are taking much longer to become ready than others, and have been skipped in order to unblock cluster launch'. And my issue I would need to make sure all worker nodes are ready and then run the code.

Anonymous
Not applicable

Why do you need all the workers to start at the same time?

Manoj12421
Valued Contributor II

If you want ​your workers ready before submitting your code then just set the cluster size like 7,3 workers and disable autoscaling

already disabled autoscaling. When you set larger number of worker nodes, you might not get all at once. Resize might be needed.

BilalAslamDbrx
Honored Contributor II
Honored Contributor II

@Leo Bao​ I talked to an engineer and found out a bit more about what you're running into. First of all, it sounds like we should investigate it as it shouldn’t happen - can you open a support ticket?

In the meantime, you can make the first step in the job just wait for all the executors to become active by doing something along these lines and sleeping until you see the desired number == active executors.

Thank you again for your reply. Could you please let me know how can I open a support ticket? Also for the solution you mentioned, I'm using job submit instead of the interactive notebook, so I'm not sure when exactly all executors will be available as the time to resize cluster varies. If there is a way to check whether all nodes are ready, please let me know how I can do that using scala code. Thanks!

@Leo Bao​  here is documentation on how to create a support ticket. Here's some code --- it should do what you are looking for. Please tweak the wait time as you like, I've set it to 10 mins.

def numWorkers: Int = sc.getExecutorMemoryStatus.size - 1
 
def waitForWorkers(requiredWorkers: Int, tries: Int) : Unit = {
  for (i <- 0 to (tries-1)) {
    if (numWorkers >= requiredWorkers) {
      println(s"Waited ${i}s. for $numWorkers/$targetWorkers workers to be ready")
      return
    }
    if (i % 60 == 0) println(s"Waiting ${i}s. for workers to be ready, got only $numWorkers/$targetWorkers workers")
    Thread sleep 1000
  }
  throw new Exception(s"Timed out waiting for workers to be ready after ${tries}s.")
}
 
waitForWorkers(targetWorkers, 600) //wait up to 10m

Anonymous
Not applicable

Hi @Leo Bao​ 

Hope everything is going great.

Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you. 

Cheers!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.