02-07-2023 01:15 PM
When there are slow nodes, sometimes a job needs to resize its number of clusters to reach the required number of nodes. Is there any way to make sure no code is running before all nodes are secured? Thank you!
02-07-2023 05:29 PM
What do you mean by slow nodes?
Jobs run only on a single cluster.
What do you mean by nodes are secure? There is no concept of an unsecure or secure node.
02-07-2023 07:56 PM
what mean by secure here?
Cluster autoscaling option is there
02-08-2023 08:37 AM
@Bilal Aslam might have a better phrasing of the question. What I mean is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"
02-08-2023 10:03 AM
You are running into a rare situation. Likely what's happening is that we cannot acquire the instance type you chose for the job cluster in time and you're hitting an optimization where we go ahead and start your job anyway with the workers we could acquire, and add additional nodes as they arrive. My recommendation is to configure a cluster with a different instance type and a smaller number of nodes.
By the way, why do you want the job to only start when all the workers are available?
02-08-2023 10:13 AM
Is there an option to turn that optimization feature off then? When some workers are added later, I've experienced some weird connection lost issues when reading/writing data while running the code. Everything works well when all nodes are ready before starting the job.
02-08-2023 04:46 AM
@Leo Bao I think what you are saying is "how do I make sure my job does not start until all worker nodes in a Spark cluster are ready"? If that's what you want, set the cluster size e.g. 5 workers and disable autoscaling. This way, Databricks will make sure all workers are ready before submitting your code to them.
02-08-2023 08:35 AM
Thank you for the clarification. That's exactly what I mean. I'm not using autoscaling since I've been submitting a job to run with a required number of worker nodes. From my observation, even when some workers are not ready, the cluster will begin once it has a decent number of worker nodes. The description shown from databricks is 'Some nodes are taking much longer to become ready than others, and have been skipped in order to unblock cluster launch'. And my issue I would need to make sure all worker nodes are ready and then run the code.
02-08-2023 04:24 PM
Why do you need all the workers to start at the same time?
02-09-2023 11:21 AM
If you want your workers ready before submitting your code then just set the cluster size like 7,3 workers and disable autoscaling
02-09-2023 11:28 AM
already disabled autoscaling. When you set larger number of worker nodes, you might not get all at once. Resize might be needed.
02-09-2023 06:25 PM
@Leo Bao I talked to an engineer and found out a bit more about what you're running into. First of all, it sounds like we should investigate it as it shouldn’t happen - can you open a support ticket?
In the meantime, you can make the first step in the job just wait for all the executors to become active by doing something along these lines and sleeping until you see the desired number == active executors.
02-10-2023 05:28 AM
Thank you again for your reply. Could you please let me know how can I open a support ticket? Also for the solution you mentioned, I'm using job submit instead of the interactive notebook, so I'm not sure when exactly all executors will be available as the time to resize cluster varies. If there is a way to check whether all nodes are ready, please let me know how I can do that using scala code. Thanks!
02-11-2023 05:55 AM
@Leo Bao here is documentation on how to create a support ticket. Here's some code --- it should do what you are looking for. Please tweak the wait time as you like, I've set it to 10 mins.
def numWorkers: Int = sc.getExecutorMemoryStatus.size - 1
def waitForWorkers(requiredWorkers: Int, tries: Int) : Unit = {
for (i <- 0 to (tries-1)) {
if (numWorkers >= requiredWorkers) {
println(s"Waited ${i}s. for $numWorkers/$targetWorkers workers to be ready")
return
}
if (i % 60 == 0) println(s"Waiting ${i}s. for workers to be ready, got only $numWorkers/$targetWorkers workers")
Thread sleep 1000
}
throw new Exception(s"Timed out waiting for workers to be ready after ${tries}s.")
}
waitForWorkers(targetWorkers, 600) //wait up to 10m
04-08-2023 11:22 PM
Hi @Leo Bao
Hope everything is going great.
Just wanted to check in if you were able to resolve your issue. If yes, would you be happy to mark an answer as best so that other members can find the solution more quickly? If not, please tell us so we can help you.
Cheers!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group