Compute pools max capacity and ideal compute settings
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-19-2024 03:51 PM
Hi there,
I'm having a difficult time understanding the compute side of our jobs under the hood, and I checked the documentation but don't have clear answers so far, so hopefully someone will provide some clarity.
I set up pools to use for our overnight jobs (initially running a bunch of tables in parallel).
The latest major issue was that we didn't set up max capacity, and the pool started over 200 EC2 instances! is this normal?
In the meantime, we set up the max to 50 and reduced terminate after idle time to 2 minutes, we also enabled autoscaling and photon acceleration on the jobs, and we changed the jobs structure to run sequentially, rather than in parallel, so that they don't need that many resources at the same time. any other useful tips or tricks to optimise?
One thing that I don't understand is why the jobs fail and terminate rather than queue instead when an infrastructure constraint happens? the pool of 50 should be more than enough to run all the jobs. queuing and max concurrency is set up on all the jobs.
Many thanks for your help!
Ana
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-21-2024 06:32 AM
I moved away from pools since cluster reuse is possible in databricks jobs.
Why? More control over your workers, no need to find a good waiting time and you can even run multiple tasks on a single cluster.
Why your jobs fail is not clear to me.
When no more workers can be assigned, the running jobs do not fail afaik. What will fail however is new jobs, as they cannot spin up a cluster (needs at least a driver).

