Databricks Community

AnaMocanu · ‎08-19-2024

Hi there,

I'm having a difficult time understanding the compute side of our jobs under the hood, and I checked the documentation but don't have clear answers so far, so hopefully someone will provide some clarity.

I set up pools to use for our overnight jobs (initially running a bunch of tables in parallel).

The latest major issue was that we didn't set up max capacity, and the pool started over 200 EC2 instances! is this normal?
In the meantime, we set up the max to 50 and reduced terminate after idle time to 2 minutes, we also enabled autoscaling and photon acceleration on the jobs, and we changed the jobs structure to run sequentially, rather than in parallel, so that they don't need that many resources at the same time. any other useful tips or tricks to optimise?

One thing that I don't understand is why the jobs fail and terminate rather than queue instead when an infrastructure constraint happens? the pool of 50 should be more than enough to run all the jobs. queuing and max concurrency is set up on all the jobs.

Many thanks for your help!

Ana

-werners- · ‎08-21-2024

I moved away from pools since cluster reuse is possible in databricks jobs.
Why? More control over your workers, no need to find a good waiting time and you can even run multiple tasks on a single cluster.
Why your jobs fail is not clear to me.
When no more workers can be assigned, the running jobs do not fail afaik. What will fail however is new jobs, as they cannot spin up a cluster (needs at least a driver).

Databricks Community

Compute pools max capacity and ideal compute settings

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!