Hi there,
I'm having a difficult time understanding the compute side of our jobs under the hood, and I checked the documentation but don't have clear answers so far, so hopefully someone will provide some clarity.
I set up pools to use for our overnight jobs (initially running a bunch of tables in parallel).
The latest major issue was that we didn't set up max capacity, and the pool started over 200 EC2 instances! is this normal?
In the meantime, we set up the max to 50 and reduced terminate after idle time to 2 minutes, we also enabled autoscaling and photon acceleration on the jobs, and we changed the jobs structure to run sequentially, rather than in parallel, so that they don't need that many resources at the same time. any other useful tips or tricks to optimise?
One thing that I don't understand is why the jobs fail and terminate rather than queue instead when an infrastructure constraint happens? the pool of 50 should be more than enough to run all the jobs. queuing and max concurrency is set up on all the jobs.
Many thanks for your help!
Ana