Databricks Community

TimB · ‎03-20-2025

I am trying to run a job using the For Each command with many concurrent processes using serverless compute.

To add dependencies to serverless jobs, it seems you have to add them to the notebook, rather than configure them on the tasks screen like you do when using a job cluster. However, when doing this, it significantly increases the processing time, and it feels like for every concurrent process, the package installation process is duplicated. e.g. if I have 30 processes, the packages are installed 30 times, once for each run of the notebook. Contrast that with how you ordinarily work with a normal cluster, they are just installed once on the cluster.

Is my understanding of this process correct, or am I missing a step that would improve my workflow?

Brahmareddy · ‎03-20-2025

Hi TimB,

How are you doing today?, As per my understanding, yeah, you got it—serverless jobs don’t keep installed packages, so every time a process runs, it installs the dependencies again, which slows things down. A better way is to package your dependencies into a wheel (.whl) file and install it from DBFS or S3, so you’re not downloading them every time. If your libraries are available in Unity Catalog, you can attach them there so jobs can use them without reinstalling. You could also reduce the number of parallel processes to limit repeated installs, or if possible, switch to a shared job cluster, where dependencies only install once instead of every run. This should help things run much faster—let me know if you need help setting it up!

Regards,

Brahma

TimB · ‎03-21-2025

Hi Brahma,

Thanks for the feedback and confirming what I thought was true. I already use the job cluster for the concurrent workload and I was testing out the serverless as a replacement, but it would seem I'm best to stick with the job cluster if I have additional dependencies to install.