cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Adding dependencies to Serverless compute with concurrency slows processing right down

TimB
New Contributor III

I am trying to run a job using the For Each command with many concurrent processes using serverless compute.

To add dependencies to serverless jobs, it seems you have to add them to the notebook, rather than configure them on the tasks screen like you do when using a job cluster. However, when doing this, it significantly increases the processing time, and it feels like for every concurrent process, the package installation process is duplicated. e.g. if I have 30 processes, the packages are installed 30 times, once for each run of the notebook. Contrast that with how you ordinarily work with a normal cluster, they are just installed once on the cluster. 

Is my understanding of this process correct, or am I missing a step that would improve my workflow?

3 REPLIES 3

Brahmareddy
Honored Contributor II

Hi TimB,

How are you doing today?, As per my understanding, yeah, you got it—serverless jobs don’t keep installed packages, so every time a process runs, it installs the dependencies again, which slows things down. A better way is to package your dependencies into a wheel (.whl) file and install it from DBFS or S3, so you’re not downloading them every time. If your libraries are available in Unity Catalog, you can attach them there so jobs can use them without reinstalling. You could also reduce the number of parallel processes to limit repeated installs, or if possible, switch to a shared job cluster, where dependencies only install once instead of every run. This should help things run much faster—let me know if you need help setting it up! 

Regards,

Brahma

TimB
New Contributor III

Hi Brahma,

Thanks for the feedback and confirming what I thought was true. I already use the job cluster for the concurrent workload and I was testing out the serverless as a replacement, but it would seem I'm best to stick with the job cluster if I have additional dependencies to install.

Brahmareddy
Honored Contributor II

Yeah, TimB. Keep going.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group