cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Package installation for multi-tasks job

Guigui
New Contributor II

I have a job with the same task to be executed twice with two sets of parameters. In each task is run after cloning a git repo then installing it locally and running a notebook from this repo. However, as each task clones the same repo, I was wondering how to do the install once and for all ? 

I tried to add a first task that install the package from the cloned repo, and added a dependency to this first step for the two tasks. Basically:

Task 0:
   * from git repo
   * %sh
      pip install poetry
      poetry install  ---will install locally cloned package named my_package---

Task 1 and 2:
   * depends on Task 0
   * same cluster
   * from my_package import my_class  ---got an exception that thereis no package my_package---

Adding the my_package package to the cluster config is not an option, I need to install it first when running the job

2 REPLIES 2

srinum89
New Contributor II

You can install the custom library from volumes/custom(abfss)/workspace path directly on two tasks as part of dependent libraries.

No need to have task0 just to install libraries.

Hope this helps! 🙂

Guigui
New Contributor II

That what I've done, but I find it less elegant that setup an environment and sharing it across multiple tasks. It seems to be impossible (unless I build a wheel file and I dont want to) as tasks do not share environment, but anyway, as they run in parallel, there is no overhead installing the package for each task.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now