topic Re: Package installation for multi-tasks job in Data Engineering

Package installation for multi-tasks job

Guigui — Thu, 03 Apr 2025 18:59:48 GMT

I have a job with the same task to be executed twice with two sets of parameters. In each task is run after cloning a git repo then installing it locally and running a notebook from this repo. However, as each task clones the same repo, I was wondering how to do the install once and for all ?

I tried to add a first task that install the package from the cloned repo, and added a dependency to this first step for the two tasks. Basically:

Task 0:
* from git repo
* %sh
pip install poetry
poetry install ---will install locally cloned package named my_package---

Task 1 and 2:
* depends on Task 0
* same cluster
* from my_package import my_class ---got an exception that thereis no package my_package---

Adding the my_package package to the cluster config is not an option, I need to install it first when running the job

Re: Package installation for multi-tasks job

srinum89 — Thu, 03 Apr 2025 20:53:20 GMT

You can install the custom library from volumes/custom(abfss)/workspace path directly on two tasks as part of dependent libraries.

No need to have task0 just to install libraries.

Hope this helps! 🙂

Re: Package installation for multi-tasks job

Guigui — Thu, 03 Apr 2025 20:57:34 GMT

That what I've done, but I find it less elegant that setup an environment and sharing it across multiple tasks. It seems to be impossible (unless I build a wheel file and I dont want to) as tasks do not share environment, but anyway, as they run in parallel, there is no overhead installing the package for each task.