cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

DLT Pipeline and Job Cluster

Deepak_Goldwyn
New Contributor III

We have written few python functions(methods within a class) and packaged them as a wheel library.

In the as-is situation we use to install that wheel library in All-Purpose cluster that we already have created.

It works fine.

In the to-be situtation(Delta Live Tables) we want this wheel library to be installed part of the Delta live pipeline execution, because when DLT pipeline runs it creates its own Job Cluster.

We use lot of python functions to do the transformations between Silver and Gold layer.

Hence we want the wheel library (which has all the UDFโ€™s) to be installed in the Job Cluster which DLT pipeline creates.

When we execute %pip install <wheel library location in DBFS> as a first step in the DLT notebook, it does not seem to work.

But when we have %pip install numpy it works.

Its important for us to have the wheel library installed in the job cluster created by DLT pipeline.

Are we missing something?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

Are you sure that the DLT cluster sees your DBFS?

You can also use "files in repos" alternatively instead.

tomasz
Databricks Employee
Databricks Employee

Does it give you an error when running the DLT pipeline specifically on the %pip command or does it not work in some other way?

If it's the former, could you share the path format that you're using for the %pip command path?

@Tomasz Bacewiczโ€‹ 

Thanks for your reply !

We are using the below command as a fist cmd (cell) in the DLT notebook,

%pip install /dbfs/dist/abnamro_acdpt_centraldatapoint-0.12.0.dev24-py3-none-any.whl

Fyi,

When we try to manually install the same on the Job Cluster which DLT pipeline creates it is getting installed.

Also when run the same above pip install command on the All purpose cluster its getting installed.

Only when its run from DLT pipeline it fails.

Makes sense, good to know that it works manually. Can you also share the error that you get?

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group