cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT Pipeline and Job Cluster

Deepak_Goldwyn
New Contributor III

We have written few python functions(methods within a class) and packaged them as a wheel library.

In the as-is situation we use to install that wheel library in All-Purpose cluster that we already have created.

It works fine.

In the to-be situtation(Delta Live Tables) we want this wheel library to be installed part of the Delta live pipeline execution, because when DLT pipeline runs it creates its own Job Cluster.

We use lot of python functions to do the transformations between Silver and Gold layer.

Hence we want the wheel library (which has all the UDF’s) to be installed in the Job Cluster which DLT pipeline creates.

When we execute %pip install <wheel library location in DBFS> as a first step in the DLT notebook, it does not seem to work.

But when we have %pip install numpy it works.

Its important for us to have the wheel library installed in the job cluster created by DLT pipeline.

Are we missing something?

Thanks

1 ACCEPTED SOLUTION

Accepted Solutions

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

View solution in original post

5 REPLIES 5

Hubert-Dudek
Esteemed Contributor III

Are you sure that the DLT cluster sees your DBFS?

You can also use "files in repos" alternatively instead.

tomasz
New Contributor III
New Contributor III

Does it give you an error when running the DLT pipeline specifically on the %pip command or does it not work in some other way?

If it's the former, could you share the path format that you're using for the %pip command path?

@Tomasz Bacewicz​ 

Thanks for your reply !

We are using the below command as a fist cmd (cell) in the DLT notebook,

%pip install /dbfs/dist/abnamro_acdpt_centraldatapoint-0.12.0.dev24-py3-none-any.whl

Fyi,

When we try to manually install the same on the Job Cluster which DLT pipeline creates it is getting installed.

Also when run the same above pip install command on the All purpose cluster its getting installed.

Only when its run from DLT pipeline it fails.

Makes sense, good to know that it works manually. Can you also share the error that you get?

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.