Databricks

Deepak_Goldwyn · ‎07-08-2022

We have written few python functions(methods within a class) and packaged them as a wheel library.

In the as-is situation we use to install that wheel library in All-Purpose cluster that we already have created.

It works fine.

In the to-be situtation(Delta Live Tables) we want this wheel library to be installed part of the Delta live pipeline execution, because when DLT pipeline runs it creates its own Job Cluster.

We use lot of python functions to do the transformations between Silver and Gold layer.

Hence we want the wheel library (which has all the UDF’s) to be installed in the Job Cluster which DLT pipeline creates.

When we execute %pip install <wheel library location in DBFS> as a first step in the DLT notebook, it does not seem to work.

But when we have %pip install numpy it works.

Its important for us to have the wheel library installed in the job cluster created by DLT pipeline.

Are we missing something?

Thanks

Deepak_Goldwyn · ‎07-11-2022

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

View solution in original post

Hubert-Dudek · ‎07-08-2022

Are you sure that the DLT cluster sees your DBFS?

You can also use "files in repos" alternatively instead.

tomasz · ‎07-08-2022

Does it give you an error when running the DLT pipeline specifically on the %pip command or does it not work in some other way?

If it's the former, could you share the path format that you're using for the %pip command path?

Deepak_Goldwyn · ‎07-11-2022

@Tomasz Bacewicz

Thanks for your reply !

We are using the below command as a fist cmd (cell) in the DLT notebook,

%pip install /dbfs/dist/abnamro_acdpt_centraldatapoint-0.12.0.dev24-py3-none-any.whl

Fyi,

When we try to manually install the same on the Job Cluster which DLT pipeline creates it is getting installed.

Also when run the same above pip install command on the All purpose cluster its getting installed.

Only when its run from DLT pipeline it fails.

tomasz · ‎07-11-2022

Makes sense, good to know that it works manually. Can you also share the error that you get?

Deepak_Goldwyn · ‎07-11-2022

It said "it could not find the whl file"

Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.

And when added the below in DLT pipeline settings json,

"spark_env_vars": {

"PIP_INDEX_URL": "<URL for our repository>"

},

it worked.

Databricks

DLT Pipeline and Job Cluster

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI