07-08-2022 08:28 AM
We have written few python functions(methods within a class) and packaged them as a wheel library.
In the as-is situation we use to install that wheel library in All-Purpose cluster that we already have created.
It works fine.
In the to-be situtation(Delta Live Tables) we want this wheel library to be installed part of the Delta live pipeline execution, because when DLT pipeline runs it creates its own Job Cluster.
We use lot of python functions to do the transformations between Silver and Gold layer.
Hence we want the wheel library (which has all the UDF’s) to be installed in the Job Cluster which DLT pipeline creates.
When we execute %pip install <wheel library location in DBFS> as a first step in the DLT notebook, it does not seem to work.
But when we have %pip install numpy it works.
Its important for us to have the wheel library installed in the job cluster created by DLT pipeline.
Are we missing something?
Thanks
07-11-2022 06:57 AM
It said "it could not find the whl file"
Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.
And when added the below in DLT pipeline settings json,
"spark_env_vars": {
"PIP_INDEX_URL": "<URL for our repository>"
},
it worked.
07-08-2022 09:55 AM
Are you sure that the DLT cluster sees your DBFS?
You can also use "files in repos" alternatively instead.
07-08-2022 10:01 AM
Does it give you an error when running the DLT pipeline specifically on the %pip command or does it not work in some other way?
If it's the former, could you share the path format that you're using for the %pip command path?
07-11-2022 05:24 AM
@Tomasz Bacewicz
Thanks for your reply !
We are using the below command as a fist cmd (cell) in the DLT notebook,
%pip install /dbfs/dist/abnamro_acdpt_centraldatapoint-0.12.0.dev24-py3-none-any.whl
Fyi,
When we try to manually install the same on the Job Cluster which DLT pipeline creates it is getting installed.
Also when run the same above pip install command on the All purpose cluster its getting installed.
Only when its run from DLT pipeline it fails.
07-11-2022 06:52 AM
Makes sense, good to know that it works manually. Can you also share the error that you get?
07-11-2022 06:57 AM
It said "it could not find the whl file"
Upon investigation we found our library sits in nexus and the cluster environment variable should be setup.
And when added the below in DLT pipeline settings json,
"spark_env_vars": {
"PIP_INDEX_URL": "<URL for our repository>"
},
it worked.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group