cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Pull JAR from private Maven repository (Azure Artifactory)

Dom1
New Contributor III

Hi,

I currently struggle on the following task:

We want to push our code to a private repository (Azure Artifactory) and then pull it from databricks when the job runs. It currently works only with wheels inside a PyPi repo in the artifactory. I found some older comments that it is not supported to use a private maven repository but I was not able to find any documentation regarding this issue.

Can someone tell me if private maven repos are not supported and would be great to have something like an official source?

Thanks a lot

3 REPLIES 3

iyashk-DB
Databricks Employee
Databricks Employee

Databricks can install Maven libraries by coordinate and lets you point at a custom repository URL.

However, passing credentials for authenticated private Maven repositories directly through the Libraries UI/Jobs is not natively supported today and requires workarounds; this has been tracked internally as a product ask rather than a GA feature.

But one workaround for your private Maven host, which requires authentication, you can use Apache Ivy settings via init scripts to provide credentials and repository resolution, then let Ivy resolve packages at cluster startup. 

For this, you can create an ivysettings.xml file with credentials and point Spark to it; for newer runtimes, you can swap in a patched Ivy JAR to externalise the settings file for multiple repositories with authentication.

Dom1
New Contributor III

Thanks for your help and your response. I will try your workaround and come back to you ๐Ÿ™‚

I think a possible solution for us would also be that we push the artifacts into a databricks volume and then install the libraries from there. In this way we would not require the workaround but I struggle to understand what the best practises are for this case and how others solve this issue.

Prajapathy_NKR
Contributor

Hi @Dom1 ,

One solution which i had implemented is to use API to connect to artifact and download the latest artifact to driver's storage (when you use curl to download the file, it gets downloaded in the disk of the driver), later moved it to the required location in dbfs and installed it.

only difference is that was using Github artifact.

so, my suggestion is, 
1. use api to connect. you can parameterize the notebook with branch info and etc, so that you can frame the api to pull it from the respective branch.

2. try to find your required package from the response. you can write a script for this.

3. once you find, try to download it using curl or requests call to download the package. The package will be now available in the driver's disk. 

4. using %sh magic command, you can use move command to move the package from disk of driver to dbfs location. ( i am not exactly sure how is volume from unity catalog is mounted in driver )

5. you are now ready to install the package, since you have mentioned that you wanted to pull the artifact during each job run. what i would recommend is to execute the above script before you run your main logic. this can be pipelined. once the download and move is successful, you can use %sh pip install command inside your main notebook to install the package.

Hope this helps. ๐Ÿ™‚