cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Issue loading spark Scala library

Anonymous
Not applicable

We have a proprietary spark scala library, which is necessary for me to do my work. We build a release version once a week and store it in a specific s3 location (so the most up-to-date prod version is always stored in the same place). But so far I canโ€™t figure out a reasonable way to load the library thatโ€™s not a huge pain. In an ideal world, my clusters would automatically install the jar from the prod folder on s3 when they start up. Tons of people at the company rely on this library. Do you know if thereโ€™s a way to achieve this? Thanks!

2 REPLIES 2

sean_owen
Honored Contributor II
Honored Contributor II

There's not a great answer for a JVM library. You can create a Library entity in the workspace based on a particular JAR you put somewhere on S3. That you can attach to a cluster. But it's static and won't pick up some new version of the JAR in another location. You would have to recreate the JAR and Library each time.

I think you might get away with a lower-level approach, which is to add the JAR location to the Spark classpath in your cluster config, and make sure that location always has the latest JAR. On any cluster launch, it would read and deploy that latest JAR. A little bit manual but closer to what you want.

User16857281974
Contributor

Databrick's curriculum team solved this problem by creating our own Maven repo and it's easier than it sounds. To do this, we took an S3 bucket, converted it to a public website, allowing for standard file downloads, and then within that bucket created a "repo" folder. From there all we had to do was follow the standard maven directory convention to create a repo with only 1-2 files.

With your repo in place, you simply create your cluster and when defining the maven library, include the extra attribute to identify your custom repo.

You can use the standard versioning scheme if you wanted to, or simply replace the same file every time you update it. In any event, your manually created cluster, or a cluster created through the APIs will install your library at startup.

You can see here how we used S3 to create our repo and host this custom copy of bigdl

Screenshot_96

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.