Databricks

Anonymous · ‎06-10-2021

We have a proprietary spark scala library, which is necessary for me to do my work. We build a release version once a week and store it in a specific s3 location (so the most up-to-date prod version is always stored in the same place). But so far I can’t figure out a reasonable way to load the library that’s not a huge pain. In an ideal world, my clusters would automatically install the jar from the prod folder on s3 when they start up. Tons of people at the company rely on this library. Do you know if there’s a way to achieve this? Thanks!

sean_owen · ‎06-17-2021

There's not a great answer for a JVM library. You can create a Library entity in the workspace based on a particular JAR you put somewhere on S3. That you can attach to a cluster. But it's static and won't pick up some new version of the JAR in another location. You would have to recreate the JAR and Library each time.

I think you might get away with a lower-level approach, which is to add the JAR location to the Spark classpath in your cluster config, and make sure that location always has the latest JAR. On any cluster launch, it would read and deploy that latest JAR. A little bit manual but closer to what you want.

User16857281974 · ‎07-30-2021

Databrick's curriculum team solved this problem by creating our own Maven repo and it's easier than it sounds. To do this, we took an S3 bucket, converted it to a public website, allowing for standard file downloads, and then within that bucket created a "repo" folder. From there all we had to do was follow the standard maven directory convention to create a repo with only 1-2 files.

With your repo in place, you simply create your cluster and when defining the maven library, include the extra attribute to identify your custom repo.

You can use the standard versioning scheme if you wanted to, or simply replace the same file every time you update it. In any event, your manually created cluster, or a cluster created through the APIs will install your library at startup.

You can see here how we used S3 to create our repo and host this custom copy of bigdl