Issue loading spark Scala library

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ06-10-2021 07:30 PM
We have a proprietary spark scala library, which is necessary for me to do my work. We build a release version once a week and store it in a specific s3 location (so the most up-to-date prod version is always stored in the same place). But so far I canโt figure out a reasonable way to load the library thatโs not a huge pain. In an ideal world, my clusters would automatically install the jar from the prod folder on s3 when they start up. Tons of people at the company rely on this library. Do you know if thereโs a way to achieve this? Thanks!
- Labels:
-
Scala
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ06-17-2021 04:27 PM
There's not a great answer for a JVM library. You can create a Library entity in the workspace based on a particular JAR you put somewhere on S3. That you can attach to a cluster. But it's static and won't pick up some new version of the JAR in another location. You would have to recreate the JAR and Library each time.
I think you might get away with a lower-level approach, which is to add the JAR location to the Spark classpath in your cluster config, and make sure that location always has the latest JAR. On any cluster launch, it would read and deploy that latest JAR. A little bit manual but closer to what you want.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ07-30-2021 03:10 PM
Databrick's curriculum team solved this problem by creating our own Maven repo and it's easier than it sounds. To do this, we took an S3 bucket, converted it to a public website, allowing for standard file downloads, and then within that bucket created a "repo" folder. From there all we had to do was follow the standard maven directory convention to create a repo with only 1-2 files.
With your repo in place, you simply create your cluster and when defining the maven library, include the extra attribute to identify your custom repo.
You can use the standard versioning scheme if you wanted to, or simply replace the same file every time you update it. In any event, your manually created cluster, or a cluster created through the APIs will install your library at startup.
You can see here how we used S3 to create our repo and host this custom copy of bigdl

