cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

How to make sparklyr extension work with Databricks runtime?

yitao
New Contributor III

Hello. I'm the current maintainer of sparklyr (a R interface for Apache Spark) and a few sparklyr extensions such as sparklyr.flint.

Sparklyr was fortunate to receive some contribution from Databricks folks, which enabled R users to run `spark_connect(method = "databricks")` to connect to Databricks Runtime.

My question is how to make this type of Spark connection in R work with sparklyr extensions (e.g., see https://github.com/r-spark/sparklyr.flint/issues/55 -- this was something I don't have a good answer for at the moment because I'm not really super-familiar with how Databricks connections work with sparklyr internally).

A bit more context: sparklyr.flint is an R interface for the Flint time series library that works on top of sparklyr. Usually when users run code such as the following

library(sparklyr)
library(sparklyr.flint)
 
sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark")

The presence of sparklyr.flint as a sparklyr extension will cause the Spark process to fetch some version of Flint time series library jar files and load those files within the Spark session that it is connecting to.

But this didn't work if we were to replace the `sc <- spark_connect(...)` from above with `sc <- spark_connect(method = "databricks")` (again, see https://github.com/r-spark/sparklyr.flint/issues/55 for details). My uneducated guess is `method = "databricks"` had some level of indirection involved in the connecting-to-Spark step, and the Flint time series jar files were downloaded into the wrong location.

I'm wondering whether there is some simple change to sparklyr I can make to ensure sparklyr extensions also work in Databricks. Your input would be greatly appreciated.

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

Sebastian
Contributor

Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime

View solution in original post

6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @yitao ! My name is Kaniz, and I'm a technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers on the Forum have an answer to your questions first. Or else I will follow up shortly with a response.

Sebastian
Contributor

Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime

Dan_Z
Honored Contributor
Honored Contributor

Yes, as Sebastian said. Also, it would be good to know what the error is here. One possible explanation is that the JARs are not copied to the executor nodes. This would be solved by Sebasitian's suggestion.

yitao
New Contributor III

Thanks for the answers everyone!

Two follow-up questions:

  • Is it possible to package the init scripts together with a R package itself? I'm thinking ideally the script should be self-contained and should not require additional user input. It should figure out the location for installing JARs in a Databricks cluster based on config files and (maybe) env variables.
  • If answer is 'yes' to the first question, is there an example R package that has solved this type of problem successfulyl with a pre-packaged init script?

Again thanks a lot for your help.

Dan_Z
Honored Contributor
Honored Contributor

Not, the init scripts run before Spark starts or any packages get loaded. So if there are any dependancies, they will need to be stated. Also, I think if the user does install,packages("your_library"), then Databricks will automatically install on all nodes. Also installing using the library UI will do this. But we're just making solutions based on hypotheses here. We would really need to know what error you are seeing to tell.

Typically- whatever R library you are installing on the cluster should ALSO install the JAR files. My guess is that R's arrow package does this, but not sure. It definitely installs the underlying C++ dependancy. Not sure if there is also a Java component.

Kaniz
Community Manager
Community Manager

Hi @yitao​ , Just a friendly follow-up. Do you still need help, or does the above response help you to find the solution? Please let us know.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.