Databricks Community

yitao · ‎08-24-2021

Hello. I'm the current maintainer of sparklyr (a R interface for Apache Spark) and a few sparklyr extensions such as sparklyr.flint.

Sparklyr was fortunate to receive some contribution from Databricks folks, which enabled R users to run `spark_connect(method = "databricks")` to connect to Databricks Runtime.

My question is how to make this type of Spark connection in R work with sparklyr extensions (e.g., see https://github.com/r-spark/sparklyr.flint/issues/55 -- this was something I don't have a good answer for at the moment because I'm not really super-familiar with how Databricks connections work with sparklyr internally).

A bit more context: sparklyr.flint is an R interface for the Flint time series library that works on top of sparklyr. Usually when users run code such as the following

library(sparklyr)
library(sparklyr.flint)
 
sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark")

The presence of sparklyr.flint as a sparklyr extension will cause the Spark process to fetch some version of Flint time series library jar files and load those files within the Spark session that it is connecting to.

But this didn't work if we were to replace the `sc <- spark_connect(...)` from above with `sc <- spark_connect(method = "databricks")` (again, see https://github.com/r-spark/sparklyr.flint/issues/55 for details). My uneducated guess is `method = "databricks"` had some level of indirection involved in the connecting-to-Spark step, and the Flint time series jar files were downloaded into the wrong location.

I'm wondering whether there is some simple change to sparklyr I can make to ensure sparklyr extensions also work in Databricks. Your input would be greatly appreciated.

Thanks!

Sebastian · ‎09-08-2021

Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime

View solution in original post

Sebastian · ‎09-08-2021

Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime

Dan_Z · ‎09-09-2021

Yes, as Sebastian said. Also, it would be good to know what the error is here. One possible explanation is that the JARs are not copied to the executor nodes. This would be solved by Sebasitian's suggestion.

yitao · ‎09-09-2021

Thanks for the answers everyone!

Two follow-up questions:

Is it possible to package the init scripts together with a R package itself? I'm thinking ideally the script should be self-contained and should not require additional user input. It should figure out the location for installing JARs in a Databricks cluster based on config files and (maybe) env variables.
If answer is 'yes' to the first question, is there an example R package that has solved this type of problem successfulyl with a pre-packaged init script?

Again thanks a lot for your help.

Dan_Z · ‎10-13-2021

Not, the init scripts run before Spark starts or any packages get loaded. So if there are any dependancies, they will need to be stated. Also, I think if the user does install,packages("your_library"), then Databricks will automatically install on all nodes. Also installing using the library UI will do this. But we're just making solutions based on hypotheses here. We would really need to know what error you are seeing to tell.

Typically- whatever R library you are installing on the cluster should ALSO install the JAR files. My guess is that R's arrow package does this, but not sure. It definitely installs the underlying C++ dependancy. Not sure if there is also a Java component.

Databricks Community

How to make sparklyr extension work with Databricks runtime?

Join Us as a Local Community Builder!

🚀 Weekly Delta (1 - 7 October): A Look Back at This Week’s Top Community Highlights!

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming

Level Up with Databricks Specialist Sessions

Announcing Data Intelligence for Cybersecurity