08-24-2021 09:06 AM
Hello. I'm the current maintainer of sparklyr (a R interface for Apache Spark) and a few sparklyr extensions such as sparklyr.flint.
Sparklyr was fortunate to receive some contribution from Databricks folks, which enabled R users to run `spark_connect(method = "databricks")` to connect to Databricks Runtime.
My question is how to make this type of Spark connection in R work with sparklyr extensions (e.g., see https://github.com/r-spark/sparklyr.flint/issues/55 -- this was something I don't have a good answer for at the moment because I'm not really super-familiar with how Databricks connections work with sparklyr internally).
A bit more context: sparklyr.flint is an R interface for the Flint time series library that works on top of sparklyr. Usually when users run code such as the following
library(sparklyr)
library(sparklyr.flint)
sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark")
The presence of sparklyr.flint as a sparklyr extension will cause the Spark process to fetch some version of Flint time series library jar files and load those files within the Spark session that it is connecting to.
But this didn't work if we were to replace the `sc <- spark_connect(...)` from above with `sc <- spark_connect(method = "databricks")` (again, see https://github.com/r-spark/sparklyr.flint/issues/55 for details). My uneducated guess is `method = "databricks"` had some level of indirection involved in the connecting-to-Spark step, and the Flint time series jar files were downloaded into the wrong location.
I'm wondering whether there is some simple change to sparklyr I can make to ensure sparklyr extensions also work in Databricks. Your input would be greatly appreciated.
Thanks!
09-08-2021 10:44 AM
Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime
09-08-2021 10:44 AM
Just like any R Library you can have an init script that will copy the library to the R runtime of the Cluster. I manage all libraries either using Global Init script/local at the cluster level. Store it in a mount and during the boot up of cluster just run a copy command to move the libraries to the runtime
09-09-2021 01:43 PM
Yes, as Sebastian said. Also, it would be good to know what the error is here. One possible explanation is that the JARs are not copied to the executor nodes. This would be solved by Sebasitian's suggestion.
09-09-2021 01:49 PM
Thanks for the answers everyone!
Two follow-up questions:
Again thanks a lot for your help.
10-13-2021 05:04 PM
Not, the init scripts run before Spark starts or any packages get loaded. So if there are any dependancies, they will need to be stated. Also, I think if the user does install,packages("your_library"), then Databricks will automatically install on all nodes. Also installing using the library UI will do this. But we're just making solutions based on hypotheses here. We would really need to know what error you are seeing to tell.
Typically- whatever R library you are installing on the cluster should ALSO install the JAR files. My guess is that R's arrow package does this, but not sure. It definitely installs the underlying C++ dependancy. Not sure if there is also a Java component.
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group