DB1To3
New Contributor III

>> there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code

This is interesting.  I would not think of python as the best runtime for bridging.  I'm wondering if this involves yet another out-of-process hop, together with more serialization and deserialization of the arrow dataframes.  I'm also wondering why it wouldn't be possible to simply bypass python and launch the R udf's directly from the Spark core (jvm).  It seems like there are lots of hops to execute R logic on executors.

I will try to find this community of sparklyr users to learn more.  I'm guessing these folks can be found on the github project, and on r/stats.


I hoped that sparklyr would be an official databricks replacement for SparkR.  I'm guessing that users in this community would have a hard time getting official support, if things should ever break after a new release of the databricks runtime.  I suppose the users rely on one another in the community for support

I hope that Databricks will have some guidance for running non-python UDF's in the future.  Most of our internal libraries are built using the .Net runtime (.Net core) and databricks seems to be deliberately neglecting that ecosystem for some reason.  I think the c#.Net community is still growing at a VERY fast rate and may even overtake Java itself in a couple of years.  Despite this large community of potential customers of Spark, I get the sense that Databricks has no interest in removing the barriers to entry for all of these folks.   I certainly agree that python integrations are always popular ( because the language/runtime/tools are free and widely accessible).  However it seems quite strange to me that Databricks would start making accommodations for R/Go/Rust, while doing almost nothing for the c#.Net developers.  I'd guess there has to be some unfortunate politics behind this strategy. It goes beyond Databricks.  Despite the fact that c# is an opensource platform nowadays, the opensource communities like Apache still don't want to accept it.