- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Sunday - last edited Sunday
>> there is no native R UDF pathway over the wire. sparklyr works around this using rpy2, a Python library that embeds and executes R code
This is interesting. I would not think of python as the best runtime for bridging. I'm wondering if this involves yet another out-of-process hop, together with more serialization and deserialization of the arrow dataframes. I'm also wondering why it wouldn't be possible to simply bypass python and launch the R udf's directly from the Spark core (jvm). It seems like there are lots of hops to execute R logic on executors.
I will try to find this community of sparklyr users to learn more. I'm guessing these folks can be found on the github project, and on r/stats.
I hoped that sparklyr would be an official databricks replacement for SparkR. I'm guessing that users in this community would have a hard time getting official support, if things should ever break after a new release of the databricks runtime. I suppose the users rely on one another in the community for support
I hope that Databricks will have some guidance for running non-python UDF's in the future. Most of our internal libraries are built using the .Net runtime (.Net core) and databricks seems to be deliberately neglecting that ecosystem for some reason. I think the c#.Net community is still growing at a VERY fast rate and may even overtake Java itself in a couple of years. Despite this large community of potential customers of Spark, I get the sense that Databricks has no interest in removing the barriers to entry for all of these folks. I certainly agree that python integrations are always popular ( because the language/runtime/tools are free and widely accessible). However it seems quite strange to me that Databricks would start making accommodations for R/Go/Rust, while doing almost nothing for the c#.Net developers. I'd guess there has to be some unfortunate politics behind this strategy. It goes beyond Databricks. Despite the fact that c# is an opensource platform nowadays, the opensource communities like Apache still don't want to accept it.