- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2022 07:34 AM
Community
I’ve been struggling with utilizing R language in databricks and after reading “Mastering Spark with R,” I believe my initial problems stemmed from not understating the difference between Spark DataFrames and R DataFrames within the databricks environment. Now that I know many R function will only work with R DataFrames I’ve become quite familiar with the collect() function and the copy_to() function to convert back and forth between dataframe types. So my question deals with are there any sort of Rules of Thumb with regards to Spark /R dataframes when using R in databricks. As it seems as though I am converting back and forth a lot.
Jeff
- Labels:
-
Spark DataFrames
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2022 12:41 AM
ggplot2 is not included by default I believe. You will have to install it yourself.
https://spark-packages.org/package/SKKU-SKT/ggplot2.SparkR
http://papl-skku.github.io/ggplot2.SparkR/index
As it is a popular package, chances are real it might be included in the future.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2022 08:18 AM
As Spark dataframes are handled in distributed way on workers it is better just to use Spark dataframes. Additionally collect is executed on driver and takes whole dataset into memory so it is shouldn't be used in production.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-14-2022 08:45 AM
That certainly makes sense but I've run into a number of R functions which error out on Spark DataFrames. For example geohashTools and ggplot2 (in particular ggplot2) only work with R DataFrames (as I understand).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-15-2022 12:41 AM
ggplot2 is not included by default I believe. You will have to install it yourself.
https://spark-packages.org/package/SKKU-SKT/ggplot2.SparkR
http://papl-skku.github.io/ggplot2.SparkR/index
As it is a popular package, chances are real it might be included in the future.

