cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Understand Spark DataFrames verse R DataFrames

Jeff1
Contributor II

Community

I’ve been struggling with utilizing R language in databricks and after reading “Mastering Spark with R,” I believe my initial problems stemmed from not understating the difference between Spark DataFrames and R DataFrames within the databricks environment. Now that I know many R function will only work with R DataFrames I’ve become quite familiar with the collect() function and the copy_to() function to convert back and forth between dataframe types. So my question deals with are there any sort of Rules of Thumb with regards to Spark /R dataframes when using R in databricks. As it seems as though I am converting back and forth a lot.

Jeff

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

ggplot2 is not included by default I believe. You will have to install it yourself.

https://spark-packages.org/package/SKKU-SKT/ggplot2.SparkR

http://papl-skku.github.io/ggplot2.SparkR/index

As it is a popular package, chances are real it might be included in the future.

View solution in original post

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

As Spark dataframes are handled in distributed way on workers it is better just to use Spark dataframes. Additionally collect is executed on driver and takes whole dataset into memory so it is shouldn't be used in production.

That certainly makes sense but I've run into a number of R functions which error out on Spark DataFrames. For example geohashTools and ggplot2 (in particular ggplot2) only work with R DataFrames (as I understand).

-werners-
Esteemed Contributor III

ggplot2 is not included by default I believe. You will have to install it yourself.

https://spark-packages.org/package/SKKU-SKT/ggplot2.SparkR

http://papl-skku.github.io/ggplot2.SparkR/index

As it is a popular package, chances are real it might be included in the future.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group