cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

To collect the elements of a SparkDataFrame and coerces them into an R dataframe.

nirajtanwar
New Contributor

Hello Everyone,

I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pyspark

I am trying this:

events_df <- collect(events)

events_grp_df <- collect(events_grp)

The error that is occuring is related to Kyro serialization

" org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 58. To avoid this, increase spark.kryoserializer.buffer.max value"

Can anyone help to suggest any alternate to collect or any other way to solve this problem?

FYI : I tried to increase the buffer.max.mb using spark.conf.set("spark.kryoserializer.buffer.max.mb", "50000") but it is not working

Thanks in advance

2 REPLIES 2

Anonymous
Not applicable

@Niraj Tanwar​ :

The error message indicates that the buffer overflowed, which could mean that the amount of data being serialized exceeded the maximum buffer size allowed by Kryo. In addition to increasing the buffer size, you can try reducing the size of the data being serialized by selecting only the necessary columns from the DataFrame. Instead of using collect() , you can try writing the DataFrame to a file in a format that can be read by R, such as CSV or Parquet. Then, you can read the file into R using its file reading functions.

Here is an example of writing a DataFrame to a CSV file:

events.write.format("csv").option("header", "true").save("/path/to/csv/file")

And here is an example of reading a CSV file into an R data.frame:

events_df <- read.csv("/path/to/csv/file")

You can use similar code to write and read Parquet files as well. Writing to Parquet format has the advantage of being a more efficient and compact file format compared to CSV.

I hope this helps!

Anonymous
Not applicable

Hi @Niraj Tanwar​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group