cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

To collect the elements of a SparkDataFrame and coerces them into an R dataframe.

nirajtanwar
New Contributor

Hello Everyone,

I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pyspark

I am trying this:

events_df <- collect(events)

events_grp_df <- collect(events_grp)

The error that is occuring is related to Kyro serialization

" org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 58. To avoid this, increase spark.kryoserializer.buffer.max value"

Can anyone help to suggest any alternate to collect or any other way to solve this problem?

FYI : I tried to increase the buffer.max.mb using spark.conf.set("spark.kryoserializer.buffer.max.mb", "50000") but it is not working

Thanks in advance

2 REPLIES 2

Anonymous
Not applicable

@Niraj Tanwar​ :

The error message indicates that the buffer overflowed, which could mean that the amount of data being serialized exceeded the maximum buffer size allowed by Kryo. In addition to increasing the buffer size, you can try reducing the size of the data being serialized by selecting only the necessary columns from the DataFrame. Instead of using collect() , you can try writing the DataFrame to a file in a format that can be read by R, such as CSV or Parquet. Then, you can read the file into R using its file reading functions.

Here is an example of writing a DataFrame to a CSV file:

events.write.format("csv").option("header", "true").save("/path/to/csv/file")

And here is an example of reading a CSV file into an R data.frame:

events_df <- read.csv("/path/to/csv/file")

You can use similar code to write and read Parquet files as well. Writing to Parquet format has the advantage of being a more efficient and compact file format compared to CSV.

I hope this helps!

Anonymous
Not applicable

Hi @Niraj Tanwar​ 

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help. 

We'd love to hear from you.

Thanks!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.