Databricks Community

nirajtanwar · ‎03-16-2023

Hello Everyone,

I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pyspark

I am trying this:

events_df <- collect(events)

events_grp_df <- collect(events_grp)

The error that is occuring is related to Kyro serialization

" org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 58. To avoid this, increase spark.kryoserializer.buffer.max value"

Can anyone help to suggest any alternate to collect or any other way to solve this problem?

FYI : I tried to increase the buffer.max.mb using spark.conf.set("spark.kryoserializer.buffer.max.mb", "50000") but it is not working

Thanks in advance

Anonymous · ‎03-17-2023

@Niraj Tanwar :

The error message indicates that the buffer overflowed, which could mean that the amount of data being serialized exceeded the maximum buffer size allowed by Kryo. In addition to increasing the buffer size, you can try reducing the size of the data being serialized by selecting only the necessary columns from the DataFrame. Instead of using collect() , you can try writing the DataFrame to a file in a format that can be read by R, such as CSV or Parquet. Then, you can read the file into R using its file reading functions.

Here is an example of writing a DataFrame to a CSV file:

events.write.format("csv").option("header", "true").save("/path/to/csv/file")

And here is an example of reading a CSV file into an R data.frame:

events_df <- read.csv("/path/to/csv/file")

You can use similar code to write and read Parquet files as well. Writing to Parquet format has the advantage of being a more efficient and compact file format compared to CSV.

I hope this helps!

Anonymous · ‎03-17-2023

Hi @Niraj Tanwar

Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.

We'd love to hear from you.

Thanks!