To collect the elements of a SparkDataFrame and coerces them into an R dataframe.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-16-2023 04:29 AM
Hello Everyone,
I am facing the challenge while collecting a spark dataframe into an R dataframe, this I need to do as I am using TraMineR algorithm whih is implemented in R only and the data pre-processing I have done in pyspark
I am trying this:
events_df <- collect(events)
events_grp_df <- collect(events_grp)
The error that is occuring is related to Kyro serialization
" org.apache.spark.SparkException: Kryo serialization failed: Buffer overflow. Available: 0, required: 58. To avoid this, increase spark.kryoserializer.buffer.max value"
Can anyone help to suggest any alternate to collect or any other way to solve this problem?
FYI : I tried to increase the buffer.max.mb using spark.conf.set("spark.kryoserializer.buffer.max.mb", "50000") but it is not working
Thanks in advance
- Labels:
-
Dataframe
-
Pyspark
-
R
-
Sparkdataframe
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-17-2023 08:52 AM
@Niraj Tanwarโ :
The error message indicates that the buffer overflowed, which could mean that the amount of data being serialized exceeded the maximum buffer size allowed by Kryo. In addition to increasing the buffer size, you can try reducing the size of the data being serialized by selecting only the necessary columns from the DataFrame. Instead of using collect() , you can try writing the DataFrame to a file in a format that can be read by R, such as CSV or Parquet. Then, you can read the file into R using its file reading functions.
Here is an example of writing a DataFrame to a CSV file:
events.write.format("csv").option("header", "true").save("/path/to/csv/file")
And here is an example of reading a CSV file into an R data.frame:
events_df <- read.csv("/path/to/csv/file")
You can use similar code to write and read Parquet files as well. Writing to Parquet format has the advantage of being a more efficient and compact file format compared to CSV.
I hope this helps!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-17-2023 11:20 PM
Hi @Niraj Tanwarโ
Hope all is well! Just wanted to check in if you were able to resolve your issue and would you be happy to share the solution or mark an answer as best? Else please let us know if you need more help.
We'd love to hear from you.
Thanks!