Creating a Spark DataFrame from a very large dataset
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-17-2023 12:02 PM
I am trying to create a DataFrame using Spark but am having some issues with the amount of data I'm using. I made a list with over 1 million entries through several API calls. The list was above the threshold for spark.rpc.message.maxSize and it was also too large to use broadcasting. I kept on getting OOM errors from using such large amounts of memory. So, I created two separate lists from the data in the original list. When I tried to create the DataFrame again, the size was still too large for the spark.rpc.message.maxSize, and that was using 32 repartitions. My endgoal is to join the two tables together in a temporary view and then write to parquet so I can get all the data in a PowerBI report.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-17-2023 08:38 PM
I would suggest saving the list as a CSV file and then reading it back in Spark using spark.read.csv and saving it in parquet format.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-17-2023 11:41 PM
Have you tried specifying the schema when creating the DataFrame ? Providing the right types can help with the memory.
Furthemore, you could incrementally load your data to a bronze delta table instead of loading the full million rows at once.
Hope this helps !
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-18-2023 08:09 AM
the best way is indeed to write the extracted data and then read it back into spark. Like that you do not burden spark with all the api calls.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2023 02:14 AM
Hi @charry
Checking in. If @-werners- answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?
Cheers!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
07-19-2023 12:47 PM
Hey @charry
Look at this KB article, this should help address the issue.
https://kb.databricks.com/execution/spark-serialized-task-is-too-large
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

