Databricks Community

charry · ‎07-17-2023

I am trying to create a DataFrame using Spark but am having some issues with the amount of data I'm using. I made a list with over 1 million entries through several API calls. The list was above the threshold for spark.rpc.message.maxSize and it was also too large to use broadcasting. I kept on getting OOM errors from using such large amounts of memory. So, I created two separate lists from the data in the original list. When I tried to create the DataFrame again, the size was still too large for the spark.rpc.message.maxSize, and that was using 32 repartitions. My endgoal is to join the two tables together in a temporary view and then write to parquet so I can get all the data in a PowerBI report.

Tharun-Kumar · ‎07-17-2023

@charry

I would suggest saving the list as a CSV file and then reading it back in Spark using spark.read.csv and saving it in parquet format.

erigaud · ‎07-17-2023

Have you tried specifying the schema when creating the DataFrame ? Providing the right types can help with the memory.

Furthemore, you could incrementally load your data to a bronze delta table instead of loading the full million rows at once.

Hope this helps !

-werners- · ‎07-18-2023

the best way is indeed to write the extracted data and then read it back into spark. Like that you do not burden spark with all the api calls.

Anonymous · ‎07-19-2023

Hi @charry

Checking in. If @-werners- answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?

Cheers!

saipujari_spark · ‎07-19-2023

Hey @charry

Look at this KB article, this should help address the issue.

https://kb.databricks.com/execution/spark-serialized-task-is-too-large

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

Databricks Community

Creating a Spark DataFrame from a very large dataset

Photos

Join Us as a Local Community Builder!

Business Intelligence in the Era of AI

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Databricks Community Champion - March 2025 - Takuya Omi

Get Started With Lakehouse Architecture | Pass a quiz to earn your certificate completion.

Virtual Learning Festival: 9 April - 30 April