cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Creating a Spark DataFrame from a very large dataset

charry
New Contributor II

I am trying to create a DataFrame using Spark but am having some issues with the amount of data I'm using. I made a list with over 1 million entries through several API calls. The list was above the threshold for spark.rpc.message.maxSize and it was also too large to use broadcasting. I kept on getting OOM errors from using such large amounts of memory. So, I created two separate lists from the data in the original list. When I tried to create the DataFrame again, the size was still too large for the spark.rpc.message.maxSize, and that was using 32 repartitions. My endgoal is to join the two tables together in a temporary view and then write to parquet so I can get all the data in a PowerBI report.

 

5 REPLIES 5

Tharun-Kumar
Databricks Employee
Databricks Employee

@charry 

I would suggest saving the list as a CSV file and then reading it back in Spark using spark.read.csv and saving it in parquet format.

erigaud
Honored Contributor

Have you tried specifying the schema when creating the DataFrame ? Providing the right types can help with the memory. 

Furthemore, you could incrementally load your data to a bronze delta table instead of loading the full million rows at once. 

Hope this helps !

-werners-
Esteemed Contributor III

the best way is indeed to write the extracted data and then read it back into spark.  Like that you do not burden spark with all the api calls.

Anonymous
Not applicable

Hi @charry 

Checking in. If @-werners- answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?

Cheers!

 

saipujari_spark
Databricks Employee
Databricks Employee

Hey @charry 

Look at this KB article, this should help address the issue.

https://kb.databricks.com/execution/spark-serialized-task-is-too-large

Thanks,
Saikrishna Pujari
Sr. Spark Technical Solutions Engineer, Databricks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group