topic Re: Creating a Spark DataFrame from a very large dataset in Data Engineering

Creating a Spark DataFrame from a very large dataset

charry — Mon, 17 Jul 2023 19:02:36 GMT

I am trying to create a DataFrame using Spark but am having some issues with the amount of data I'm using. I made a list with over 1 million entries through several API calls. The list was above the threshold for spark.rpc.message.maxSize and it was also too large to use broadcasting. I kept on getting OOM errors from using such large amounts of memory. So, I created two separate lists from the data in the original list. When I tried to create the DataFrame again, the size was still too large for the spark.rpc.message.maxSize, and that was using 32 repartitions. My endgoal is to join the two tables together in a temporary view and then write to parquet so I can get all the data in a PowerBI report.

Re: Creating a Spark DataFrame from a very large dataset

Tharun-Kumar — Tue, 18 Jul 2023 03:38:51 GMT

@charry

I would suggest saving the list as a CSV file and then reading it back in Spark using spark.read.csv and saving it in parquet format.

Re: Creating a Spark DataFrame from a very large dataset

erigaud — Tue, 18 Jul 2023 06:41:27 GMT

Have you tried specifying the schema when creating the DataFrame ? Providing the right types can help with the memory.

Furthemore, you could incrementally load your data to a bronze delta table instead of loading the full million rows at once.

Hope this helps !

Re: Creating a Spark DataFrame from a very large dataset

-werners- — Tue, 18 Jul 2023 15:09:01 GMT

the best way is indeed to write the extracted data and then read it back into spark. Like that you do not burden spark with all the api calls.

Re: Creating a Spark DataFrame from a very large dataset

Anonymous — Wed, 19 Jul 2023 09:14:56 GMT

Hi @charry

Checking in. If @-werners- answer helped, would you let us know and mark the answer as best? If not, would you be happy to give us more information?

Cheers!

Re: Creating a Spark DataFrame from a very large dataset

saipujari_spark — Wed, 19 Jul 2023 19:47:23 GMT

Hey @charry

Look at this KB article, this should help address the issue.

https://kb.databricks.com/execution/spark-serialized-task-is-too-large