Databricks Community

Abhijeet · ‎01-07-2023

I want to read 1000 GB data. As in spark we do in memory transformation. Do I need worker nodes with combined size of 1000 GB.

Also Just want to understand if will reading we store 1000 GB in memory. So how the Cache Data frame is different from the above case

Aviral-Bhardwaj · ‎01-07-2023

in the master and slave node system

your data chunk will be divided into 128 MB.

so 1000/128= 7.8125

so it will require creating 7-8 partitions of that data so you don't need a 1000GB cluster 2-3 nodes with 10-30 GB size I will work fine

Let me know if I am wrong here

Thanks

Aviral Bhardwaj

AviralBhardwaj

Abhijeet · ‎01-07-2023

no of partitions will be

1000*1024/128=8000

So my question is, all these 8000 partitions combined will be 1000 GB.

And I am creating a data frame from this data.

How this data is loaded. It will require to somehow hold the data In memory.

So I am just trying to understand what happens at backend, how the data is read( how the nodes manages this load)

Ajay-Pandey · ‎01-07-2023

Hi @Abhijeet Singh below blog might help you-

Ajay Kumar Pandey

Abhijeet · ‎01-16-2023

None of the answers are relevant to me

How to Read Terabytes of data in Databricks