topic Re: How to Read Terabytes of data in Databricks in Data Engineering

How to Read Terabytes of data in Databricks

Abhijeet — Sat, 07 Jan 2023 14:01:59 GMT

I want to read 1000 GB data. As in spark we do in memory transformation. Do I need worker nodes with combined size of 1000 GB.

Also Just want to understand if will reading we store 1000 GB in memory. So how the Cache Data frame is different from the above case

Aviral-Bhardwaj — Sat, 07 Jan 2023 16:00:30 GMT

in the master and slave node system

your data chunk will be divided into 128 MB.

so 1000/128= 7.8125

so it will require creating 7-8 partitions of that data so you don't need a 1000GB cluster 2-3 nodes with 10-30 GB size I will work fine

Let me know if I am wrong here

Thanks

Aviral Bhardwaj

Abhijeet — Sat, 07 Jan 2023 17:02:52 GMT

no of partitions will be

1000*1024/128=8000

So my question is, all these 8000 partitions combined will be 1000 GB.

And I am creating a data frame from this data.

How this data is loaded. It will require to somehow hold the data In memory.

So I am just trying to understand what happens at backend, how the data is read( how the nodes manages this load)

Ajay-Pandey — Sun, 08 Jan 2023 07:24:59 GMT

Hi @Abhijeet Singh below blog might help you-

Abhijeet — Tue, 17 Jan 2023 05:18:41 GMT

None of the answers are relevant to me