cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to Read Terabytes of data in Databricks

Abhijeet
New Contributor III

I want to read 1000 GB data. As in spark we do in memory transformation. Do I need worker nodes with combined size of 1000 GB.

Also Just want to understand if will reading we store 1000 GB in memory. So how the Cache Data frame is different from the above case

4 REPLIES 4

Aviral-Bhardwaj
Esteemed Contributor III

in the master and slave node system

your data chunk will be divided into 128 MB.

so 1000/128= 7.8125

so it will require creating 7-8 partitions of that data so you don't need a 1000GB cluster 2-3 nodes with 10-30 GB size I will work fine

Let me know if I am wrong here

Thanks

Aviral Bhardwaj

AviralBhardwaj

no of partitions will be

1000*1024/128=8000

So my question is, all these 8000 partitions combined will be 1000 GB.

And I am creating a data frame from this data.

How this data is loaded. It will require to somehow hold the data In memory.

So I am just trying to understand what happens at backend, how the data is read( how the nodes manages this load)

Ajay-Pandey
Esteemed Contributor III

Hi @Abhijeet Singhโ€‹ below blog might help you-

Link

Ajay Kumar Pandey

Abhijeet
New Contributor III

None of the answers are relevant to me

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group