How to Read Terabytes of data in Databricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2023 06:01 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2023 08:00 AM
in the master and slave node system
your data chunk will be divided into 128 MB.
so 1000/128= 7.8125
so it will require creating 7-8 partitions of that data so you don't need a 1000GB cluster 2-3 nodes with 10-30 GB size I will work fine
Let me know if I am wrong here
Thanks
Aviral Bhardwaj
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2023 09:02 AM
no of partitions will be
1000*1024/128=8000
So my question is, all these 8000 partitions combined will be 1000 GB.
And I am creating a data frame from this data.
How this data is loaded. It will require to somehow hold the data In memory.
So I am just trying to understand what happens at backend, how the data is read( how the nodes manages this load)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-07-2023 11:24 PM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
01-16-2023 09:18 PM
None of the answers are relevant to me

