Databricks Community

Prashanth24 · ‎08-11-2024

I need to ingest full load with 5 TB of data by applying business transformations and wants to process it in 2-3 hours. Any criteria needs to be considered while selecting min and max node workers for this full load processing.

Aviral-Bhardwaj · ‎08-11-2024

this is example for 100TB you can modify according to your need

To read 100 TB of data in 5 minutes with a Hadoop cluster that has a read/write speed of 100 MB/s and a replication factor of 3, you would need approximately 200 data nodes.

Here's the calculation:

Total data to be read: 100 TB
Time to read the data: 5 minutes = 300 seconds
Read speed per node: 100 MB/s
Replication factor: 3

The total amount of data that can be read in 300 seconds with a single 100 MB/s node is:
- 100 MB/s * 300 s = 30 TB

Since the replication factor is 3, the actual amount of unique data that can be read is 1/3 of that, which is 10 TB.

To read 100 TB of data, you would need:
- 100 TB / 10 TB per node = 10 nodes

However, since the data is replicated 3 times, the total number of nodes required is:
- 10 nodes * 3 replicas = 30 nodes

Therefore, you would need approximately 200 data nodes to read 100 TB of data in 5 minutes from your Hadoop cluster with a 100 MB/s read/write speed and a replication factor of 3.

AviralBhardwaj

Brahmareddy · ‎08-11-2024

Hi Prashanth,

I recommend you to Start with around 40 powerful workers and set the auto-scaling limit to 120 to handle any extra load. Keep an eye on the job, and adjust the workers if things slow down or resources get stretched. Just a thought. Give a try.

Retired_mod · ‎08-12-2024

Hi @Prashanth24, Thanks for reaching out! Please review the responses and let us know which best addresses your question. Your feedback is valuable to us and the community.

If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.

We appreciate your participation and are here if you need further assistance!

joeharris76 · ‎08-13-2024

Need more details about the workload to fully advise but generally speaking:

use the latest generation of cloud instances
enable Unity Catalog
enable Photon

If the source data is raw CSV then the load should scale linearly. For example, if 64 nodes complete the process in 30 minutes then 32 nodes will complete it in 1 hour. So, start with many nodes and then scale down as needed to hit your SLA.

Note that compressed CSV files cannot be split among cores like raw CSV. For compressed CSV your parallelism will be limited by the total number of files. In this case it's best use at least 1 core per input file. For example, if you load ~250 compressed files then a 16 node * 8 vCPU cluster will work well because it has 128 cores but a 64 node * 8 vCPU cluster would not be fully utilized because it has 512 cores but the input parallelism is limited to ~250.

Databricks Community

Number of Min and Max nodes count for processing 5 TB data

Join Us as a Local Community Builder!

Solution Accelerator Series | #5 - Automating Product Review Summarization with LLMs

The next BrickTalks about the latest and greatest in AI/BI is scheduled for Oct 28!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟