cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Number of Min and Max nodes count for processing 5 TB data

Prashanth24
New Contributor III

I need to ingest full load with 5 TB of data by applying business transformations and wants to process it in 2-3 hours. Any criteria needs to be considered while selecting min and max node workers for this full load processing. 

3 REPLIES 3

Aviral-Bhardwaj
Esteemed Contributor III

this is example for 100TB you can modify according to your need 

To read 100 TB of data in 5 minutes with a Hadoop cluster that has a read/write speed of 100 MB/s and a replication factor of 3, you would need approximately 200 data nodes.

Here's the calculation:

  • Total data to be read: 100 TB
  • Time to read the data: 5 minutes = 300 seconds
  • Read speed per node: 100 MB/s
  • Replication factor: 3

The total amount of data that can be read in 300 seconds with a single 100 MB/s node is:
- 100 MB/s * 300 s = 30 TB

Since the replication factor is 3, the actual amount of unique data that can be read is 1/3 of that, which is 10 TB.

To read 100 TB of data, you would need:
- 100 TB / 10 TB per node = 10 nodes

However, since the data is replicated 3 times, the total number of nodes required is:
- 10 nodes * 3 replicas = 30 nodes

Therefore, you would need approximately 200 data nodes to read 100 TB of data in 5 minutes from your Hadoop cluster with a 100 MB/s read/write speed and a replication factor of 3.

AviralBhardwaj

Brahmareddy
Valued Contributor III

Hi Prashanth,

I recommend you to Start with around 40 powerful workers and set the auto-scaling limit to 120 to handle any extra load. Keep an eye on the job, and adjust the workers if things slow down or resources get stretched. Just a thought. Give a try.

joeharris76
New Contributor II


Need more details about the workload to fully advise but generally speaking:

  • use the latest generation of cloud instances 
  • enable Unity Catalog
  • enable Photon

If the source data is raw CSV then the load should scale linearly. For example, if 64 nodes complete the process in 30 minutes then 32 nodes will complete it in 1 hour. So, start with many nodes and then scale down as needed to hit your SLA.

Note that compressed CSV files cannot be split among cores like raw CSV. For compressed CSV your parallelism will be limited by the total number of files. In this case it's best use at least 1 core per input file. For example, if you load ~250 compressed files then a 16 node * 8 vCPU cluster will work well because it has 128 cores but a 64 node * 8 vCPU cluster would not be fully utilized because it has 512 cores but the input parallelism is limited to ~250.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group