Databricks Community

Prashanth24 · ‎08-11-2024

I need to ingest full load with 5 TB of data by applying business transformations and wants to process it in 2-3 hours. Any criteria needs to be considered while selecting min and max node workers for this full load processing.

Aviral-Bhardwaj · ‎08-11-2024

this is example for 100TB you can modify according to your need

To read 100 TB of data in 5 minutes with a Hadoop cluster that has a read/write speed of 100 MB/s and a replication factor of 3, you would need approximately 200 data nodes.

Here's the calculation:

Total data to be read: 100 TB
Time to read the data: 5 minutes = 300 seconds
Read speed per node: 100 MB/s
Replication factor: 3

The total amount of data that can be read in 300 seconds with a single 100 MB/s node is:
- 100 MB/s * 300 s = 30 TB

Since the replication factor is 3, the actual amount of unique data that can be read is 1/3 of that, which is 10 TB.

To read 100 TB of data, you would need:
- 100 TB / 10 TB per node = 10 nodes

However, since the data is replicated 3 times, the total number of nodes required is:
- 10 nodes * 3 replicas = 30 nodes

Therefore, you would need approximately 200 data nodes to read 100 TB of data in 5 minutes from your Hadoop cluster with a 100 MB/s read/write speed and a replication factor of 3.

AviralBhardwaj

Brahmareddy · ‎08-11-2024

Hi Prashanth,

I recommend you to Start with around 40 powerful workers and set the auto-scaling limit to 120 to handle any extra load. Keep an eye on the job, and adjust the workers if things slow down or resources get stretched. Just a thought. Give a try.

joeharris76 · ‎08-13-2024

Need more details about the workload to fully advise but generally speaking:

use the latest generation of cloud instances
enable Unity Catalog
enable Photon

If the source data is raw CSV then the load should scale linearly. For example, if 64 nodes complete the process in 30 minutes then 32 nodes will complete it in 1 hour. So, start with many nodes and then scale down as needed to hit your SLA.

Note that compressed CSV files cannot be split among cores like raw CSV. For compressed CSV your parallelism will be limited by the total number of files. In this case it's best use at least 1 core per input file. For example, if you load ~250 compressed files then a 16 node * 8 vCPU cluster will work well because it has 128 cores but a 64 node * 8 vCPU cluster would not be fully utilized because it has 512 cores but the input parallelism is limited to ~250.

Databricks Community

Number of Min and Max nodes count for processing 5 TB data

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences