โ08-11-2024 08:37 AM
I need to ingest full load with 5 TB of data by applying business transformations and wants to process it in 2-3 hours. Any criteria needs to be considered while selecting min and max node workers for this full load processing.
โ08-11-2024 09:43 AM
this is example for 100TB you can modify according to your need
To read 100 TB of data in 5 minutes with a Hadoop cluster that has a read/write speed of 100 MB/s and a replication factor of 3, you would need approximately 200 data nodes.
Here's the calculation:
The total amount of data that can be read in 300 seconds with a single 100 MB/s node is:
- 100 MB/s * 300 s = 30 TB
Since the replication factor is 3, the actual amount of unique data that can be read is 1/3 of that, which is 10 TB.
To read 100 TB of data, you would need:
- 100 TB / 10 TB per node = 10 nodes
However, since the data is replicated 3 times, the total number of nodes required is:
- 10 nodes * 3 replicas = 30 nodes
Therefore, you would need approximately 200 data nodes to read 100 TB of data in 5 minutes from your Hadoop cluster with a 100 MB/s read/write speed and a replication factor of 3.
โ08-11-2024 09:43 PM
Hi Prashanth,
I recommend you to Start with around 40 powerful workers and set the auto-scaling limit to 120 to handle any extra load. Keep an eye on the job, and adjust the workers if things slow down or resources get stretched. Just a thought. Give a try.
โ08-12-2024 07:11 AM
Hi @Prashanth24, Thanks for reaching out! Please review the responses and let us know which best addresses your question. Your feedback is valuable to us and the community.
If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.
We appreciate your participation and are here if you need further assistance!
โ08-13-2024 06:08 AM - edited โ08-13-2024 06:09 AM
Need more details about the workload to fully advise but generally speaking:
If the source data is raw CSV then the load should scale linearly. For example, if 64 nodes complete the process in 30 minutes then 32 nodes will complete it in 1 hour. So, start with many nodes and then scale down as needed to hit your SLA.
Note that compressed CSV files cannot be split among cores like raw CSV. For compressed CSV your parallelism will be limited by the total number of files. In this case it's best use at least 1 core per input file. For example, if you load ~250 compressed files then a 16 node * 8 vCPU cluster will work well because it has 128 cores but a 64 node * 8 vCPU cluster would not be fully utilized because it has 512 cores but the input parallelism is limited to ~250.
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now