Hi @Retired_mod
Thanks for getting back with so valuable information.
System | File size | Duration | System | Duration | Comments | Comments1 |
EMR | 225 GB | 22 mins | Databricks | 63 mins | EMR is cheaper than Databricks by 5 times | This involves various S3 writes with m5d4xlarge |
EMR | 225 GB | 45 mins | Databricks | 84 mins | EMR is cheaper than Databricks by 3.5 times | This involves various S3 writes but with m5d2xlarge |
EMR | 225 GB | 13 mins | Databricks | 20 mins | EMR is cheaper than Databricks by 2 times | This does not involve S3 writes but take counts m52xlarge |
EMR | 1 TB | 3 Hrs 37 mins | Databricks | 6 Hrs 11 mins | EMR is cheaper than Databricks by more than 3 times | This involves various S3 writes with m5d4xlarge |
EMR | 1 TB | 1 Hr 24 mins | Databricks | 2 Hrs | EMR is cheaper than Databricks by about 6 times | This does not involve S3 writes but take counts - m5d4xlarge |
This is the first time the comparisions have been made and I am bit surprised the processing times.
As you see, the row1, its 5 times more expensive than Databricks. Next, I changed the instance size from m5d4xlarge to m5d2xlarge, which has brought down the cost difference to 3.5 times. I then had to change the logic to remove the S3 writes ( about 5 - 7 writes ) and use counts as actions with m5d2x large ( row 3 ) which brought down the cost difference to 2 times as processing times are better and DBUs are adjusted to lower them using less powerful instances but with more of those. This is best I can do with 225 GB file but at the cost of changing the logic to reduce the number of writes. I noticed in the logs that EMR uses multi part uploads while writing into S3 but not sure if Databricks is using the same to perform S3 writes. Any ideas here to improve Databricks writes speed?
My other question is: In my understanding, if I further tune the spark configurations, it would improve the processing times equally among the two softwares which again the cost difference might not come down. So I don't see much point here especially the cost difference I am happy to see myself proving wrong here.Anything is being missed out here please?
Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary.
I believe this can be tested to fire off concurrent jobs and check how two clusters cope up? I hope this is what you meant here?
Anything that can be done to bring down the costs would be greatly welcomed.
Additionally other factors like writing into delta table rather than csv in S3 etc, you think might bring any difference at all?
Kindly let me know.
Thanks.