POC
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2024 10:26 AM
We tried Databricks Data Intelligence platform for one of our clients and found that its very expensive when compared to AWS EMR. I understand its not apple-apple comparision as one being platform and another being single software as a service.
- Unstructured text data: A unstructured text file need to be read, parsed and process the data as per some requirements and write into two separate files. csv and text. This involved lot of shuffling and disk spill but however tests revealed that databricks either performed similar or better than EMR Especially for 1 TB file, with same configuration databricks is able to complete job successfully but not in EMR. Databricks looks good here.
- For structured csv files, I have written a sample code that reads csv , does some aggregations, joins back to the source csv. This pattern is repeated for several group by columns, this also involves some windowing aggregations. This is all mocked up logic, reading csv and writing back into csv in S3. For 225 GB - EMR finished in 40 mins whereas databricks finished in 1 hr 5 mins. For 1 TB - EMR finished 3 hrs and databricks 6 hrs. When I convert these into cost, databricks is turning out to be 2-4 costlier than EMR due to DBU costs. Photon was enabled, I was hoping that would decrease the processing times and complete the job quicker than EMR but still EMR finished quicker than Databricks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-16-2024 09:32 AM
Hi @Retired_mod
Thanks for getting back with so valuable information.
System | File size | Duration | System | Duration | Comments | Comments1 |
EMR | 225 GB | 22 mins | Databricks | 63 mins | EMR is cheaper than Databricks by 5 times | This involves various S3 writes with m5d4xlarge |
EMR | 225 GB | 45 mins | Databricks | 84 mins | EMR is cheaper than Databricks by 3.5 times | This involves various S3 writes but with m5d2xlarge |
EMR | 225 GB | 13 mins | Databricks | 20 mins | EMR is cheaper than Databricks by 2 times | This does not involve S3 writes but take counts m52xlarge |
EMR | 1 TB | 3 Hrs 37 mins | Databricks | 6 Hrs 11 mins | EMR is cheaper than Databricks by more than 3 times | This involves various S3 writes with m5d4xlarge |
EMR | 1 TB | 1 Hr 24 mins | Databricks | 2 Hrs | EMR is cheaper than Databricks by about 6 times | This does not involve S3 writes but take counts - m5d4xlarge |
This is the first time the comparisions have been made and I am bit surprised the processing times.
As you see, the row1, its 5 times more expensive than Databricks. Next, I changed the instance size from m5d4xlarge to m5d2xlarge, which has brought down the cost difference to 3.5 times. I then had to change the logic to remove the S3 writes ( about 5 - 7 writes ) and use counts as actions with m5d2x large ( row 3 ) which brought down the cost difference to 2 times as processing times are better and DBUs are adjusted to lower them using less powerful instances but with more of those. This is best I can do with 225 GB file but at the cost of changing the logic to reduce the number of writes. I noticed in the logs that EMR uses multi part uploads while writing into S3 but not sure if Databricks is using the same to perform S3 writes. Any ideas here to improve Databricks writes speed?
My other question is: In my understanding, if I further tune the spark configurations, it would improve the processing times equally among the two softwares which again the cost difference might not come down. So I don't see much point here especially the cost difference I am happy to see myself proving wrong here.Anything is being missed out here please?
Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary.
I believe this can be tested to fire off concurrent jobs and check how two clusters cope up? I hope this is what you meant here?
Anything that can be done to bring down the costs would be greatly welcomed.
Additionally other factors like writing into delta table rather than csv in S3 etc, you think might bring any difference at all?
Kindly let me know.
Thanks.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-20-2024 04:25 AM
Hi @Retired_mod
Please can you let me know if there are suggestions on my above further findings/points raised?
Thanks

