cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

POC

VANNGA
New Contributor II
Hi,
 
I wonder if you could help me on the below please.
We tried Databricks Data Intelligence platform for one of our clients and found that its very expensive when compared to AWS EMR. I understand its not apple-apple comparision as one being platform and another being single software as a service.
It is turning out to be expensive especially it is take quite longer than EMR especially for structured csv processing (point 2 below). I wonder if there is something obvious that I am missing out but its the same code base. I am using pyspark dataframe APIs for my POC to compare the timings and convert the timings into costs.Here is bit of context.
  1. Unstructured text data: A unstructured text file need to be read, parsed and process the data as per some requirements and write into two separate files. csv and text. This involved lot of shuffling and disk spill but however tests revealed that databricks either performed similar or better than EMR Especially for 1 TB file, with same configuration databricks is able to complete job successfully but not in EMR. Databricks looks good here.
  2. For structured csv files, I have written a sample code that reads csv , does some aggregations, joins back to the source csv. This pattern is repeated for several group by columns, this also involves some windowing aggregations. This is all mocked up logic, reading csv and writing back into csv in S3. For 225 GB - EMR finished in 40 mins whereas databricks finished in 1 hr 5 mins. For 1 TB - EMR finished 3 hrs and databricks 6 hrs. When I convert these into cost, databricks is turning out to be 2-4 costlier than EMR due to DBU costs. Photon was enabled, I was hoping that would decrease the processing times and complete the job quicker than EMR but still EMR finished quicker than Databricks.
Kindly let me know if there is anything that you could help to understand why processing times are much slower than EMR especially Databricks has got much optimized spark and next generation processing engine photon.
2 REPLIES 2

VANNGA
New Contributor II

Hi @Retired_mod 

Thanks for getting back with so valuable information.

SystemFile sizeDurationSystemDurationCommentsComments1
EMR225 GB22 minsDatabricks63 minsEMR is cheaper than Databricks by 5 timesThis involves various S3 writes with m5d4xlarge
EMR225 GB45 minsDatabricks84 minsEMR is cheaper than Databricks by 3.5 timesThis involves various S3 writes but with m5d2xlarge
EMR225 GB13 minsDatabricks20 minsEMR is cheaper than Databricks by 2 timesThis does not involve S3 writes but take counts m52xlarge
EMR1 TB3 Hrs 37 minsDatabricks6 Hrs 11 minsEMR is cheaper than Databricks by more than 3 timesThis involves various S3 writes with m5d4xlarge
EMR1 TB1 Hr 24 minsDatabricks2 HrsEMR is cheaper than Databricks by about 6 timesThis does not involve S3 writes but take counts - m5d4xlarge

This is the first time the comparisions have been made and I am bit surprised the processing times.

As you see, the row1, its 5 times more expensive than Databricks. Next, I changed the instance size from m5d4xlarge to m5d2xlarge, which has brought down the cost difference to 3.5 times. I then had to change the logic to remove the S3 writes ( about 5 - 7 writes ) and use counts as actions with m5d2x large ( row 3 ) which brought down the cost difference to 2 times as processing times are better and DBUs are adjusted to lower them using less powerful instances but with more of those. This is best I can do with 225 GB file but at the cost of changing the logic to reduce the number of writes. I noticed in the logs that EMR uses multi part uploads while writing into S3 but not sure if Databricks is using the same to perform S3 writes. Any ideas here to improve Databricks writes speed? 

My other question is: In my understanding, if I further tune the spark configurations, it would improve the processing times equally among the two softwares which again the cost difference might not come down. So I don't see much point here especially the cost difference I am happy to see myself proving wrong here.Anything is being missed out here please?

Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary. 

I believe this can be tested to fire off concurrent jobs and check how two clusters cope up? I hope this is what you meant here? 

Anything that can be done to bring down the costs would be greatly welcomed.

Additionally other factors like writing into delta table rather than csv in S3 etc, you think might bring any difference at all?

Kindly let me know.

Thanks. 

 

 

VANNGA
New Contributor II

Hi @Retired_mod 

Please can you let me know if there are suggestions on my above further findings/points raised?

Thanks

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group