Databricks Community

VANNGA · ‎02-15-2024

Hi,

I wonder if you could help me on the below please.
We tried Databricks Data Intelligence platform for one of our clients and found that its very expensive when compared to AWS EMR. I understand its not apple-apple comparision as one being platform and another being single software as a service.

It is turning out to be expensive especially it is take quite longer than EMR especially for structured csv processing (point 2 below). I wonder if there is something obvious that I am missing out but its the same code base. I am using pyspark dataframe APIs for my POC to compare the timings and convert the timings into costs.Here is bit of context.

Unstructured text data: A unstructured text file need to be read, parsed and process the data as per some requirements and write into two separate files. csv and text. This involved lot of shuffling and disk spill but however tests revealed that databricks either performed similar or better than EMR Especially for 1 TB file, with same configuration databricks is able to complete job successfully but not in EMR. Databricks looks good here.
For structured csv files, I have written a sample code that reads csv , does some aggregations, joins back to the source csv. This pattern is repeated for several group by columns, this also involves some windowing aggregations. This is all mocked up logic, reading csv and writing back into csv in S3. For 225 GB - EMR finished in 40 mins whereas databricks finished in 1 hr 5 mins. For 1 TB - EMR finished 3 hrs and databricks 6 hrs. When I convert these into cost, databricks is turning out to be 2-4 costlier than EMR due to DBU costs. Photon was enabled, I was hoping that would decrease the processing times and complete the job quicker than EMR but still EMR finished quicker than Databricks.

Kindly let me know if there is anything that you could help to understand why processing times are much slower than EMR especially Databricks has got much optimized spark and next generation processing engine photon.

VANNGA · ‎02-16-2024

Hi @Retired_mod

Thanks for getting back with so valuable information.

System	File size	Duration	System	Duration	Comments	Comments1
EMR	225 GB	22 mins	Databricks	63 mins	EMR is cheaper than Databricks by 5 times	This involves various S3 writes with m5d4xlarge
EMR	225 GB	45 mins	Databricks	84 mins	EMR is cheaper than Databricks by 3.5 times	This involves various S3 writes but with m5d2xlarge
EMR	225 GB	13 mins	Databricks	20 mins	EMR is cheaper than Databricks by 2 times	This does not involve S3 writes but take counts m52xlarge
EMR	1 TB	3 Hrs 37 mins	Databricks	6 Hrs 11 mins	EMR is cheaper than Databricks by more than 3 times	This involves various S3 writes with m5d4xlarge
EMR	1 TB	1 Hr 24 mins	Databricks	2 Hrs	EMR is cheaper than Databricks by about 6 times	This does not involve S3 writes but take counts - m5d4xlarge

This is the first time the comparisions have been made and I am bit surprised the processing times.

As you see, the row1, its 5 times more expensive than Databricks. Next, I changed the instance size from m5d4xlarge to m5d2xlarge, which has brought down the cost difference to 3.5 times. I then had to change the logic to remove the S3 writes ( about 5 - 7 writes ) and use counts as actions with m5d2x large ( row 3 ) which brought down the cost difference to 2 times as processing times are better and DBUs are adjusted to lower them using less powerful instances but with more of those. This is best I can do with 225 GB file but at the cost of changing the logic to reduce the number of writes. I noticed in the logs that EMR uses multi part uploads while writing into S3 but not sure if Databricks is using the same to perform S3 writes. Any ideas here to improve Databricks writes speed?

My other question is: In my understanding, if I further tune the spark configurations, it would improve the processing times equally among the two softwares which again the cost difference might not come down. So I don't see much point here especially the cost difference I am happy to see myself proving wrong here.Anything is being missed out here please?

Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary.

I believe this can be tested to fire off concurrent jobs and check how two clusters cope up? I hope this is what you meant here?

Anything that can be done to bring down the costs would be greatly welcomed.

Additionally other factors like writing into delta table rather than csv in S3 etc, you think might bring any difference at all?

Kindly let me know.

Thanks.

VANNGA · ‎02-20-2024

Hi @Retired_mod

Please can you let me know if there are suggestions on my above further findings/points raised?

Thanks

Databricks Community

POC

Photos

Join Us as a Local Community Builder!

Announcing the APJ Databricks Smart Business Insights Challenge: Empowering Data-Driven Decision Mak

🚀 Monthly Databricks Get Started Days – Accelerate Your Learning Journey! 🚀

Business Intelligence in the Era of AI

Virtual Learning Festival: 9 April - 30 April

Data + AI Summit 2025 — registration now open!