cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

POC Comparison: Databricks vs AWS EMR

TravisBrowne
New Contributor

Hello,

I need some assistance with a comparison between Databricks and AWS EMR. We've been evaluating the Databricks Data Intelligence platform for a client and found it to be significantly more expensive than AWS EMR. I understand the challenge in making a direct comparison since Databricks is a comprehensive platform while EMR is more of a single service.

The cost disparity becomes apparent when dealing with structured CSV processing (details in point 2 below). Despite using the same codebase, our tests indicate longer processing times with Databricks compared to EMR. I'm utilizing PySpark DataFrame APIs for this proof of concept to measure and compare the performance and cost. Here's the context:

  1. Unstructured Text Data: We have an unstructured text file that needs to be read, parsed, and processed according to specific requirements, resulting in two output files: one CSV and one text. This process involves considerable shuffling and disk spills. Our tests showed that Databricks performed as well or better than EMR. Specifically, for a 1 TB file with identical configurations, Databricks completed the job successfully while EMR did not. Databricks seems to excel in this scenario.

  2. Structured CSV Files: I developed a sample code to read a CSV file, perform aggregations, and join the results back to the source CSV. This pattern is repeated for various group-by columns and includes some windowing aggregations. The logic involves reading from and writing back to CSV files in S3. For a 225 GB file, EMR completed the task in 40 minutes, while Databricks took 1 hour and 5 minutes. For a 1 TB file, EMR finished in 3 hours compared to Databricks' 6 hours. When these timings are converted into costs, Databricks ends up being 2-4 times more expensive than EMR due to the DBU costs. Photon was enabled, which I hoped would reduce processing times and outperform EMR, but EMR still finished quicker.

Can someone help me understand why Databricks is significantly slower than EMR for these structured CSV tasks, despite Databricks' optimized Spark and next-generation Photon engine?

2 REPLIES 2

Schofield
New Contributor III

I can't say for sure what your exact limitation is going to be but two things I have come across in EMR bake offs are:

1) bottleneck in the network throughput.  Verify the S3 reads and writes are happening at similar throughput rates.  This is a cloud set up issue and not a Databrick's issue.  Could be the node type being used or any network routing in between your Databrick's cluster and S3.

2) Unnecessary auto-scaling in the databricks clusters.  Sometimes databricks can get a little too proactive in scaling up and then has to back off.  This can slow down the end-to-end execution time.  In a well known workload the cluster should be right sized to maximize usage of each node and eliminate spilling to disk.  A right sized cluster will also avoid auto-scaling which skips the spin up and spin down times of nodes during the job execution.

Regarding photon...  It kicks in for specific workloads.  Things like UDFs and RDD APIs won't take advantage of photon.

Definitely take the time to drill into the Spark UI to see if there are any difference in the actual spark execution.  This may uncover other differences like job configurations that are impacting your benchmarks.

Kaniz_Fatma
Community Manager
Community Manager

Hi @TravisBrowne, Thanks for reaching out! Please review the response and let us know if it answers your question. Your feedback is valuable to us and the community.

If the response resolves your issue, kindly mark it as the accepted solution. This will help close the thread and assist others with similar queries.

We appreciate your participation and are here if you need further assistance!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group