Databricks Community

TravisBrowne · ‎08-04-2024

Hello,

I need some assistance with a comparison between Databricks and AWS EMR. We've been evaluating the Databricks Data Intelligence platform for a client and found it to be significantly more expensive than AWS EMR. I understand the challenge in making a direct comparison since Databricks is a comprehensive platform while EMR is more of a single service.

The cost disparity becomes apparent when dealing with structured CSV processing (details in point 2 below). Despite using the same codebase, our tests indicate longer processing times with Databricks compared to EMR. I'm utilizing PySpark DataFrame APIs for this proof of concept to measure and compare the performance and cost. Here's the context:

Unstructured Text Data: We have an unstructured text file that needs to be read, parsed, and processed according to specific requirements, resulting in two output files: one CSV and one text. This process involves considerable shuffling and disk spills. Our tests showed that Databricks performed as well or better than EMR. Specifically, for a 1 TB file with identical configurations, Databricks completed the job successfully while EMR did not. Databricks seems to excel in this scenario.
Structured CSV Files: I developed a sample code to read a CSV file, perform aggregations, and join the results back to the source CSV. This pattern is repeated for various group-by columns and includes some windowing aggregations. The logic involves reading from and writing back to CSV files in S3. For a 225 GB file, EMR completed the task in 40 minutes, while Databricks took 1 hour and 5 minutes. For a 1 TB file, EMR finished in 3 hours compared to Databricks' 6 hours. When these timings are converted into costs, Databricks ends up being 2-4 times more expensive than EMR due to the DBU costs. Photon was enabled, which I hoped would reduce processing times and outperform EMR, but EMR still finished quicker.

Can someone help me understand why Databricks is significantly slower than EMR for these structured CSV tasks, despite Databricks' optimized Spark and next-generation Photon engine?

Schofield · ‎08-04-2024

I can't say for sure what your exact limitation is going to be but two things I have come across in EMR bake offs are:

1) bottleneck in the network throughput. Verify the S3 reads and writes are happening at similar throughput rates. This is a cloud set up issue and not a Databrick's issue. Could be the node type being used or any network routing in between your Databrick's cluster and S3.

2) Unnecessary auto-scaling in the databricks clusters. Sometimes databricks can get a little too proactive in scaling up and then has to back off. This can slow down the end-to-end execution time. In a well known workload the cluster should be right sized to maximize usage of each node and eliminate spilling to disk. A right sized cluster will also avoid auto-scaling which skips the spin up and spin down times of nodes during the job execution.

Regarding photon... It kicks in for specific workloads. Things like UDFs and RDD APIs won't take advantage of photon.

Definitely take the time to drill into the Spark UI to see if there are any difference in the actual spark execution. This may uncover other differences like job configurations that are impacting your benchmarks.

Databricks Community

POC Comparison: Databricks vs AWS EMR

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon