POC Comparison: Databricks vs AWS EMR
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
08-04-2024 02:10 AM
Hello,
I need some assistance with a comparison between Databricks and AWS EMR. We've been evaluating the Databricks Data Intelligence platform for a client and found it to be significantly more expensive than AWS EMR. I understand the challenge in making a direct comparison since Databricks is a comprehensive platform while EMR is more of a single service.
The cost disparity becomes apparent when dealing with structured CSV processing (details in point 2 below). Despite using the same codebase, our tests indicate longer processing times with Databricks compared to EMR. I'm utilizing PySpark DataFrame APIs for this proof of concept to measure and compare the performance and cost. Here's the context:
Unstructured Text Data: We have an unstructured text file that needs to be read, parsed, and processed according to specific requirements, resulting in two output files: one CSV and one text. This process involves considerable shuffling and disk spills. Our tests showed that Databricks performed as well or better than EMR. Specifically, for a 1 TB file with identical configurations, Databricks completed the job successfully while EMR did not. Databricks seems to excel in this scenario.
Structured CSV Files: I developed a sample code to read a CSV file, perform aggregations, and join the results back to the source CSV. This pattern is repeated for various group-by columns and includes some windowing aggregations. The logic involves reading from and writing back to CSV files in S3. For a 225 GB file, EMR completed the task in 40 minutes, while Databricks took 1 hour and 5 minutes. For a 1 TB file, EMR finished in 3 hours compared to Databricks' 6 hours. When these timings are converted into costs, Databricks ends up being 2-4 times more expensive than EMR due to the DBU costs. Photon was enabled, which I hoped would reduce processing times and outperform EMR, but EMR still finished quicker.
Can someone help me understand why Databricks is significantly slower than EMR for these structured CSV tasks, despite Databricks' optimized Spark and next-generation Photon engine?