cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

POC

VANNGA
New Contributor II
Hi,
 
I wonder if you could help me on the below please.
We tried Databricks Data Intelligence platform for one of our clients and found that its very expensive when compared to AWS EMR. I understand its not apple-apple comparision as one being platform and another being single software as a service.
It is turning out to be expensive especially it is take quite longer than EMR especially for structured csv processing (point 2 below). I wonder if there is something obvious that I am missing out but its the same code base. I am using pyspark dataframe APIs for my POC to compare the timings and convert the timings into costs.Here is bit of context.
  1. Unstructured text data: A unstructured text file need to be read, parsed and process the data as per some requirements and write into two separate files. csv and text. This involved lot of shuffling and disk spill but however tests revealed that databricks either performed similar or better than EMR Especially for 1 TB file, with same configuration databricks is able to complete job successfully but not in EMR. Databricks looks good here.
  2. For structured csv files, I have written a sample code that reads csv , does some aggregations, joins back to the source csv. This pattern is repeated for several group by columns, this also involves some windowing aggregations. This is all mocked up logic, reading csv and writing back into csv in S3. For 225 GB - EMR finished in 40 mins whereas databricks finished in 1 hr 5 mins. For 1 TB - EMR finished 3 hrs and databricks 6 hrs. When I convert these into cost, databricks is turning out to be 2-4 costlier than EMR due to DBU costs. Photon was enabled, I was hoping that would decrease the processing times and complete the job quicker than EMR but still EMR finished quicker than Databricks.
Kindly let me know if there is anything that you could help to understand why processing times are much slower than EMR especially Databricks has got much optimized spark and next generation processing engine photon.
3 REPLIES 3

Kaniz_Fatma
Community Manager
Community Manager

Hi @VANNGA

 

Platform vs. Software as a Service (SaaS):

  • You’ve rightly pointed out that Databricks is a platform, while AWS EMR is a managed service for running big data frameworks like Apache Spark. This distinction affects the overall architecture, flexibility, and pricing.
  • Databricks provides an integrated environment for data engineering, data science, and machine learning, which can be advantageous for end-to-end workflows. EMR, on the other hand, focuses specifically on Spark (and other big data tools) execution.

Structured CSV Processing:

  • Your observation that Databricks performs similarly or better than EMR for unstructured text data is interesting. Databricks’ optimizations and features might contribute to this.
  • However, when it comes to structured CSV files, you’ve noticed a significant difference in processing times. EMR outperforms Databricks in this scenario.

 

  • Let’s explore potential reasons for this discrepancy:

Performance Factors:

  • State Management and Structured Streaming:
    • Databricks offers integration with Kinesis Data Streams for Structured Streaming out of the box. This can be advantageous for real-time data processing.
    • Additionally, Databricks leverages RocksDB (a key-value store) to manage states outside the JVM, allowing it to handle tens of millions of states per executor without garbage collector (GC) issues.
    • EMR, while powerful, might not have the same level of state management optimization.
  • ETL and Machine Learning Workloads:
    • ETL and ML workloads are common use cases for both platforms.
    • Databricks’ unified analytics platform is optimized for ML and AI workloads, making it a great choice for data scientists and ML engineers.
    • EMR, being highly scalable, can handle various use cases, from simple data processing to complex analytics.
    • The underlying cloud provider (AWS in this case) can impact performance. EMR’s native integration with AWS infrastructure might offer advantages.
    • Databricks emphasizes Spark optimization, which can lead to superior performance in Spark-based appl...23.

Cost Considerations:

  • While Databricks’ performance optimizations are valuable, the cost of Databricks Units (DBUs) can significantly impact the overall expense.
  • EMR’s pricing model is different, based on instance types and usage hours.
  • When converting processing times to costs, it’s essential to consider both execution time and pricing. Databricks’ higher processing time may translate to higher costs, even if it performs well.
  • Photon, although promising, didn’t yield the expected performance boost in your case. Investigating further might reveal insights into its limitations or configuration.

Recommendations:

  • Profile Your Workloads: Dive deeper into specific workloads (e.g., aggregations, joins, windowing) to identify bottlenecks. Use Spark’s profiling tools to understand where time is spent.
  • Tune Spark Configurations: Both platforms allow tuning Spark configurations. Experiment with memory settings, shuffle partitions, and parallelism to optimize performance.
  • Evaluate Data Shuffling: Shuffling can impact performance. Ensure efficient data partitioning and minimize shuffling.
  • Consider Instance Types: EMR’s instance types and Databricks’ cluster configurations play a crucial role. Choose appropriately based on workload characteristics.
  • Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary.

VANNGA
New Contributor II

Hi @Kaniz_Fatma 

Thanks for getting back with so valuable information.

SystemFile sizeDurationSystemDurationCommentsComments1
EMR225 GB22 minsDatabricks63 minsEMR is cheaper than Databricks by 5 timesThis involves various S3 writes with m5d4xlarge
EMR225 GB45 minsDatabricks84 minsEMR is cheaper than Databricks by 3.5 timesThis involves various S3 writes but with m5d2xlarge
EMR225 GB13 minsDatabricks20 minsEMR is cheaper than Databricks by 2 timesThis does not involve S3 writes but take counts m52xlarge
EMR1 TB3 Hrs 37 minsDatabricks6 Hrs 11 minsEMR is cheaper than Databricks by more than 3 timesThis involves various S3 writes with m5d4xlarge
EMR1 TB1 Hr 24 minsDatabricks2 HrsEMR is cheaper than Databricks by about 6 timesThis does not involve S3 writes but take counts - m5d4xlarge

This is the first time the comparisions have been made and I am bit surprised the processing times.

As you see, the row1, its 5 times more expensive than Databricks. Next, I changed the instance size from m5d4xlarge to m5d2xlarge, which has brought down the cost difference to 3.5 times. I then had to change the logic to remove the S3 writes ( about 5 - 7 writes ) and use counts as actions with m5d2x large ( row 3 ) which brought down the cost difference to 2 times as processing times are better and DBUs are adjusted to lower them using less powerful instances but with more of those. This is best I can do with 225 GB file but at the cost of changing the logic to reduce the number of writes. I noticed in the logs that EMR uses multi part uploads while writing into S3 but not sure if Databricks is using the same to perform S3 writes. Any ideas here to improve Databricks writes speed? 

My other question is: In my understanding, if I further tune the spark configurations, it would improve the processing times equally among the two softwares which again the cost difference might not come down. So I don't see much point here especially the cost difference I am happy to see myself proving wrong here.Anything is being missed out here please?

Cost-Benefit Analysis: Weigh performance gains against costs. Sometimes a balance between performance and cost is necessary. 

I believe this can be tested to fire off concurrent jobs and check how two clusters cope up? I hope this is what you meant here? 

Anything that can be done to bring down the costs would be greatly welcomed.

Additionally other factors like writing into delta table rather than csv in S3 etc, you think might bring any difference at all?

Kindly let me know.

Thanks. 

 

 

VANNGA
New Contributor II

Hi @Kaniz_Fatma 

Please can you let me know if there are suggestions on my above further findings/points raised?

Thanks

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!