cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Why Databricks spark is faster than AWS EMR Spark ?

kali_tummala
New Contributor II

https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html

Hi All,

just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have another version of optimized spark which is not committed to open source spark ?

5 REPLIES 5

Chandan
New Contributor II

At its core, EMR just launches Spark applications, whereas Databricks is a higher-level platform that also includes multi-user support, an interactive UI, security, and job scheduling. Specifically, Databricks runs standard Spark applications inside a user’s AWS account, similar to EMR, but it adds a variety of features to create an end-to-end environment for working with Spark. These include:

  • Interactive UI (includes a workspace with notebooks, dashboards, a job scheduler, point-and-click cluster management)
  • Cluster sharing (multiple users can connect to the same cluster, saving cost)
  • Security features (access controls to the whole workspace, clusters)
  • Collaboration (multi-user access to the same notebook, revision control, and IDE and GitHub integration)
  • Data management (support for connecting different data sources to Spark, caching service to speed up queries)

The idea is that a lot of Spark deployments soon need to bring in multiple users, different types of jobs, etc, and we want to have these built-in. But if you just want to connect to existing data and run jobs, that also works. Databricks adds several features, such as allowing multiple users to run commands on the same cluster and running multiple versions of Spark. Because Databricks is also the team that initially built Spark, the service is very up to date and tightly integrated with the newest Spark features -- e.g. you can run previews of the next release, any data in Spark can be displayed visually, etc.

  1. nope that's not the answer I want the features what you gave doesn't help with spark performance (faster), If I do a code diff between open source spark 2.4 and Databricks latest spark version will I see differences? if I see differences why not data bricks version is open sourced yet ? did data bricks dont want to open source time to time? do they want spark outside databricks to be slower

@kali.tummala@gmail.com Databricks Runtime is very similar to open source spark, completely API compatible. Any open source Spark (OSS) code you've written will run the same against the equivalent Databricks Runtime version. There are some features we've built which were requested by our customers and do not have open source equivalents. Since open source Spark is an Apache Project, it is governed by the Apache rules of project governance, whereas Databricks Runtime is proprietary software that Databricks has 100% control over. Any correctness bugs identified will be immediately fixed in OSS.

@kali.tummala@gmail.com in general though the answers you want are too complicated to be explained in a forum post, do you have a point of contact at Databricks that you could setup time with? If not, you can reach out to me at fish@databricks.com

RafiKurlansik
New Contributor III
New Contributor III

I think you can get some pretty good insight into the optimizations on Databricks here:

https://docs.databricks.com/delta/delta-on-databricks.html

Specifically, check out the sections on caching, z-ordering, and join optimization. There's also a great, detailed blog post here: Processing Petabytes of Data in Seconds with Databricks Delta

Hope this helps! @kali.tummala@gmail.com