06-06-2019 11:29 AM
https://databricks.com/blog/2017/07/12/benchmarking-big-data-sql-platforms-in-the-cloud.html
Hi All,
just wondering why Databricks Spark is lot faster on S3 compared with AWS EMR spark both the systems are on spark version 2.4 , is Databricks have another version of optimized spark which is not committed to open source spark ?
06-07-2019 09:54 PM
At its core, EMR just launches Spark applications, whereas Databricks is a higher-level platform that also includes multi-user support, an interactive UI, security, and job scheduling. Specifically, Databricks runs standard Spark applications inside a user’s AWS account, similar to EMR, but it adds a variety of features to create an end-to-end environment for working with Spark. These include:
The idea is that a lot of Spark deployments soon need to bring in multiple users, different types of jobs, etc, and we want to have these built-in. But if you just want to connect to existing data and run jobs, that also works. Databricks adds several features, such as allowing multiple users to run commands on the same cluster and running multiple versions of Spark. Because Databricks is also the team that initially built Spark, the service is very up to date and tightly integrated with the newest Spark features -- e.g. you can run previews of the next release, any data in Spark can be displayed visually, etc.
06-09-2019 02:00 PM
06-10-2019 10:58 AM
@kali.tummala@gmail.com Databricks Runtime is very similar to open source spark, completely API compatible. Any open source Spark (OSS) code you've written will run the same against the equivalent Databricks Runtime version. There are some features we've built which were requested by our customers and do not have open source equivalents. Since open source Spark is an Apache Project, it is governed by the Apache rules of project governance, whereas Databricks Runtime is proprietary software that Databricks has 100% control over. Any correctness bugs identified will be immediately fixed in OSS.
06-10-2019 11:13 AM
@kali.tummala@gmail.com in general though the answers you want are too complicated to be explained in a forum post, do you have a point of contact at Databricks that you could setup time with? If not, you can reach out to me at fish@databricks.com
06-11-2019 06:59 PM
I think you can get some pretty good insight into the optimizations on Databricks here:
https://docs.databricks.com/delta/delta-on-databricks.htmlSpecifically, check out the sections on caching, z-ordering, and join optimization. There's also a great, detailed blog post here: Processing Petabytes of Data in Seconds with Databricks Delta
Hope this helps! @kali.tummala@gmail.com
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group