At its core, EMR just launches Spark applications, whereas Databricks is a higher-level platform that also includes multi-user support, an interactive UI, security, and job scheduling. Specifically, Databricks runs standard Spark applications inside a user’s AWS account, similar to EMR, but it adds a variety of features to create an end-to-end environment for working with Spark. These include:
- Interactive UI (includes a workspace with notebooks, dashboards, a job scheduler, point-and-click cluster management)
- Cluster sharing (multiple users can connect to the same cluster, saving cost)
- Security features (access controls to the whole workspace, clusters)
- Collaboration (multi-user access to the same notebook, revision control, and IDE and GitHub integration)
- Data management (support for connecting different data sources to Spark, caching service to speed up queries)
The idea is that a lot of Spark deployments soon need to bring in multiple users, different types of jobs, etc, and we want to have these built-in. But if you just want to connect to existing data and run jobs, that also works. Databricks adds several features, such as allowing multiple users to run commands on the same cluster and running multiple versions of Spark. Because Databricks is also the team that initially built Spark, the service is very up to date and tightly integrated with the newest Spark features -- e.g. you can run previews of the next release, any data in Spark can be displayed visually, etc.