Databricks Community

vladmarin · ‎06-17-2024

Traditional warehouse administrators face several challenges: streamlining operations and security, improving efficiency in high concurrency and low latency environments, reducing costs and overhead associated with cluster setup and under-utilization, maintaining systems, and minimizing dependency on cloud providers.

Databricks Serverless SQL addresses these issues and brings many benefits, including enhanced productivity, efficiency, and simplicity in data analytics operations. Even though setting up Serverless SQL on Databricks does not pose any significant technical challenges, a common question we hear is, “what are the security aspects of SQL Serverless”?

In this blog, we will walk you through the possible setup scenarios you can use when enabling Serverless warehouses in your account and the different security considerations. Before enabling Serverless, remember the prerequisites for your cloud (AWS / Azure). Databricks Serverless SQL is available in Azure and AWS in documented regions.

Table of Contents

Benefits of Serverless SQL
High-Level Security Considerations
Networking Configuration
Step-by-step methodology
Step 1 - Validate that serverless is available in your cloud region
Step 2 - Assess cloud storage networking configuration
Step 3 - Assess metastore networking configuration
Network Connectivity Configurations (NCCs)
Conclusion

Benefits of Serverless SQL

The serverless capability provides instantaneous and elastic compute resources, improving the customer experience as the infrastructure is available precisely when needed. This optimal performance is achieved through machine learning algorithms that provision and scale compute resources based on usage, eliminating the need for manual cluster management and shutdowns. The automatic up- and down-scaling of compute also minimizes costs associated with unnecessary idle time.

Moreover, it does not require network address allocation. This shift eliminates the burden of capacity management, patching, upgrading, and performance optimization of clusters, allowing users to focus solely on their data and insights without worrying about infrastructure management.

The simplified pricing model ensures a single bill to track costs efficiently. Furthermore, the platform continuously enhances performance and reduces costs through predictive I/O optimizations and persistent results caching features, with remote result cache for all analytics use cases, making it the simplest way to securely utilize the Databricks Data Intelligence Platform.

Overall Databricks architecture.

High-Level Security Considerations

As part of Databricks Serverless SQL, security is an essential topic to keep in mind, and it is beneficial to have a good understanding of these concepts during your serverless setup. When talking about the security aspect of serverless SQL, we can speak to workload isolation - per cluster, workspace, and customer; secure network access to the data; and hardening of infrastructure.

Unity Catalog in Databricks ensures data security through centralized and granular access control over data assets, and data isolation. It also maintains secure data permissions and provides auditing and lineage capabilities. These measures collectively ensure that users can only access and query data they are entitled to in compliance with industry standards.

Below is an overview of the main features provided in our Serverless SQL architecture.

Serverless Isolation Principals: Workloads (clusters) are securely segregated with no inter-workload communication, preventing lateral movement. Dedicated compute resources are allocated where each node exclusively handles compute tasks. Upon completion, workload nodes and associated resources are cleared. Unnecessary inbound access is restricted, allowing only authenticated requests from authorized users through the control plane.
Multiple Isolation Layer Protection: Workloads operate within a container with limited privileges. All local and attached disks are exclusively allocated to a customer and erased once utilized. These disks are temporary and encrypted while at rest. The compute resources are exclusively allocated to a specific customer and are wiped out post-usage. Additionally, there are no privileges or credentials for other systems. Every workload functions within its own private network without any public IP addresses. This network is logically separated from other workloads, preventing any lateral movement or communication between workloads. All traffic, whether from the user, the control plane, the compute plane, or cloud services, is directed through the cloud provider's global network, not the public internet.
Databricks hardening and vulnerability policies: Databricks' hardening policies involve operating each workload within a private network, with strict access controls and ephemeral hosts. The vulnerability policies include a robust patch management system, monitoring of vulnerability notifications, and timely remediation based on risk and impact.
Access to Data: Access is granted through tokens that have a short lifespan (1hr). All data transmission occurs over the network provided by the cloud service, not the public internet. Furthermore, all traffic is encrypted using TLS 1.2 or higher.
Encryption: is applied to all traffic between the user, the control plane, the compute plane, and cloud APIs, as well as storage in the control plane and all attached disks.

For more details, refer to serverless security and serverless computing (AWS/Azure).

Networking Configuration

This section will help you navigate the different decision points and methodologies for enabling Serverless connectivity for your Databricks SQL Warehouses. Depending on your setup and the company’s network security requirements, you might need to reconfigure several connectivity configurations, including your storage (i.e., S3 or ADLS).

Before diving into the specifics, here are the high-level steps you will need to take for the upgrade:

Assess the current setup of your Databricks deployment(s) following the decision flow diagram provided in this document.
If applicable, create the necessary resources in the cloud and your Databricks account, following the steps in the methodology below.
Validate that you can now create a Serverless SQL Warehouse. Create a new SQL warehouse in your Databricks workspace.
Test connectivity on a sample workload.
Congratulations! You can now benefit from the full power of Serverless SQL compute.

Step-by-step methodology

Step 1 - Validate that serverless is available in your cloud region

For AWS, check here.

For Azure, check here.

Step 2 - Assess cloud storage networking configuration

For Databricks SQL Serverless to work, all cloud storage objects that Databricks communicates with will need to be configured to allow for Serverless compute.

Azure - Azure Data Lake Storage (ADLS)

Option 1 - Public connectivity for your Storage Account

If your storage account is enabled for public network access (i.e., the <<Enabled from all networks>> option is selected under Networking > Public network access), there are no configuration changes needed, as Serverless SQL will work out of the box.

Option 2 - Public connectivity with Service Endpoints

If your storage account is behind a Firewall (i.e., the <<Enabled from selected virtual networks and IP addresses>> option is selected under Networking > Public network access), you will need to configure your Azure Storage Firewall according to the public documentation.

Option 3 - Private Connectivity

If your storage account is private (i.e., the <<Disabled>> option is selected under Networking > Public network access), you will need to perform the steps described in the public documentation for Serverless Private Link*.

*At the time of writing this blog, this feature is still in Gated Public Preview, which requires you to contact your Databricks representative to have you enrolled in the program.

AWS - S3

Public connectivity with Gateway Endpoints

If you are using Gateway Endpoints for your AWS setup, you will need to perform the steps described in the public documentation for Gateway Endpoints.

Other options, such as private connectivity (i.e., Private Link) are not possible at the time of writing this blog post.

Step 3 - Assess metastore networking configuration

Just like with the cloud storage objects, the Unity Catalog metastore service might need to be reconfigured to allow for serverless connectivity.

For the default Hive Metastore and for Unity Catalog, no additional configuration changes are needed, as Serverless will work out of the box. The storage configuration mentioned in Step 2 covers the configuration required for them.
If you are using AWS Glue as an external Hive Metastore (with public connectivity), Serverless will also work out of the box, and no configuration changes will be needed.
If you use AWS RDS or Azure SQL DB as an external Hive Metastore, please contact your Databricks representative for more information on supported options.

Network Connectivity Configurations (NCCs)

Screenshot 2024-05-28 at 14.56.36.png

If you choose one of the private Serverless connectivity options in Databricks, you must create NCC entities in your Account. NCCs, or Network Connectivity Configurations in Databricks are used to manage serverless network connectivity.

Account administrators must create them in the account console, and they can be attached to one or more workspaces. Below are a few considerations when choosing how to configure your NCCs:

The number of NCCs you can create per account is limited to 10 NCCs per region.
Each NCC can be attached to up to 50 workspaces.
Each region can have 100 private endpoints, distributed as needed across 1-10 NCCs.
We recommend you plan how you are going to structure them. One common way in which you can structure your NCCs is per environment: DEV, Staging, PROD or Business Unit.

Conclusion

Unlock the full potential of your data analytics with Databricks Serverless SQL. You will get enhanced productivity, efficiency, and simplicity as you focus on insights, not infrastructure. With automatic scaling and fully managed capabilities, Serverless SQL will empower you to harness the power of the Databricks Data Intelligence Platform securely and cost-effectively. We invite you to try Serverless SQL today and get started on AWS or Azure.

If you need guidance or have questions, our expert team at Databricks is ready to assist you. Join our community to share your experiences, learn from others, and discover new possibilities. Start your journey now, and let us help you uncover the true value of your data.