Data security is the topmost priority and critical requirement for any organization in today's digital world. Data storage security refers to the measures and protocols implemented to protect data stored in various storage systems from unauthorized access, breaches, and other security threats. In the context of Databricks, storage security involves several key components and practices.
In this blog, we explore one of those: how to secure the data storage connectivity to both Classic and Serverless Compute using Terraform in Azure. This network perimeter control is a coarse-grained security measure that adds an additional layer of protection. The options can be using service endpoints or private endpoints depending on the organisation's security needs.
For two reasons, it is important to consider both Classic and Serverless:
The following two options are explained in this blog with the implementation code snippets using Terraform.
Please be advised that while data exfiltration protection is achievable for serverless compute environments through Secure Egress Control, the specifics of its implementation are beyond the scope of this document. Refer to the official documentation for instructions on executing data exfiltration protection in serverless contexts.
Azure Databricks uses a control plane for backend services and a compute plane for data processing. The compute plane can be serverless (within your Databricks account) or classic (within your Azure subscription's network).
This architecture diagram outlines how to secure your data from unauthorized access by both classic and serverless compute using service endpoints, leveraging the following components:
Note: Please find the complete Terraform code in Databricks GitHub: adb-data-storage-vnet-ncc-public-endpoint
//vnet host subnet
resource "azurerm_subnet" "public" {
name = "${var.name_prefix}-public"
resource_group_name = var.databricks_workspace_vnet_rg
virtual_network_name = azurerm_virtual_network.this.name
address_prefixes = [var.public_subnets_cidr]
delegation {
name = "databricks"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
}
}
service_endpoints = var.subnet_service_endpoints // e.g ["Microsoft.Storage"]
}
//vnet private subnet
resource "azurerm_subnet" "private" {
name = "${var.name_prefix}-private"
resource_group_name = var.rg_name
virtual_network_name = azurerm_virtual_network.this.name
address_prefixes = [var.private_subnets_cidr]
delegation {
name = "databricks"
service_delegation {
name = "Microsoft.Databricks/workspaces"
actions = [
"Microsoft.Network/virtualNetworks/subnets/join/action",
"Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
"Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
}
}
service_endpoints = var.subnet_service_endpoints // e.g ["Microsoft.Storage"]
}
//Retrieve the ID of the specified Azure Databricks Workspace (Created in Step 1)
data "azurerm_databricks_workspace" "this" {
name = var.databricks_workspace
resource_group_name = var.databricks_workspace_rg
}
//Create NCC
resource "databricks_mws_network_connectivity_config" "ncc" {
provider = databricks.accounts
name = var.workspace_ncc_name
region = var.azure_region
}
//Attach NCC to workspace
resource "databricks_mws_ncc_binding" "ncc_binding" {
provider = databricks.accounts
network_connectivity_config_id = databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
workspace_id = data.azurerm_databricks_workspace.this.workspace_id
}
// Retrieve the public subnets of the Databricks workspace
data "azurerm_subnet" "ws_subnets" {
name = var.databricks_workspace_public_subnet
virtual_network_name = var.databricks_workspace_vnet
resource_group_name = var.databricks_workspace_vnet_rg
}
data "databricks_mws_network_connectivity_config" "ncc" {
provider = databricks.accounts
name = var.workspace_ncc_name
}
locals {
all_storage_subnets = [for conf in data.databricks_mws_network_connectivity_config.ncc.egress_config :
[for rule in conf.default_rules :
[for se_rule in rule.azure_service_endpoint_rule :
se_rule.subnets if contains(se_rule.target_services, "AZURE_BLOB_STORAGE")
]
]
]
uniq_storage_subnets = distinct(flatten(local.all_storage_subnets))
}
//storage account for adls with hierarchical namespace enabled
resource "azurerm_storage_account" "this" {
account_replication_type = "LRS"
account_tier = "Standard"
location = var.azure_region
name = var.data_storage_account
resource_group_name = azurerm_resource_group.this.name
is_hns_enabled = true
tags = var.tags
depends_on = [azurerm_resource_group.this]
}
//Added to storage account network rules to allow only the workspace VNET public subnets and ncc subnets
resource "azurerm_storage_account_network_rules" "this" {
storage_account_id = azurerm_storage_account.this.id
default_action = "Deny"
virtual_network_subnet_ids = concat(
[ data.azurerm_subnet.ws_subnets.id ], //for classic VNET
Local.uniq_storage_subnets // for serverless
)
}
Now, connectivity has been secured for both Classic VNET and Serverless compute to the Storage Account to their associated subnets only. This can be tested by checking the access to data from the associated workspace and non-associated workspaces.
Azure Databricks uses a control plane for backend services and a compute plane for data processing. The compute plane can be serverless (within your Databricks account) or classic (within your Azure subscription's network).
This architecture diagram outlines how to secure your data from unauthorized access by both classic and serverless compute using private endpoints, leveraging the following components:
Please find the complete terraform code in Databricks GitHub : adb-data-storage-vnet-ncc-private-endpoint
//storage account for adls with hierarchical namespace enabled
resource "azurerm_storage_account" "this" {
account_replication_type = "LRS"
account_tier = "Standard"
location = var.azure_region
name = var.data_storage_account
resource_group_name = var.data_storage_account_rg
is_hns_enabled = true
tags = var.tags
depends_on = [azurerm_resource_group.this]
}
# [Optional]Configure network rules for the storage account
# [Recommendation] Terraform should also reach via Private Link only
resource "azurerm_storage_account_network_rules" "this" {
storage_account_id = azurerm_storage_account.this.id
default_action = "Deny"
ip_rules = var.storage_account_allowed_ips // e.g. terraform env. IP
}
data "azurerm_virtual_network" "ws_vnet" {
name = var.databricks_workspace_vnet
resource_group_name = var.databricks_workspace_vnet_rg
}
//vnet private link subnet
resource "azurerm_subnet" "plsubnet" {
name = "${var.name_prefix}-privatelink"
resource_group_name = var.databricks_workspace_vnet_rg
virtual_network_name = var.databricks_workspace_vnet
address_prefixes = [var.pl_subnets_cidr]
}
//private dsn zone for data-dfs
resource "azurerm_private_dns_zone" "dfs" {
name = "privatelink.dfs.core.windows.net"
resource_group_name = var.databricks_workspace_vnet_rg
tags = var.tags
}
// Link the private DNS zone to the VNet
resource "azurerm_private_dns_zone_virtual_network_link" "dfsdnszonevnetlink" {
name = "dfsvnetconnection"
resource_group_name = var.databricks_workspace_vnet_rg
private_dns_zone_name = azurerm_private_dns_zone.dfs.name
virtual_network_id = data.azurerm_virtual_network.ws_vnet.id // Connect to the spoke VNet
tags = var.tags
}
// Create a private endpoint for the workspace to access the data storage (catalog external location)
resource "azurerm_private_endpoint" "data_dfs" {
name = "datapvtendpoint"
location = var.azure_region
resource_group_name = var.databricks_workspace_vnet_rg
subnet_id = azurerm_subnet.plsubnet.id
tags = var.tags
private_service_connection {
name = "ple-${var.name_prefix}-data"
private_connection_resource_id = azurerm_storage_account.this.id
is_manual_connection = false
subresource_names = ["dfs"]
}
private_dns_zone_group {
name = "private-dns-zone-data-dfs"
private_dns_zone_ids = [azurerm_private_dns_zone.dfs.id]
}
}
//private dsn zone for data-blob
resource "azurerm_private_dns_zone" "blob" {
name = "privatelink.blob.core.windows.net"
resource_group_name = var.databricks_workspace_vnet_rg
tags = var.tags
}
// Link the private DNS zone to the VNet
resource "azurerm_private_dns_zone_virtual_network_link" "blobdnszonevnetlink" {
name = "blobvnetconnection"
resource_group_name = var.databricks_workspace_vnet_rg
private_dns_zone_name = azurerm_private_dns_zone.blob.name
virtual_network_id = data.azurerm_virtual_network.ws_vnet.id // Connect to the spoke VNet
tags = var.tags
}
// Create a private endpoint for the workspace to access the data storage (catalog external location)
resource "azurerm_private_endpoint" "data_blob" {
name = "datapvtendpointblob"
location = var.azure_region
resource_group_name = var.databricks_workspace_vnet_rg
subnet_id = azurerm_subnet.plsubnet.id
tags = var.tags
private_service_connection {
name = "ple-${var.name_prefix}-data-blob"
private_connection_resource_id = azurerm_storage_account.this.id
is_manual_connection = false
subresource_names = ["blob"]
}
private_dns_zone_group {
name = "private-dns-zone-data-blob"
private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
}
}
//private link for root storage
data “azurerm_storage_account” "dbfs_storage_account" {
name = var.dbfs_storage_account
resource_group_name = var.dbfs_storage_account_rg
}
/ Create a private endpoint for the workspace to access the data storage (catalog external location)
resource "azurerm_private_endpoint" "dbfs_dfs" {
name = "dbfspvtendpoint"
location = var.azure_region
resource_group_name = var.databricks_workspace_vnet_rg
subnet_id = azurerm_subnet.plsubnet.id
tags = var.tags
private_service_connection {
name = "ple-${var.name_prefix}-dbfs"
private_connection_resource_id = data.azurerm_storage_account.dbfs_storage_account.id
is_manual_connection = false
subresource_names = ["dfs"]
}
private_dns_zone_group {
name = "private-dns-zone-dbfs-dfs"
private_dns_zone_ids = [azurerm_private_dns_zone.dfs.id]
}
}
// Create a private endpoint for the workspace to access the data storage (catalog external location)
resource "azurerm_private_endpoint" "dbfs_blob" {
name = "dbfspvtendpointblob"
location = var.azure_region
resource_group_name = var.databricks_workspace_vnet_rg
subnet_id = azurerm_subnet.plsubnet.id
tags = var.tags
private_service_connection {
name = "ple-${var.name_prefix}-dbfs-blob"
private_connection_resource_id = data.azurerm_storage_account.dbfs_storage_account.id
is_manual_connection = false
subresource_names = ["blob"]
}
private_dns_zone_group {
name = "private-dns-zone-dadbfsta-blob"
private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
}
}
//Get Workspace Id
data "azurerm_databricks_workspace" "this" {
name = var.databricks_workspace
resource_group_name = var.databricks_workspace_rg
}
//Create NCC
resource "databricks_mws_network_connectivity_config" "ncc" {
provider = databricks.accounts
name = var.workspace_ncc_name
region = var.azure_region
}
//Attach NCC to workspace
resource "databricks_mws_ncc_binding" "ncc_binding" {
provider = databricks.accounts
network_connectivity_config_id = databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
workspace_id = data.azurerm_databricks_workspace.this.workspace_id
depends_on = [databricks_mws_network_connectivity_config.ncc]
}
data "databricks_mws_network_connectivity_config" "ncc" {
provider = databricks.accounts
name = var.workspace_ncc_name
}
//add ncc private endpoint rule(dfs)
resource "databricks_mws_ncc_private_endpoint_rule" "storage_dfs" {
provider = databricks.accounts
network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
resource_id = data.azurerm_storage_account.data_storage_account.id
group_id = "dfs"
}
//add ncc private endpoint rule(blob)
resource "databricks_mws_ncc_private_endpoint_rule" "storage_blob" {
provider = databricks.accounts
network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
resource_id = data.azurerm_storage_account.data_storage_account.id
group_id = "blob"
}
// Add a private endpoint rule for the NCC to access the storage account
resource "databricks_mws_ncc_private_endpoint_rule" "dbfs_dfs" {
provider = databricks.accounts
network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
resource_id = data.azurerm_storage_account.dbfs_storage_account.id
group_id = "dfs"
}
// Add a private endpoint rule for the NCC to access the storage account
resource "databricks_mws_ncc_private_endpoint_rule" "dbfs_blob" {
provider = databricks.accounts
network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
resource_id = data.azurerm_storage_account.dbfs_storage_account.id
group_id = "blob"
}
// Retrieve the list of private endpoint connections for the storage account
data "azapi_resource_list" "list_storage_private_endpoint_connection" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
parent_id = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${var.data_storage_account_rg}/providers/Microsoft.Storage/storageAccounts/${var.data_storage_account}"
response_export_values = ["*"]
depends_on = [databricks_mws_ncc_private_endpoint_rule.storage_dfs, databricks_mws_ncc_private_endpoint_rule.storage_blob]
}
// Approve the private endpoint connection for the storage account (dfs)
resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dfs" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
name = [
for i in data.azapi_resource_list.list_storage_private_endpoint_connection.output.value
: i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.storage_dfs.endpoint_name)
][0]
parent_id = azurerm_storage_account.this.id
body = {
properties = {
privateLinkServiceConnectionState = {
description = "Auto Approved via Terraform"
status = "Approved"
}
}
}
}
// Approve the private endpoint connection for the storage account (blob)
resource "azapi_update_resource" "approve_storage_private_endpoint_connection_blob" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
name = [
for i in data.azapi_resource_list.list_storage_private_endpoint_connection.output.value
: i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.storage_blob.endpoint_name)
][0]
parent_id = azurerm_storage_account.this.id
body = {
properties = {
privateLinkServiceConnectionState = {
description = "Auto Approved via Terraform"
status = "Approved"
}
}
}
}
// Retrieve the list of private endpoint connections for the DBFS storage account
data "azapi_resource_list" "list_storage_private_endpoint_connection_dbfs" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
parent_id = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${var.dbfs_storage_account_rg}/providers/Microsoft.Storage/storageAccounts/${var.dbfs_storage_account}"
response_export_values = ["*"]
depends_on = [databricks_mws_ncc_private_endpoint_rule.storage_dfs, databricks_mws_ncc_private_endpoint_rule.storage_blob]
}
// Approve the private endpoint connection for the DBFS storage account (dfs)
resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dbfs_dfs" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
name = [
for i in data.azapi_resource_list.list_storage_private_endpoint_connection_dbfs.output.value
: i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.dbfs_dfs.endpoint_name)
][0]
parent_id = data.azurerm_storage_account.dbfs_storage_account.id
body = {
properties = {
privateLinkServiceConnectionState = {
description = "Auto Approved via Terraform"
status = "Approved"
}
}
}
}
// Approve the private endpoint connection for the DBFS storage account (blob)
resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dbfs_blob" {
type = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
name = [
for i in data.azapi_resource_list.list_storage_private_endpoint_connection_dbfs.output.value
: i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.dbfs_blob.endpoint_name)
][0]
parent_id = data.azurerm_storage_account.dbfs_storage_account.id
body = {
properties = {
privateLinkServiceConnectionState = {
description = "Auto Approved via Terraform"
status = "Approved"
}
}
}
}
Now, connectivity has been secured for both Classic VNET and Serverless compute to the Storage Account using private endpoints. This can be tested by checking the access to data from the associated workspace and non-associated workspaces.
In this blog we explored how to secure the data storage connectivity to both Classic and Serverless Compute using Terraform in Azure by reasoning their use cases. Also we explored the available connectivity options using service endpoints and private endpoints with the ready to go implementation terraform code snippets.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.