cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
DineshBabuK
Databricks Employee
Databricks Employee

Data security is the topmost priority and critical requirement for any organization in today's digital world. Data storage security refers to the measures and protocols implemented to protect data stored in various storage systems from unauthorized access, breaches, and other security threats. In the context of Databricks, storage security involves several key components and practices. 

In this blog, we explore one of those: how to secure the data storage connectivity to both Classic and Serverless Compute using Terraform in Azure. This network perimeter control is a coarse-grained security measure that adds an additional layer of protection. The options can be using service endpoints or private endpoints depending on the organisation's security needs.

Why is it important to consider both Classic and Serverless?

For two reasons, it is important to consider both Classic and Serverless:

  1. Use cases where the users would like to take advantage of Serverless for their new workloads.
  2. Seamless migration of the existing workloads from the Classic compute to Serverless with confidence and control.

What are the options to secure connectivity between Storage and Compute in Azure?

The following two options are explained in this blog with the implementation code snippets using Terraform.

  1. Service endpoint provides secure and direct connectivity to Azure services such as Azure Storage over an optimized route over the Azure backbone network. This is a secure approach with no additional costs.
  2. A private endpoint provides a network interface that connects privately and securely to a service such as Azure Storage, powered by Azure Private Link. This is a more secure approach with additional costs.

Please be advised that while data exfiltration protection is achievable for serverless compute environments through Secure Egress Control, the specifics of its implementation are beyond the scope of this document. Refer to the official documentation for instructions on executing data exfiltration protection in serverless contexts.

Option 1 - Steps to secure connectivity using Service endpoints:

High-level architecture

Azure Databricks uses a control plane for backend services and a compute plane for data processing. The compute plane can be serverless (within your Databricks account) or classic (within your Azure subscription's network).

This architecture diagram outlines how to secure your data from unauthorized access by both classic and serverless compute using service endpoints, leveraging the following components:

  • Classic Compute: Workspaces deployed with Virtual Network Injection mode (public or host subnets) have storage service endpoints attached. You then configure the storage account firewall to permit connections exclusively from the workspace’s public subnets within the virtual network.
  • Serverless Compute: Compute operates on Databricks-hosted virtual networks and subnets, employing multiple security layers to isolate different Azure Databricks customer workspaces and enforce network controls between clusters of the same customer. Therefore, it is crucial to identify these subnets and configure the data storage account firewall to only allow connections from them. This can be achieved within Databricks through network connectivity configuration.

     

DineshBabuK_4-1751548880397.png

 

 

Step-by-Step Implementation using Terraform:

Note: Please find the complete Terraform code in Databricks GitHub: adb-data-storage-vnet-ncc-public-endpoint 

  1. Deploy a Databricks workspace with VNET injection with service endpoint to the data storage account in the host subnet as follows.
    //vnet host subnet
    resource "azurerm_subnet" "public" {
      name                 = "${var.name_prefix}-public"
      resource_group_name  = var.databricks_workspace_vnet_rg
      virtual_network_name = azurerm_virtual_network.this.name
      address_prefixes     = [var.public_subnets_cidr]
    
      delegation {
        name = "databricks"
        service_delegation {
          name = "Microsoft.Databricks/workspaces"
          actions = [
            "Microsoft.Network/virtualNetworks/subnets/join/action",
            "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
            "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
        }
      }
      service_endpoints = var.subnet_service_endpoints // e.g ["Microsoft.Storage"]
    }
    
    //vnet private subnet
    resource "azurerm_subnet" "private" {
      name                 = "${var.name_prefix}-private"
      resource_group_name  = var.rg_name
      virtual_network_name = azurerm_virtual_network.this.name
      address_prefixes     = [var.private_subnets_cidr]
      delegation {
        name = "databricks"
        service_delegation {
          name = "Microsoft.Databricks/workspaces"
          actions = [
            "Microsoft.Network/virtualNetworks/subnets/join/action",
            "Microsoft.Network/virtualNetworks/subnets/prepareNetworkPolicies/action",
            "Microsoft.Network/virtualNetworks/subnets/unprepareNetworkPolicies/action"]
        }
      }
      service_endpoints = var.subnet_service_endpoints // e.g ["Microsoft.Storage"]
    }
    
    
  2. Create Serverless Network Connectivity Configuration(NCC) and attach to Workspace (applicable only for serverless)
    //Retrieve the ID of the specified Azure Databricks Workspace (Created in Step 1)
    data "azurerm_databricks_workspace" "this" {
      name                = var.databricks_workspace
      resource_group_name = var.databricks_workspace_rg
    }
    //Create NCC
    resource "databricks_mws_network_connectivity_config" "ncc" {
      provider = databricks.accounts
      name     = var.workspace_ncc_name
      region   = var.azure_region
    }
    //Attach NCC to workspace
    resource "databricks_mws_ncc_binding" "ncc_binding" {
      provider                       = databricks.accounts
      network_connectivity_config_id = databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      workspace_id                   = data.azurerm_databricks_workspace.this.workspace_id
    }
  3. Get the list of subnets from Classic VNET.
    // Retrieve the public subnets of the Databricks workspace
    data "azurerm_subnet" "ws_subnets" {
      name                 = var.databricks_workspace_public_subnet
      virtual_network_name = var.databricks_workspace_vnet
      resource_group_name  = var.databricks_workspace_vnet_rg
    }​
  4. Get the list of subnets from Serverless NCC configuration.  (applicable only for serverless)
    data "databricks_mws_network_connectivity_config" "ncc" {
      provider = databricks.accounts
      name     = var.workspace_ncc_name
    }
    
    locals {
      all_storage_subnets = [for conf in data.databricks_mws_network_connectivity_config.ncc.egress_config :
        [for rule in conf.default_rules :
          [for se_rule in rule.azure_service_endpoint_rule :
            se_rule.subnets if contains(se_rule.target_services, "AZURE_BLOB_STORAGE")
          ]
        ]
      ]
      uniq_storage_subnets = distinct(flatten(local.all_storage_subnets))
    }
    
  5. Create a Data Storage Account with network rules to allow connection from the list of subnets from Classic VNET and Serverless NCC configuration only (as retrieved from the previous step).
    //storage account for adls with hierarchical namespace enabled
    resource "azurerm_storage_account" "this" {
     account_replication_type = "LRS"
     account_tier             = "Standard"
     location                 = var.azure_region
     name                     = var.data_storage_account
     resource_group_name      = azurerm_resource_group.this.name
     is_hns_enabled           = true
     tags                     = var.tags
     depends_on               = [azurerm_resource_group.this]
    }
    
    //Added to storage account network rules to allow only the workspace VNET public subnets and ncc subnets
    resource "azurerm_storage_account_network_rules" "this" {
      storage_account_id = azurerm_storage_account.this.id
    
      default_action             = "Deny"
      virtual_network_subnet_ids = concat(
          [ data.azurerm_subnet.ws_subnets.id ], //for classic VNET
          Local.uniq_storage_subnets // for serverless
        )
    }

Now, connectivity has been secured for both Classic VNET and Serverless compute to the Storage Account to their associated subnets only. This can be tested by checking the access to data from the associated workspace and non-associated workspaces.

Option 2 - Steps to secure connectivity using Private endpoints:

High-level architecture

Azure Databricks uses a control plane for backend services and a compute plane for data processing. The compute plane can be serverless (within your Databricks account) or classic (within your Azure subscription's network).

This architecture diagram outlines how to secure your data from unauthorized access by both classic and serverless compute using private endpoints, leveraging the following components:

  • Classic Compute: Workspaces deployed with Virtual Network Injection mode which has an additional subnet for hosting private endpoints. You then configure the storage account firewall to only permit connections via private endpoints associated with the private subnet subnets within the virtual network (Linked to private DSN zone).
  • Serverless Compute: Compute operates on Databricks-hosted virtual networks and subnets, with multiple security layers to isolate different Azure Databricks customer workspaces and enforce network controls between clusters of the same customer. Therefore, it is crucial to create private endpoints for the relevant Databricks-hosted subnets and configure the data storage account firewall to allow access only through these private endpoints. This can be managed within Databricks using private endpoints connection via network connectivity configuration.

DineshBabuK_5-1751549082698.png

 

Step-by-Step Implementation using Terraform:

Please find the complete terraform code in Databricks GitHub : adb-data-storage-vnet-ncc-private-endpoint 

  1. Create Data Storage Account with network rule (default action : deny and allow access from terraform environment IP)
    //storage account for adls with hierarchical namespace enabled
    resource "azurerm_storage_account" "this" {
      account_replication_type = "LRS"
      account_tier             = "Standard"
      location                 = var.azure_region
      name                     = var.data_storage_account
      resource_group_name      = var.data_storage_account_rg
      is_hns_enabled           = true
      tags                     = var.tags
      depends_on               = [azurerm_resource_group.this]
    }
    
    # [Optional]Configure network rules for the storage account
    # [Recommendation] Terraform should also reach via Private Link only
    resource "azurerm_storage_account_network_rules" "this" {
      storage_account_id = azurerm_storage_account.this.id
      default_action     = "Deny"
      ip_rules           = var.storage_account_allowed_ips // e.g. terraform env. IP
    }
  2. Deploy a Databricks workspace with VNET injection with a private link subnet for private endpoints.
    data "azurerm_virtual_network" "ws_vnet" {
      name                = var.databricks_workspace_vnet
      resource_group_name = var.databricks_workspace_vnet_rg
    }
    
    //vnet private link subnet
    resource "azurerm_subnet" "plsubnet" {
      name                  = "${var.name_prefix}-privatelink"	
      resource_group_name   = var.databricks_workspace_vnet_rg
      virtual_network_name  = var.databricks_workspace_vnet
      address_prefixes      = [var.pl_subnets_cidr]
    }
  3. Add Storage Private endpoint to allow connection from workspace VNET.
    //private dsn zone for data-dfs
    resource "azurerm_private_dns_zone" "dfs" {
      name                = "privatelink.dfs.core.windows.net"
      resource_group_name = var.databricks_workspace_vnet_rg
      tags                = var.tags
    }
    
    // Link the private DNS zone to the VNet
    resource "azurerm_private_dns_zone_virtual_network_link" "dfsdnszonevnetlink" {
      name                  = "dfsvnetconnection"
      resource_group_name   = var.databricks_workspace_vnet_rg
      private_dns_zone_name = azurerm_private_dns_zone.dfs.name
      virtual_network_id    = data.azurerm_virtual_network.ws_vnet.id // Connect to the spoke VNet
      tags                  = var.tags
    }
    
    // Create a private endpoint for the workspace to access the data storage (catalog external location)
    resource "azurerm_private_endpoint" "data_dfs" {
      name                = "datapvtendpoint"
      location            = var.azure_region
      resource_group_name = var.databricks_workspace_vnet_rg
      subnet_id           = azurerm_subnet.plsubnet.id
      tags                = var.tags
      
      private_service_connection {
        name                           = "ple-${var.name_prefix}-data"
        private_connection_resource_id = azurerm_storage_account.this.id
        is_manual_connection           = false
        subresource_names              = ["dfs"]
      }
      private_dns_zone_group {
        name                 = "private-dns-zone-data-dfs"
        private_dns_zone_ids = [azurerm_private_dns_zone.dfs.id]
      }
    }
    
    //private dsn zone for data-blob
    resource "azurerm_private_dns_zone" "blob" {
      name                = "privatelink.blob.core.windows.net"
      resource_group_name = var.databricks_workspace_vnet_rg
      tags                = var.tags
    }
    
    // Link the private DNS zone to the VNet
    resource "azurerm_private_dns_zone_virtual_network_link" "blobdnszonevnetlink" {
      name                  = "blobvnetconnection"
      resource_group_name   = var.databricks_workspace_vnet_rg
      private_dns_zone_name = azurerm_private_dns_zone.blob.name
      virtual_network_id    = data.azurerm_virtual_network.ws_vnet.id // Connect to the spoke VNet
      tags                  = var.tags
    }
    
    // Create a private endpoint for the workspace to access the data storage (catalog external location)
    resource "azurerm_private_endpoint" "data_blob" {
      name                = "datapvtendpointblob"
      location            = var.azure_region
      resource_group_name = var.databricks_workspace_vnet_rg
      subnet_id           = azurerm_subnet.plsubnet.id
      tags                = var.tags
      
      private_service_connection {
        name                           = "ple-${var.name_prefix}-data-blob"
        private_connection_resource_id = azurerm_storage_account.this.id
        is_manual_connection           = false
        subresource_names              = ["blob"]
      }
      private_dns_zone_group {
        name                 = "private-dns-zone-data-blob"
        private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
      }
    }
    ​
  4. To secure DBFS, Add Storage Private endpoint to allow connection from workspace VNET.  (Note : Set default_storage_firewall_enabled = true during workspace creation to secure DBFS storage)
    //private link for root storage
    data “azurerm_storage_account” "dbfs_storage_account" {
      name = var.dbfs_storage_account
      resource_group_name = var.dbfs_storage_account_rg
    }
    
    / Create a private endpoint for the workspace to access the data storage (catalog external location)
    resource "azurerm_private_endpoint" "dbfs_dfs" {
      name                = "dbfspvtendpoint"
      location            = var.azure_region
      resource_group_name = var.databricks_workspace_vnet_rg
      subnet_id           = azurerm_subnet.plsubnet.id
      tags                = var.tags
      private_service_connection {
        name                           = "ple-${var.name_prefix}-dbfs"
        private_connection_resource_id = data.azurerm_storage_account.dbfs_storage_account.id
        is_manual_connection           = false
        subresource_names              = ["dfs"]
      }
      private_dns_zone_group {
        name                 = "private-dns-zone-dbfs-dfs"
        private_dns_zone_ids = [azurerm_private_dns_zone.dfs.id]
      }
    }
    
    // Create a private endpoint for the workspace to access the data storage (catalog external location)
    resource "azurerm_private_endpoint" "dbfs_blob" {
      name                = "dbfspvtendpointblob"
      location            = var.azure_region
      resource_group_name = var.databricks_workspace_vnet_rg
      subnet_id           = azurerm_subnet.plsubnet.id
      tags                = var.tags
      private_service_connection {
        name                           = "ple-${var.name_prefix}-dbfs-blob"
        private_connection_resource_id = data.azurerm_storage_account.dbfs_storage_account.id
        is_manual_connection           = false
        subresource_names              = ["blob"]
      }
      private_dns_zone_group {
        name                 = "private-dns-zone-dadbfsta-blob"
        private_dns_zone_ids = [azurerm_private_dns_zone.blob.id]
      }
    }
  5. Create Serverless Network Connectivity Configuration(NCC) and attach to Workspace. (applicable only for serverless)
    //Get Workspace Id
    data "azurerm_databricks_workspace" "this" {
      name                = var.databricks_workspace
      resource_group_name = var.databricks_workspace_rg
    }
    //Create NCC
    resource "databricks_mws_network_connectivity_config" "ncc" {
      provider = databricks.accounts
      name     = var.workspace_ncc_name
      region   = var.azure_region
    }
    //Attach NCC to workspace
    resource "databricks_mws_ncc_binding" "ncc_binding" {
      provider                       = databricks.accounts
      network_connectivity_config_id = databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      workspace_id                   = data.azurerm_databricks_workspace.this.workspace_id
      depends_on = [databricks_mws_network_connectivity_config.ncc]
      
    }
  6. Add Storage account Private endpoint rule in NCC to allow connection from Serverless. (applicable only for serverless)
    data "databricks_mws_network_connectivity_config" "ncc" {
      provider = databricks.accounts
      name     = var.workspace_ncc_name
    }
    
    //add ncc private endpoint rule(dfs)
    resource "databricks_mws_ncc_private_endpoint_rule" "storage_dfs" {
      provider                       = databricks.accounts
      network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      resource_id                    = data.azurerm_storage_account.data_storage_account.id
      group_id                       = "dfs"
    }​
    
    //add ncc private endpoint rule(blob)
    resource "databricks_mws_ncc_private_endpoint_rule" "storage_blob" {
      provider                       = databricks.accounts
      network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      resource_id                    = data.azurerm_storage_account.data_storage_account.id
      group_id                       = "blob"
    }​
  7. Add DBFS Storage account Private endpoint rule in NCC to allow connection from Serverless. (applicable only for serverless)
    // Add a private endpoint rule for the NCC to access the storage account
    resource "databricks_mws_ncc_private_endpoint_rule" "dbfs_dfs" {
      provider                       = databricks.accounts
      network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      resource_id                    = data.azurerm_storage_account.dbfs_storage_account.id
      group_id                       = "dfs"
    }
    
    // Add a private endpoint rule for the NCC to access the storage account
    resource "databricks_mws_ncc_private_endpoint_rule" "dbfs_blob" {
      provider                       = databricks.accounts
      network_connectivity_config_id = data.databricks_mws_network_connectivity_config.ncc.network_connectivity_config_id
      resource_id                    = data.azurerm_storage_account.dbfs_storage_account.id
      group_id                       = "blob"
    }
    
  8. Finally, approve the newly added NCC private endpoint connection via Azure API. (applicable only for serverless)
    // Retrieve the list of private endpoint connections for the storage account
    data "azapi_resource_list" "list_storage_private_endpoint_connection" {
      type                   = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      parent_id              = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${var.data_storage_account_rg}/providers/Microsoft.Storage/storageAccounts/${var.data_storage_account}"
      response_export_values = ["*"]
      depends_on = [databricks_mws_ncc_private_endpoint_rule.storage_dfs, databricks_mws_ncc_private_endpoint_rule.storage_blob]
    }
    
    // Approve the private endpoint connection for the storage account (dfs)
    resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dfs" {
      type      = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      name      = [
        for i in data.azapi_resource_list.list_storage_private_endpoint_connection.output.value 
        : i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.storage_dfs.endpoint_name)
      ][0]
      parent_id = azurerm_storage_account.this.id
    
      body = {
        properties = {
          privateLinkServiceConnectionState = {
            description = "Auto Approved via Terraform"
            status      = "Approved"
          }
        }
      }
    }
    
    // Approve the private endpoint connection for the storage account (blob)
    resource "azapi_update_resource" "approve_storage_private_endpoint_connection_blob" {
      type      = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      name      = [
        for i in data.azapi_resource_list.list_storage_private_endpoint_connection.output.value 
        : i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.storage_blob.endpoint_name)
      ][0]
      parent_id = azurerm_storage_account.this.id
      body = {
        properties = {
          privateLinkServiceConnectionState = {
            description = "Auto Approved via Terraform"
            status      = "Approved"
          }
        }
      }
    }
    
    // Retrieve the list of private endpoint connections for the DBFS storage account
    data "azapi_resource_list" "list_storage_private_endpoint_connection_dbfs" {
      type                   = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      parent_id              = "/subscriptions/${var.azure_subscription_id}/resourceGroups/${var.dbfs_storage_account_rg}/providers/Microsoft.Storage/storageAccounts/${var.dbfs_storage_account}"
      response_export_values = ["*"]
      depends_on = [databricks_mws_ncc_private_endpoint_rule.storage_dfs, databricks_mws_ncc_private_endpoint_rule.storage_blob]
    }
    
    // Approve the private endpoint connection for the DBFS storage account (dfs)
    resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dbfs_dfs" {
      type      = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      name      = [
        for i in data.azapi_resource_list.list_storage_private_endpoint_connection_dbfs.output.value 
        : i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.dbfs_dfs.endpoint_name)
      ][0]
      parent_id = data.azurerm_storage_account.dbfs_storage_account.id
      body = {
        properties = {
          privateLinkServiceConnectionState = {
            description = "Auto Approved via Terraform"
            status      = "Approved"
          }
        }
      }
    }
    
    // Approve the private endpoint connection for the DBFS storage account (blob)
    resource "azapi_update_resource" "approve_storage_private_endpoint_connection_dbfs_blob" {
      type      = "Microsoft.Storage/storageAccounts/privateEndpointConnections@2022-09-01"
      name      = [
        for i in data.azapi_resource_list.list_storage_private_endpoint_connection_dbfs.output.value 
        : i.name if endswith(i.properties.privateEndpoint.id, databricks_mws_ncc_private_endpoint_rule.dbfs_blob.endpoint_name)
      ][0]
      parent_id = data.azurerm_storage_account.dbfs_storage_account.id
      body = {
        properties = {
          privateLinkServiceConnectionState = {
            description = "Auto Approved via Terraform"
            status      = "Approved"
          }
        }
      }
    }

Now, connectivity has been secured for both Classic VNET and Serverless compute to the Storage Account using private endpoints. This can be tested by checking the access to data from the associated workspace and non-associated workspaces. 

Conclusion

In this blog we explored how to secure the data storage connectivity to both Classic and Serverless Compute using Terraform in Azure by reasoning their use cases. Also we explored the available connectivity options using service endpoints and private endpoints with the ready to go implementation terraform code snippets.