cancel
Showing results for 
Search instead for 
Did you mean: 
Product Platform Updates
Stay informed about the latest updates and enhancements to the Databricks platform. Learn about new features, improvements, and best practices to optimize your data analytics workflow.
cancel
Showing results for 
Search instead for 
Did you mean: 
Shu_Li
Databricks Employee
Databricks Employee

This article walks you through a step-by-step guide of how to strengthen Data Privacy with enhanced Python UDFs in Databricks Unity Catalog

 

Step 1, Strengthen security of encryption key or hash salt key used for data protection with Databricks Unity Catalog(UC) Service Credential

Use service credentials to retrieve Azure Key Vault(AKV) secrets strengthens security over traditional Databricks dbutils.secrets with Azure Key Vault backed secret scopes. Because Databricks secrets are visible to privileged users such as workspace administrators, secret creators. Unity Catalog Service Credential is a metastore level securable. Workspace admins don’t automatically gain access to it

Step 2, Create an Enhanced Python UDF in Unity Catalog as Column Masking(CM) function for sensitive Data

In this Pandas UC Python UDF, use UC service credential from Step 1 to access AKV to retrieve keys

CREATE OR REPLACE FUNCTION mycatalog.myschema.akv_key(data STRING) RETURNS STRING

LANGUAGE PYTHON

PARAMETER STYLE PANDAS

HANDLER 'batchhandler'

CREDENTIALS (`keyvault-cred` DEFAULT)

ENVIRONMENT(

  dependencies = '["azure-keyvault","azure-keyvault-secrets", "azure-identity"]',

  environment_version = 'None'

)

AS $$

from azure.identity import DefaultAzureCredential

from azure.keyvault.secrets import SecretClient

def batchhandler(data):

  vault_secret_key = "cmsaultkey"

  vault_url = "https://xxxxx.vault.azure.net/" 

  credential = DefaultAzureCredential()

  client = SecretClient(vault_url=vault_url, credential=credential)

  for d in data:

    yield "cm sault key" + d + client.get_secret(vault_secret_key).value

$$

Step 3, Create a UC SQL UDF wrapper for the Pandas UC UDF from step 2

Pandas  UC UDFs can’t be used as CM functions in Databricks UC directly. CM functions in Unity Catalog must be scalar SQL UDFs or Python/Scala UDFs wrapped in SQL UDFs, and they operate at query runtime to mask sensitive data. 

use catalog mycatalog;

use schema myschema;

CREATE OR REPLACE FUNCTION

mycatalog.myschema.maskWithHashSaltKey(column_to_mask STRING)

RETURN 

CASE

    WHEN is_account_group_member('classified')

    THEN column_to_mask

    ELSE sha2(akv_key(trim(lower(encode(column_to_mask, "UTF-8"))), 256)

END ;

ALTER TABLE mycatalog.myschema.pii_table alter column pii_column SET mask mycatalog.myschema.maskWithHashSaltKey;

Step 4 (Optional), Enable networking for UDFs in Serverless SQL Warehouses

Databricks Serverless SQL offers a compelling combination of productivity, performance, cost efficiency, and operational simplicity for modern analytics and business intelligence workloads. To use Enhanced UC Python UDF on Serverless SQL, you need to enable this feature from your Databricks Preview page

Just like that, in 3 easy steps (4 if using Databricks Serverless SQL), Enhanced UC UDFs allow you to specify custom dependencies (from PyPI, Unity Catalog volumes, or public URLs) and UC Service Credentials directly in the UDF definition. This means you can enhance security with UC service credentials, use external libraries(in this example, AKV) or your own packaged code without manual environment setup.