This article walks you through a step-by-step guide of how to strengthen Data Privacy with enhanced Python UDFs in Databricks Unity Catalog
Step 1, Strengthen security of encryption key or hash salt key used for data protection with Databricks Unity Catalog(UC) Service Credential
Use service credentials to retrieve Azure Key Vault(AKV) secrets strengthens security over traditional Databricks dbutils.secrets with Azure Key Vault backed secret scopes. Because Databricks secrets are visible to privileged users such as workspace administrators, secret creators. Unity Catalog Service Credential is a metastore level securable. Workspace admins don’t automatically gain access to it
Step 2, Create an Enhanced Python UDF in Unity Catalog as Column Masking(CM) function for sensitive Data
In this Pandas UC Python UDF, use UC service credential from Step 1 to access AKV to retrieve keys
CREATE OR REPLACE FUNCTION mycatalog.myschema.akv_key(data STRING) RETURNS STRING
LANGUAGE PYTHON
PARAMETER STYLE PANDAS
HANDLER 'batchhandler'
CREDENTIALS (`keyvault-cred` DEFAULT)
ENVIRONMENT(
dependencies = '["azure-keyvault","azure-keyvault-secrets", "azure-identity"]',
environment_version = 'None'
)
AS $$
from azure.identity import DefaultAzureCredential
from azure.keyvault.secrets import SecretClient
def batchhandler(data):
vault_secret_key = "cmsaultkey"
vault_url = "https://xxxxx.vault.azure.net/"
credential = DefaultAzureCredential()
client = SecretClient(vault_url=vault_url, credential=credential)
for d in data:
yield "cm sault key" + d + client.get_secret(vault_secret_key).value
$$
Step 3, Create a UC SQL UDF wrapper for the Pandas UC UDF from step 2
Pandas UC UDFs can’t be used as CM functions in Databricks UC directly. CM functions in Unity Catalog must be scalar SQL UDFs or Python/Scala UDFs wrapped in SQL UDFs, and they operate at query runtime to mask sensitive data.
use catalog mycatalog;
use schema myschema;
CREATE OR REPLACE FUNCTION
mycatalog.myschema.maskWithHashSaltKey(column_to_mask STRING)
RETURN
CASE
WHEN is_account_group_member('classified')
THEN column_to_mask
ELSE sha2(akv_key(trim(lower(encode(column_to_mask, "UTF-8"))), 256)
END ;
ALTER TABLE mycatalog.myschema.pii_table alter column pii_column SET mask mycatalog.myschema.maskWithHashSaltKey;
Step 4 (Optional), Enable networking for UDFs in Serverless SQL Warehouses
Databricks Serverless SQL offers a compelling combination of productivity, performance, cost efficiency, and operational simplicity for modern analytics and business intelligence workloads. To use Enhanced UC Python UDF on Serverless SQL, you need to enable this feature from your Databricks Preview page
Just like that, in 3 easy steps (4 if using Databricks Serverless SQL), Enhanced UC UDFs allow you to specify custom dependencies (from PyPI, Unity Catalog volumes, or public URLs) and UC Service Credentials directly in the UDF definition. This means you can enhance security with UC service credentials, use external libraries(in this example, AKV) or your own packaged code without manual environment setup.