cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
ponnapv
Databricks Employee
Databricks Employee

Problem Statement

Let us start with setting some context. The problem statement that we are solving here is kept a bit generic as the solution can be applied to any similar situations.

Consider a Payroll datasource with PII data is ingested into Databricks with the following requirements:

  •   A table includes only a few columns that are confidential and can only be viewed by managers (members with the right access)
  •   Available data includes the team hierarchy that will be used as the access rights. Only manager and above will have access to individual employee's confidential data 
  •   The process of managing the solution should be simple and provide operational efficiency for scaling
  •   It should minimize the manual intervention and increase maintainabelity
  •   The solution should follow industry standards and provide high level of security

High-Level Architecture

The solution is built on industry standards as explained below:

  • Security Best Practices: The solution adapts Databricks’ best practice  Envelope Encryption for Unity Catalog (UC), providing multiple layers of security to ensure data confidentiality. This approach uses a combination of data encryption keys and key encryption keys, offering robust protection for sensitive information.
  • User-Specific Data Access: Utilizing the session_user() function, the system enforces row-level security. This mechanism dynamically filters data based on the authenticated user's identity, ensuring that individuals only access information relevant to their roles and permissions.
  • Principle of Least Privilege (PoLP) Access Control: The framework leverages Unity Catalog's granular permission system to implement the principle of least privilege. This allows for precise control over user access to catalogs, schemas, tables, views, and functions, minimizing potential security risks by restricting unnecessary access.
  • Secure Secret Management: Sensitive information such as keys used to access data is securely stored and managed using Databricks-backed secret scopes. This approach ensures that critical credentials are protected and can be accessed safely by authorized processes and users.
  • Data Encryption and Controlled Decryption: By default, all data at rest is encrypted at the table level. When data is accessed through views, it is selectively decrypted based on the current user's permissions. This ensures that even if unauthorized access occurs at the storage level, the data remains protected, with decryption only happening for authorized users through controlled access points.

This architecture provides a comprehensive approach to data security, combining encryption, access control, and secure credential management to protect sensitive information throughout its lifecycle in the system.

Design Approach 

Data is encrypted in the table by default and embeds a function to decrypt the data in the view on top, as it's filtered for the user accessing the data.

We use Payroll_Encrypted, Employee_hierarchy, & Employee_UPN tables to create the view Payroll_decrypted. The Employee_UPN table is the key to use the current_user function

ponnapv_0-1744056752033.png

Permission Summary

  1. The objects (key_vault table, encrypt, decrypt and unwrap_key functions) within the Domain.crypto “catalog” are exclusively accessible to the admin user “or specific segmented team”.
  2. The “payroll_decrypted view” has SELECT permissions restricted to users or groups with managers (hierarchy) authorized to query. Both use catalog and use schema on where the view is defined need to be granted to the same group. Read on secret scope needs to be granted to this group as well.
  3. The “payroll_decrypted view” owner needs to have execute on the decrypt function and read on secret scope. “This would be the admin team or selected segmented team”

Implementation Walk through

Employee hierarchy and Employee UPN information is for general consumption.  The base payroll table / data would be encrypted.  A view would be created joining the 3 together along with the decrypt function for general consumption using the current_user() function.

Step 0

Secret Scope, Key Encryption Key Name and User/Group to Access the Key as Inputs

dbutils.widgets.text(name="secret_scope", defaultValue="piiscope", label="The secret scope to use for DEKs")
dbutils.widgets.text(name="kek_name", defaultValue="piikeyname", label="The name to use for our KEK")
dbutils.widgets.text(name="keyvault_user", defaultValue="payroll_managers", label="The username to grant unprivileged access to decrypt the data")

Step 1

Prepare the sample employee and manager tables

CREATE OR REPLACE TABLE consume.catalog.employee_hierarchy AS (SELECT * FROM read_files(
 '/Volumes/consume/catalog/synthetic_data/employee_hierarchy.csv',
 format => 'csv',
 header => true,
 inferSchema => true))
CREATE OR REPLACE TABLE consume.catalog.employee_upn AS (SELECT * FROM read_files(
 '/Volumes/consume/catalog/synthetic_data/employee_upn.csv',
 format => 'csv',
 header => true,
 inferSchema => true));

Step 2

Generate a Key Encryption Key (KEK) and create a key_vault table to store it in a dedicated catalog and schema

from base64 import b64encode
from os import urandom
kek = b64encode(urandom(24)).decode('utf-8')
CREATE OR REPLACE TABLE sys.crypto.key_vault (
 id BIGINT GENERATED BY DEFAULT AS IDENTITY,
 created_date DATE,
 created_time TIMESTAMP,
 last_modified_time TIMESTAMP,
 created_by STRING,
 managed_by STRING,
 key_name STRING,
 key_version INT,
 key_enabled BOOLEAN,
 key_type STRING,
 key STRING);

kek_name = dbutils.widgets.get("kek_name")
sql(f"""
   INSERT INTO sys.crypto.key_vault (created_date, created_time, last_modified_time, created_by, managed_by, key_name, key_version, key_enabled, key_type, key)
   VALUES (current_date(), current_timestamp(), current_timestamp(), session_user(), session_user(), '{kek_name}', 1, True, 'KEK', '{kek}')""")

Step 3

Use the KEK to encrypt our Data Encryption Key (DEK) and store the encrypted DEK as a secret

import string
import random
 
dek = b64encode(urandom(24)).decode('utf-8')
iv = ''.join(random.choices(string.ascii_uppercase + string.digits, k=12))
aad = ''.join(random.choices(string.ascii_uppercase + string.digits, k=8))
 
encrypted_dek = sql(f"SELECT base64(aes_encrypt('{dek}', '{kek}', 'GCM', 'DEFAULT'))").first()[0]
encrypted_iv = sql(f"SELECT base64(aes_encrypt('{iv}', '{kek}', 'GCM', 'DEFAULT'))").first()[0]
encrypted_aad = sql(f"SELECT base64(aes_encrypt('{aad}', '{kek}', 'GCM', 'DEFAULT'))").first()[0]
from databricks.sdk import WorkspaceClient
w = WorkspaceClient()
secret_scope = dbutils.widgets.get("secret_scope")
try:
   w.secrets.create_scope(scope=secret_scope)
except Exception as e:
   print(e)
w.secrets.put_secret(scope=secret_scope, key='dek', string_value=encrypted_dek)
w.secrets.put_secret(scope=secret_scope, key='iv', string_value=encrypted_iv)
w.secrets.put_secret(scope=secret_scope, key='aad', string_value=encrypted_aad)
from databricks.sdk.service import workspace
w.secrets.put_acl(scope=secret_scope, permission=workspace.AclPermission.READ, principal=dbutils.widgets.get("keyvault_user"))

Step 4

Create crypto functions to unwrap our keys and encrypt the data

CREATE OR REPLACE FUNCTION sys.crypto.unwrap_key(key_to_unwrap STRING, key_to_use STRING)
RETURNS STRING
RETURN aes_decrypt(unbase64(key_to_unwrap), (SELECT key FROM sys.crypto.key_vault WHERE key_enabled AND key_name = key_to_use ORDER BY created_date DESC  LIMIT 1), 'GCM', 'DEFAULT')
kek_name = dbutils.widgets.get("kek_name")
sql(f"""CREATE OR REPLACE FUNCTION sys.crypto.encrypt(col STRING)
RETURNS STRING
RETURN
   base64(aes_encrypt(col,
   sys.crypto.unwrap_key(secret('{secret_scope}', 'dek'), '{kek_name}'),
   'GCM',
   'DEFAULT',
   sys.crypto.unwrap_key(secret('{secret_scope}', 'iv'), '{kek_name}'),
   sys.crypto.unwrap_key(secret('{secret_scope}', 'aad'), '{kek_name}')
   ))""")

Step 5

Create a table payroll_encrypted with the salary information encrypted 

CREATE OR REPLACE TABLE consume.catalog.payroll_encrypted AS 
(SELECT
employee_id,
first_name,
last_name,
sys.crypto.encrypt(salary) AS salary
FROM consume.catalog.employee_hierarchy)

Step 6

Create a crypto function to decrypt the data 

sql(f"""CREATE OR REPLACE FUNCTION sys.crypto.decrypt(col STRING)
RETURNS STRING
RETURN
   nvl(CAST(try_aes_decrypt(unbase64(col),
   sys.crypto.unwrap_key(secret('{secret_scope}', 'dek'), '{kek_name}'),
   'GCM',
   'DEFAULT',
   sys.crypto.unwrap_key(secret('{secret_scope}', 'aad'), '{kek_name}')) AS STRING),
   col)
   """)

Step 7

Apply the decrypt function to create a view which allows the manager to see their employee data only 

create or replace view consume.catalog.payroll_decrypted as
select e.employee_id, e.first_name, e.last_name, m.manager_id, m.manager_email,
sys.crypto.decrypt(e.salary) as salary
from consume.catalog.payroll_encrypted e join consume.catalog.employee_upn m on e.employee_id = m.employee_id
where m.manager_email = current_user()

Step 8

Query the data and confirm that the data is decrypted as expected

select * from consume.catalog.payroll_decrypted;

 

Conclusion

We have multiple options to secure data on Databricks, but picking the right option and customizing it to the requirement is important. This blog summarizes one such requirement with a detailed approach to solve it and scale it.

 

References

  1. Sample code on Git Hub 
  2. Envelope Encryption with Unity Catalog
  3. Secret Management on Databricks