Data security is a cornerstone of any modern analytics or data processing environment, ensuring that sensitive information remains protected throughout its lifecycle. Organizations must comply with regulations like GDPR, HIPAA, and CCPA, which dictate how personal and sensitive data should be stored, accessed, and processed. Failure to meet these standards can result in legal penalties and reputational damage. At the same time, robust data security fosters privacy and trust by assuring customers and stakeholders that their information is safe from breaches.
Implementing fine-grained security controls, like column masking and row-level filtering, enables secure data access without hindering legitimate analytical activity. With DLT, these controls can be applied declaratively at the table creation time.
Column Masking is a data security technique designed to protect sensitive or personally identifiable information (PII) within specific database columns by transforming the data in a way that preserves its structure but obscures or anonymizes its contents. Unlike encryption, where data can be decrypted back to its original form, masking is typically a one-way transformation, ensuring individuals without the necessary privileges cannot view the real values.
To implement column masking in a DLT table:
USE stores.north_america;
CREATE OR REPLACE FUNCTION ssn_mask(ssn STRING)
RETURN case
WHEN is_account_group_member('admin') then ssn
WHEN is_account_group_member('dept_manager') THEN CONCAT('***-**-', SUBSTRING(ssn, LENGTH(ssn)-3, 4))
ELSE '***-**-****' END
;
import dlt
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
@dlt.table(
name="customerData",
schema = """
ssn STRING MASK stores.north_america.ssn_mask,
state STRING,
gender STRING,
birthdate STRING,
maiden_name STRING,
last_name STRING,
first_name STRING,
address STRING,
_rescued_data STRING
"""
)
def customer_data():
return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")
Depending on group membership, users will either see the full SSN, a partially masked version, or a fully obfuscated version.
Example result of a user query who is not an admin or a department manager:
select * from stores.north_america.customerdata
Row-Level Security (RLS) is a method of controlling access to rows within a table based on the user’s identity or specific attributes, such as their department, region, role, or security clearance level. By enforcing RLS, you ensure that users only see the data rows they are explicitly authorized to view. This eliminates the need for maintaining multiple separate datasets tailored to different user groups, greatly simplifying data governance and reducing storage overhead. RLS is particularly valuable in environments dealing with sensitive or regulated data, ensuring users access only what is necessary for their role
1. Similar to Column Mask, implementing RLS starts with a User-Defined Function.
For the above example, lets add a row filter where only the group members belonging to that particular state can sees applicable rows. You can create the UDF on a SQL Warehouse or an Interactive Notebook.
CREATE OR REPLACE FUNCTION state_filter(state STRING)
RETURN CASE WHEN is_account_group_member('CA_resident') then state = 'CA'
WHEN is_account_group_member('NY_resident') then state = 'NY'
END;
2. Apply the row filter in DLT:
import dlt
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
@dlt.table(
name="customerData",
schema = """
ssn STRING MASK stores.north_america.ssn_mask,
state STRING,
gender STRING,
birthdate STRING,
maiden_name STRING,
last_name STRING,
first_name STRING,
address STRING
""",
row_filter = "ROW FILTER stores.north_america.state_filter on (state)"
)
def customer_data():
return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")
Column masking and row-level security can work together for layered data protection. For example
For example, A Department Manager from California will see only rows for “CA” and partially masked SSNs.
CM and RLS SQL UDFs are applied at the query phase rather than when the data data is written. This gives the flexibility to change the UDFs and have that be consistently reflected across the dataset.
Users can check the Query Profile in DBSQL, and what filters and column masking are applied:
Example RLS:
Example CM:
(click on verbose mode to check for CM)
By embedding column masking and row-level filtering directly into your Delta Live Tables pipelines, you're doing much more than just ticking boxes for compliance and security—you're building a dynamic, responsive data environment. These capabilities transform how you manage sensitive information, fostering a culture of trust among your stakeholders and customers by transparently demonstrating your commitment to data privacy. Moreover, by empowering your analytics teams with safe and precise data access, you're not only protecting privacy but actively accelerating innovation. Embrace these powerful tools today to confidently navigate regulatory complexities, minimize risks, and lead your organization toward a secure and data-driven future.
Additional Resources:
1. https://docs.databricks.com/aws/en/tables/row-and-column-filters
2. https://www.databricks.com/resources/demos/tutorials/governance/table-acl-and-dynamic-views-with-uc
3. https://docs.databricks.com/aws/en/sql/language-manual/functions/is_account_group_member
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.