Databricks Community

RamGoli · ‎05-01-2025

Overview of Data Security:

Data security is a cornerstone of any modern analytics or data processing environment, ensuring that sensitive information remains protected throughout its lifecycle. Organizations must comply with regulations like GDPR, HIPAA, and CCPA, which dictate how personal and sensitive data should be stored, accessed, and processed. Failure to meet these standards can result in legal penalties and reputational damage. At the same time, robust data security fosters privacy and trust by assuring customers and stakeholders that their information is safe from breaches.

Implementing fine-grained security controls, like column masking and row-level filtering, enables secure data access without hindering legitimate analytical activity. With DLT, these controls can be applied declaratively at the table creation time.

Column Masking (CM):

Column Masking is a data security technique designed to protect sensitive or personally identifiable information (PII) within specific database columns by transforming the data in a way that preserves its structure but obscures or anonymizes its contents. Unlike encryption, where data can be decrypted back to its original form, masking is typically a one-way transformation, ensuring individuals without the necessary privileges cannot view the real values.

Implementing Column Mask in DLT:

To implement column masking in a DLT table:

Create a SQL UDF: This function determines the visibility logic based on user group membership.
Use is_account_group_member(): This built-in Databricks function checks if a user belongs to a specific group.

USE stores.north_america;


CREATE OR REPLACE FUNCTION ssn_mask(ssn STRING)
RETURN case 
WHEN is_account_group_member('admin') then ssn
       WHEN  is_account_group_member('dept_manager') THEN CONCAT('***-**-', SUBSTRING(ssn, LENGTH(ssn)-3, 4))
       ELSE  '***-**-****' END
        ;

Apply the mask in DLT

import dlt
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
@dlt.table(
name="customerData",
schema = """
     ssn STRING MASK stores.north_america.ssn_mask,
     state STRING,
     gender STRING,
     birthdate STRING,
     maiden_name STRING,
     last_name STRING,
     first_name STRING,
     address STRING,
     _rescued_data STRING
   """
)
def customer_data():
 return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")

Depending on group membership, users will either see the full SSN, a partially masked version, or a fully obfuscated version.

Example result of a user query who is not an admin or a department manager:

select * from stores.north_america.customerdata

Row-Level Security:

Row-Level Security (RLS) is a method of controlling access to rows within a table based on the user’s identity or specific attributes, such as their department, region, role, or security clearance level. By enforcing RLS, you ensure that users only see the data rows they are explicitly authorized to view. This eliminates the need for maintaining multiple separate datasets tailored to different user groups, greatly simplifying data governance and reducing storage overhead. RLS is particularly valuable in environments dealing with sensitive or regulated data, ensuring users access only what is necessary for their role

Implementing Row Level Security:

1. Similar to Column Mask, implementing RLS starts with a User-Defined Function.

For the above example, lets add a row filter where only the group members belonging to that particular state can sees applicable rows. You can create the UDF on a SQL Warehouse or an Interactive Notebook.

CREATE OR REPLACE FUNCTION state_filter(state STRING)
RETURN CASE WHEN is_account_group_member('CA_resident') then state = 'CA'
       WHEN is_account_group_member('NY_resident') then state = 'NY'
     END;

2. Apply the row filter in DLT:

import dlt
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


@dlt.table(
name="customerData",
schema = """
     ssn STRING MASK stores.north_america.ssn_mask,
     state STRING,
     gender STRING,
     birthdate STRING,
     maiden_name STRING,
     last_name STRING,
     first_name STRING,
     address STRING
   """,
   row_filter = "ROW FILTER stores.north_america.state_filter on (state)"
)
def customer_data():
 return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")

Combining CM and RLS

Column masking and row-level security can work together for layered data protection. For example

For example, A Department Manager from California will see only rows for “CA” and partially masked SSNs.

Query Profile when Using RLS/CM:

CM and RLS SQL UDFs are applied at the query phase rather than when the data data is written. This gives the flexibility to change the UDFs and have that be consistently reflected across the dataset.

Users can check the Query Profile in DBSQL, and what filters and column masking are applied:

Example RLS:

Example CM:
(click on verbose mode to check for CM)

Best Practices:

Combine Column Masking and Row-Level Security with table-level permissions in Unity Catalog to create robust security and governance policies.
Monitor access using system tables to detect misconfigurations.
Update group memberships and UDF logic as organizational roles evolve.
Test access with different user roles to validate expected behavior.

Additional Considerations:

CM and RLS can be used together: RLS is applied first, followed by CM.
CM/RLS must be defined at table creation for Streaming and Materialized views in DLT—ALTER TABLE is not yet supported for these features. However, this will be available in a future release.
Delta Sharing is not supported on tables with CM/RLS.
If a RLS/CM SQL UDF is referenced in a pipeline but a user drops the function, the pipeline will fail with [UNRESOLVED_ROUTINE] error pointing to the deleted RLS/CM

By embedding column masking and row-level filtering directly into your Delta Live Tables pipelines, you're doing much more than just ticking boxes for compliance and security—you're building a dynamic, responsive data environment. These capabilities transform how you manage sensitive information, fostering a culture of trust among your stakeholders and customers by transparently demonstrating your commitment to data privacy. Moreover, by empowering your analytics teams with safe and precise data access, you're not only protecting privacy but actively accelerating innovation. Embrace these powerful tools today to confidently navigate regulatory complexities, minimize risks, and lead your organization toward a secure and data-driven future.

Additional Resources:
1. https://docs.databricks.com/aws/en/tables/row-and-column-filters
2. https://www.databricks.com/resources/demos/tutorials/governance/table-acl-and-dynamic-views-with-uc
3. https://docs.databricks.com/aws/en/sql/language-manual/functions/is_account_group_member

Databricks Community

Fine-Grained Data Security with Column Mask and Row Level Security in DLT

Overview of Data Security:

Column Masking (CM):

Implementing Column Mask in DLT:

Row-Level Security:

Implementing Row Level Security:

Combining CM and RLS

Query Profile when Using RLS/CM:

Best Practices:

Additional Considerations:

Metadata-Driven ETL Framework in Databricks (Part-1)

Top 10 query performance tuning tips for Databricks Serverless SQL

Best practices for safe data experimentation with Databricks