cancel
Showing results for 
Search instead for 
Did you mean: 
Technical Blog
Explore in-depth articles, tutorials, and insights on data analytics and machine learning in the Databricks Technical Blog. Stay updated on industry trends, best practices, and advanced techniques.
cancel
Showing results for 
Search instead for 
Did you mean: 
RamGoli
Databricks Employee
Databricks Employee

Overview of Data Security:

Data security is a cornerstone of any modern analytics or data processing environment, ensuring that sensitive information remains protected throughout its lifecycle. Organizations must comply with regulations like GDPR, HIPAA, and CCPA, which dictate how personal and sensitive data should be stored, accessed, and processed. Failure to meet these standards can result in legal penalties and reputational damage. At the same time, robust data security fosters privacy and trust by assuring customers and stakeholders that their information is safe from breaches.

Implementing fine-grained security controls, like column masking and row-level filtering, enables secure data access without hindering legitimate analytical activity. With DLT, these controls can be applied declaratively at the table creation time. 

Column Masking (CM): 

RamGoli_0-1746106283806.png

Column Masking is a data security technique designed to protect sensitive or personally identifiable information (PII) within specific database columns by transforming the data in a way that preserves its structure but obscures or anonymizes its contents. Unlike encryption, where data can be decrypted back to its original form, masking is typically a one-way transformation, ensuring individuals without the necessary privileges cannot view the real values.

 

 

Implementing Column Mask in DLT:

To implement column masking in a DLT table:

  1. Create a SQL UDF: This function determines the visibility logic based on user group membership.
    Use is_account_group_member(): This built-in Databricks function checks if a user belongs to a specific group.
    USE stores.north_america;
    
    
    CREATE OR REPLACE FUNCTION ssn_mask(ssn STRING)
    RETURN case 
    WHEN is_account_group_member('admin') then ssn
           WHEN  is_account_group_member('dept_manager') THEN CONCAT('***-**-', SUBSTRING(ssn, LENGTH(ssn)-3, 4))
           ELSE  '***-**-****' END
            ;
    
  2. Apply the mask in DLT
    import dlt
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType
    @dlt.table(
    name="customerData",
    schema = """
         ssn STRING MASK stores.north_america.ssn_mask,
         state STRING,
         gender STRING,
         birthdate STRING,
         maiden_name STRING,
         last_name STRING,
         first_name STRING,
         address STRING,
         _rescued_data STRING
       """
    )
    def customer_data():
     return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")
    ​

Depending on group membership, users will either see the full SSN, a partially masked version, or a fully obfuscated version.

Example result of a user query who is not an admin or a department manager:

 

select * from stores.north_america.customerdata

 

RamGoli_1-1746106493401.png

 

Row-Level Security:

RamGoli_2-1746105883394.png

Row-Level Security (RLS) is a method of controlling access to rows within a table based on the user’s identity or specific attributes, such as their department, region, role, or security clearance level. By enforcing RLS, you ensure that users only see the data rows they are explicitly authorized to view. This eliminates the need for maintaining multiple separate datasets tailored to different user groups, greatly simplifying data governance and reducing storage overhead. RLS is particularly valuable in environments dealing with sensitive or regulated data, ensuring users access only what is necessary for their role

 

Implementing Row Level Security:

1. Similar to Column Mask, implementing RLS starts with a User-Defined Function. 

For the above example, lets add a row filter where only the group members belonging to that particular state can sees applicable rows. You can create the UDF on a SQL Warehouse or an Interactive Notebook. 

 

CREATE OR REPLACE FUNCTION state_filter(state STRING)
RETURN CASE WHEN is_account_group_member('CA_resident') then state = 'CA'
       WHEN is_account_group_member('NY_resident') then state = 'NY'
     END;

 

2. Apply the row filter in DLT:

 

import dlt
from pyspark.sql.types import StructType, StructField, StringType, IntegerType


@dlt.table(
name="customerData",
schema = """
     ssn STRING MASK stores.north_america.ssn_mask,
     state STRING,
     gender STRING,
     birthdate STRING,
     maiden_name STRING,
     last_name STRING,
     first_name STRING,
     address STRING
   """,
   row_filter = "ROW FILTER stores.north_america.state_filter on (state)"
)
def customer_data():
 return spark.readStream.format("cloudFiles").option("cloudFiles.format", "csv").option("header", "true").load("/Volumes/stores/north_america/manynames")

 

Combining CM and RLS 

Column masking and row-level security can work together for layered data protection. For example

For example, A Department Manager from California will see only rows for “CA” and partially masked SSNs.

RamGoli_0-1746107190413.png

Query Profile when Using RLS/CM:

CM and RLS SQL UDFs are applied at the query phase rather than when the data data is written. This gives the flexibility to change the UDFs and have that be consistently reflected across the dataset. 

Users can check the Query Profile in DBSQL, and what filters and column masking are applied:

Example RLS:

RamGoli_4-1746105999190.png

 

Example CM:
(click on verbose mode to check for CM)

RamGoli_5-1746106009535.png

Best Practices:

  1. Combine Column Masking and Row-Level Security with table-level permissions in Unity Catalog to create robust security and governance policies.
  2. Monitor access using system tables to detect misconfigurations.
  3. Update group memberships and UDF logic as organizational roles evolve.
  4. Test access with different user roles to validate expected behavior.

Additional Considerations:

  1. CM and RLS can be used together: RLS is applied first, followed by CM.
  2. CM/RLS must be defined at table creation for Streaming and Materialized views in DLT—ALTER TABLE is not yet supported for these features. However, this will be available in a future release. 
  3. Delta Sharing is not supported on tables with CM/RLS.
  4. If a RLS/CM SQL UDF is referenced in a pipeline but a user drops the function, the pipeline will fail with [UNRESOLVED_ROUTINE] error pointing to the deleted RLS/CM

By embedding column masking and row-level filtering directly into your Delta Live Tables pipelines, you're doing much more than just ticking boxes for compliance and security—you're building a dynamic, responsive data environment. These capabilities transform how you manage sensitive information, fostering a culture of trust among your stakeholders and customers by transparently demonstrating your commitment to data privacy. Moreover, by empowering your analytics teams with safe and precise data access, you're not only protecting privacy but actively accelerating innovation. Embrace these powerful tools today to confidently navigate regulatory complexities, minimize risks, and lead your organization toward a secure and data-driven future.

 

Additional Resources:
1. https://docs.databricks.com/aws/en/tables/row-and-column-filters
2.  https://www.databricks.com/resources/demos/tutorials/governance/table-acl-and-dynamic-views-with-uc
3. https://docs.databricks.com/aws/en/sql/language-manual/functions/is_account_group_member

 

5 Comments