cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Data Masking Techniques and Issues with Creating Tables

weilin0323
New Contributor III

Hello Databricks Team,

I understand that the mask function can be used to mask columns, but I have a few questions:

  1. When users with access use a masked TABLE to create a downstream TABLE, the downstream TABLE does not inherit the mask function directly, so the data remains unmasked. In this case, do we need to apply the mask function again to the columns that require masking when creating the downstream TABLE?

  2. If users without access to the original data use a masked TABLE to create a downstream TABLE, the masked columns in the downstream TABLE will display the masked values, but it seems that this masking is hard-coded rather than applied through the mask function. How can this be addressed during querying? For example, if a user wants to query data for id=1234, but the id appears as '1**4' after masking, will it be impossible to use id=1234 as a query condition?

We would greatly appreciate any guidance on how to resolve these issues. Thank you for your assistance.

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @weilin0323,

Good morning.

To avoid multiple results with masking, try using a hash function like sha2 in your function instead of partial masking. This ensures each value is unique, even when masked. When joining tables, apply the same hash to both columns so you can match them without revealing sensitive data. For filters, using hashing or encryption will give you more accurate results compared to partial matching. Also, consider using views with dynamic masking to consistently apply masks across queries. This approach will help maintain security and integrity in your data.

Regrads,

Brahma

View solution in original post

4 REPLIES 4

Brahmareddy
Honored Contributor

Hi @weilin0323, How are you doing today?

As per my understanding, When creating a downstream table from a masked table, you’ll need to reapply the mask function to the necessary columns in the new table if you want to maintain the same level of data protection. For users without access to the original data, the masked values in the downstream table are indeed hard-coded. This means if they try to query using an original value like id=1234;, it won't work because the data is already masked. To address this, consider applying the mask function during the query process, or provide a separate mechanism for those users to query based on the masked values.

Give a try and let me know.

Regards,

Brahma

Thank you for your response.

I tried to create a function that can be used in a query:

 

CREATE OR REPLACE FUNCTION idno_filter_test(idno STRING, filter_value STRING)
RETURNS BOOLEAN
RETURN IF(current_user() = "admin@test.com", idno = filter_value,
    concat(left(idno, 2), "**", right(idno, 2)) = concat(left(filter_value, 2), "**", right(filter_value, 2)) )

 

However, this may result in errors in the query. For example, with the following condition:

 

SELECT * 
FROM working.idno_masked_downstream
WHERE idno_filter_test(memidno, "123456")

 

Both "123456" and "124356" match the condition id=12**56, so it may return two results.

Additionally, if I want to join two tables using idno as the key, how should I configure the masked table to ensure it can be matched with other tables based on the key value?

 

SELECT aa.order_number, aa.IDNO, bb.MEMIDNO
FROM processed.order_integrate AS aa
INNER JOIN working.idno_masked_downstream AS bb
ON bb.MEMIDNO = aa.IDNO

 

I would greatly appreciate any guidance on how to resolve these issues. Thank you for your assistance.

Hi @weilin0323,

Good morning.

To avoid multiple results with masking, try using a hash function like sha2 in your function instead of partial masking. This ensures each value is unique, even when masked. When joining tables, apply the same hash to both columns so you can match them without revealing sensitive data. For filters, using hashing or encryption will give you more accurate results compared to partial matching. Also, consider using views with dynamic masking to consistently apply masks across queries. This approach will help maintain security and integrity in your data.

Regrads,

Brahma

Hi @Brahmareddy ,

Thank you so much. I'll try it.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group