cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Please guide on the algorithm for masking of column in databricks which is compatible (can be unmasked) with sqlserver.

Istuti
Contributor
 
1 REPLY 1

Anonymous
Not applicable

@Istuti Guptaโ€‹ :

There are several algorithms you can use to mask a column in Databricks in a way that is compatible with SQL Server. One commonly used algorithm is called pseudonymization or tokenization.

Here's an example of how you can implement pseudonymization/tokenization in Databricks:

import hashlib
import random
 
# Define the function to pseudonymize/tokenize a value
def pseudonymize(value, salt):
    # Generate a random seed for shuffling the characters
    seed = int.from_bytes(hashlib.sha256(salt.encode()).digest()[:4], byteorder='big')
    
    # Convert the value to a string if it's not already a string
    if not isinstance(value, str):
        value = str(value)
        
    # Convert the value to lowercase and remove any leading/trailing whitespace
    value = value.lower().strip()
    
    # Shuffle the characters in the value using the random seed
    random.seed(seed)
    chars = list(value)
    random.shuffle(chars)
    shuffled_value = ''.join(chars)
    
    # Hash the shuffled value using SHA256
    hashed_value = hashlib.sha256(shuffled_value.encode()).hexdigest()
    
    # Return the hashed value as a pseudonym/token
    return hashed_value
 
# Define the salt to use for pseudonymization/tokenization
salt = "my_secret_salt"
 
# Define the dataframe and column to mask
df = spark.table("my_table")
column_to_mask = "my_column"
 
# Create a new column to hold the pseudonym/token
df = df.withColumn(column_to_mask + "_pseudonym", udf(pseudonymize, StringType())(col(column_to_mask), lit(salt)))
 
# Save the masked data to a new table or overwrite the original table
df.write.mode("overwrite").saveAsTable("my_table_masked")

In this example, we use the SHA256 hashing algorithm to pseudonymize/tokenize the values in the specified column. We also add a random seed to shuffle the characters in the value before hashing it. The salt value is used to ensure that the same input value always gets mapped to the same pseudonym/token.

Note that the resulting pseudonym/token is irreversible, meaning that it cannot be used to recover the original value. To unmask the data, you would need to keep a separate mapping of the original values to their corresponding pseudonyms/tokens.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.