Databricks Community

Istuti · ‎01-13-2023

Anonymous · ‎04-10-2023

@Istuti Gupta :

There are several algorithms you can use to mask a column in Databricks in a way that is compatible with SQL Server. One commonly used algorithm is called pseudonymization or tokenization.

Here's an example of how you can implement pseudonymization/tokenization in Databricks:

import hashlib
import random
 
# Define the function to pseudonymize/tokenize a value
def pseudonymize(value, salt):
    # Generate a random seed for shuffling the characters
    seed = int.from_bytes(hashlib.sha256(salt.encode()).digest()[:4], byteorder='big')
    
    # Convert the value to a string if it's not already a string
    if not isinstance(value, str):
        value = str(value)
        
    # Convert the value to lowercase and remove any leading/trailing whitespace
    value = value.lower().strip()
    
    # Shuffle the characters in the value using the random seed
    random.seed(seed)
    chars = list(value)
    random.shuffle(chars)
    shuffled_value = ''.join(chars)
    
    # Hash the shuffled value using SHA256
    hashed_value = hashlib.sha256(shuffled_value.encode()).hexdigest()
    
    # Return the hashed value as a pseudonym/token
    return hashed_value
 
# Define the salt to use for pseudonymization/tokenization
salt = "my_secret_salt"
 
# Define the dataframe and column to mask
df = spark.table("my_table")
column_to_mask = "my_column"
 
# Create a new column to hold the pseudonym/token
df = df.withColumn(column_to_mask + "_pseudonym", udf(pseudonymize, StringType())(col(column_to_mask), lit(salt)))
 
# Save the masked data to a new table or overwrite the original table
df.write.mode("overwrite").saveAsTable("my_table_masked")

In this example, we use the SHA256 hashing algorithm to pseudonymize/tokenize the values in the specified column. We also add a random seed to shuffle the characters in the value before hashing it. The salt value is used to ensure that the same input value always gets mapped to the same pseudonym/token.

Note that the resulting pseudonym/token is irreversible, meaning that it cannot be used to recover the original value. To unmask the data, you would need to keep a separate mapping of the original values to their corresponding pseudonyms/tokens.

Databricks Community

Please guide on the algorithm for masking of column in databricks which is compatible (can be unmasked) with sqlserver.

Connect with Databricks Users in Your Area

Databricks Learning Festival (Virtual): 15 January - 31 January 2025

Milestone: DatabricksTV Reaches 100 Videos!

Announcing the new Meta Llama 3.3 model on Databricks

Databricks Community Champion - December 2024 - Sujesh Menon

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences