@Istuti Guptaโ :
There are several algorithms you can use to mask a column in Databricks in a way that is compatible with SQL Server. One commonly used algorithm is called pseudonymization or tokenization.
Here's an example of how you can implement pseudonymization/tokenization in Databricks:
import hashlib
import random
# Define the function to pseudonymize/tokenize a value
def pseudonymize(value, salt):
# Generate a random seed for shuffling the characters
seed = int.from_bytes(hashlib.sha256(salt.encode()).digest()[:4], byteorder='big')
# Convert the value to a string if it's not already a string
if not isinstance(value, str):
value = str(value)
# Convert the value to lowercase and remove any leading/trailing whitespace
value = value.lower().strip()
# Shuffle the characters in the value using the random seed
random.seed(seed)
chars = list(value)
random.shuffle(chars)
shuffled_value = ''.join(chars)
# Hash the shuffled value using SHA256
hashed_value = hashlib.sha256(shuffled_value.encode()).hexdigest()
# Return the hashed value as a pseudonym/token
return hashed_value
# Define the salt to use for pseudonymization/tokenization
salt = "my_secret_salt"
# Define the dataframe and column to mask
df = spark.table("my_table")
column_to_mask = "my_column"
# Create a new column to hold the pseudonym/token
df = df.withColumn(column_to_mask + "_pseudonym", udf(pseudonymize, StringType())(col(column_to_mask), lit(salt)))
# Save the masked data to a new table or overwrite the original table
df.write.mode("overwrite").saveAsTable("my_table_masked")
In this example, we use the SHA256 hashing algorithm to pseudonymize/tokenize the values in the specified column. We also add a random seed to shuffle the characters in the value before hashing it. The salt value is used to ensure that the same input value always gets mapped to the same pseudonym/token.
Note that the resulting pseudonym/token is irreversible, meaning that it cannot be used to recover the original value. To unmask the data, you would need to keep a separate mapping of the original values to their corresponding pseudonyms/tokens.