cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Please guide on the algorithm for masking of column in databricks which is compatible (can be unmasked) with sqlserver.

Istuti
Contributor
 
1 REPLY 1

Anonymous
Not applicable

@Istuti Guptaโ€‹ :

There are several algorithms you can use to mask a column in Databricks in a way that is compatible with SQL Server. One commonly used algorithm is called pseudonymization or tokenization.

Here's an example of how you can implement pseudonymization/tokenization in Databricks:

import hashlib
import random
 
# Define the function to pseudonymize/tokenize a value
def pseudonymize(value, salt):
    # Generate a random seed for shuffling the characters
    seed = int.from_bytes(hashlib.sha256(salt.encode()).digest()[:4], byteorder='big')
    
    # Convert the value to a string if it's not already a string
    if not isinstance(value, str):
        value = str(value)
        
    # Convert the value to lowercase and remove any leading/trailing whitespace
    value = value.lower().strip()
    
    # Shuffle the characters in the value using the random seed
    random.seed(seed)
    chars = list(value)
    random.shuffle(chars)
    shuffled_value = ''.join(chars)
    
    # Hash the shuffled value using SHA256
    hashed_value = hashlib.sha256(shuffled_value.encode()).hexdigest()
    
    # Return the hashed value as a pseudonym/token
    return hashed_value
 
# Define the salt to use for pseudonymization/tokenization
salt = "my_secret_salt"
 
# Define the dataframe and column to mask
df = spark.table("my_table")
column_to_mask = "my_column"
 
# Create a new column to hold the pseudonym/token
df = df.withColumn(column_to_mask + "_pseudonym", udf(pseudonymize, StringType())(col(column_to_mask), lit(salt)))
 
# Save the masked data to a new table or overwrite the original table
df.write.mode("overwrite").saveAsTable("my_table_masked")

In this example, we use the SHA256 hashing algorithm to pseudonymize/tokenize the values in the specified column. We also add a random seed to shuffle the characters in the value before hashing it. The salt value is used to ensure that the same input value always gets mapped to the same pseudonym/token.

Note that the resulting pseudonym/token is irreversible, meaning that it cannot be used to recover the original value. To unmask the data, you would need to keep a separate mapping of the original values to their corresponding pseudonyms/tokens.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group