cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Delta live table generate unique integer value (kind of surrogate key) for combination of columns

NathanSundarara
Contributor

Hi,

we are in process of moving our Datawarehouse from sql server to databricks. we are in process of testing our Dimension Product table which has identity column for referencing in fact table as surrogate key.

In Databricks Apply changes SCD type 2 we were recommended to generate since apply changes don't allow identity column. I tried hash() function using combination of two columns using concat_ws() and passing result of concat_ws to hash function. It generated big int but it also generates negative values.

Functions like MD5 , SHA1, SHA2 all generates string function. There is xxhash64 which generates BIG INT not sure if it also generates negative numbers.

My questions is does any one know kind of hash function which will generate same hash like numeric number (positive) ones? If not we might end up using SHA1, or SHA2 function. I asked in Databricks American office hours the guy recommended posting in Community forum so trying here.

7 REPLIES 7

NathanSundarara
Contributor

XXhash_64 also generating negative values, I'm not sure if I will get into issue if I convert to abs and for some combination of my key it will actually generate the positive number which I might then have duplicate

Anonymous
Not applicable

@Nathan Sundararajan​ :

When working with hash functions in Databricks or any other system, it's important to understand that hash functions are not specifically designed to generate positive or negative numbers. The output of a hash function is generally a binary string or a numeric representation, such as an integer or a hexadecimal value.

If you require a positive numeric representation, one option is to use the xxhash64 function in Databricks. xxhash64 is a 64-bit hash function that can generate positive integer values. However, it's worth noting that hash functions are not guaranteed to produce unique values for every input, so collisions (i.e., multiple inputs producing the same hash) can occur.

If uniqueness is a critical requirement for your surrogate keys, you might want to consider using a different approach, such as generating surrogate keys using a sequence or a UUID (Universally Unique Identifier). These approaches provide unique values for each row without relying on hash functions.

Alternatively, you mentioned using SCD Type 2 for your dimension table. In such cases, you typically assign a new surrogate key whenever a change occurs in the dimension. This can be achieved by using a combination of the natural/business key and an incremental sequence or timestamp to ensure uniqueness and order of the surrogate keys.

xxhash64 generates negative numbers as well.

Anonymous
Not applicable

@Nathan Sundararajan​ : Okay, I see, you are right about it. Since the XXhash64 function in Databricks can generate both positive and negative values. To ensure that the generated hash values are positive, you can apply the bitwise AND operation with a bitmask. Here's an example:

from pyspark.sql.functions import xxhash64
 
df = df.withColumn('surrogate_key', xxhash64(concat_ws('-', df.col1, df.col2)) & 0x7FFFFFFFFFFFFFFF)

Please let me know if this helps.

Anonymous
Not applicable

Hi @Nathan Sundararajan​ 

Thank you for posting your question in our community! We are happy to assist you.

To help us provide you with the most accurate information, could you please take a moment to review the responses and select the one that best answers your question?

This will also help other community members who may have similar questions in the future. Thank you for your participation and let us know if you need any further assistance! 

The response was wrong. The answer was not correct. XXhash64 generates negative keys.

ilarsen
Contributor

Hey.  Yep, xxhash64 (or even just hash) generate numerical values for you.  Combine with abs function to ensure the value is positive.  In our team we used abs(hash()) ourselves... for maybe a day.  Very quickly I observed a collision, and the data set was not at all large.  Of course what columns you include matter, but a large factor would have been enforcing a positive value.

 

We dropped the abs and went with xxhash64.  It is probably big enough, but use case matters.  Now we're considering the seemingly tried-and-true SHA2 256 instead.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.