Anonymous
Not applicable

@Nathan Sundararajan​ :

When working with hash functions in Databricks or any other system, it's important to understand that hash functions are not specifically designed to generate positive or negative numbers. The output of a hash function is generally a binary string or a numeric representation, such as an integer or a hexadecimal value.

If you require a positive numeric representation, one option is to use the xxhash64 function in Databricks. xxhash64 is a 64-bit hash function that can generate positive integer values. However, it's worth noting that hash functions are not guaranteed to produce unique values for every input, so collisions (i.e., multiple inputs producing the same hash) can occur.

If uniqueness is a critical requirement for your surrogate keys, you might want to consider using a different approach, such as generating surrogate keys using a sequence or a UUID (Universally Unique Identifier). These approaches provide unique values for each row without relying on hash functions.

Alternatively, you mentioned using SCD Type 2 for your dimension table. In such cases, you typically assign a new surrogate key whenever a change occurs in the dimension. This can be achieved by using a combination of the natural/business key and an incremental sequence or timestamp to ensure uniqueness and order of the surrogate keys.