cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Hashing Functions in PySpark

Michael_Appiah
New Contributor III

Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:

  • MD5 (pyspark.sql.functions.md5)
  • SHA1 (pyspark.sql.functions.sha1)
  • SHA2 (pyspark.sql.functions.sha2)
  • xxHASH64 (pyspark.sql.functions.xxhash64)
  • 32 bit HASH (pyspark.sql.functions.hash)
  • Crc32 (pyspark.sql.functions.crc32)

Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?

 

1 ACCEPTED SOLUTION

Accepted Solutions

Hi @Michael_Appiah, When it comes to choosing between xxHASH64 and SHA2, it’s important to consider the specific use case and requirements. xxHASH64 is a non-cryptographic hash function that is known for its speed and low memory usage. It is ideal for scenarios where the hash function is used for performance optimization and not for security purposes. On the other hand, SHA2 is a cryptographic hash function that is designed to be secure and resistant.... It is a better choice when the hash function is used for security purposes.

 

In terms of the column type, it’s true that xxHASH64 returns a LongType column while SHA2 returns a StringType column. However, this should not be the only factor to consider when choosing between the two hash functions. If the hash column is used for joining tables, then the column type should be consistent across the tables being joined. In other words, if one table uses xxHASH64, then the other table should also use xxHASH64. Similarly, if one table uses SHA2, then the other table should also use SHA2. This will ensure that the join operation is performed efficiently and without errors.

 

In summary, xxHASH64 is a good choice when the hash function is used for performance optimization and not for security purposes. SHA2 is a better choice when the hash function is used for security purposes. When choosing between the two hash functions, it’s important to consider the specific use case and requirements and ensure that the column type is consistent across the tables being joined.

 

I hope this helps! Let me know if you have any more questions.

View solution in original post

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @Michael_Appiah , 

For implementing a comparison between the source table and target table rows in an SCD2-type merge, any of the PySpark hashing functions listed in the question can be used, depending on the specific requirements of the use case

However, the choice of a hashing algorithm can have an impact on robustness, performance, and collision likelihood when comparing large datasets.

Of the hashing functions listed, the SHA2 algorithm is considered to be the most secure and robust hashing algorithm, as it supports different hash lengths (256, 384, 512 bits) and its collision resistance is well-studied. However, SHA2 is generally slower than other hashing algorithms and may not be well-suited for use cases that require high performance.

MD5 and SHA1 are also widely used hashing algorithms and would be suitable for many use cases.

However, they are known to have vulnerabilities related to collision attacks and are not recommended for cryptographic purposes.

xxHASH64 is a fast hashing algorithm that provides good collision resistance, but it is not considered as secure as other hashing algorithms and should not be used for cryptographic purposes.

32-bit HASH is fast, but its collision resistance is not as good as other algorithms. It may be suitable for use cases that require high performance and can tolerate some level of collision.

Crc32 is used primarily for error detection and is not suitable for use as a secure hashing function.

Ultimately, the choice of the hashing algorithm should be based on the specific requirements of the use case, including the size and complexity of the datasets, performance constraints, and the level of collision resistance that is needed.

Michael_Appiah
New Contributor III

Hi @Kaniz ,

thank you for your comprehensive answer. What is your opinion on the trade-off between using a hash like xxHASH64 which returns a LongType column and thus would offer good performance when there is a need to join on the hash column versus using a more robust/secure algorithm like the SHA2 which however returns a StringType column which would be slower when performing joins?

Hi @Michael_Appiah, When it comes to choosing between xxHASH64 and SHA2, it’s important to consider the specific use case and requirements. xxHASH64 is a non-cryptographic hash function that is known for its speed and low memory usage. It is ideal for scenarios where the hash function is used for performance optimization and not for security purposes. On the other hand, SHA2 is a cryptographic hash function that is designed to be secure and resistant.... It is a better choice when the hash function is used for security purposes.

 

In terms of the column type, it’s true that xxHASH64 returns a LongType column while SHA2 returns a StringType column. However, this should not be the only factor to consider when choosing between the two hash functions. If the hash column is used for joining tables, then the column type should be consistent across the tables being joined. In other words, if one table uses xxHASH64, then the other table should also use xxHASH64. Similarly, if one table uses SHA2, then the other table should also use SHA2. This will ensure that the join operation is performed efficiently and without errors.

 

In summary, xxHASH64 is a good choice when the hash function is used for performance optimization and not for security purposes. SHA2 is a better choice when the hash function is used for security purposes. When choosing between the two hash functions, it’s important to consider the specific use case and requirements and ensure that the column type is consistent across the tables being joined.

 

I hope this helps! Let me know if you have any more questions.