Hashing Functions in PySpark

Michael_Appiah · ‎10-14-2023

Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:

MD5 (pyspark.sql.functions.md5)
SHA1 (pyspark.sql.functions.sha1)
SHA2 (pyspark.sql.functions.sha2)
xxHASH64 (pyspark.sql.functions.xxhash64)
32 bit HASH (pyspark.sql.functions.hash)
Crc32 (pyspark.sql.functions.crc32)

Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?

Michael_Appiah · ‎10-17-2023

Hi @Retired_mod ,

thank you for your comprehensive answer. What is your opinion on the trade-off between using a hash like xxHASH64 which returns a LongType column and thus would offer good performance when there is a need to join on the hash column versus using a more robust/secure algorithm like the SHA2 which however returns a StringType column which would be slower when performing joins?