Hashing Functions in PySpark

Michael_Appiah — Sat, 14 Oct 2023 16:48:08 GMT

Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:

MD5 (pyspark.sql.functions.md5)
SHA1 (pyspark.sql.functions.sha1)
SHA2 (pyspark.sql.functions.sha2)
xxHASH64 (pyspark.sql.functions.xxhash64)
32 bit HASH (pyspark.sql.functions.hash)
Crc32 (pyspark.sql.functions.crc32)

Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?

Re: Hashing Functions in PySpark

Michael_Appiah — Tue, 17 Oct 2023 16:19:30 GMT

Hi @Retired_mod ,

thank you for your comprehensive answer. What is your opinion on the trade-off between using a hash like xxHASH64 which returns a LongType column and thus would offer good performance when there is a need to join on the hash column versus using a more robust/secure algorithm like the SHA2 which however returns a StringType column which would be slower when performing joins?

topic Hashing Functions in PySpark in Data Engineering

Hashing Functions in PySpark

Re: Hashing Functions in PySpark