Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:
- MD5 (pyspark.sql.functions.md5)
- SHA1 (pyspark.sql.functions.sha1)
- SHA2 (pyspark.sql.functions.sha2)
- xxHASH64 (pyspark.sql.functions.xxhash64)
- 32 bit HASH (pyspark.sql.functions.hash)
- Crc32 (pyspark.sql.functions.crc32)
Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?