cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Hashing Functions in PySpark

Michael_Appiah
Contributor

Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:

  • MD5 (pyspark.sql.functions.md5)
  • SHA1 (pyspark.sql.functions.sha1)
  • SHA2 (pyspark.sql.functions.sha2)
  • xxHASH64 (pyspark.sql.functions.xxhash64)
  • 32 bit HASH (pyspark.sql.functions.hash)
  • Crc32 (pyspark.sql.functions.crc32)

Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?

 

1 REPLY 1

Michael_Appiah
Contributor

Hi @Retired_mod ,

thank you for your comprehensive answer. What is your opinion on the trade-off between using a hash like xxHASH64 which returns a LongType column and thus would offer good performance when there is a need to join on the hash column versus using a more robust/secure algorithm like the SHA2 which however returns a StringType column which would be slower when performing joins?

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local communityโ€”sign up today to get started!

Sign Up Now