Databricks Community

Michael_Appiah · ‎10-14-2023

Hashes are commonly used in SCD2 merges to determine whether data has changed by comparing the hashes of the new rows in the source with the hashes of the existing rows in the target table. PySpark offers multiple different hashing functions like:

MD5 (pyspark.sql.functions.md5)
SHA1 (pyspark.sql.functions.sha1)
SHA2 (pyspark.sql.functions.sha2)
xxHASH64 (pyspark.sql.functions.xxhash64)
32 bit HASH (pyspark.sql.functions.hash)
Crc32 (pyspark.sql.functions.crc32)

Which one of those are best suited for implementing a comparison between source table and target table rows in a SCD2-type merge in terms of robustness, performance and collision likelihood?

Michael_Appiah · ‎10-17-2023

Hi @Retired_mod ,

thank you for your comprehensive answer. What is your opinion on the trade-off between using a hash like xxHASH64 which returns a LongType column and thus would offer good performance when there is a need to join on the hash column versus using a more robust/secure algorithm like the SHA2 which however returns a StringType column which would be slower when performing joins?

Databricks Community

Hashing Functions in PySpark

Connect with Databricks Users in Your Area

Introducing SAP Databricks

Serverless Compute for Notebooks, Workflows and Pipelines is now Generally Available on Google Cloud

Welcoming BladeBridge to Databricks: Accelerating Data Warehouse Migrations to Lakehouse

Databricks Clean Rooms: Now Generally Available on AWS and Azure

Securely share data, analytics and AI