cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Concatenating a row to be able to hash

turagittech
Contributor

Hi All,

Sometimes to load data we want to only update a row based on changes in values. SCD1 type scenarios for a data warehouse. One approach is equivalency A=A, B=B etc. Another is generating a hash of all rows of interest, I believe pretty common.  I have found several approaches as to how concatenating a row that will always successfully hash.

Firstly lets get concat() out of the way, it doesn't handle nulls. Any null and you get no hash as you end up with null output. Wrapping every column in isNull isn't great when you have 100 columns to deal with, although AI dev tools do help it still ends up looking untidy to me.

Next concat_ws() looks good as an option it can handle nulls, stringifies the row and produces a string output regardless of datatypes

Coalesce() also can, but appears limited to only a certain data types within a construct, so may fail on many typical structures where you have date, int and string data for example in a row. It would require creating strings for say Date data types. This may be a misunderstanding of least common datatype precedence, but someone can maybe explain what creates a typical risk in this case

Finally a but left field, but create a json string using to_json(). This will never fail to create a string with data in it that can be hashed. It may get a bit long in it's construct.

My questions:

Are there any edge cases where concat_ws will fail to produce a string when any single column in the row is populated?

Has anyone done any tests of  the performance of any of these to know how one will fail to perform in large data situations?

My crude performance testing shows them very similar, I am sure there are tested results which provides a win for one

turagittech_0-1752799736547.png

In the above when I take out scheduling time for the queries concat_ws returned 37 rows/sec and to_json returned 28 rows/sec I have run the test a few times to pick up on any nuances and this result difference looks to be caching. If I run the test with the other query first the second is always fastest.

https://docs.databricks.com/aws/en/sql/language-manual/functions/concat

https://docs.databricks.com/aws/en/sql/language-manual/functions/coalesce

https://docs.databricks.com/aws/en/sql/language-manual/functions/concat_ws

https://docs.databricks.com/aws/en/sql/language-manual/functions/to_json

 

 

 

 

1 REPLY 1

SP_6721
Contributor III

Hi @turagittech ,

concat_ws() is generally the most practical and reliable option here. It handles mixed datatypes well and safely skips nulls. The only edge cases you'd typically run into are with complex or unsupported custom datatypes or if the separator itself is NULL.

The performance difference you noticed is likely due to caching, whichever query runs second tends to benefit from the data already being cached. to_json() is more suited for nested or complex row structures.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now