Databricks Community

turagittech · ‎07-17-2025

Hi All,

Sometimes to load data we want to only update a row based on changes in values. SCD1 type scenarios for a data warehouse. One approach is equivalency A=A, B=B etc. Another is generating a hash of all rows of interest, I believe pretty common. I have found several approaches as to how concatenating a row that will always successfully hash.

Firstly lets get concat() out of the way, it doesn't handle nulls. Any null and you get no hash as you end up with null output. Wrapping every column in isNull isn't great when you have 100 columns to deal with, although AI dev tools do help it still ends up looking untidy to me.

Next concat_ws() looks good as an option it can handle nulls, stringifies the row and produces a string output regardless of datatypes

Coalesce() also can, but appears limited to only a certain data types within a construct, so may fail on many typical structures where you have date, int and string data for example in a row. It would require creating strings for say Date data types. This may be a misunderstanding of least common datatype precedence, but someone can maybe explain what creates a typical risk in this case

Finally a but left field, but create a json string using to_json(). This will never fail to create a string with data in it that can be hashed. It may get a bit long in it's construct.

My questions:

Are there any edge cases where concat_ws will fail to produce a string when any single column in the row is populated?

Has anyone done any tests of the performance of any of these to know how one will fail to perform in large data situations?

My crude performance testing shows them very similar, I am sure there are tested results which provides a win for one

In the above when I take out scheduling time for the queries concat_ws returned 37 rows/sec and to_json returned 28 rows/sec I have run the test a few times to pick up on any nuances and this result difference looks to be caching. If I run the test with the other query first the second is always fastest.

https://docs.databricks.com/aws/en/sql/language-manual/functions/concat

https://docs.databricks.com/aws/en/sql/language-manual/functions/coalesce

https://docs.databricks.com/aws/en/sql/language-manual/functions/concat_ws

https://docs.databricks.com/aws/en/sql/language-manual/functions/to_json

SP_6721 · ‎07-21-2025

Hi @turagittech ,

concat_ws() is generally the most practical and reliable option here. It handles mixed datatypes well and safely skips nulls. The only edge cases you'd typically run into are with complex or unsupported custom datatypes or if the separator itself is NULL.

The performance difference you noticed is likely due to caching, whichever query runs second tends to benefit from the data already being cached. to_json() is more suited for nested or complex row structures.

Databricks Community

Concatenating a row to be able to hash

Join Us as a Local Community Builder!

🌟 Community Pulse: Your Weekly Roundup! October 31 – November 06, 2025

Free Edition Hackathon

🚀 Announcing the Databricks Data Intelligence Platform Cheat Sheet

Zerobus Ingest in Action: How to Stream Event Data Directly into Your Lakehouse

Find Sensitive Data at Scale with Data Classification in Unity Catalog