topic Re: Databricks SQL string_agg in Data Engineering

Databricks SQL string_agg

Harish2122 — Sat, 10 Dec 2022 11:22:20 GMT

Migrating some on-premise SQL views to Databricks and struggling to find conversions for some functions. the main one is the string_agg function.

string_agg(field_name, ', ')

Anyone know how to convert that to Databricks SQL?

Thanks in advance.

Re: Databricks SQL string_agg

Heman2 — Sat, 10 Dec 2022 11:24:14 GMT

The rough equivalent would be using collect_set and array_join but note you have lost the order:

Use this

SELECT col1, array_join(collect_set(col2), ',') j

FROM tmp

GROUP BY col1

Re: Databricks SQL string_agg

Ajay-Pandey — Sat, 10 Dec 2022 17:40:39 GMT

Hi @Harish K you can use the below query in spark SQL-

%sql
SELECT col1, array_join(collect_set(col2), ',') j
FROM tmp
GROUP BY col1

Re: Databricks SQL string_agg

eriodega — Tue, 06 Aug 2024 20:33:50 GMT

Here's a way that preserves ordering. It seems way to complicated to me, I'm hoping there is a more elegant way someone else can provide in a subsequent comment.

%sql ;WITH Blah(col1, i) as ( SELECT 'abc',1 UNION SELECT 'def',2 UNION SELECT 'ghi',3 ) SELECT array_join(collect_set(col1), ',') as combined_string FROM Blah; --ghi,def,abc --------------------- --but what if you want to preserve ordering? ;WITH Blah(col1, i) as ( SELECT 'abc',1 UNION SELECT 'def',2 UNION SELECT 'ghi',3 ) SELECT ARRAY_JOIN ( TRANSFORM ( ARRAY_SORT ( ARRAY_AGG( (col1, i) ), (left, right) -> CASE WHEN left.i < right.i THEN -1 WHEN left.i > right.i THEN 1 ELSE 0 END ), x -> x.col1 ), ',' ) as combined_string FROM Blah; --abc,def,ghi

Re: Databricks SQL string_agg

eriodega — Thu, 22 Aug 2024 19:40:15 GMT

Note: it would be great if support was added for a STRING_AGG function. Here's how simple it is to write the same order-preserving query in Postgres SQL (as an example):

;WITH Blah(col1, i) as ( SELECT 'abc',1 UNION SELECT 'def',2 UNION SELECT 'ghi',3 ) SELECT STRING_AGG(col1, ',' ORDER BY i ASC) FROM Blah

Re: Databricks SQL string_agg

eriodega — Thu, 29 Aug 2024 19:48:30 GMT

On a support case there is now a Databricks Aha Idea request created for an order-preserving string_agg function (reference number DB-I-11734).

Re: Databricks SQL string_agg

smueller — Tue, 17 Sep 2024 21:16:41 GMT

If not grouping by something else:

SELECT array_join(collect_set(field_name), ',') field_list

FROM table

Re: Databricks SQL string_agg

eriodega — Tue, 17 Sep 2024 21:18:37 GMT

Hmmm, when I try it, I get multiple rows back (I desire to only get one row back):

;WITH Blah(col1, i) as ( SELECT 'abc',1 UNION SELECT 'def',2 UNION SELECT 'ghi',3 ) SELECT array_join(collect_set(col1) over (order by i), ',') FROM Blah

Results:

abc def,abc def,ghi,abc

Re: Databricks SQL string_agg

filipniziol — Wed, 18 Sep 2024 07:03:05 GMT

Hi @eriodega ,

If you want just to get a single row then do not use OVER:

%sql WITH Blah(col1, i) AS ( SELECT 'abc', 1 UNION SELECT 'def', 2 UNION SELECT 'ghi', 3 ) SELECT array_join(collect_set(col1), ',') AS concatenated_string FROM Blah

Results:

Re: Databricks SQL string_agg

eriodega — Wed, 18 Sep 2024 13:13:32 GMT

yes, that returns one row. The ordering of the resultant string is non-deterministic though (I just ran it and got "def,abc,ghi"), and that is likely fine for most people's use-cases (in fact Heman2 mentioned in in the first answer in this thread). However, if one is looking for ordering, it won't be suitable and they may need to resort to the array_join,transform,array_sort,array_agg,lambda answer I posted above. I was just complaining about the cumbersome nature of it and don't mean to belabor this thread.

Re: Databricks SQL string_agg

just-Vlad — Fri, 17 Jan 2025 19:05:12 GMT

Adding the proper window specification for the "OVER" clause plus DISTINCT helps to achieve some resemblance of STRING_AGG for the simple ascending order:

select distinct object ,array_join(array_sort(collect_set(property) over (partition by object order by property ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)), ',') as properties from ( values ('object 1','C') ,('object 2','F') ,('object 1','B') ,('object 2','E') ,('object 1','A') ,('object 2','D') ) as t(object,property) order by object

object	properties
object 1	A,B,C
object 2	D,E,F