cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is it possible to control the ordering of the array values created by array_agg()?

akisugi
New Contributor III

Hi! I would be glad to ask you some questions.

I have the following data.โ€ƒ

ใ‚นใ‚ฏใƒชใƒผใƒณใ‚ทใƒงใƒƒใƒˆ 2024-04-06 23.08.15.png

I would like to get this kind of result. I want `move` to correspond to the order of `hist`.

ใ‚นใ‚ฏใƒชใƒผใƒณใ‚ทใƒงใƒƒใƒˆ 2024-04-06 23.07.34.png

Therefore, i considered the following query.

```

with tmp as (
select * from (values
(1, 1, 'a'),
(1, 3, 'c'),
(1, 2, 'b'),
(2, 1, 'a'),
(2, 2, 'b'),
(2, 3, null),
(3, 3, 'b'),
(3, 1, 'a'),
(3, 2, null),
(4, 1, 'a')
) as tab(custid, hist, alf)
order by custid asc, hist asc
)
select
custid,
array_join(array_agg(alf), ' -> ') as move
from tmp
group by custid
order by custid asc;
```

Tried sorting by `order by` before combining strings into an array with `array_join(array_agg())`. At first glance it seems to work, but the official documentation for `array_agg()` states the following

```
The order of elements in the array is non-deterministic.

From | https://docs.databricks.com/en/sql/language-manual/functions/array_agg.html#:~:text=The%20order%20of....
```

Does it make sense to sort by `order by` before running `array_join(array_agg())`? Will it work as expected with larger data sizes?

I apologize for the inconvenience and would appreciate your response.

 
1 ACCEPTED SOLUTION

Accepted Solutions

Hello,

Below is a solution
regexp_replace(array_join(sort_array(array_agg(concat(hist, alf))), ' -> '),'[0-9]','')

The idea in the code is to concatenate the "hist" column with the "alf" column, this way the "sort_array" can sort based on the numbers in the "hist" column and finally a "regexp_replace" is done removing the numbers from the "hist" column. hist" as they cannot be shown in the final result

Hope this helps

 

ThomazRossito_0-1712704018151.png

 

Att.
Thomaz Antonio Rossito Neto
Data Technical Lead / Data Engineer-Architect

View solution in original post

5 REPLIES 5

ThomazRossito
Contributor

Hi, 

I believe that for your scenario the "sort_array", function can help

https://docs.databricks.com/en/sql/language-manual/functions/sort_array.html

ThomazRossito_0-1712504147539.png

 

Att.
Thomaz Antonio Rossito Neto
Data Technical Lead / Data Engineer-Architect

Hi @ThomazRossito 

Thanks for your response and Sorry for my late reply.

I apologize that my question did not accurately describe my scenario.

It is true that the `sort_array` function works well for this sample data.

On the other hand, array values do not necessarily have rules like alphabetical or numeric.

------------------------------

with tmp as (
select * from (values
(1, 1, 'cherry'),
(1, 3, 'strawberries'),
(1, 2, 'acerola'),
(1, 5, 'banan'),
(1, 4, 'lemon')
) as tab(custid, hist, alf)
order by custid asc, hist asc
)
select
custid,
array_join(array_agg(alf), ' -> ') as move
from tmp
group by custid
order by custid asc;

------------------------------

I would like to get array like `[cherry -> acerola -> strawberries -> lemon -> banan]` for above sample data. I would like to sort and keep the array values according to the order of the `hist` columns but `array_agg()` states `The order of elements in the array is non-deterministic`.

If you have a solution, I would appreciate your response!

 

Hello,

Below is a solution
regexp_replace(array_join(sort_array(array_agg(concat(hist, alf))), ' -> '),'[0-9]','')

The idea in the code is to concatenate the "hist" column with the "alf" column, this way the "sort_array" can sort based on the numbers in the "hist" column and finally a "regexp_replace" is done removing the numbers from the "hist" column. hist" as they cannot be shown in the final result

Hope this helps

 

ThomazRossito_0-1712704018151.png

 

Att.
Thomaz Antonio Rossito Neto
Data Technical Lead / Data Engineer-Architect

akisugi
New Contributor III

Hi @ThomazRossito 

This is a great idea. It can solve my problem.Thank you.

akisugi
New Contributor III

Hi @ThomazRossito 

This is a great idea. It can solve my problem.Thank you.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group