How to explode an array column and repack the distinct values into one array in DB SQL?

test_user
New Contributor II

Hi, I am new to DB SQL. I have a table where the array column (cities) contains multiple arrays and some have multiple duplicate values. I need to unpack the array values into rows so I can list the distinct values. The following query works for this step.

SELECT distinct 
key, 
explode(cities) as cities
FROM green.cities
group by key, cities

The next step I want to repack the distinct cities into one array grouped by key.

I can do this easily in pyspark using two dataframes, first by doing an explode on the array column of the first dataframe and then doing a collect_set on the same column in the next dataframe.

Any suggestions how I can do this just using Databricks SQL?

Hubert-Dudek
Databricks MVP

You can use collect_set in SQL. If you share the example input row and desired output, it would be easier to help.


My blog: https://databrickster.medium.com/

Thanks Hubert. What i tried so far is not collecting the distinct values of the expolded column grouped by key.

An example input is:

key, cities

1, ["milan","paris","new york"]

1, ["London"]

1, ["London","paris"]

1, ["London","paris"]

1, ["London","paris"]

1, ["milan","paris"]

1, ["paris","new york"]

1, ["new york"]

1, ["new york"]

2, ["milan","paris","new york"]

2, ["paris"]

2, ["paris"]

2, ["milan","paris"]

2, ["paris","new york"]

2, ["Tokyo"]

2, ["new york"]

2, ["LA","Tokyo"]

2, ["LA","Tokyo"]

The desired output is:

key, cities

1, ["milan","paris","new york","London"]

2, ["milan","paris","new york","LA","Tokyo"]

Aviral-Bhardwaj
Esteemed Contributor III

try to use SQL windows functions here

AviralBhardwaj