Use Array in WHERE IN clause

zc
New Contributor III

This is what I'm trying to do using SQL:

 

create table check1 as
select * from dataA
where IDs in ('12483258','12483871','12483883');
 
The list of IDs is much longer and may be changed so I want to use a variable for that. This is what I have tried
 
declare or replace ID ARRAY<string>;
SET VARIABLE ID = ARRAY('12483258','12483871','12483883');
create table check1 as
select * from dataA
where IDs in (ID);
 
And I get the error message below
data type mismatch: Input to `in` should all be the same type, but it's ["STRING", "ARRAY<STRING>"]
 
It appears that databricks won't compare IDs column to the array elements. I also tried split and explode but didn't work, maybe the syntax wasn't right.
 

szymon_dybczak
Esteemed Contributor III

Hi @,

This is expected. Your id column is of type string and you're trying to provide in in clause array<string>

zc
New Contributor III

Thanks for responding. So what's the proper way of using an array there?

szymon_dybczak
Esteemed Contributor III

I don't have access to computer right now, but I'll provide you tomorrow code that should work

WiliamRosa
Databricks Partner

Hi @zc,

I managed to reproduce your scenario and put together a possible solution—see if it helps.

Simulating the error:

WiliamRosa_0-1755888823722.png

Solution with array:

WiliamRosa_1-1755888876798.png

 

 

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

zc
New Contributor III

@WiliamRosa 

Thank you for this, your solution works but in a different way. I've actually been using temp views in the same way, but that requires extra steps that's why I wanted to go for variables instead. But thank you for putting this together!

ManojkMohan
Honored Contributor II

Root Cause

  • In your SQL statement, where IDs in (ID) is comparing a string column (IDs) with an array variable (ARRAY<STRING>).
  • Databricks SQL does not allow IN to directly take an array — IN expects a list of scalar values, not an array object.

That’s why you see the error

Solution:

Try Using array_contains to structure your query.

Use array_contains

If you’re filtering using an array variable:

 

 

DECLARE ids ARRAY<STRING>;

SET VARIABLE ids = ARRAY('12483258','12483871','12483883');

CREATE TABLE check1 AS

SELECT *

FROM dataA

WHERE array_contains(ids, IDs);

View solution in original post

zc
New Contributor III

@ManojkMohan 

Thank you so much for your detailed explanation and perfect solution!

BS_THE_ANALYST
Databricks Partner

Nice solutions! @ManojkMohan @WiliamRosa 

I love the use of the temp view for the intermediate result. The array_contains is also a really nice touch. 

@ManojkMohan when you write "SET VARIABLE ids = ARRAY('12483258','12483871','12483883');" ... can this array be from another query to make it dynamic? 

Really cool stuff. 

@zc I guess we could also use parameters: https://docs.databricks.com/aws/en/sql/user/queries/query-parameters#query-based-dropdown-list . The benefit of this is that the query can be maintained/updated outside of the script, given that you used a query based drop down list parameter. I haven't tried it but it makes sense, in theory. Example of using a parameter below:

BS_THE_ANALYST_0-1755952877629.png

All the best,
BS

zc
New Contributor III

@BS_THE_ANALYST 

Thank you for responding. Yeah I guess that's an option too. Talking about a list maintained outside of the script though, I'd prefer having a csv and import it as a temp view when needed.