cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Use Array in WHERE IN clause

zc
New Contributor III

This is what I'm trying to do using SQL:

 

create table check1 as
select * from dataA
where IDs in ('12483258','12483871','12483883');
 
The list of IDs is much longer and may be changed so I want to use a variable for that. This is what I have tried
 
declare or replace ID ARRAY<string>;
SET VARIABLE ID = ARRAY('12483258','12483871','12483883');
create table check1 as
select * from dataA
where IDs in (ID);
 
And I get the error message below
data type mismatch: Input to `in` should all be the same type, but it's ["STRING", "ARRAY<STRING>"]
 
It appears that databricks won't compare IDs column to the array elements. I also tried split and explode but didn't work, maybe the syntax wasn't right.
 
1 ACCEPTED SOLUTION

Accepted Solutions

ManojkMohan
Valued Contributor III

Root Cause

  • In your SQL statement, where IDs in (ID) is comparing a string column (IDs) with an array variable (ARRAY<STRING>).
  • Databricks SQL does not allow IN to directly take an array — IN expects a list of scalar values, not an array object.

That’s why you see the error

Solution:

Try Using array_contains to structure your query.

Use array_contains

If you’re filtering using an array variable:

 

 

DECLARE ids ARRAY<STRING>;

SET VARIABLE ids = ARRAY('12483258','12483871','12483883');

CREATE TABLE check1 AS

SELECT *

FROM dataA

WHERE array_contains(ids, IDs);

View solution in original post

9 REPLIES 9

szymon_dybczak
Esteemed Contributor III

Hi @,

This is expected. Your id column is of type string and you're trying to provide in in clause array<string>

zc
New Contributor III

Thanks for responding. So what's the proper way of using an array there?

szymon_dybczak
Esteemed Contributor III

I don't have access to computer right now, but I'll provide you tomorrow code that should work

WiliamRosa
New Contributor III

Hi @zc,

I managed to reproduce your scenario and put together a possible solution—see if it helps.

Simulating the error:

WiliamRosa_0-1755888823722.png

Solution with array:

WiliamRosa_1-1755888876798.png

 

 

Wiliam Rosa
Data Engineer | Machine Learning Engineer
LinkedIn: linkedin.com/in/wiliamrosa

zc
New Contributor III

@WiliamRosa 

Thank you for this, your solution works but in a different way. I've actually been using temp views in the same way, but that requires extra steps that's why I wanted to go for variables instead. But thank you for putting this together!

ManojkMohan
Valued Contributor III

Root Cause

  • In your SQL statement, where IDs in (ID) is comparing a string column (IDs) with an array variable (ARRAY<STRING>).
  • Databricks SQL does not allow IN to directly take an array — IN expects a list of scalar values, not an array object.

That’s why you see the error

Solution:

Try Using array_contains to structure your query.

Use array_contains

If you’re filtering using an array variable:

 

 

DECLARE ids ARRAY<STRING>;

SET VARIABLE ids = ARRAY('12483258','12483871','12483883');

CREATE TABLE check1 AS

SELECT *

FROM dataA

WHERE array_contains(ids, IDs);

zc
New Contributor III

@ManojkMohan 

Thank you so much for your detailed explanation and perfect solution!

BS_THE_ANALYST
Esteemed Contributor

Nice solutions! @ManojkMohan @WiliamRosa 

I love the use of the temp view for the intermediate result. The array_contains is also a really nice touch. 

@ManojkMohan when you write "SET VARIABLE ids = ARRAY('12483258','12483871','12483883');" ... can this array be from another query to make it dynamic? 

Really cool stuff. 

@zc I guess we could also use parameters: https://docs.databricks.com/aws/en/sql/user/queries/query-parameters#query-based-dropdown-list . The benefit of this is that the query can be maintained/updated outside of the script, given that you used a query based drop down list parameter. I haven't tried it but it makes sense, in theory. Example of using a parameter below:

BS_THE_ANALYST_0-1755952877629.png

All the best,
BS

zc
New Contributor III

@BS_THE_ANALYST 

Thank you for responding. Yeah I guess that's an option too. Talking about a list maintained outside of the script though, I'd prefer having a csv and import it as a temp view when needed.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now