cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Partition In Spark with subqeury which include Union

Jeewan
New Contributor

I have a SQL query like this:
select ... from table1 where id in (slect id from table 1 where (some condition) UNION select id from table2 where (some condition)) table1

I have made a partition of 200 where upper bound is 200 and lower bound is 0 and partition will be done on partition_key column which has value ranged from 1 to 200. I am using JDBC connector. I am passing options("dbtable,table)  where table is the query mentioned above.

How will the internal query in Spark look like? Since we are using UNION will it not affect the partitioning? 

1 REPLY 1

Kaniz_Fatma
Community Manager
Community Manager

Hi @Jeewan, When using a SQL query with a `UNION` in Spark, the process involves executing the subqueries within the `UNION`, combining their results, and then applying partitioning based on the specified `partition_key` column. First, Spark executes the individual subqueries and combines their results using `UNION` (which removes duplicates) or `UNION ALL` (which keeps duplicates). After this, Spark applies partitioning according to the `partition_key`, distributing the data across the specified partitions. The partitioning is applied to the combined result set, not directly influenced by the `UNION` operation itself.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group