cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

SQL Update Join

MikeK_
New Contributor II

Hi,

I'm importing some data and stored procedures from SQL Server into databricks, I noticed that updates with joins are not supported in Spark SQL, what's the alternative I can use? Here's what I'm trying to do:

update t1
set t1.colB=CASE WHEN t2.colB>t1.colB THEN t2.colB ELSE t1.colB + t2.colB END
from table1 t1
inner join table2 t2 ON t1.colA=t2.ColA
where t2.colC='XYZ'<br>

Another thing I was unable to do in Spak SQL are CROSS APPLY and OUTER APPLY, are there any alternatives for those 2?

Thanks in advance.

Mike

1 ACCEPTED SOLUTION

Accepted Solutions

lee
Contributor

Hi @Mike K.,

As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the

merge into
statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:

create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;

Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249...

or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380

Hope this gives you some help!

View solution in original post

6 REPLIES 6

lee
Contributor

Hi @Mike K.,

As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the

merge into
statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:

create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;

Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249...

or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380

Hope this gives you some help!

BhaktiAbhyankar
New Contributor II

I have a issue, where I wanted to convert SQL UPDATE with JOIN query into Merge.

I do not have Primary OR Unique key in both the tables (on which join in performed). Hence getting error in merge -

Error in SQL statement: UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. Can some please help me on resolving this issue?

Hi @BhaktiAbhyankar , im facing same kind of issue, my table doesn't have any primary keys to join, 
so i went with 'leftanti' and union. is this issue resolved for you? if you got any better approach. please share. 

BisharathSthapi
New Contributor II

@Mike K.​  I have added a solution without using views.

temp_df = df2.join(df1, "colA", how = "left").select("*", df1.colB.alias("df1_colB")).drop(df1.colB)
df2 = temp_df.withColumn("colB", when((col("colB") <= col("df1_colB")) \
                                         & (temp_df.df1_colB.isNotNull()) \
                                         & (temp_df.colC == 'XYZ'), col("colB") + col("df1_colB")).otherwise(col("colB"))) \
                                        .drop("df1_colB")

Link to the full code in notebook w/ explanation:

https://github.com/bsthapit/Databricks_Solutions/blob/master/db_29380.ipynb

MikeK_
New Contributor II

Thanks for the answers guys, I went with @Lee suggestion, because we need the code to run in SQL, but I will test the python code, @Bisharath Sthapit​  provided, later on, to see if there are any performance gains.

Cheers,

Mike

LyderIversen
New Contributor II

Hi!

This is way late, but did you ever find a solution to the CROSS APPLY-part of your question? Is it possible to do CROSS APPLY in Spark SQL, or is there something you can use instead?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group