Databricks

MikeK_ · ‎12-04-2019

Hi,

I'm importing some data and stored procedures from SQL Server into databricks, I noticed that updates with joins are not supported in Spark SQL, what's the alternative I can use? Here's what I'm trying to do:

update t1
set t1.colB=CASE WHEN t2.colB>t1.colB THEN t2.colB ELSE t1.colB + t2.colB END
from table1 t1
inner join table2 t2 ON t1.colA=t2.ColA
where t2.colC='XYZ'<br>

Another thing I was unable to do in Spak SQL are CROSS APPLY and OUTER APPLY, are there any alternatives for those 2?

Thanks in advance.

Mike

lee · ‎12-04-2019

Hi @Mike K.,

As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the

merge into

statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:

create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;

Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249...

or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380

Hope this gives you some help!

View solution in original post

lee · ‎12-04-2019

Hi @Mike K.,

As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the

merge into

statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:

create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;

Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3249...

or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380

Hope this gives you some help!

BhaktiAbhyankar · ‎02-02-2021

I have a issue, where I wanted to convert SQL UPDATE with JOIN query into Merge.

I do not have Primary OR Unique key in both the tables (on which join in performed). Hence getting error in merge -

Error in SQL statement: UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. Can some please help me on resolving this issue?

venkatesh547 · ‎11-11-2023

Hi @BhaktiAbhyankar , im facing same kind of issue, my table doesn't have any primary keys to join,
so i went with 'leftanti' and union. is this issue resolved for you? if you got any better approach. please share.

BisharathSthapi · ‎12-07-2019

@Mike K. I have added a solution without using views.

temp_df = df2.join(df1, "colA", how = "left").select("*", df1.colB.alias("df1_colB")).drop(df1.colB)

df2 = temp_df.withColumn("colB", when((col("colB") <= col("df1_colB")) \
                                         & (temp_df.df1_colB.isNotNull()) \
                                         & (temp_df.colC == 'XYZ'), col("colB") + col("df1_colB")).otherwise(col("colB"))) \
                                        .drop("df1_colB")

Link to the full code in notebook w/ explanation:

https://github.com/bsthapit/Databricks_Solutions/blob/master/db_29380.ipynb

MikeK_ · ‎12-09-2019

Thanks for the answers guys, I went with @Lee suggestion, because we need the code to run in SQL, but I will test the python code, @Bisharath Sthapit provided, later on, to see if there are any performance gains.

Cheers,

Mike

LyderIversen · ‎12-04-2020

Hi!

This is way late, but did you ever find a solution to the CROSS APPLY-part of your question? Is it possible to do CROSS APPLY in Spark SQL, or is there something you can use instead?

Databricks

SQL Update Join

How to successfully build GenAI applications

Registration now open! Databricks Data + AI Summit 2024

Meet DBRX, the New Standard for High-Quality LLMs

Data Warehousing in the Era of AI