โ12-04-2019 10:51 AM
Hi,
I'm importing some data and stored procedures from SQL Server into databricks, I noticed that updates with joins are not supported in Spark SQL, what's the alternative I can use? Here's what I'm trying to do:
update t1
set t1.colB=CASE WHEN t2.colB>t1.colB THEN t2.colB ELSE t1.colB + t2.colB END
from table1 t1
inner join table2 t2 ON t1.colA=t2.ColA
where t2.colC='XYZ'<br>
Another thing I was unable to do in Spak SQL are CROSS APPLY and OUTER APPLY, are there any alternatives for those 2?
Thanks in advance.
Mike
โ12-04-2019 01:18 PM
Hi @Mike K.,
As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the
merge into
statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:
create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;
Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:
or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380
Hope this gives you some help!
โ12-04-2019 01:18 PM
Hi @Mike K.,
As spark is of a different distributed architecture than traditional RDBMS, you'll certainly see that some functionality isn't supported 1:1. For your particular issue, I would suggest performing the conditional logic and filtering into one temporary structure, and then using the
merge into
statement https://docs.databricks.com/spark/latest/spark-sql/language-manual/merge-into.html. This does assume that you are using the (awesome) Delta architecture and storage format, which handles the ACID-compliant type transactions under the hood. There certainly other ways to accomplish this, but here is one such way:
create temporary view joined as
select dt1.colA, CASE WHEN dt2.colB>dt1.colB THEN dt2.colB ELSE dt1.colB + dt2.colB END as colB
from dt1 inner join dt2 ON dt1.colA=dt2.colA
where dt2.colC='XYZ';
merge into dt1
using joined ON dt1.colA=joined.colA
WHEN MATCHED THEN UPDATE set colB = joined.colB;
Here is my working notebook, which shows reproducing the failure and steps taken to get to this final answer:
or on my github: https://github.com/leedabee/databricks-forum-support-notebooks/tree/master/db-forum-29380
Hope this gives you some help!
โ02-02-2021 09:23 PM
I have a issue, where I wanted to convert SQL UPDATE with JOIN query into Merge.
I do not have Primary OR Unique key in both the tables (on which join in performed). Hence getting error in merge -
Error in SQL statement: UnsupportedOperationException: Cannot perform Merge as multiple source rows matched and attempted to modify the same target row in the Delta table in possibly conflicting ways. Can some please help me on resolving this issue?โ11-11-2023 12:13 AM
Hi @BhaktiAbhyankar , im facing same kind of issue, my table doesn't have any primary keys to join,
so i went with 'leftanti' and union. is this issue resolved for you? if you got any better approach. please share.
โ12-07-2019 03:30 PM
@Mike K.โ I have added a solution without using views.
temp_df = df2.join(df1, "colA", how = "left").select("*", df1.colB.alias("df1_colB")).drop(df1.colB)
df2 = temp_df.withColumn("colB", when((col("colB") <= col("df1_colB")) \
& (temp_df.df1_colB.isNotNull()) \
& (temp_df.colC == 'XYZ'), col("colB") + col("df1_colB")).otherwise(col("colB"))) \
.drop("df1_colB")
Link to the full code in notebook w/ explanation:
https://github.com/bsthapit/Databricks_Solutions/blob/master/db_29380.ipynb
โ12-09-2019 08:59 AM
Thanks for the answers guys, I went with @Lee suggestion, because we need the code to run in SQL, but I will test the python code, @Bisharath Sthapitโ provided, later on, to see if there are any performance gains.
Cheers,
Mike
โ12-04-2020 04:50 AM
Hi!
This is way late, but did you ever find a solution to the CROSS APPLY-part of your question? Is it possible to do CROSS APPLY in Spark SQL, or is there something you can use instead?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group