โ01-13-2022 07:46 AM
I am trying to do a dynamic partition overwrite on delta table using replaceWhere option. This was working fine until I upgraded the DB runtime to 9.1 LTS from 8.3.x. I am concatenating 'year', 'month' and 'day' columns and then using to_date function to define the replaceWhere condition. I could find in the documentation that there was a new feature added as part of 9.1 LTS - Delta now supports arbitrary replaceWhere. Probably this has introduced a bug here. I am using the spark conf setting to fallback to old behaviour.
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)
But would be nice to have a better solution for this problem. Here is my write statement with replaceWhere.
dataframeToWrite.write.format("delta").mode("overwrite").partitionBy("year","month","day").option("replaceWhere", s"to_date(concat(year, '-', month, '-', day)) in ($datesList)").save(outputLocation)
Error message:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 13) (): java.lang.UnsupportedOperationException: Cannot generate code for expression: to_date(concat(cast(year#2810 as string), -, cast(month#2863 as string), -, cast(day#2916 as string)), None)
PS: I've also tried specifying the format in to_date, but it doesn't help either.
โ01-14-2022 12:10 AM
strange,
have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?
df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that
โ01-13-2022 09:52 AM
Hi @Prasanth Kolangarethโ - Welcome and thank you for your question. I'm sorry to hear about the trouble you're having. Let's give the community a chance to respond first. Thanks for your patience.
โ01-14-2022 12:10 AM
strange,
have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?
df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that
โ01-14-2022 01:04 AM
Hi @Werner Stinckensโ , the to_date function works fine everywhere else like filter, withColumn etc. If you meant to create a new column in the dataframe and then use it in the replaceWhere - it's not something I want to do, because to do that I must add this new column in the target delta table also.
โ01-14-2022 02:43 AM
my only guess is that either the replaceWhere cannot take expressions, or the expression is faulty.
โ01-14-2022 03:22 AM
I think it is neither.
โ01-14-2022 06:00 AM
Can you try to generate that column earlier:
.withColumn("conditionColumn", to_date...
to simplify code so replaceWhere will be easier to debug as now it is complaining about to_date
โ01-14-2022 06:20 AM
Hi @Hubert Dudekโ , to do that I must also add this new column in the target delta table, which I don't want to do. Adding this new column only in the dataframe to write will not work because of two reasons,
โ01-14-2022 06:54 AM
yes but you can at least do it for testing even with small copy of dataframe, this way is easier to find the problem
โ01-26-2022 08:32 AM
@Prasanth Kolangarethโ - Does Hubert's answer resolve the problem for you? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?