01-13-2022 07:46 AM
I am trying to do a dynamic partition overwrite on delta table using replaceWhere option. This was working fine until I upgraded the DB runtime to 9.1 LTS from 8.3.x. I am concatenating 'year', 'month' and 'day' columns and then using to_date function to define the replaceWhere condition. I could find in the documentation that there was a new feature added as part of 9.1 LTS - Delta now supports arbitrary replaceWhere. Probably this has introduced a bug here. I am using the spark conf setting to fallback to old behaviour.
spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)
But would be nice to have a better solution for this problem. Here is my write statement with replaceWhere.
dataframeToWrite.write.format("delta").mode("overwrite").partitionBy("year","month","day").option("replaceWhere", s"to_date(concat(year, '-', month, '-', day)) in ($datesList)").save(outputLocation)
Error message:
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 13) (): java.lang.UnsupportedOperationException: Cannot generate code for expression: to_date(concat(cast(year#2810 as string), -, cast(month#2863 as string), -, cast(day#2916 as string)), None)
PS: I've also tried specifying the format in to_date, but it doesn't help either.
01-14-2022 12:10 AM
strange,
have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?
df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that
01-13-2022 09:52 AM
Hi @Prasanth Kolangareth - Welcome and thank you for your question. I'm sorry to hear about the trouble you're having. Let's give the community a chance to respond first. Thanks for your patience.
01-14-2022 12:10 AM
strange,
have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?
df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that
01-14-2022 01:04 AM
Hi @Werner Stinckens , the to_date function works fine everywhere else like filter, withColumn etc. If you meant to create a new column in the dataframe and then use it in the replaceWhere - it's not something I want to do, because to do that I must add this new column in the target delta table also.
01-14-2022 02:43 AM
my only guess is that either the replaceWhere cannot take expressions, or the expression is faulty.
01-14-2022 03:22 AM
I think it is neither.
01-14-2022 06:00 AM
Can you try to generate that column earlier:
.withColumn("conditionColumn", to_date...
to simplify code so replaceWhere will be easier to debug as now it is complaining about to_date
01-14-2022 06:20 AM
Hi @Hubert Dudek , to do that I must also add this new column in the target delta table, which I don't want to do. Adding this new column only in the dataframe to write will not work because of two reasons,
01-14-2022 06:54 AM
yes but you can at least do it for testing even with small copy of dataframe, this way is easier to find the problem
01-26-2022 08:32 AM
@Prasanth Kolangareth - Does Hubert's answer resolve the problem for you? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group