cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

to_date not functioning as expected after introduction of arbitrary replaceWhere in Databricks 9.1 LTS

kolangareth
New Contributor III

I am trying to do a dynamic partition overwrite on delta table using replaceWhere option. This was working fine until I upgraded the DB runtime to 9.1 LTS from 8.3.x. I am concatenating 'year', 'month' and 'day' columns and then using to_date function to define the replaceWhere condition. I could find in the documentation that there was a new feature added as part of 9.1 LTS - Delta now supports arbitrary replaceWhere. Probably this has introduced a bug here. I am using the spark conf setting to fallback to old behaviour.

spark.conf.set("spark.databricks.delta.replaceWhere.dataColumns.enabled", false)

But would be nice to have a better solution for this problem. Here is my write statement with replaceWhere.

dataframeToWrite.write.format("delta").mode("overwrite").partitionBy("year","month","day").option("replaceWhere", s"to_date(concat(year, '-', month, '-', day)) in ($datesList)").save(outputLocation)

Error message:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 4 times, most recent failure: Lost task 0.3 in stage 11.0 (TID 13) (): java.lang.UnsupportedOperationException: Cannot generate code for expression: to_date(concat(cast(year#2810 as string), -, cast(month#2863 as string), -, cast(day#2916 as string)), None)

PS: I've also tried specifying the format in to_date, but it doesn't help either.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

strange,

have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?

df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that

View solution in original post

9 REPLIES 9

Anonymous
Not applicable

Hi @Prasanth Kolangarethโ€‹ - Welcome and thank you for your question. I'm sorry to hear about the trouble you're having. Let's give the community a chance to respond first. Thanks for your patience.

-werners-
Esteemed Contributor III

strange,

have you tried to create a column with to_date in your dataframe, so not in the replaceWhere condition?

df2 = df1.withColumn(to_date(..., "yyyyMMdd")) something like that

Hi @Werner Stinckensโ€‹ , the to_date function works fine everywhere else like filter, withColumn etc. If you meant to create a new column in the dataframe and then use it in the replaceWhere - it's not something I want to do, because to do that I must add this new column in the target delta table also.

-werners-
Esteemed Contributor III

my only guess is that either the replaceWhere cannot take expressions, or the expression is faulty.

I think it is neither.

  • replaceWhere can take expressions, because I've tried using concat without to_date and it doesn't fail.
  • The expression isn't faulty, because it works with the spark conf setting to fallback to old behaviour and also the same expression works in filter, withColumn.

Hubert-Dudek
Esteemed Contributor III

Can you try to generate that column earlier:

.withColumn("conditionColumn", to_date...

to simplify code so replaceWhere will be easier to debug as now it is complaining about to_date

Hi @Hubert Dudekโ€‹ , to do that I must also add this new column in the target delta table, which I don't want to do. Adding this new column only in the dataframe to write will not work because of two reasons,

  1. delta will give a schema mismacth error while writing.
  2. I think the replaceWhere condition is basically applied on the target delta table and not on the input dataframe. So the condition should be on some columns which is already existing in the table.

Hubert-Dudek
Esteemed Contributor III

yes but you can at least do it for testing even with small copy of dataframe, this way is easier to find the problem

Anonymous
Not applicable

@Prasanth Kolangarethโ€‹ - Does Hubert's answer resolve the problem for you? If yes, would you be happy to mark it as best so that other members can find the solution more quickly?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group