TimestampFormat issue

irfanaziz — Thu, 11 Aug 2022 05:31:58 GMT

The databricks notebook failed yesterday due to timestamp format issue.

error:

"SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2022-08-10 00:00:14.2760000' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.

The notebook has been running fine before. e.g. we have "2022-08-07T23:59:57.9740000" kind of timestamp values in the ts column.

We are using explicit timestampformat 'yyyy-MM-dd HH:mm:ss.SSS' when rreading the csv files.

However, we started getting null in the timestamp values as the values could not be converted.

So i changed the format to 'yyyy-MM-dd HH:mm:ss.SSSSSSS' and it worked for one of the objects. But the issue remains for another object.

However,

When i completely removed the timestampFormat option it worked for this last object.

I am wondering what changed on the databricks cluster that it started failing. The timestamp values in the files are in the same format as before.

Here is the function without the timestampFormat option that works.

def ReadRawCSV(filesToProcess,header,delimiter,schema_struct):
  delta_df = spark.read.options(header=header,delimiter=delimiter).schema(schema_struct).csv(filesToProcess)
  return delta_df

Re: TimestampFormat issue

searchs — Tue, 28 Mar 2023 20:25:13 GMT

You must have solved this issue by now but for the sake of those that encounter this again, here's the solution that worked for me:

spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

topic Re: TimestampFormat issue in Data Engineering

TimestampFormat issue

Re: TimestampFormat issue