The databricks notebook failed yesterday due to timestamp format issue.
error:
"SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2022-08-10 00:00:14.2760000' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
"
The notebook has been running fine before. e.g. we have "2022-08-07T23:59:57.9740000" kind of timestamp values in the ts column.
We are using explicit timestampformat 'yyyy-MM-dd HH:mm:ss.SSS' when rreading the csv files.
However, we started getting null in the timestamp values as the values could not be converted.
So i changed the format to 'yyyy-MM-dd HH:mm:ss.SSSSSSS' and it worked for one of the objects. But the issue remains for another object.
However,
When i completely removed the timestampFormat option it worked for this last object.
I am wondering what changed on the databricks cluster that it started failing. The timestamp values in the files are in the same format as before.
Here is the function without the timestampFormat option that works.
def ReadRawCSV(filesToProcess,header,delimiter,schema_struct):
delta_df = spark.read.options(header=header,delimiter=delimiter).schema(schema_struct).csv(filesToProcess)
return delta_df