I have a parquet file that I am trying to write to a delta table:
df.writeStream
.format("delta")
.option("checkpointLocation", f"{targetPath}/delta/{tableName}/__checkpoints")
.trigger(once=True)
.foreachBatch(processTable)
.outputMode("append")
.start()
The parquet file is a product of an automatic data pull from a table in SQL Server. Occasionally, a new column is added to the table. When this happens, we see the following error:
org.apache.spark.sql.catalyst.util.UnknownFieldException: [UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE] Encountered unknown fields during parsing: <newColumn1>,<newColumn2>, which can be fixed by an automatic retry: true
According to the Databricks documentation, AutoLoader by default will error out when a new column is detected. It says that Databricks recommends incorporating retries, at the workflow level.
For our purposes, we do not want to implement retries in our workflow. We simply want the delta table to add the new column(s), and ingest the new data, without any errors.
Can anyone please advise if there is a method to do this?