I'm using Autoloader (in Azure Databricks) to read parquet files and write their data into the Delta table.
schemaEvolutionMode is set to 'rescue'.
In foreach_batch I do
1) Transform of read dataframe;
2) Create temp view based on read dataframe and merge it into target Delta table using
condition
merge into target using source
on target.pk = source.pk
when matched and target.timestamp_field > source.timestamp_field
then update *
when not matched
insert *
On the first run of the job (when some historical data were being uploaded), first 4 batches went fine, and 5th batch failed with (when trying to execute a merge):
"File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o375.sql. : com.databricks.sql.transaction.tahoe.DeltaRuntimeException: Keeping the source of the MERGE statement materialized has failed repeatedly."
New run of the job was successfull and all the data from the source files were loaded (including data which caused the fail of 5th batch of the 1st run).
So I'm now trying to understand what were the reasons of the error, and how to prevent it.
I wonder if the error can be related with the fact that there could be duplicated records in read dataframe.
Photon Acceleration is disabled on the cluster, Databricks version is 13.2 ML (Spark 3.4.0).