Databricks Community

dzmitry_tt · ‎12-29-2023

I'm using Autoloader (in Azure Databricks) to read parquet files and write their data into the Delta table.
schemaEvolutionMode is set to 'rescue'.

In foreach_batch I do
1) Transform of read dataframe;
2) Create temp view based on read dataframe and merge it into target Delta table using
condition

merge into target using source
on target.pk = source.pk
when matched and target.timestamp_field > source.timestamp_field
then update *
when not matched
insert *

On the first run of the job (when some historical data were being uploaded), first 4 batches went fine, and 5th batch failed with (when trying to execute a merge):
"File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o375.sql. : com.databricks.sql.transaction.tahoe.DeltaRuntimeException: Keeping the source of the MERGE statement materialized has failed repeatedly."

New run of the job was successfull and all the data from the source files were loaded (including data which caused the fail of 5th batch of the 1st run).

So I'm now trying to understand what were the reasons of the error, and how to prevent it.
I wonder if the error can be related with the fact that there could be duplicated records in read dataframe.

Photon Acceleration is disabled on the cluster, Databricks version is 13.2 ML (Spark 3.4.0).

Wojciech_BUK · ‎12-29-2023

Hmm, you can't have duplicated data in source dataframe/batch but it should error out with diffrent erro like "Cannot perform Merge as multiple source rows matched and attempted to modify the same target row...".

Also this behaviour after rerun is strange.

Can ou attach your full code and full race on error?

Are you using this code as baseline?

https://docs.gcp.databricks.com/en/structured-streaming/delta-lake.html#upsert-from-streaming-querie...