cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DeltaRuntimeException: Keeping the source of the MERGE statement materialized has failed repeatedly.

dzmitry_tt
New Contributor

I'm using Autoloader (in Azure Databricks) to read parquet files and write their data into the Delta table.
schemaEvolutionMode is set to 'rescue'.

In foreach_batch I do
1) Transform of read dataframe;
2) Create temp view based on read dataframe and merge it into target Delta table using
condition

merge into target using source
on target.pk = source.pk
when matched and target.timestamp_field > source.timestamp_field
then update *
when not matched
insert *

On the first run of the job (when some historical data were being uploaded), first 4 batches went fine, and 5th batch failed with (when trying to execute a merge):
"File "/databricks/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o375.sql. : com.databricks.sql.transaction.tahoe.DeltaRuntimeException: Keeping the source of the MERGE statement materialized has failed repeatedly."


New run of the job was successfull and all the data from the source files were loaded (including data which caused the fail of 5th batch of the 1st run).

So I'm now trying to understand what were the reasons of the error, and how to prevent it.
I wonder if the error can be related with the fact that there could be duplicated records in read dataframe.

Photon Acceleration is disabled on the cluster, Databricks version is 13.2 ML (Spark 3.4.0).

1 REPLY 1

Wojciech_BUK
Valued Contributor III

Hmm, you can't have duplicated data in source dataframe/batch but it should error out with diffrent erro like "Cannot perform Merge as multiple source rows matched and attempted to modify the same target row...".

Also this behaviour after rerun is strange.

Can ou attach your full code and full race on error?

Are you using this code as baseline?

https://docs.gcp.databricks.com/en/structured-streaming/delta-lake.html#upsert-from-streaming-querie...

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group