cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

ProtocolChangedException on concurrent blind appends to delta table

MiguelKulisic
New Contributor II

Hello,

I am developing an application that runs multiple processes that write their results to a common delta table as blind appends. According to the docs I've read online: https://docs.databricks.com/delta/concurrency-control.html#protocolchangedexception & https://docs.delta.io/0.4.0/delta-concurrency.html, append only writes should never cause concurrency issues. I am running into the following error:

ProtocolChangedException: The protocol version of the Delta table has been changed by a concurrent update. This happens when multiple writers are writing to an empty directory. Creating the table ahead of time will avoid this conflict. Please try the operation again. Conflicting commit: {"timestamp":1642800186194,"userId":"61587887627726","userName":"USERNAME","operation":"WRITE","operationParameters":{"mode":Append,"partitionBy":["Date","GroupId","Scope"]},"notebook":{"notebookId":"241249750697631"},"clusterId":"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"56","numOutputBytes":"267086","numOutputRows":"61"}} Refer to https://docs.microsoft.com/azure/databricks/delta/concurrency-control for more details.

Some more information for context:

  1. The code that writes the data is:
def saveToDeltaTable(ds: Dataset[Class], dtPath: String) = {
    ds.write.format("delta")
       .partitionBy("Date", "GroupId", "Scope")
       .option("mergeSchema", "true")
       .mode("append")
       .save(dtPath)
}
  1. I'm unable to recreated this very consistently.
  2. Both writes are currently running in the same cluster. They eventually won't.
  3. The partition columns "Date" and "GroupId" have the same values for each write, but the partition column "Scope" will differ between each write.

Given the description of "ProtocolChangedException" given in the docs, it doesn't make much sense to me that this is crashing. My only thought is that it could be due to the mergeSchema flag even though it's currently doing nothing.

Thank you,

Miguel

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III

I think you are right, the mergeSchema will change the schema of the table, but if you both write to that same table with another schema, which one will it be?

Can you check if both of you actually write the same schema, or remove the mergeschema?

View solution in original post

2 REPLIES 2

Anonymous
Not applicable

Hi there, @Miguel Kulisic​! It's nice to meet you and thank you for coming to the community for help. We'll give the rest of the community a chance to respond before we come back to this. Thank you in advance for your patience! 🙂

-werners-
Esteemed Contributor III

I think you are right, the mergeSchema will change the schema of the table, but if you both write to that same table with another schema, which one will it be?

Can you check if both of you actually write the same schema, or remove the mergeschema?

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group