Hello,
I am developing an application that runs multiple processes that write their results to a common delta table as blind appends. According to the docs I've read online: https://docs.databricks.com/delta/concurrency-control.html#protocolchangedexception & https://docs.delta.io/0.4.0/delta-concurrency.html, append only writes should never cause concurrency issues. I am running into the following error:
ProtocolChangedException: The protocol version of the Delta table has been changed by a concurrent update. This happens when multiple writers are writing to an empty directory. Creating the table ahead of time will avoid this conflict. Please try the operation again. Conflicting commit: {"timestamp":1642800186194,"userId":"61587887627726","userName":"USERNAME","operation":"WRITE","operationParameters":{"mode":Append,"partitionBy":["Date","GroupId","Scope"]},"notebook":{"notebookId":"241249750697631"},"clusterId":"","isolationLevel":"WriteSerializable","isBlindAppend":true,"operationMetrics":{"numFiles":"56","numOutputBytes":"267086","numOutputRows":"61"}} Refer to https://docs.microsoft.com/azure/databricks/delta/concurrency-control for more details.
Some more information for context:
- The code that writes the data is:
def saveToDeltaTable(ds: Dataset[Class], dtPath: String) = {
ds.write.format("delta")
.partitionBy("Date", "GroupId", "Scope")
.option("mergeSchema", "true")
.mode("append")
.save(dtPath)
}
- I'm unable to recreated this very consistently.
- Both writes are currently running in the same cluster. They eventually won't.
- The partition columns "Date" and "GroupId" have the same values for each write, but the partition column "Scope" will differ between each write.
Given the description of "ProtocolChangedException" given in the docs, it doesn't make much sense to me that this is crashing. My only thought is that it could be due to the mergeSchema flag even though it's currently doing nothing.
Thank you,
Miguel