cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Is Concurrent Writes from multiple databricks clusters to same delta table on S3 Supported?

ptambe
New Contributor III

Does databricks have support for writing to same Delta Table from multiple clusters concurrently. I am specifically interested to know if there is any solution for https://github.com/delta-io/delta/issues/41 implemented in databricks OR if you have any recommendations on achieving - concurrent writes to same delta table on S3.

1 ACCEPTED SOLUTION

Accepted Solutions

dennyglee
New Contributor III
New Contributor III

Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Delta Lake OSS, the community is working on PR 339 to resolve this issue.

Saying this, your question is specific to Databricks' implementation of Delta which allows for multiple clusters to concurrently write to the same Delta table using the S3 commit service. The pertinent quote is:

Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane

For more information, please refer to Configure Databricks S3 commit service-related settings

View solution in original post

6 REPLIES 6

Kaniz
Community Manager
Community Manager

Hi @ ptambe! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.

Hubert-Dudek
Esteemed Contributor III

Usually yes. It depends on partitioning. If you have 2 executors (writers) and every of them hold some partition which have to be append to delta, write process will be per partition simultaneously. You can also analyze you exact use case looking to jobs (and other tabs) in Spark UI.

ptambe
New Contributor III

Yes, with same cluster and multiple executors it works and we use replaceWhere to overwrite separate partitions. Will the same thing work if the partitions are being written to from different job clusters. The issue that I mentioned above indicates that it is not supported by delta.

dennyglee
New Contributor III
New Contributor III

Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Delta Lake OSS, the community is working on PR 339 to resolve this issue.

Saying this, your question is specific to Databricks' implementation of Delta which allows for multiple clusters to concurrently write to the same Delta table using the S3 commit service. The pertinent quote is:

Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane

For more information, please refer to Configure Databricks S3 commit service-related settings

ptambe
New Contributor III

Thanks @Denny Lee​ !!

This is what I was looking for, and I assume this configurations is enabled by default.

dennyglee
New Contributor III
New Contributor III

Glad to help @Prashant Tambe​  - yes, this configuration is on by default. HTH!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.