12-17-2021 12:16 AM
Does databricks have support for writing to same Delta Table from multiple clusters concurrently. I am specifically interested to know if there is any solution for https://github.com/delta-io/delta/issues/41 implemented in databricks OR if you have any recommendations on achieving - concurrent writes to same delta table on S3.
12-20-2021 08:57 AM
Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Delta Lake OSS, the community is working on PR 339 to resolve this issue.
Saying this, your question is specific to Databricks' implementation of Delta which allows for multiple clusters to concurrently write to the same Delta table using the S3 commit service. The pertinent quote is:
Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane
For more information, please refer to Configure Databricks S3 commit service-related settings
12-17-2021 12:23 AM
Hi @ ptambe! My name is Kaniz, and I'm the technical moderator here. Great to meet you, and thanks for your question! Let's see if your peers in the community have an answer to your question first. Or else I will get back to you soon. Thanks.
12-17-2021 02:08 AM
Usually yes. It depends on partitioning. If you have 2 executors (writers) and every of them hold some partition which have to be append to delta, write process will be per partition simultaneously. You can also analyze you exact use case looking to jobs (and other tabs) in Spark UI.
12-20-2021 12:53 AM
Yes, with same cluster and multiple executors it works and we use replaceWhere to overwrite separate partitions. Will the same thing work if the partitions are being written to from different job clusters. The issue that I mentioned above indicates that it is not supported by delta.
12-20-2021 08:57 AM
Please note, the issue noted above [Storage System] Support for AWS S3 (multiple clusters/drivers/JVMs) is for Delta Lake OSS. As noted in this issue as well as Issue 324, as of this writing, S3 lacks putIfAbsent transactional consistency. For Delta Lake OSS, the community is working on PR 339 to resolve this issue.
Saying this, your question is specific to Databricks' implementation of Delta which allows for multiple clusters to concurrently write to the same Delta table using the S3 commit service. The pertinent quote is:
Databricks runs a commit service that coordinates writes to Amazon S3 from multiple clusters. This service runs in the Databricks control plane
For more information, please refer to Configure Databricks S3 commit service-related settings
12-20-2021 10:13 PM
Thanks @Denny Lee !!
This is what I was looking for, and I assume this configurations is enabled by default.
12-21-2021 07:56 AM
Glad to help @Prashant Tambe - yes, this configuration is on by default. HTH!
07-30-2024 02:58 AM
Hi @dennyglee ,
If I am writing data into a Delta table using delta-rs and a Databricks job, but I lose some transactions, how can I handle this?
Given that Databricks runs a commit service and delta-rs uses DynamoDB for transaction logs, how can we handle concurrent writers from Databricks jobs and delta-rs writers on the same table?
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group