cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Administration & Architecture
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Delta Lake S3 multi-cluster writes - DynamoDB

JonLaRose
New Contributor III

Hi there!

I'm trying to figure out how the multi-writers architecture for Delta Lake tables is implemented under the hood.

I understand that a DynamoDB table is used to provide mutual exclusion, but the question is: where is the table located? Is it in the control plane compute or the user's?

If it's in the data plane, how can I provide permissions to create/update this specific table?

If it's in the control plane compute, why is it failing with the following: 

Py4JJavaError: An error occurred while calling o476.save.
: com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: The security token included in the request is invalid.

?

Thanks!

3 REPLIES 3

Kaniz
Community Manager
Community Manager

Hi @JonLaRose , The multi-writers architecture for Delta Lake tables uses the S3 commit service.
- The S3 commit service ensures write consistency across multiple clusters on a single table.
- The service is part of the control plane and does not read any data from S3, only puts a new file if it doesn't exist.
- The DynamoDB table is not directly referenced in the sources, but the S3 commit service is located in the control plane.
- The S3 commit service is used to implement ACID transactions and ensure consistency.
- The error Py4JJavaError: An error occurred while calling o476.save. : com.amazonaws.services.securitytoken.model.AWSSecurityTokenServiceException: The security token included in the request is invalid is likely due to invalid or expired AWS credentials.
- The S3 commit service uses temporary AWS credentials from the data plane, valid for six hours.
- If these credentials are invalid or expired, this error would occur.


- To fix this, ensure AWS credentials are valid and not expired.
- If using IAM roles, ensure necessary permissions are granted for the operations.

JonLaRose
New Contributor III

Thank you @Kaniz .

Does the S3 Commit service use the `s3a` configured S3 endpoint (from the Spark session Hadoop configurations)? If not, is there a way to configure the S3 endpoint that the S3 Commit service uses? 

Kaniz
Community Manager
Community Manager

Hi @JonLaRose, The S3 Commit service is a Databricks service that helps guarantee consistency of writes across multiple clusters on a single table in specific cases. It runs in the Databricks control plane and coordinates writes to Amazon S3 from multiple clusters.

 

Regarding your question, the S3 commit service sends temporary AWS credentials from the compute plane to the control plane in the commit service API call. The compute plane writes data directly to S3, and then the S3 commit service in the control plane provides concurrency control by finalizing the commit log upload. The commit service does not read any data from S3. It puts a new file in S3 if it does not exist.

 

To access s3a:// files from Apache Sparkโ„ข, you must pass some configurations in spark-submit and specify the endpoint. You can find more information on configuring Databricks S3 commit service-related settings in the Databricks documentation page I found for you. I hope this helps!

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.