cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

How to write Change Data from Delta Lake to aws dynamodb

hari
Contributor

Is there some direct way to write data from DeltaLake to AWS DynamoDB.

If there is none, Is there any way to do the same.

1 ACCEPTED SOLUTION

Accepted Solutions

-werners-
Esteemed Contributor III
8 REPLIES 8

AmanSehgal
Honored Contributor III

This is something that might be achievable via delta live tables and delta sharing in future.

But for now, what you can do is track changes in delta tables when you do an upsert.

To know more you'll have to describe how your delta tables are getting updated? What's the source, frequency of data operations in delta lake?

hari
Contributor

The data will be updated as batches nightly

AmanSehgal
Honored Contributor III

A way to track INSERT/UPDATE/DELETE would be to split out your upsert process as a 3 part process.

Step1: Include an update timestamp field in your target table.

Step2: In your incoming refresh dataset, add a update timestamp column and set it to current timestamp.

UPDATE count

Step3: Run upsert job with just whenMatchedUpdate clause. Count rows that have current timestamp for update.

Also, get an extract of the rows with current timestamp - these are the rows that you'll UPDATE in DyanmoDB

INSERT count

Step5: Get the row count of table - RC1.

Step6: Run upsert job with just whenNotMatchedInsert clause. Get row count of table after insert - RC2

INSERT count = RC2-RC1

Get an extract of the rows with current timestamp and perform a left-anti join with UPDATE df - these are the rows that you'll INSERT in DyanmoDB

DELETE count

Step7: Get the row count of table - RC1.

Step8: Run upsert job with just whenNotMatchedDelete clause. Get row count of table after insert - RC2

DELETE count = RC1-RC2

From your source data, there would be a way to identify the records that are to be deleted. Grab those indices/primary keys to delete records.

-werners-
Esteemed Contributor III

there is change data feed on delta tables:

https://docs.databricks.com/delta/delta-change-data-feed.html

hari
Contributor

@Aman Sehgalโ€‹  Thanks for the reply

Are you suggesting collecting the update/delete set and then manually writing to dynamodb?

hari
Contributor

@Werner Stinckensโ€‹ Thanks for the reply

I have gone through the CDF docs. But was wondering if there is some connector available to write data from Delta Table directly to dynamodb.

Eg: Read data from Delta table with change data mode on and write to dynamodb directly.

Or are you suggesting to get the Change Data and then write to DynamoDB

-werners-
Esteemed Contributor III

I suggest fetching the change data and write that to dynamodb indeed.

jose_gonzalez
Databricks Employee
Databricks Employee

Hi @Harikrishnan P Hโ€‹ ,

Did @Werner Stinckensโ€‹ reply help you to resolved your issue? if yes, please mark it as best. if not, please let us know.

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group