- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 04:47 AM
Is there some direct way to write data from DeltaLake to AWS DynamoDB.
If there is none, Is there any way to do the same.
- Labels:
-
Change Data
-
Delt Lake
Accepted Solutions
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 07:27 AM
there is change data feed on delta tables:
https://docs.databricks.com/delta/delta-change-data-feed.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 05:15 AM
This is something that might be achievable via delta live tables and delta sharing in future.
But for now, what you can do is track changes in delta tables when you do an upsert.
To know more you'll have to describe how your delta tables are getting updated? What's the source, frequency of data operations in delta lake?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 06:12 AM
The data will be updated as batches nightly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 06:43 AM
A way to track INSERT/UPDATE/DELETE would be to split out your upsert process as a 3 part process.
Step1: Include an update timestamp field in your target table.
Step2: In your incoming refresh dataset, add a update timestamp column and set it to current timestamp.
UPDATE count
Step3: Run upsert job with just whenMatchedUpdate clause. Count rows that have current timestamp for update.
Also, get an extract of the rows with current timestamp - these are the rows that you'll UPDATE in DyanmoDB
INSERT count
Step5: Get the row count of table - RC1.
Step6: Run upsert job with just whenNotMatchedInsert clause. Get row count of table after insert - RC2
INSERT count = RC2-RC1
Get an extract of the rows with current timestamp and perform a left-anti join with UPDATE df - these are the rows that you'll INSERT in DyanmoDB
DELETE count
Step7: Get the row count of table - RC1.
Step8: Run upsert job with just whenNotMatchedDelete clause. Get row count of table after insert - RC2
DELETE count = RC1-RC2
From your source data, there would be a way to identify the records that are to be deleted. Grab those indices/primary keys to delete records.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 07:27 AM
there is change data feed on delta tables:
https://docs.databricks.com/delta/delta-change-data-feed.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 08:04 AM
@Aman Sehgalโ Thanks for the reply
Are you suggesting collecting the update/delete set and then manually writing to dynamodb?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 08:06 AM
@Werner Stinckensโ Thanks for the reply
I have gone through the CDF docs. But was wondering if there is some connector available to write data from Delta Table directly to dynamodb.
Eg: Read data from Delta table with change data mode on and write to dynamodb directly.
Or are you suggesting to get the Change Data and then write to DynamoDB
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ02-14-2022 08:17 AM
I suggest fetching the change data and write that to dynamodb indeed.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
โ03-14-2022 05:55 PM
Hi @Harikrishnan P Hโ ,
Did @Werner Stinckensโ reply help you to resolved your issue? if yes, please mark it as best. if not, please let us know.