How to Capture Change Data (CDC) from DynamoDB Streams and Write into Delta table of DataBricks
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-12-2024 09:21 AM
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2024 02:00 AM
@Retired_mod Thanks for detail overview.
But can you tell more about - How should i fetch DynamoDB Streams generated data (which will reside in S3) to Delta Table via Notebook (automated by Job).
I saw in official docs that if i use Job(Notebook) Trigger as file arrival but there is limit of Max 50 [Reference Docs].
If i have to implement for more than 50 table how can i manage this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-13-2024 04:19 AM
Hi, @prasad95 Thank you for sharing your concern here.
In addition to the @Retired_mod comments you can follow below To capture Change Data (CDC) from DynamoDB Streams and write it into a Delta table in Databricks:
1. Connect to DynamoDB Streams and read the CDC data using the AWS SDK.
2. Process the CDC data in Databricks using the APPLY CHANGES
API in Delta Live Tables, which is designed to correctly process CDC records.
3. Use the APPLY CHANGES INTO
statement with the APPLY CHANGES
API. An example of its usage is:
CREATE OR REFRESH STREAMING TABLE table_name;
APPLY CHANGES INTO LIVE.table_name FROM source KEYS (keys)
[IGNORE NULL UPDATES]
[APPLY AS DELETE WHEN condition]
[APPLY AS TRUNCATE WHEN condition]
SEQUENCE BY orderByColumn
[COLUMNS {columnList | * EXCEPT (exceptColumnList)}]
[STORED AS {SCD TYPE 1 | SCD TYPE 2}]
[TRACK HISTORY ON {columnList | * EXCEPT (exceptColumnList)}]
In this statement, source
is the CDC data from DynamoDB Streams, and table_name
is the Delta table where you want to write the CDC data.
4. After executing this statement, the CDC data from DynamoDB Streams is written into the Delta table in Databricks. Remember to define unique keys for each row in the source data. If you want to track history on certain columns, use the TRACK HISTORY ON
clause.
You can go through the below links to understand more about this.
DLT with SQL reference:- https://docs.databricks.com/en/delta-live-tables/sql-ref.html
Please leave a like if it is helpful. Follow-ups are appreciated.
Kudos,
Sai Kumar
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
02-15-2024 02:04 AM

