cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

How do use Databricks Lakeflow Declarative Pipeline on AWS DMS data?

excavator-matt
New Contributor III

Hi!

I am trying to replicate an AWS RDS PostgreSQL database in Databricks. I have successfully manage to enable CDC using AWS DMS that writes an initial load file and continuous CDC files in parquet.

I have been trying to follow the official guide Replicate an external RDBMS table using AUTO CDC. However, this guide leave three main issues unanswered.

1.  How would you handle the scenario where you haven't received any CDC updates yet? That leaves the schema of the rdbms_orders_change_feed view undefined which cause CF_EMPTY_DIR_FOR_SCHEMA_INFERENCE error that can't be caught. I obviously want to run the inital load without waiting for change.

2. Why would you split the initial load from the cdc? Since AUTO CDC checks update time, there is no risk LOAD overrides CDC.

3. How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?

Thanks!

 
2 REPLIES 2

excavator-matt
New Contributor III

If you find the author of the official documentation, there you can also send him bonus tasks

1. The guide still dead names declarative pipelines dlt, but the modern import is of course 

"from pyspark import pipelines as dp" with identical syntax
2. The guide has incorrect indentation for the section continuous feed.
3. The guide doesn't mention how out of order deletes are handled, but probably should.
 

Hi @excavator-matt 

Thank you for reaching out and for sending your questions and feedback about this article.

Why would you split the initial load from the cdc?

Short answer: to create a baseline streaming table, then to capture new changes and run the CDC to synchronize these changes incrementally and handle idempotency; otherwise would need to sync it all every time.

Long answer: 

The guide suggests two steps: 

  • The first step is to create a streaming baseline table while leveraging a View as source to obtain the full snapshot of your data (Initial Hydration). This step will leverage the schema from your view to create the schema for your streaming table. Notice that it uses the flag “once=true” to handle processing in a batch-like manner; the pipeline will process all currently available changes from the source only once and then stop.
  • The second step leverages your baseline streaming table, while using a second view as the source. The guide uses JSON files and also infers from them; in your case, this would be the PostgreSQL source. 

The medallion architecture suggests bringing in your raw bronze data (step 1), then in your silver layer, you apply the CDC (step 2), that way you keep history of your changes, you can also leverage Change Data Feed

 

How would you handle the scenario where you haven't received any CDC updates yet?

On your second step, you can pre-define your schema on your view and and let the CDC run its course. This will cause the step not to fail and if there are no updates the pipeline will continue.

 

How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?

Jinja templating is a great approach, you can define a SQL template (CREATE STREAMING…), then have a yaml file with all your table configuration (table, schema, path, etc) then use Python generate them.

This approach centralizes table configs for maintainability and reduces code duplication and errors.

 

I hope this helps! If this solution works for you, please click the "Accept as Solution" button.





Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now