Databricks Community

excavator-matt · Monday

Hi!

I am trying to replicate an AWS RDS PostgreSQL database in Databricks. I have successfully manage to enable CDC using AWS DMS that writes an initial load file and continuous CDC files in parquet.

I have been trying to follow the official guide Replicate an external RDBMS table using AUTO CDC. However, this guide leave three main issues unanswered.

1. How would you handle the scenario where you haven't received any CDC updates yet? That leaves the schema of the rdbms_orders_change_feed view undefined which cause CF_EMPTY_DIR_FOR_SCHEMA_INFERENCE error that can't be caught. I obviously want to run the inital load without waiting for change.

2. Why would you split the initial load from the cdc? Since AUTO CDC checks update time, there is no risk LOAD overrides CDC.

3. How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?

Thanks!

excavator-matt · Monday

If you find the author of the official documentation, there you can also send him bonus tasks

1. The guide still dead names declarative pipelines dlt, but the modern import is of course

"from pyspark import pipelines as dp" with identical syntax

2. The guide has incorrect indentation for the section continuous feed.

3. The guide doesn't mention how out of order deletes are handled, but probably should.

mmayorga · Tuesday

Hi @excavator-matt

Thank you for reaching out and for sending your questions and feedback about this article.

Why would you split the initial load from the cdc?

Short answer: to create a baseline streaming table, then to capture new changes and run the CDC to synchronize these changes incrementally and handle idempotency; otherwise would need to sync it all every time.

Long answer:

The guide suggests two steps:

The first step is to create a streaming baseline table while leveraging a View as source to obtain the full snapshot of your data (Initial Hydration). This step will leverage the schema from your view to create the schema for your streaming table. Notice that it uses the flag “once=true” to handle processing in a batch-like manner; the pipeline will process all currently available changes from the source only once and then stop.
The second step leverages your baseline streaming table, while using a second view as the source. The guide uses JSON files and also infers from them; in your case, this would be the PostgreSQL source.

The medallion architecture suggests bringing in your raw bronze data (step 1), then in your silver layer, you apply the CDC (step 2), that way you keep history of your changes, you can also leverage Change Data Feed.

How would you handle the scenario where you haven't received any CDC updates yet?

On your second step, you can pre-define your schema on your view and and let the CDC run its course. This will cause the step not to fail and if there are no updates the pipeline will continue.

How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?

Jinja templating is a great approach, you can define a SQL template (CREATE STREAMING…), then have a yaml file with all your table configuration (table, schema, path, etc) then use Python generate them.

This approach centralizes table configs for maintainability and reduces code duplication and errors.

I hope this helps! If this solution works for you, please click the "Accept as Solution" button.

excavator-matt · Wednesday

Thank you for your reply!

Maybe I am slow, but two things don't sense to me with this reply. If these are clarified, I consider the question resolved.

You raise a good point that even if the you use AUTO CDC's sequence_by, you would still have the problem of reprocessing the same files if you don't have run_once on in the general case. In this case in particular though, this is not an issue as cloudFiles/AutoLoader keeps track of files it has already read. Smaller detail, but this is accurate?

For wider point of using a predefined schema, I guess that would work. However, then you miss out of schema evolution? There are pros and cons of enforcing schemas, and I guess it makes sense if you handle your CDC in silver that the schema is fixed by then, but I was otherwise thinking that it would evolve in bronze.

Or are you saying that, unlike other forms of pipelines in Databricks, a hidden downside of using declarative pipelines is that they don't support schema evolution? Does the declarative pipeline view handle schema evolution if the cdc files evolve? Let's leave schema evolution in AWS DMS as a topic for another day.

mmayorga · Wednesday

Hi @excavator-matt ,

Yes, you are correct. CloudFiles/Autoloader handles idempotency on the file level.

From the guide's perspective, the View is created from the source files in the specified location. This view captures all files and their corresponding rows, and then AutoCDC is applied from the view to create the baseline streaming table.

Regarding schema evolution and enforcement, IMO, it is about the right balance: In Bronze, your data changes, so we need to be resilient enough to handle them. However, as we move through the Silver and Gold layers, we may require a more rigid schema and enforce it to ensure that downstream/consumer systems don't break.

The key detail is that AUTO CDC handles schema evolution. Note that the streaming table “rdbms_orders” is created without a schema, and then AUTO CDC FLOW is applied from the “full_orders_snapshot” view.

I hope this helps. Thank you!