How do use Databricks Lakeflow Declarative Pipelin...

excavator-matt · ‎10-06-2025

Hi!

I am trying to replicate an AWS RDS PostgreSQL database in Databricks. I have successfully manage to enable CDC using AWS DMS that writes an initial load file and continuous CDC files in parquet.

I have been trying to follow the official guide Replicate an external RDBMS table using AUTO CDC. However, this guide leave three main issues unanswered.

1. How would you handle the scenario where you haven't received any CDC updates yet? That leaves the schema of the rdbms_orders_change_feed view undefined which cause CF_EMPTY_DIR_FOR_SCHEMA_INFERENCE error that can't be caught. I obviously want to run the inital load without waiting for change.

2. Why would you split the initial load from the cdc? Since AUTO CDC checks update time, there is no risk LOAD overrides CDC.

3. How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?

Thanks!

How do use Databricks Lakeflow Declarative Pipeline on AWS DMS data?