Hi @excavator-matt
Thank you for reaching out and for sending your questions and feedback about this article.
Why would you split the initial load from the cdc?
Short answer: to create a baseline streaming table, then to capture new changes and run the CDC to synchronize these changes incrementally and handle idempotency; otherwise would need to sync it all every time.
Long answer:
The guide suggests two steps:
- The first step is to create a streaming baseline table while leveraging a View as source to obtain the full snapshot of your data (Initial Hydration). This step will leverage the schema from your view to create the schema for your streaming table. Notice that it uses the flag “once=true” to handle processing in a batch-like manner; the pipeline will process all currently available changes from the source only once and then stop.
- The second step leverages your baseline streaming table, while using a second view as the source. The guide uses JSON files and also infers from them; in your case, this would be the PostgreSQL source.
The medallion architecture suggests bringing in your raw bronze data (step 1), then in your silver layer, you apply the CDC (step 2), that way you keep history of your changes, you can also leverage Change Data Feed.
How would you handle the scenario where you haven't received any CDC updates yet?
On your second step, you can pre-define your schema on your view and and let the CDC run its course. This will cause the step not to fail and if there are no updates the pipeline will continue.
How would you handle this scalably if I have maybe 20 tables from the same database? I currently went for Jinja templating, but is there a better way?
Jinja templating is a great approach, you can define a SQL template (CREATE STREAMING…), then have a yaml file with all your table configuration (table, schema, path, etc) then use Python generate them.
This approach centralizes table configs for maintainability and reduces code duplication and errors.
I hope this helps! If this solution works for you, please click the "Accept as Solution" button.