So since nobody responded, I decided to try my own suggestion and hack the snapshot data into the table that gathers the change data capture. After some straying I ended up with the notebook as attached.
The notebook first creates 2 dlt tables (lookup_time_table and cdctest_cdc_raw) reflecting the cdc data captured by sql server for source table dbo.cdctest (cdc.lsn_time_mapping and dbo_cdctest_CT). It then defines another dlt table named "cdctest_cdc_enriched" as a join of these two initial tables. When ran for the first time, the notebook then appends some previously existing data (that was already in dbo.cdctest before cdc was turned on) by using the @Dlt.append_flow(target="cdctest_cdc_enriched"). It then creates a streaming table called cdctest, in which changes from cdctest_cdc_enriched are applied.
This pipeline runs fine the first time (or when fully refreshed), and also updates successfully when no DML is done on the source table dbo.cdctest in sql server db after the initial run/full refresh.
However, when I did some DML on the dbo.cdctest table in sql server (causing appends on used tables cdc.lsn_time_mapping and cdc.dbo_cdctest_CT, but of course no deletes or updates are done in these tables), and then ran the pipeline again (to process the changes), the pipeline fails with the following error:
Flow 'cdctest_cdc_enriched' has FAILED fatally. An error occurred because we detected an update or delete to one or more rows in the source table. Streaming tables may only use append-only streaming sources. If you expect to delete or update rows to the source table in the future, please convert table cdctest_cdc_enriched to a materialized view instead of a streaming table. If you are fine to skip changes, set the option 'skipChangeCommits' to 'true'......The non-append change can be found at version 28. Operation: WRITE Username: ... Source table name: lookup_time_table
The problem with this is, that I already set skipChangeCommits for the two source tables (lookup_time_table and cdctest_cdc_raw) to true, and also that I'm pretty sure there are no actual updates or deletes in the source at the moment. I set this option to true because in a future update run, there might be deletes caused by the fact that cdc records are only kept in the cdc tables (by default sql server behaviour) for three days).
I tried having table cdctest_cdc_enriched as a materialized view, but then append_flow fails.
So two questions,
One is where is the failure coming from as there are no updates in the source (yet)?
Two, how to get the dlt pipeline to ignore that updates and just process the inserts to the cdc tables?