Hi @lrm_data yes, this one catches a lot of people. A few things to check on the SQL Server side that commonly block recovery even after destroy + recreate:
Stale lakeflow_* capture instance. SQL Server allows only 2 capture instances per table. If both slots are occupied - often a live one plus a stale one left from the failed schema change - Lakeflow can't do full refresh or schema evolution, even on a fresh pipeline. Check:
SELECT capture_instance, start_lsn, create_date
FROM cdc.change_tables ct
JOIN sys.tables t ON ct.source_object_id = t.object_id
WHERE t.name = '<your_table>';
Disable + re-enable CDC on the table to clear stale ones.
Lakeflow Connect is two pipelines. A gateway (SQL Server -> UC staging) and an ingestion pipeline (staging -> destination). Destroying only the ingestion side leaves the gateway on the broken state. Both must go.
Staging volume doesn't always drop cleanly. Volumes persist 25โ30 days after deletion, and UC schemas the pipeline created can hang around. Reusing the old staging location is the #1 reason "recreated from scratch" isn't actually from scratch. Drop it manually and use a fresh one.
Recovery order that works:
- Stop + delete ingestion pipeline
- Stop + delete gateway
- Disable CDC on the table, confirm no stale lakeflow_* instances remain, re-enable CDC
- Drop the residual UC staging schema/volume
- Recreate gateway with a fresh staging location
- Recreate ingestion pipeline
If the table has a primary key, switch to change tracking (CT) instead of CDC - Databricks' own recommendation. CT doesn't use capture instances and this whole class of problem goes away.
If all the above is clean and it still won't recover, open a support case with the utility script version, capture instance state before/after, and gateway driver logs.