2 weeks ago
Has anyone else run into a situation where a breaking schema change on a SQL Server source table leaves their Lakeflow Connect pipeline in a state it can't recover from — even after destroying and recreating the pipeline?
Here's what happened to us:
- We had a breaking schema change on one of our source tables
- The CDC incrementals broke as expected, so we triggered a full refresh
- The pipeline never stabilized — it continued to fail on subsequent runs
- We destroyed the pipeline and recreated it from scratch
- Even after recreating, we were unable to get the pipeline to a healthy state
I'm aware that Databricks docs acknowledge that rows prior to a schema change aren't guaranteed to have been ingested before the pipeline fails (`INCOMPATIBLE_SCHEMA_CHANGE`). But the expectation is that a full refresh — or at minimum a destroy and recreate — should get you back to a clean state. In our experience, neither did.
2 weeks ago
Hi,
I've done some research internally and found the thing that usually catches people out is that Lakeflow Connect for SQL Server isn't one pipeline, it actually has a few different components.
There's a gateway (talks to SQL Server, writes to staging) and a separate ingestion pipeline (reads staging, writes to UC). If you only destroyed the ingestion pipeline, the gateway could still sitting on the broken state.
A few things to make sure you've deleted:
- Both pipelines: destroy the gateway too, not just the ingestion side.
- SQL Server side: Lakeflow recreates its own lakeflow_* capture instances on full refresh, but a broken one can stick around. Check cdc.change_tables and EXEC sys.sp_cdc_help_change_data_capture for stale lakeflow_* rows. Disabling + re-enabling CDC on the source table usually clears it.
- UC side: destination tables drop on pipeline delete, but staging volumes hang around for 25–30 days, and any schemas the pipeline created in UC don't always go with it. Worth dropping the staging schema/volume before you recreate.
Order of operations that should work: stop + delete ingestion pipeline → stop + delete gateway → clean up SQL Server CDC on the affected tables → drop residual UC staging schemas → recreate gateway → recreate ingestion pipeline (fresh staging location if you can).
Also: open a support case. The docs say full refresh should recover from INCOMPATIBLE_SCHEMA_CHANGE and that destination tables drop on delete, if this isn't happening then our engineers should know about it. Include the original DDL diff, the failure event log, and what failed on both the refresh and the recreate.
I hope this helps.
Thanks,
Emma
2 weeks ago
Hi,
I've done some research internally and found the thing that usually catches people out is that Lakeflow Connect for SQL Server isn't one pipeline, it actually has a few different components.
There's a gateway (talks to SQL Server, writes to staging) and a separate ingestion pipeline (reads staging, writes to UC). If you only destroyed the ingestion pipeline, the gateway could still sitting on the broken state.
A few things to make sure you've deleted:
- Both pipelines: destroy the gateway too, not just the ingestion side.
- SQL Server side: Lakeflow recreates its own lakeflow_* capture instances on full refresh, but a broken one can stick around. Check cdc.change_tables and EXEC sys.sp_cdc_help_change_data_capture for stale lakeflow_* rows. Disabling + re-enabling CDC on the source table usually clears it.
- UC side: destination tables drop on pipeline delete, but staging volumes hang around for 25–30 days, and any schemas the pipeline created in UC don't always go with it. Worth dropping the staging schema/volume before you recreate.
Order of operations that should work: stop + delete ingestion pipeline → stop + delete gateway → clean up SQL Server CDC on the affected tables → drop residual UC staging schemas → recreate gateway → recreate ingestion pipeline (fresh staging location if you can).
Also: open a support case. The docs say full refresh should recover from INCOMPATIBLE_SCHEMA_CHANGE and that destination tables drop on delete, if this isn't happening then our engineers should know about it. Include the original DDL diff, the failure event log, and what failed on both the refresh and the recreate.
I hope this helps.
Thanks,
Emma
2 weeks ago
Hi @lrm_data yes, this one catches a lot of people. A few things to check on the SQL Server side that commonly block recovery even after destroy + recreate:
a week ago
Hello @emma_s + @abhi_dabhi ,
Thank you so much! I had destroyed the bundle that included schema, ingestion pipeline and gateway. However, I did not clear out SQL Server CDC so that may have been the issue.
I plan to leave the current gateway stopped and the pipeline paused since I have some open tickets with Databricks on this and they need to review logs.
However, I will be creating a net new bundle with a new gateway and pipeline and will deploy and start that once I clear out the SQL server. Let me know if you think leaving the other items paused and stopped with cause issues.