Databricks Community

lrm_data · 2 weeks ago

Has anyone else run into a situation where a breaking schema change on a SQL Server source table leaves their Lakeflow Connect pipeline in a state it can't recover from — even after destroying and recreating the pipeline?

Here's what happened to us:

- We had a breaking schema change on one of our source tables
- The CDC incrementals broke as expected, so we triggered a full refresh
- The pipeline never stabilized — it continued to fail on subsequent runs
- We destroyed the pipeline and recreated it from scratch
- Even after recreating, we were unable to get the pipeline to a healthy state

I'm aware that Databricks docs acknowledge that rows prior to a schema change aren't guaranteed to have been ingested before the pipeline fails (`INCOMPATIBLE_SCHEMA_CHANGE`). But the expectation is that a full refresh — or at minimum a destroy and recreate — should get you back to a clean state. In our experience, neither did.

emma_s · 2 weeks ago

Hi,

I've done some research internally and found the thing that usually catches people out is that Lakeflow Connect for SQL Server isn't one pipeline, it actually has a few different components.

There's a gateway (talks to SQL Server, writes to staging) and a separate ingestion pipeline (reads staging, writes to UC). If you only destroyed the ingestion pipeline, the gateway could still sitting on the broken state.

A few things to make sure you've deleted:

- Both pipelines: destroy the gateway too, not just the ingestion side.
- SQL Server side: Lakeflow recreates its own lakeflow_* capture instances on full refresh, but a broken one can stick around. Check cdc.change_tables and EXEC sys.sp_cdc_help_change_data_capture for stale lakeflow_* rows. Disabling + re-enabling CDC on the source table usually clears it.
- UC side: destination tables drop on pipeline delete, but staging volumes hang around for 25–30 days, and any schemas the pipeline created in UC don't always go with it. Worth dropping the staging schema/volume before you recreate.

Order of operations that should work: stop + delete ingestion pipeline → stop + delete gateway → clean up SQL Server CDC on the affected tables → drop residual UC staging schemas → recreate gateway → recreate ingestion pipeline (fresh staging location if you can).

Also: open a support case. The docs say full refresh should recover from INCOMPATIBLE_SCHEMA_CHANGE and that destination tables drop on delete, if this isn't happening then our engineers should know about it. Include the original DDL diff, the failure event log, and what failed on both the refresh and the recreate.

I hope this helps.

Thanks,

Emma

View solution in original post

emma_s · 2 weeks ago

Hi,

I've done some research internally and found the thing that usually catches people out is that Lakeflow Connect for SQL Server isn't one pipeline, it actually has a few different components.

There's a gateway (talks to SQL Server, writes to staging) and a separate ingestion pipeline (reads staging, writes to UC). If you only destroyed the ingestion pipeline, the gateway could still sitting on the broken state.

A few things to make sure you've deleted:

- Both pipelines: destroy the gateway too, not just the ingestion side.
- SQL Server side: Lakeflow recreates its own lakeflow_* capture instances on full refresh, but a broken one can stick around. Check cdc.change_tables and EXEC sys.sp_cdc_help_change_data_capture for stale lakeflow_* rows. Disabling + re-enabling CDC on the source table usually clears it.
- UC side: destination tables drop on pipeline delete, but staging volumes hang around for 25–30 days, and any schemas the pipeline created in UC don't always go with it. Worth dropping the staging schema/volume before you recreate.

Order of operations that should work: stop + delete ingestion pipeline → stop + delete gateway → clean up SQL Server CDC on the affected tables → drop residual UC staging schemas → recreate gateway → recreate ingestion pipeline (fresh staging location if you can).

Also: open a support case. The docs say full refresh should recover from INCOMPATIBLE_SCHEMA_CHANGE and that destination tables drop on delete, if this isn't happening then our engineers should know about it. Include the original DDL diff, the failure event log, and what failed on both the refresh and the recreate.

I hope this helps.

Thanks,

Emma

abhi_dabhi · 2 weeks ago

Hi @lrm_data yes, this one catches a lot of people. A few things to check on the SQL Server side that commonly block recovery even after destroy + recreate:

Stale lakeflow_* capture instance. SQL Server allows only 2 capture instances per table. If both slots are occupied - often a live one plus a stale one left from the failed schema change - Lakeflow can't do full refresh or schema evolution, even on a fresh pipeline. Check:

SELECT capture_instance, start_lsn, create_date

FROM cdc.change_tables ct

JOIN sys.tables t ON ct.source_object_id = t.object_id

WHERE t.name = '<your_table>';

Disable + re-enable CDC on the table to clear stale ones.

Lakeflow Connect is two pipelines. A gateway (SQL Server -> UC staging) and an ingestion pipeline (staging -> destination). Destroying only the ingestion side leaves the gateway on the broken state. Both must go.

Staging volume doesn't always drop cleanly. Volumes persist 25–30 days after deletion, and UC schemas the pipeline created can hang around. Reusing the old staging location is the #1 reason "recreated from scratch" isn't actually from scratch. Drop it manually and use a fresh one.

Recovery order that works:

Stop + delete ingestion pipeline
Stop + delete gateway
Disable CDC on the table, confirm no stale lakeflow_* instances remain, re-enable CDC
Drop the residual UC staging schema/volume
Recreate gateway with a fresh staging location
Recreate ingestion pipeline

If the table has a primary key, switch to change tracking (CT) instead of CDC - Databricks' own recommendation. CT doesn't use capture instances and this whole class of problem goes away.

If all the above is clean and it still won't recover, open a support case with the utility script version, capture instance state before/after, and gateway driver logs.

lrm_data · a week ago

Hello @emma_s + @abhi_dabhi ,

Thank you so much! I had destroyed the bundle that included schema, ingestion pipeline and gateway. However, I did not clear out SQL Server CDC so that may have been the issue.

I plan to leave the current gateway stopped and the pipeline paused since I have some open tickets with Databricks on this and they need to review logs.

However, I will be creating a net new bundle with a new gateway and pipeline and will deploy and start that once I clear out the SQL server. Let me know if you think leaving the other items paused and stopped with cause issues.

Databricks Community

Lakeflow Connect - SQL Server - Issues restarting after failure

Solution Accelerator Series | Digital Twins

Community Alert: Free BrickTalk on Supply Chain Management with Databricks!

🌟 Community Pulse: Your Weekly Roundup! April 20 – 26, 2026

The Lakebase Hub: Official Community Space for Lakebase Insights

Take Control: Customer-Managed Keys for Lakebase Postgres