02-24-2026 05:49 AM
Hi everyone
We recently started using Lakeflow Connect to ingest data from our on-prem SQL Server environment, and we’ve run into an issue related to gateway restarts.
From our understanding, the gateway begins by performing full snapshots of each table. However, it appears that whenever the gateway restarts, it also restarts the snapshot process from the beginning.
This has become a significant challenge for us because:
Our gateway is expected to lose connectivity approximately once per week.
Some of our tables are quite large, and a full snapshot can take 48+ hours to complete.
Any deployment of our DAB (even minor changes such as updating resource tags for deployment time or git hash) triggers a gateway restart, which in turn seems to restart the snapshot process.
Since we tag all resources for traceability, even small metadata updates can cause this behavior.
Questions:
Is this expected behavior for Lakeflow Connect?
Is there a way to resume snapshots instead of restarting them after a gateway interruption?
Has anyone implemented a workaround or best practice to handle large tables and frequent restarts?
We’d really appreciate hearing how others have approached this.
Thanks in advance!
02-24-2026 06:38 AM
Hi @DavidOldelius Databricks can answer this but i think there is no checkpointing mechansim for lakeflow connect to sql server.
03-07-2026 08:54 PM
Thanks for the detailed write-up. This is a pain point I have seen others run into with Lakeflow Connect database connectors when dealing with large initial snapshots and environments where gateway interruptions are a fact of life.
Let me share what I know.
IS THIS EXPECTED BEHAVIOR?
Unfortunately, yes -- at least partially. For database connectors, the ingestion gateway runs on classic compute and operates continuously to extract snapshots and change data. The gateway stores a cursor position so it can resume from the last known position on subsequent runs. However, the current behavior during the initial snapshot phase is that in-memory state (including snapshot progress) resets on restart, refresh, or resume. The Databricks docs explicitly note that "Metrics are in-memory only and reset on restart, refresh, or resume" and that this is "intentional for implementation simplicity."
What this means in practice is that while the CDC (change data capture) phase benefits from cursor-based resumability, the initial full snapshot of large tables may not survive a gateway restart gracefully. The staging volume (a Unity Catalog volume that temporarily stores extracted data) does help with failure recovery in some scenarios, but a full gateway restart during the snapshot phase can cause it to start over.
Reference: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/gateway-event-logs
Reference: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/
DAB DEPLOYMENTS CAUSING GATEWAY RESTARTS
This is the part that is likely most frustrating for you. When you deploy a Databricks Asset Bundle that includes the Lakeflow Connect pipeline definition, even if the only changes are resource tags (deployment time, git hash, etc.), the deployment updates the pipeline configuration, which can trigger a restart of the gateway.
A few suggestions to mitigate this:
1. Separate your Lakeflow Connect pipeline into its own bundle or deployment target. This way, deploying unrelated infrastructure changes (tags, jobs, dashboards) does not touch the ingestion pipeline definition and will not trigger a gateway restart.
2. Use conditional deployment logic. If you are using CI/CD, add logic to skip deploying the ingestion pipeline bundle unless there are actual configuration changes to the pipeline itself. Avoid redeploying when only metadata tags change.
3. Consider tagging at a higher level. Instead of tagging individual resources on every deployment, consider whether you can track deployment metadata externally (e.g., in a separate metadata table or your CI/CD system) rather than as resource tags that trigger redeployments.
Reference: https://docs.databricks.com/aws/en/dev-tools/bundles/
DEALING WITH WEEKLY CONNECTIVITY LOSS
For the on-prem gateway losing connectivity roughly once per week, here are some strategies:
1. Stabilize the gateway compute. If possible, ensure the classic compute cluster running the gateway has maximum uptime. Review the cluster configuration to make sure auto-termination is disabled for the gateway cluster, and that the underlying infrastructure is as stable as possible.
2. Enable change tracking or CDC on your source SQL Server tables before starting ingestion. This is important because once the initial snapshot completes, subsequent runs will only ingest changes. The gateway continuously captures change data and stores cursor positions, so after the snapshot phase is done, connectivity interruptions are handled much more gracefully -- the connector picks up from where it left off.
Reference: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/sql-server.html
3. For very large tables (48+ hour snapshots), consider whether you can reduce the scope of the initial snapshot. For example:
- If possible, filter to a subset of columns or use table selection to ingest smaller tables first.
- Check if you can pre-load historical data into the destination streaming tables through another mechanism and then let Lakeflow Connect handle incremental CDC going forward.
4. Coordinate deployments around snapshot windows. If you know a large table snapshot is in progress, hold off on any DAB deployments that touch the ingestion pipeline until the snapshot completes.
MONITORING SNAPSHOT PROGRESS
While the snapshot is running, you can track progress using the gateway event logs. The gateway emits flow_progress events every 5 minutes by default. You can query these to see how many rows have been upserted per table and whether the pipeline is in the snapshot or CDC phase.
Example query:
SELECT
timestamp,
details
FROM
event_log(<your_pipeline_id>)
WHERE
event_type = 'flow_progress'
ORDER BY
timestamp DESC
You can adjust the emission frequency with the pipelines.gateway.progressEventEmitFrequencySeconds configuration (valid range: 30-3600 seconds). Note that zero-update events serve as liveness signals confirming the gateway is running even when no data changes are happening.
Reference: https://docs.databricks.com/aws/en/ingestion/lakeflow-connect/gateway-event-logs
LOOKING AHEAD
Snapshot resumability for database connectors is a known area of improvement for the Lakeflow Connect team. The current architecture uses staging volumes and cursor tracking that provide good recovery for the CDC phase, but the initial snapshot phase is where this gap is most felt. I would recommend filing a feature request through your Databricks account team or the Ideas portal specifically for "snapshot checkpoint/resume on gateway restart" -- this kind of feedback helps prioritize the roadmap.
SUMMARY
- Snapshot restarting on gateway restart is currently expected behavior during the initial load phase.
- CDC phase has better resumability through cursor position tracking.
- The most impactful short-term fix is to isolate your Lakeflow Connect pipeline definition from other DAB deployments so that tag updates and minor changes do not trigger gateway restarts.
- For the weekly connectivity losses, ensure the gateway cluster is configured for maximum uptime and coordinate large snapshot windows around known maintenance periods.
Hope this helps. Let me know if you have follow-up questions.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
* This reply used an agent system I built to research and draft this response based on the wide set of documentation I have available and previous memory. I personally review the draft for any obvious issues and for monitoring system reliability and update it when I detect any drift, but there is still a small chance that something is inaccurate, especially if you are experimenting with brand new features.
2 weeks ago
Hi @DavidOldelius ,
If the initial snapshot has been fully ingested, then in case of ingestion gateway restart the connector will start from where it left and not from the beginning.
Do you observe the behaviour you described during or after the snapshot being processed? You can check this in the gateway event logs. Look for origin.flow_name: {catalog}.{schema}.{table}_snapshot_flow for initial snapshot flows.
If ingestion gateway restart happens during the initial snapshot processing, a full snapshot refresh could be required if the table could not be chunked. For example, if a SQL Server table has no Primary Key, Unique Key, or index, there is little to no chance it could be split into chunks. Also, if the primary key values are skewed, this can also lead to a full refresh of a large part of the table.
I recommend to check the structure of the SQL Server tables to improve chunking. Otherwise, please contact Databricks support to work with you to improve performance.
Hope it helps.
Best regards,