cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Lakeflow Connect: Data Ingestion from SQL Server to Databricks

shan-databricks
Databricks Partner

We have a use case to ingest data from SQL Server into Databricks using Lakeflow Connect. There are 100 tables, and on a daily basis we will perform inserts, updates, and deletes based on CDC data. For this requirement, how can we enable multiple parallel connections to the SQL Server database?

2 REPLIES 2

Ashwin_DSA
Databricks Employee
Databricks Employee

Hi @shan-databricks,

Databricks recommends up to ~250 tables per pipeline, so 100 is well within guidance. Lakeflow Connect doesnโ€™t offer a user-facing control for multiple parallel connections. Instead, configure a single SQL Server gateway with sufficient cores. Databricks automatically manages the parallel JDBC connections from the gateway to your SQL Server.

When you give it enough cores (via the gatewayโ€™s compute policy/node sizes) to let Databricks scale extraction in parallel... the platform then opens and manages multiple JDBC connections internally. 

If this answer resolves your question, could you mark it as โ€œAccept as Solutionโ€? That helps other users quickly find the correct fix.

 

Regards,
Ashwin | Delivery Solution Architect @ Databricks
Helping you build and scale the Data Intelligence Platform.
***Opinions are my own***

amirabedhiafi
New Contributor II

Hello @shan-databricks  !

One additional point, I would also validate the expected load with the SQL Server DBA because even if Lakeflow manages the parallelism internally the source SQL Server still needs to handle those concurrent reads. For 100 tables, I would start with one pipeline/gateway, monitor extraction duration and SQL Server load, then only split into multiple pipelines/gateways if there is a clear operational need such as different refresh SLAs, very large tables, or source side throttling. Don't forget that for tables with primary keys CT is generally preferred over CDC to reduce source overhead.

If this answer resolves your question, could you please mark it as โ€œAccept as Solutionโ€? It will help other users quickly find the correct fix.

Senior BI/Data Engineer | Microsoft MVP Data Platform | Microsoft MVP Power BI | Power BI Super User | C# Corner MVP