Databricks Community

Nick_Pacey · ‎08-01-2025

Hi,

We have a Azure SQL Server (replicating from an On Prem SQL Server) that is required to be in Databricks bronze and beyond.

This database has 100s of tables that are all required. Size of tables will vary from very small up to the biggest tables 100 million+ rows. Change on biggest tables can be 10,000 rows per hour.

So far, we've been using Lakehouse Federation and materialised view generation via DLT pipelines to deliver SQL data into Databricks, but this scale/change is bigger. We don't believe we can use incremental updates using this method (source doesn't have row tracking available) so we would have to bring in a full load of data into the mat view on every refresh. Is this correct?

We're also looking again at native SQL CDC options. This still seems to have the same limitations when we last looked i.e. you have to set this up for every table (as above, we have over 500 tables) and schema drift takes a fair bit of code and management.

Welcome thoughts and latest ideas on what's the best to handle this from the Databricks end. Do you think our usual method will cope okay with this scale? Are we missing something on MV incremental loads or CDC?

As always, thanks in advance!

Nick

ilir_nuredini · ‎08-05-2025

Hello @Nick_Pacey ,

Databricks has recently introduced Lakeflow Connect and it supports ingesting data from SQL Server. I have done some small experimentation which went all well but not at scale. It looks like a very promising option. To note that it is still in public preview.

You can refer to the documentation here: Managed connectors in Lakeflow Connect | Databricks Documentation
And if you wanna look into a more detailed doc for SQL Server you can check this article:
Efficient Data Ingestion from SQL Server with Lake... - Databricks Community - 122597

Hope that helps and I would definitely give it a shot to Lakeflow Connect.

Best, Ilir

kerem · ‎08-05-2025

Hey Nick,

Have you tried the SQL Server connector with Lakeflow Connect? This should provide native connection to your SQL server, potentially allowing for incremental updates and CDC setup.

https://learn.microsoft.com/en-us/azure/databricks/ingestion/lakeflow-connect/sql-server-pipeline

I haven’t tried this connector before but it seems like a good first thing to try for your case.

Kerem Durak

Databricks Community

Question on best method to deliver Azure SQL Server data into Databricks Bronze and Silver.

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples