Hi,
We have a Azure SQL Server (replicating from an On Prem SQL Server) that is required to be in Databricks bronze and beyond.
This database has 100s of tables that are all required. Size of tables will vary from very small up to the biggest tables 100 million+ rows. Change on biggest tables can be 10,000 rows per hour.
So far, we've been using Lakehouse Federation and materialised view generation via DLT pipelines to deliver SQL data into Databricks, but this scale/change is bigger. We don't believe we can use incremental updates using this method (source doesn't have row tracking available) so we would have to bring in a full load of data into the mat view on every refresh. Is this correct?
We're also looking again at native SQL CDC options. This still seems to have the same limitations when we last looked i.e. you have to set this up for every table (as above, we have over 500 tables) and schema drift takes a fair bit of code and management.
Welcome thoughts and latest ideas on what's the best to handle this from the Databricks end. Do you think our usual method will cope okay with this scale? Are we missing something on MV incremental loads or CDC?
As always, thanks in advance!
Nick