We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.
Since LakeFlow Connect currently does not support direct mirroring from on-prem Db2, we are using Qlik Replicate to capture CDC data and land it in ADLS Gen2 in Parquet format โ one table per folder.
We then created a Declarative Pipeline in Databricks using Auto Loader to read the CDC files in streaming mode into a staging bronze streaming table. From there, we use Auto-CDC to apply SCD Type 1 logic and write to the final bronze streaming table, running the pipeline in continuous mode.
The challenge is:
To stream thousands of tables, we would need to create thousands of individual declarative streaming pipelines, which is not scalable.
We considered using a configuration table to loop through the list of source tables dynamically, but this would require scheduling the pipeline โ and we want to keep it in continuous mode, not scheduled.
Question:
Is there a scalable solution or pattern in Databricks to dynamically stream CDC data for thousands of tables using a single or minimal number of declarative pipelines, while keeping the pipeline in continuous mode?
Any guidance or best practices would be appreciated!