Databricks Community

vartyg · ‎11-04-2025

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.

Since LakeFlow Connect currently does not support direct mirroring from on-prem Db2, we are using Qlik Replicate to capture CDC data and land it in ADLS Gen2 in Parquet format — one table per folder.

We then created a Declarative Pipeline in Databricks using Auto Loader to read the CDC files in streaming mode into a staging bronze streaming table. From there, we use Auto-CDC to apply SCD Type 1 logic and write to the final bronze streaming table, running the pipeline in continuous mode.

The challenge is:
To stream thousands of tables, we would need to create thousands of individual declarative streaming pipelines, which is not scalable.

We considered using a configuration table to loop through the list of source tables dynamically, but this would require scheduling the pipeline — and we want to keep it in continuous mode, not scheduled.

Question:
Is there a scalable solution or pattern in Databricks to dynamically stream CDC data for thousands of tables using a single or minimal number of declarative pipelines, while keeping the pipeline in continuous mode?

Any guidance or best practices would be appreciated!