cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Scaling Declarative Streaming Pipelines for CDC from On-Prem Database to Lakehouse

vartyg
New Contributor

We have a scenario where we need to mirror thousands of tables from on-premises Db2 databases to an Azure Lakehouse. The goal is to create mirror Delta tables in the Lakehouse.

Since LakeFlow Connect currently does not support direct mirroring from on-prem Db2, we are using Qlik Replicate to capture CDC data and land it in ADLS Gen2 in Parquet format โ€” one table per folder.

We then created a Declarative Pipeline in Databricks using Auto Loader to read the CDC files in streaming mode into a staging bronze streaming table. From there, we use Auto-CDC to apply SCD Type 1 logic and write to the final bronze streaming table, running the pipeline in continuous mode.

The challenge is:
To stream thousands of tables, we would need to create thousands of individual declarative streaming pipelines, which is not scalable.

We considered using a configuration table to loop through the list of source tables dynamically, but this would require scheduling the pipeline โ€” and we want to keep it in continuous mode, not scheduled.

Question:
Is there a scalable solution or pattern in Databricks to dynamically stream CDC data for thousands of tables using a single or minimal number of declarative pipelines, while keeping the pipeline in continuous mode?

Any guidance or best practices would be appreciated!

2 REPLIES 2

bidek56
Contributor

AbhaySingh
Databricks Employee
Databricks Employee

Yes, a databricks labs project seems perfect for your scenario.

https://databrickslabs.github.io/dlt-meta/index.html