In Databricks Lakeflow Connect for MySQL (currently in public preview), Databricks recommends limiting each ingestion pipeline to around 250 tables, with validated testing up to 1 TB of snapshot data.
However, in real-world enterprise scenarios, customers often have significantly larger environments for example, thousands of tables (e.g., 6,000–7,000) and data volumes exceeding multiple terabytes.
To accommodate this, we are required to create multiple ingestion pipelines. Since each pipeline typically provisions its own compute resources (clusters), this can lead to:
- Increased infrastructure costs due to multiple clusters running in parallel
- Higher operational overhead in managing multiple pipelines
- Customer dissatisfaction due to perceived inefficiency and cost escalation
This raises an important challenge:
How can we design a scalable ingestion strategy that handles large table volumes and data sizes efficiently, while minimizing compute cost and avoiding unnecessary cluster proliferation?