Databricks Community

antoalphi · ‎04-08-2026

In Databricks Lakeflow Connect for MySQL (currently in public preview), Databricks recommends limiting each ingestion pipeline to around 250 tables, with validated testing up to 1 TB of snapshot data.

However, in real-world enterprise scenarios, customers often have significantly larger environments for example, thousands of tables (e.g., 6,000–7,000) and data volumes exceeding multiple terabytes.

To accommodate this, we are required to create multiple ingestion pipelines. Since each pipeline typically provisions its own compute resources (clusters), this can lead to:

Increased infrastructure costs due to multiple clusters running in parallel
Higher operational overhead in managing multiple pipelines
Customer dissatisfaction due to perceived inefficiency and cost escalation

This raises an important challenge:
How can we design a scalable ingestion strategy that handles large table volumes and data sizes efficiently, while minimizing compute cost and avoiding unnecessary cluster proliferation?

Sumit_7 · ‎04-08-2026

@antoalphi I think you have already answered in the first line itself - Public Preview -- meaning not full developed for general/real use. Hence it comes with limitations or bugs which will be covered in General Available. Though still you may consider following points to tackle effectively:

Group tables - split pipelines by schema / domain / size (not randomly)
Use incremental (CDC) instead of full snapshot - reduces compute drastically
Orchestrate pipelines (Databricks Workflows) - run sequentially or staggered to avoid many clusters at once
Use serverless pipelines (where supported) - reduces cluster management overhead

Hope this helps, thanks.

Lu_Wang_ENB_DBX · 2 weeks ago

TLDR: Do not scale Lakeflow Connect for MySQL linearly by creating dozens of identical pipelines. Current guidance is ~250 tables per pipeline, with a hard limit of 1,000 flows / ~500 tables, and the MySQL connector has been tested at 100 tables with <1 TB snapshot data. The gateway must also run continuously to avoid binlog cleanup/full refresh issues.

Options

Tiered Lakeflow Connect sharding
- Split pipelines by criticality / change rate / table size, not just by table count.
- Keep hot, business-critical tables in Lakeflow Connect; put cold/rarely used tables on less frequent schedules.
- Stagger pipeline schedules so ingestion pipelines don’t all run at once, while gateways stay up.
Consolidate before ingest
- If many source tables are structurally similar, consolidate at source or landing into fewer larger destination tables/views.
- The guidance for very high table-count scenarios is that some consolidation is necessary; otherwise you hit table/platform limits and cost grows quickly.
Use external CDC for the long tail
- For 6,000–7,000 tables / multi-TB estates, use Debezium/Kafka or object storage landing + Auto Loader / SDP for the bulk estate, and reserve Lakeflow Connect for the subset that benefits most from managed CDC.
- We recommend Debezium -> Auto Loader / SDP as a reasonable architecture for this scale because it allows consolidation and avoids massive pipeline sprawl.

Recommendation

Recommend Option 3 + Option 1 together:
Use Lakeflow Connect only for the top-priority subset of MySQL tables, grouped into a small number of carefully sharded pipelines, and move the long tail to an external CDC landing pattern. That is the most scalable path for thousands of tables / multi-TB while minimizing cluster proliferation, ops overhead, and cost.

Databricks Community

Databricks Lakeflow Connect for MySQL

Options

Recommendation

Welcome to the Family! 2,800+ New Members and Counting

🌟 Community Pulse: Your Weekly Roundup! June 15 – 21, 2026

Solution Accelerator Series | Creating Brand-Aligned Images Using Generative AI

Build apps without jumping through hoops

FREE TRAINING: Databricks Business Impact Accelerator