Databricks Community

antoalphi · ‎04-08-2026

🔹 Core Idea

Don’t let “1 pipeline = 1 always-on cluster” become your cost trap.
Instead, design for controlled parallelism + shared compute + smart grouping.

✅ A. Pipeline Sharding Strategy (Not Blind Splitting)

Instead of randomly splitting 6,700 tables into pipelines:

👉 Group tables based on:

Data volume (small / medium / large)
Change rate (CDC-heavy vs static)
Business domain (optional but useful)

Example:

Pipeline 1 → Small tables (0–5 GB, 1000 tables)
Pipeline 2 → Medium tables (5–50 GB, 800 tables)
Pipeline 3 → Large tables (50GB+, 100 tables)

💡 Benefit:

Avoids over-provisioning clusters for tiny tables
Prevents large tables from slowing everything else

✅ B. Control Concurrency (THIS is the real cost lever)

By default, multiple pipelines may spin up compute in parallel.

👉 Instead:

Schedule pipelines sequentially or in controlled batches
Use orchestration (Workflows / Jobs)

Example Strategy:

Batch 1 → 3 pipelines run
Batch 2 → next 3 pipelines run after completion

💡 Result:
👉 You reuse compute instead of multiplying it

✅ C. Use Job Clusters (Not All-Purpose Clusters)

In Databricks:

Use job clusters (auto-terminate after run)
Avoid long-running clusters for ingestion

💡 Why:

You only pay when pipelines run
No idle cost

✅ D. Right-Size Compute Per Pipeline

Not all pipelines need the same cluster size.

Pipeline Type Suggested Cluster

Small tables	Small autoscaling
Medium	Medium autoscaling
Large	Dedicated larger cluster

💡 This avoids the classic mistake:

“One big cluster for everything” → 💸

✅ E. Incremental First, Snapshot Once

Do initial load once
Then rely on CDC / incremental ingestion

💡 Huge cost saver:

Snapshot = expensive
Incremental = cheap

✅ F. Advanced: Shared Compute Pattern (If Needed)

If scale is very large:

👉 Instead of many pipelines:

Use fewer pipelines
Increase table parallelism inside pipeline

OR

👉 Hybrid approach:

Databricks Auto Loader + CDC tools
Reduce dependency on Lakeflow limits

✅ G. Cost Guardrails

Cluster auto-termination (15–30 mins)
Max workers cap
Budget alerts

🔥 Key Takeaway

You don’t reduce cost by reducing pipelines.
You reduce cost by controlling how compute is used across pipelines.

🗣️ 2. Customer-Facing Explanation (Simple + Reassuring)

Here’s how you explain this without triggering panic 👇

Customer-Friendly Version:

While Databricks Lakeflow Connect currently recommends around 250 tables per pipeline for optimal performance, this does not mean costs will scale linearly with the number of pipelines.

Our design approach ensures cost efficiency through:

Controlled execution: Pipelines are scheduled in batches, not all running simultaneously
On-demand compute: We use ephemeral job clusters that start only during ingestion and shut down automatically
Right-sized resources: Each pipeline uses appropriately sized compute based on workload
Incremental ingestion: After initial load, only changes are processed, significantly reducing compute usage

👉 In practice, this means:

Compute resources are reused across pipelines
There is no need to keep multiple clusters running continuously
Overall cost is optimized despite having multiple pipelines

🎯 Reassurance Line (Very Important)

“We scale pipelines for performance, but we control compute for cost.”

💬 If Customer Pushes Further (“Still sounds expensive…”)

You can say:

Multiple pipelines improve reliability and fault isolation
Parallelism is configurable, not mandatory
Cost is driven by runtime, not pipeline count

⚡ Bonus: One-Liner You Can Use in Meetings

“We decouple scalability from cost by orchestrating pipelines over shared, on-demand compute.”