Databricks Community

manish24101981 · ‎06-22-2025

We are currently delivering a large-scale healthcare data migration project involving:

One-time historical migration of approx. 80 TB of data, already completed and loaded into Delta Lake.
CDC merge logic is already developed and validated using Apache Spark (Databricks DBR) notebooks.
Now moving into the real-time streaming phase, with Kafka ingest rates ranging between 3,000 to 30,000 events per second, across multiple topics and domains.

We are evaluating the best-fit architecture for this next phase, and would appreciate your expert guidance. Our goal is to ensure scalability, operational simplicity, and cost optimization, as our customer is highly sensitive to long-term running costs.

We are considering the following three options:

Option A: Continue using Databricks DBR (non-DLT) notebooks for both CDC and streaming

Pros: Unified codebase, no transition effort.
Concern: Limited autoscaling (especially scale-in) during low-volume periods, potentially increasing cost.

Option B: Use Databricks DBR for CDC, and Delta Live Tables (DLT) for Streaming

Question: How complex will it be to transition from DBR-based CDC pipelines to DLT-based streaming pipelines within the same workspace/project?
Are there migration tools, best practices, or effort estimations available to assist this shift?

Option C: Use Delta Live Tables (DLT) for both CDC and Streaming

Question: Can DLT support both structured batch CDC merges as well as streaming ingestion at our required event rates?
Since our customer is cost-conscious, how can we technically demonstrate that DLT will actually result in cost savings, especially in terms of aggressive autoscaling down (scale-in) behavior?

In summary:

Which option would you recommend for our case and why?
If you recommend DLT, can you help us validate the assumption that DLT is more cost-effective than DBR, particularly in streaming workloads with idle windows?
Any benchmarking guidelines, usage calculators, or example cost comparisons would be appreciated.

mark_ott · ‎10-01-2025

For cost-sensitive, large-scale healthcare data streaming scenarios, using Delta Live Tables (DLT) for both CDC and streaming (Option C) is generally the most scalable, manageable, and cost-optimized approach. DLT offers native support for structured batch CDC and high-throughput streaming ingestion, plus robust autoscaling and simplified operations compared to traditional notebook-driven architectures.

Evaluation of Each Option

Option A: Databricks DBR Notebooks for CDC & Streaming

Pros: Consistent codebase and zero migration effort.
Cons: Notebooks lack the fine-grained autoscaling and operational abstraction of DLT. Scale-in is often less aggressive, leading to higher steady-state costs. Notebooks require more manual orchestration and monitoring, which can increase operational complexity over time.

Option B: DBR for CDC, DLT for Streaming

Transition Effort: Migrating CDC logic from DBR notebooks to DLT requires refactoring pipeline code—largely syntactic changes, replacing notebook-oriented code (e.g., direct Spark DataFrame operations) with declarative DLT transformations.
- Tools & Best Practices: Databricks provides documentation on migration from Spark notebooks to DLT pipelines, covering code adjustments, testing strategies, and deployment processes. However, there is no fully automated refactoring tool; migration is a semi-manual, guided process.
- Effort Estimation: For a typical CDC pipeline, expect 2-4 weeks of hands-on effort for initial migration, integration testing, and validation within the same workspace. Complexity increases with custom logic, external dependencies, or highly individualized notebook constructs.

Option C: DLT for CDC & Streaming

DLT Capabilities: DLT natively handles both batch (historical and periodic CDC) and streaming ingestion. It scales to thousands of events per second per topic with built-in reliability, idempotency, and schema enforcement.
Cost Optimization: DLT’s autoscaling (especially the Enhanced Autoscaling feature) can aggressively scale down resources during low-volume periods, unlike notebook jobs which tend to reserve clusters. DLT also reduces cloud compute footprint by orchestrating resources more efficiently, resulting in lower long-term costs.
- Technical Cost Demonstration: DLT provides real-time metrics (CPU, memory, costs per event/operation) and autoscaling history. Running a proof-of-value with identical workloads on both DBR notebooks and DLT pipelines can surface quantifiable cost differences. Many organizations observe 15–30% lower steady-state costs with DLT due to automatic scale-in and resource pooling.

Recommendations

Recommended Option: Option C (DLT for both CDC and Streaming) is optimal for your scenario, given the performance needs, cost sensitivity, and desired operational simplicity.
- DLT is designed for seamless unified batch and streaming workflows, and at your scale, the operational savings typically outweigh initial migration effort.
- To convince cost-sensitive stakeholders, implement a short-term POC where you benchmark identical workloads on both approaches and collect operational cost data.
Migration Effort (if choosing Option B or transitioning to C):
- Use Databricks’ official migration guides and allocate 2–4 weeks for CDC pipeline refactoring, integration, and acceptance tests.
- Engage Databricks solution architects for advanced optimization and troubleshooting.

View solution in original post

mark_ott · ‎10-01-2025