mark_ott
Databricks Employee
Databricks Employee

For cost-sensitive, large-scale healthcare data streaming scenarios, using Delta Live Tables (DLT) for both CDC and streaming (Option C) is generally the most scalable, manageable, and cost-optimized approach. DLT offers native support for structured batch CDC and high-throughput streaming ingestion, plus robust autoscaling and simplified operations compared to traditional notebook-driven architectures.

Evaluation of Each Option

Option A: Databricks DBR Notebooks for CDC & Streaming

  • Pros: Consistent codebase and zero migration effort.

  • Cons: Notebooks lack the fine-grained autoscaling and operational abstraction of DLT. Scale-in is often less aggressive, leading to higher steady-state costs. Notebooks require more manual orchestration and monitoring, which can increase operational complexity over time.

Option B: DBR for CDC, DLT for Streaming

  • Transition Effort: Migrating CDC logic from DBR notebooks to DLT requires refactoring pipeline code—largely syntactic changes, replacing notebook-oriented code (e.g., direct Spark DataFrame operations) with declarative DLT transformations.

    • Tools & Best Practices: Databricks provides documentation on migration from Spark notebooks to DLT pipelines, covering code adjustments, testing strategies, and deployment processes. However, there is no fully automated refactoring tool; migration is a semi-manual, guided process.

    • Effort Estimation: For a typical CDC pipeline, expect 2-4 weeks of hands-on effort for initial migration, integration testing, and validation within the same workspace. Complexity increases with custom logic, external dependencies, or highly individualized notebook constructs.

Option C: DLT for CDC & Streaming

  • DLT Capabilities: DLT natively handles both batch (historical and periodic CDC) and streaming ingestion. It scales to thousands of events per second per topic with built-in reliability, idempotency, and schema enforcement.

  • Cost Optimization: DLT’s autoscaling (especially the Enhanced Autoscaling feature) can aggressively scale down resources during low-volume periods, unlike notebook jobs which tend to reserve clusters. DLT also reduces cloud compute footprint by orchestrating resources more efficiently, resulting in lower long-term costs.

    • Technical Cost Demonstration: DLT provides real-time metrics (CPU, memory, costs per event/operation) and autoscaling history. Running a proof-of-value with identical workloads on both DBR notebooks and DLT pipelines can surface quantifiable cost differences. Many organizations observe 15–30% lower steady-state costs with DLT due to automatic scale-in and resource pooling.

Recommendations

  • Recommended Option: Option C (DLT for both CDC and Streaming) is optimal for your scenario, given the performance needs, cost sensitivity, and desired operational simplicity.

    • DLT is designed for seamless unified batch and streaming workflows, and at your scale, the operational savings typically outweigh initial migration effort.

    • To convince cost-sensitive stakeholders, implement a short-term POC where you benchmark identical workloads on both approaches and collect operational cost data.

  • Migration Effort (if choosing Option B or transitioning to C):

    • Use Databricks’ official migration guides and allocate 2–4 weeks for CDC pipeline refactoring, integration, and acceptance tests.

    • Engage Databricks solution architects for advanced optimization and troubleshooting.

View solution in original post