We are currently delivering a large-scale healthcare data migration project involving:
One-time historical migration of approx. 80 TB of data, already completed and loaded into Delta Lake.
CDC merge logic is already developed and validated using Apache Spark (Databricks DBR) notebooks.
Now moving into the real-time streaming phase, with Kafka ingest rates ranging between 3,000 to 30,000 events per second, across multiple topics and domains.
We are evaluating the best-fit architecture for this next phase, and would appreciate your expert guidance. Our goal is to ensure scalability, operational simplicity, and cost optimization, as our customer is highly sensitive to long-term running costs.
We are considering the following three options:
Option A: Continue using Databricks DBR (non-DLT) notebooks for both CDC and streaming
Pros: Unified codebase, no transition effort.
Concern: Limited autoscaling (especially scale-in) during low-volume periods, potentially increasing cost.
Option B: Use Databricks DBR for CDC, and Delta Live Tables (DLT) for Streaming
Question: How complex will it be to transition from DBR-based CDC pipelines to DLT-based streaming pipelines within the same workspace/project?
Are there migration tools, best practices, or effort estimations available to assist this shift?
Option C: Use Delta Live Tables (DLT) for both CDC and Streaming
Question: Can DLT support both structured batch CDC merges as well as streaming ingestion at our required event rates?
Since our customer is cost-conscious, how can we technically demonstrate that DLT will actually result in cost savings, especially in terms of aggressive autoscaling down (scale-in) behavior?
In summary:
Which option would you recommend for our case and why?
If you recommend DLT, can you help us validate the assumption that DLT is more cost-effective than DBR, particularly in streaming workloads with idle windows?
Any benchmarking guidelines, usage calculators, or example cost comparisons would be appreciated.