cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

DLT or DataBricks for CDC and NRT

manish24101981
New Contributor

We are currently delivering a large-scale healthcare data migration project involving:

  • One-time historical migration of approx. 80 TB of data, already completed and loaded into Delta Lake.

  • CDC merge logic is already developed and validated using Apache Spark (Databricks DBR) notebooks.

  • Now moving into the real-time streaming phase, with Kafka ingest rates ranging between 3,000 to 30,000 events per second, across multiple topics and domains.

We are evaluating the best-fit architecture for this next phase, and would appreciate your expert guidance. Our goal is to ensure scalability, operational simplicity, and cost optimization, as our customer is highly sensitive to long-term running costs.

We are considering the following three options:

Option A: Continue using Databricks DBR (non-DLT) notebooks for both CDC and streaming

  • Pros: Unified codebase, no transition effort.

  • Concern: Limited autoscaling (especially scale-in) during low-volume periods, potentially increasing cost.

Option B: Use Databricks DBR for CDC, and Delta Live Tables (DLT) for Streaming

  • Question: How complex will it be to transition from DBR-based CDC pipelines to DLT-based streaming pipelines within the same workspace/project?

  • Are there migration tools, best practices, or effort estimations available to assist this shift?

Option C: Use Delta Live Tables (DLT) for both CDC and Streaming

  • Question: Can DLT support both structured batch CDC merges as well as streaming ingestion at our required event rates?

  • Since our customer is cost-conscious, how can we technically demonstrate that DLT will actually result in cost savings, especially in terms of aggressive autoscaling down (scale-in) behavior?

In summary:

  • Which option would you recommend for our case and why?

  • If you recommend DLT, can you help us validate the assumption that DLT is more cost-effective than DBR, particularly in streaming workloads with idle windows?

  • Any benchmarking guidelines, usage calculators, or example cost comparisons would be appreciated.

 

0 REPLIES 0

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now