Databricks Community

lorenz · ‎06-28-2023

I'm interested in learning more about Change Data Capture (CDC) approaches with Databricks. Can anyone provide insights on the best practices and recommendations for utilizing CDC effectively in Databricks? Are there any specific connectors or tools available in Databricks to facilitate CDC implementation?

nicolamonaca · ‎06-28-2023

Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:

1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.

2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.

By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.

I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!

View solution in original post

nicolamonaca · ‎06-28-2023

Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:

1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.

2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.

By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.

I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!

Pektas · ‎10-18-2023

@nicolamonaca would you mind providing more info regarding this Debezium connector for Databricks? I cannot seem to find relevant resources for that. Thank you

I'm planning to use Debezium > Kafka and then Read from a kafka stream in Spark > DLT

jcozar · ‎12-26-2023

Hi, first of all thank you all in advance! I am very interested on this topic!

My question is beyond what it is described here. As well as @Pektas , I am using debezium to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?

I am using Unity Catalog, but I am thinking about different implementations:

- Use a table or a volume for raw data (if it is a table, it would contain data from all tables in a database)

- Use a standard workflow or a DLT pipeline?

- Use a DLT or not?

For clarification, I want to store raw data as parquet files and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it safe because the truth is stored in an object store. Am I right?

Thank you!

Deekay · ‎06-18-2025

Hi @jcozar, were you able to figure out the best practices? We are also looking for same solution.

jcozar · ‎06-18-2025

Hi @Deekay ,

in the end what I did:

Send from postgres databases to azure eventhub using debezium server
Use spark streaming jobs to save from eventhub to azure storage account as raw data. In this way, I have full control to configure how data is written (for example, for small master data, it is directly consolidated using delta updates)
Use dlt or spark streaming jobs to create bronze data from raw data, writing tables to unity catalog
Use dlt or spark streaming jobs to create the rest of the ETLs, from bronze to silver and gold

I hope that this helps you! I also would like to hear from your opinion 🙂

Deekay · ‎06-18-2025

Hi @jcozar ,
Thank you so much for your response 🙂 I have some queries, it will be really helpful if you can share your thoughts.
How are you segregating the tables from raw to bronze? Suppose Debezium is capturing CDCs from 100 tables, all changes are streamed to Eventhub and capture is enabled on Eventhub. If we use autoloader (structured streaming) my dataframe will have events for all the tables. Are you filtering out each table and then writing in bronze individually? If so, how much scalable is this solution if you have any idea. We are looking for implementing this for 500+ tables per Eventhub within Eventhub namespace.

Also, how are you handling schema evolution from source to Eventhub?

jcozar · ‎06-19-2025

Hi @Deekay,

I'm glad to hear that 🙂 Respect to your question, It is like you say, Debezium is capturing CDCs from XXX tables. What I do is using a custom spark streaming job to read from eventhub and save a delta table partitioned by date and table name. Therefore, there is a single raw table per database in delta format, but partitioned.

Then, I create a bronze table per database and table from the raw table (per database), but it is efficient because it is partitioned. The disadvantage is that raw data is more "fragmented" due to the partitioning.

jcozar · ‎06-19-2025

Regarding schema evolution I implemented some protocols at database level, not allowing to modify columns, just adding. If I need to update a column, just create a new one and make a migration. If you need to delete it, just leave it without using it. I am not sure if this is the best solution, but it is the easiest way to implement it. Do you have other ideas in mind?

Databricks Community

Databricks approaches to CDC

Join Us as a Local Community Builder!

🚀 Weekly Delta (8 - 14 October): A Look Back at This Week’s Top Community Highlights

Databricks Community Champion - September 2025 - Nayanjyoti Sonowal

BrickCon 2025 — Dec 3–5 | A Community Conference for Databricks Builders

🌟 Community Sparks of the Week | September 26 – October 2 🌟

Solution Accelerator Series | #4 - Toxicity Detection for Gaming