โ06-28-2023 07:21 AM
I'm interested in learning more about Change Data Capture (CDC) approaches with Databricks. Can anyone provide insights on the best practices and recommendations for utilizing CDC effectively in Databricks? Are there any specific connectors or tools available in Databricks to facilitate CDC implementation?
โ06-28-2023 12:11 PM
Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:
1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.
2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.
By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.
I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!
โ06-28-2023 12:11 PM
Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:
1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.
2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.
By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.
I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!
โ10-18-2023 06:40 AM
@nicolamonaca would you mind providing more info regarding this Debezium connector for Databricks? I cannot seem to find relevant resources for that. Thank you
I'm planning to use Debezium > Kafka and then Read from a kafka stream in Spark > DLT
โ12-26-2023 05:50 AM - edited โ12-26-2023 05:53 AM
Hi, first of all thank you all in advance! I am very interested on this topic!
My question is beyond what it is described here. As well as @Pektas , I am using debezium to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?
I am using Unity Catalog, but I am thinking about different implementations:
- Use a table or a volume for raw data (if it is a table, it would contain data from all tables in a database)
- Use a standard workflow or a DLT pipeline?
- Use a DLT or not?
For clarification, I want to store raw data as parquet files and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it safe because the truth is stored in an object store. Am I right?
Thank you!
โ06-18-2025 05:08 AM
Hi @jcozar, were you able to figure out the best practices? We are also looking for same solution.
โ06-18-2025 07:17 AM
โ06-18-2025 11:37 AM
Hi @jcozar ,
Thank you so much for your response ๐ I have some queries, it will be really helpful if you can share your thoughts.
How are you segregating the tables from raw to bronze? Suppose Debezium is capturing CDCs from 100 tables, all changes are streamed to Eventhub and capture is enabled on Eventhub. If we use autoloader (structured streaming) my dataframe will have events for all the tables. Are you filtering out each table and then writing in bronze individually? If so, how much scalable is this solution if you have any idea. We are looking for implementing this for 500+ tables per Eventhub within Eventhub namespace.
Also, how are you handling schema evolution from source to Eventhub?
โ06-19-2025 12:30 AM
Hi @Deekay,
I'm glad to hear that ๐ Respect to your question, It is like you say, Debezium is capturing CDCs from XXX tables. What I do is using a custom spark streaming job to read from eventhub and save a delta table partitioned by date and table name. Therefore, there is a single raw table per database in delta format, but partitioned.
Then, I create a bronze table per database and table from the raw table (per database), but it is efficient because it is partitioned. The disadvantage is that raw data is more "fragmented" due to the partitioning.
โ06-19-2025 12:32 AM
Regarding schema evolution I implemented some protocols at database level, not allowing to modify columns, just adding. If I need to update a column, just create a new one and make a migration. If you need to delete it, just leave it without using it. I am not sure if this is the best solution, but it is the easiest way to implement it. Do you have other ideas in mind?
Passionate about hosting events and connecting people? Help us grow a vibrant local communityโsign up today to get started!
Sign Up Now