cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Databricks approaches to CDC

lorenz
New Contributor III

I'm interested in learning more about Change Data Capture (CDC) approaches with Databricks. Can anyone provide insights on the best practices and recommendations for utilizing CDC effectively in Databricks? Are there any specific connectors or tools available in Databricks to facilitate CDC implementation?

1 ACCEPTED SOLUTION

Accepted Solutions

nicolamonaca
New Contributor III

Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:

1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.

2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.

By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.

I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!

View solution in original post

3 REPLIES 3

nicolamonaca
New Contributor III

Certainly! Change Data Capture (CDC) is an important capability when it comes to efficiently processing and analyzing real-time data in Databricks. Here are some insights and recommendations on the best practices for utilizing CDC effectively in Databricks, along with specific connectors and tools available:

1. Best Practices for CDC Implementation in Databricks:
- Identify the appropriate CDC approach based on your use case: Databricks supports both log-based CDC and trigger-based CDC. Log-based CDC relies on reading database transaction logs, while trigger-based CDC involves capturing changes using database triggers. Choose the approach that aligns with your requirements and database capabilities.
- Leverage Databricks Delta Lake: Delta Lake is an optimized data storage format in Databricks that supports ACID transactions. By storing your CDC data in Delta Lake, you can ensure data consistency and reliability, enabling easy processing and analysis.
- Utilize structured streaming: Databricks' structured streaming API allows you to consume CDC data as an input source. By leveraging the power of structured streaming, you can process and analyze the incoming CDC data in near real-time, enabling timely insights and actions.

2. Connectors and Tools for CDC in Databricks:
- Debezium Connector: Databricks provides a connector for Debezium, an open-source CDC platform. Debezium supports a wide range of databases, allowing you to easily capture and stream database changes to Databricks. This connector simplifies the implementation of CDC pipelines and enables efficient change data ingestion.
- Databricks Delta CDC: Databricks Delta has built-in CDC capabilities, enabling you to efficiently capture and process changes in Delta Lake tables. With Delta CDC, you can subscribe to changes made to Delta tables and consume them in your Databricks workflows seamlessly.

By following these best practices and leveraging the available connectors and tools, you can effectively implement CDC in Databricks and unlock the full potential of real-time data processing and analysis.

I hope this information helps you get started with CDC in Databricks. If you have any further questions, feel free to ask!

Pektas
New Contributor II

@nicolamonaca would you mind providing more info regarding this Debezium connector for Databricks? I cannot seem to find relevant resources for that.  Thank you

I'm planning to use Debezium > Kafka and then Read from a kafka stream in Spark > DLT 

jcozar
Contributor

Hi, first of all thank you all in advance! I am very interested on this topic!

My question is beyond what it is described here. As well as @Pektas , I am using debezium to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?

I am using Unity Catalog, but I am thinking about different implementations:

- Use a table or a volume for raw data (if it is a table, it would contain data from all tables in a database)

- Use a standard workflow or a DLT pipeline?

- Use a DLT or not?

For clarification, I want to store raw data as parquet files and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it safe because the truth is stored in an object store. Am I right?

Thank you!

Join 100K+ Data Experts: Register Now & Grow with Us!

Excited to expand your horizons with us? Click here to Register and begin your journey to success!

Already a member? Login and join your local regional user group! If there isn’t one near you, fill out this form and we’ll create one for you to join!