cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

CDC DLT

Stellar
New Contributor II

Hi all,

I would appreciate some clarity regarding the DLT and CDC. So my first question would be, when it comes to the "source" table in the synta, is that CDC table or? Further, if we want to use only databricks, would mounting foreign catalog be a good thing? Also, what would be some of the solutions you guys implemented to initially pull in the data? I need to pull in certain portion of the data not the whole table and then continue with DLT CDC.

Also, if my CDC table is in database(SQL Server), what would be the preferred method to connect to that database and read cdc table from Databricks?

Thanks!

#CDC #Databricks

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @StellarLet’s dive into your questions about Delta Live Tables (DLT) and Change Data Capture (CDC).

  1. CDC Implementation with Delta Live Tables (DLT):

    • DLT simplifies CDC using the APPLY CHANGES API. Previously, the commonly used method was the MERGE INTO statement, but it could produce incorrect results due to out-of-sequence records or require complex logic for reordering records.
    • The APPLY CHANGES API automatically handles out-of-sequence records, ensuring correct processing of CDC records without the need for complex logic.
    • It supports both SCD type 1 (direct record updates without history retention) and SCD type 2 (history retention for records) updates.
    • To perform CDC with DLT:
  2. Source Table in the Syntax:

    • The “source” table in your syntax refers to the source data from which you want to capture changes. It could be a CDC table or any other data source.
  3. Mounting Foreign Catalog in Databricks:

    • If you want to use only Databricks, mounting a foreign catalog (such as an external database) can be beneficial.
    • Mounting allows you to access external data seamlessly within Databricks, making it easier to read and process data from different sources.
  4. Initial Data Pull Solutions:

    • To pull in a specific portion of data (not the entire table), consider these solutions:
      • Incremental Load: Pull only the changed or new records since the last load.
      • Partition Pruning: Leverage partitioning to read only relevant partitions.
      • Filtering: Apply filters during data extraction to limit the data pulled.
  5. Connecting to SQL Server CDC Table from Databricks:

    • To connect to your SQL Server CDC table from Databricks:
      • Use the JDBC connector to establish a connection.
      • Specify the necessary connection details (server, database, credentials).
      • Read the CDC table using standard SQL queries or Databricks APIs.
 

Stellar
New Contributor II

@Kaniz Thanks for the explanations! What I am not sure about is the following. I create streaming table within the apply changes syntax. However I need to have a table already with the data. Does this mean that in the apply changes, create streaming table syntax is just refrencing existing table as delta live table or? Further, do we always load the whole cdc table? Sorry if my questions are not good but I am trying to figure this out. Also just to confirm my understanding these keys are not from the cdc table but rather from the table we wish to apply changes to because we need to connect two records together. 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.