cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

cdc type 2 - not closing records

MariuszJ
New Contributor II

Hi,
i would not mind small advice, i do have dlt cdc typ2, the definition

 

dlt.create_streaming_table('`my_table_dlt_cdc`')
dlt.apply_changes(
  target = 'my_table_dlt_cdc',
  source = 'source',
  keys = ['id'],
  sequence_by = col('snapshot_date'),
  # ignore_null_updates = True,
  except_column_list = ['snapshot_date'],
  stored_as_scd_type = "2"
)

 

the table it self is very simple

 

id| val    | snapshot_date
1 | 'abc' | 2023-01-01

 

The '1' is stopped to exist in source table after march, but cdc never closed the row. Data for whole 2023 has been loaded (up to today).

When querying __apply_changes_storage_my_table_dlt_cdc i do see last entry in march but nothing after, row was not closed.

 

 

id,value,__recordStartAt,__START_AT,__END_AT,__rowIsHidden
1,'ABC',2023-03-17,2019-07-05,null,true
1,'ABC',2023-03-24,2019-07-05,null,true
1,'ABC',2023-03-31,2019-07-05,null,false

 

I do have a feeling that I'm missing something obvious... But what.

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @MariuszJ , Based on the provided information, it seems like your CDC operation isn't closing the row for id '1' after March because it's not detecting a deletion event for that id. In CDC operations with Databricks, you need to specify a sequence column which is a monotonically increasing representation of the proper ordering of the source data. For SCD Type 2, Databricks propagates the appropriate sequencing values to the __START_AT and __END_AT columns of the target table. If a row is deleted in the source data, it should be marked as deleted in the CDC operation and the __END_AT column should be populated with the timestamp of the deletion event.However, in your case, it seems like the deletion event for id '1' is not being detected or not being processed correctly, hence the row is not being closed and the __END_AT column remains null. You could check your source data to confirm if the deletion event for id '1' is correctly represented and also check your CDC operation to ensure it's correctly handling deletion events.

Here are the relevant sources used for this response:
- [How is CDC implemented with DLT?](https://docs.databricks.com/delta-live-tables/cdc.html)
- [What data objects are used for DLT CDC processing?](https://docs.databricks.com/delta-live-tables/cdc.html)
- [Limitations of Databricks CDC](https://docs.databricks.com/delta-live-tables/cdc.html)

MariuszJ
New Contributor II

For anyone who sees this post in the future. I was missing one argument

apply_as_deletes 

 

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.