Replay(backfill) DLT CDC using kafka

532664
New Contributor III

Hello,

We are receiving DB CDC binlogs through Kafka and synchronizing tables in OLAP system using the apply_changes function in Delta Live Table (DLT). A month ago, a column was added to our table, but due to a type mismatch, it's being stored incorrectly as nulls. (We manage our schema statically.)

In our DLT pipeline, the bronze stage holds the raw binlog data, so no changes are needed there. However, we need to adjust the types in the silver and gold stages. The issue is that in continuous mode, selective refresh isn't supported, and we're constrained to a full refresh. Given that Kafka's retention period is two weeks, a full refresh might lead to the loss of existing data.

What would be the best course of action in this situation?

Since it takes a long time to get the snapshot data from CDC binlogs and kafka, I'm thinking of storing the snapshot data in s3 and merging it into a DLT table, and then continuing the CDC with DLT. However, I don't know if it is safe to manually merge into the DLT table.

If anyone has had a similar experience, I would appreciate your help.