CDC and raw data

jcozar
Contributor

Hi, 

I am using debezium server to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?

For clarification, I want to store raw data as delta format and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it is safe because the truth is stored in an object store.

I am using Unity Catalog, but I am thinking about different implementations:

  • Where to store? External location, volume or Unity Catalog table?
  • Use a standard workflow or a DLT pipeline?

Am I facing this problem right?

Thank you in advance!