01-02-2024 03:41 AM
Hi,
I am using debezium server to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?
For clarification, I want to store raw data as delta format and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it is safe because the truth is stored in an object store.
I am using Unity Catalog, but I am thinking about different implementations:
Am I facing this problem right?
Thank you in advance!
02-06-2024 07:17 PM - edited 02-06-2024 07:19 PM
Hey @jcozar,
Thanks for bringing up your concerns, always happy to help 😁
Let's take a look at your concerns:
1. External Locations and Data Deletion:
In a nutshell, the metadata is removed immediately. The underlying data is deleted asynchronously and permanently after 30 days. (You can find more details on this topic answered by @Retired_mod our community manager here Databricks Community - Permanently delete dropped table (Unity Catalog) )
2. Schema Paths and Management:
3. Delta File and Cloud files Format
This concern is a little unclear to me at this time but I'll try my best to answer, delta is a file format that has the same file extension as .parquet so when you load your bronze you use .parquet but at the same time while sharing the data between the layers through DLT Pipeline like bronze and silver we have cloud files think of this as data sharing between layers instead of reading and writing data at every layer. Take a look at this image , if you were looking for more details on file systems databricks can be found in the documentation. Please do follow up if I misunderstood this one!!
Recommendation:
Leave a like if this helps!
02-03-2024 11:00 PM
Hey @jcozar
Let's address your questions about storing raw data and implementing a medallion architecture.
Storing Raw Data:
Medallion Architecture Implementation:
Where to store:
Workflow Options:
Addressing Your Approach:
Leave a like if this helps, followups are appreciated.
02-06-2024 12:51 AM
Thank you @Palash01! I totally agree with you, but I would like to ask you for a little more detail in a couple of things, if you don't mind! 🙂
1. If I use external location for storage and I delete the table in Unity Catalog, is data deleted in external location (Azure Storage Account)? If I do not use external location I think data is deleted after 30 days.
2. If I use external location for a schema, and I create a table in that schema, the path in the external location is managed by Unity Catalog. Is it a good practice to leave Unity Catalog to manage paths, or is it better to specify custom paths?
With respect raw data, I store data as delta files (using a DLT continuous pipeline). But then, my bronze tables read raw data using cloudfiles format to incrementally read new files (DLT continuous and DLT triggered pipelines).
Thank you very much!
02-06-2024 07:17 PM - edited 02-06-2024 07:19 PM
Hey @jcozar,
Thanks for bringing up your concerns, always happy to help 😁
Let's take a look at your concerns:
1. External Locations and Data Deletion:
In a nutshell, the metadata is removed immediately. The underlying data is deleted asynchronously and permanently after 30 days. (You can find more details on this topic answered by @Retired_mod our community manager here Databricks Community - Permanently delete dropped table (Unity Catalog) )
2. Schema Paths and Management:
3. Delta File and Cloud files Format
This concern is a little unclear to me at this time but I'll try my best to answer, delta is a file format that has the same file extension as .parquet so when you load your bronze you use .parquet but at the same time while sharing the data between the layers through DLT Pipeline like bronze and silver we have cloud files think of this as data sharing between layers instead of reading and writing data at every layer. Take a look at this image , if you were looking for more details on file systems databricks can be found in the documentation. Please do follow up if I misunderstood this one!!
Recommendation:
Leave a like if this helps!
02-13-2024 10:40 PM
Hey @jcozar
Just checking in if the provided solution was helpful to you. If yes, please accept this as a Best Solution so that this thread can be considered closed.
02-14-2024 04:06 AM
Thank you very much @Palash01 ! It has been really helpful! 🙂
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group