Databricks Community

jcozar · ‎01-02-2024

Hi,

I am using debezium server to send data from Postgres to a Kafka topic (in fact, Azure EventHub). My question is, what are the best practices and recommendations to save raw data and then implement a medallion architecture?

For clarification, I want to store raw data as delta format and then use them as cloudfiles format for CDC and bronze tables using DLT. I think this approach is good because if I need to reprocess raw data (let's say because raw data schema changed and I need to reprocess it), I feel it is safe because the truth is stored in an object store.

I am using Unity Catalog, but I am thinking about different implementations:

Where to store? External location, volume or Unity Catalog table?
Use a standard workflow or a DLT pipeline?

Am I facing this problem right?

Thank you in advance!

Palash01 · ‎02-06-2024

Hey @jcozar,

Thanks for bringing up your concerns, always happy to help 😁

Let's take a look at your concerns:

1. External Locations and Data Deletion:

If you delete a table in Unity Catalog that references an external location (e.g., Azure Storage Account), the data in the external location is NOT deleted. Unity Catalog only manages metadata, not the underlying data itself. The data remains in the external location until you manually delete it through the storage service's management interface or other tools.
If you don't use an external location and store the data directly in Unity Catalog tables, the data will indeed be deleted after 30 days of inactivity. This is due to Unity Catalog's automatic cleanup policies.

In a nutshell, the metadata is removed immediately. The underlying data is deleted asynchronously and permanently after 30 days. (You can find more details on this topic answered by @Retired_mod our community manager here Databricks Community - Permanently delete dropped table (Unity Catalog) )

2. Schema Paths and Management:

Unity Catalog managing paths in external locations can be convenient, but it's not always the best practice. Advantages include automatic path updates during schema changes and centralized metadata management.
Specifying custom paths provides more control and flexibility. You can define paths based on your specific needs and data organization logic. This might be more suitable if you have complex data structures or prefer manual control over file locations.

3. Delta File and Cloud files Format

This concern is a little unclear to me at this time but I'll try my best to answer, delta is a file format that has the same file extension as .parquet so when you load your bronze you use .parquet but at the same time while sharing the data between the layers through DLT Pipeline like bronze and silver we have cloud files think of this as data sharing between layers instead of reading and writing data at every layer. Take a look at this image , if you were looking for more details on file systems databricks can be found in the documentation. Please do follow up if I misunderstood this one!!

Recommendation:

For raw data (Delta files): Using an external location is always recommended, and Unity Catalog managing the paths is generally fine. This ensures data persistence and consistency with schema changes.
Consider your specific needs: If using custom paths simplifies your logic or provides more control, go for it. However, if Unity Catalog managing paths aligns with your organization and doesn't cause issues, it can be a convenient option.

Ensure proper access control and security for your data, both in the Unity Catalog and the external location.
Evaluate your specific use case and team preferences when deciding between custom paths and Unity Catalog management.

Leave a like if this helps!

Leave a like if this helps! Kudos,
Palash

View solution in original post

Palash01 · ‎02-03-2024

Hey @jcozar

Let's address your questions about storing raw data and implementing a medallion architecture.

Storing Raw Data:

Delta Lake: Storing raw data as Delta Lake files in an external location (e.g., ADLS Gen2) is a good practice for reprocessing due to its schema evolution capabilities.
Cloudfiles format: While using cloudfiles format within Delta Lake isn't standard, it might be achievable with custom logic. However, consider if it adds unnecessary complexity compared to the benefits of native Delta Lake schema features.

Medallion Architecture Implementation:

Where to store:

External location: Storing raw, bronze, and silver tables externally is common as it separates data from computing and facilitates data lifecycle management.
Unity Catalog table: While possible, storing data directly in Unity Catalog tables isn't recommended due to performance and cost implications. Unity Catalog excels at metadata management, not data storage.

Workflow Options:

Standard workflow: Databricks offers standard Delta Lake pipelines for operations, including bronze and silver table creation. These pipelines leverage built-in Delta Lake features and simplify development.
DLT pipeline: You can create a custom DLT pipeline using Spark. Which can help you in monitoring and data sharing between the layers while loading it. DLT also gives you the capability of waiting and loading if your tables have dependencies on each other.

Addressing Your Approach:

Your approach to storing raw data in Delta Lake format is commendable. It allows for schema evolution and potential reprocessing.
Consider whether the cloud files format adds real value compared to Delta Lake schema features.
Storing tables in external locations is generally recommended.
Leveraging Databricks' standard DLT pipelines can save development time and effort.

Leave a like if this helps, followups are appreciated.

Leave a like if this helps! Kudos,
Palash

jcozar · ‎02-06-2024

Thank you @Palash01! I totally agree with you, but I would like to ask you for a little more detail in a couple of things, if you don't mind! 🙂

1. If I use external location for storage and I delete the table in Unity Catalog, is data deleted in external location (Azure Storage Account)? If I do not use external location I think data is deleted after 30 days.

2. If I use external location for a schema, and I create a table in that schema, the path in the external location is managed by Unity Catalog. Is it a good practice to leave Unity Catalog to manage paths, or is it better to specify custom paths?

With respect raw data, I store data as delta files (using a DLT continuous pipeline). But then, my bronze tables read raw data using cloudfiles format to incrementally read new files (DLT continuous and DLT triggered pipelines).

Thank you very much!

Palash01 · ‎02-06-2024

Hey @jcozar,

Thanks for bringing up your concerns, always happy to help 😁

Let's take a look at your concerns:

1. External Locations and Data Deletion:

If you delete a table in Unity Catalog that references an external location (e.g., Azure Storage Account), the data in the external location is NOT deleted. Unity Catalog only manages metadata, not the underlying data itself. The data remains in the external location until you manually delete it through the storage service's management interface or other tools.
If you don't use an external location and store the data directly in Unity Catalog tables, the data will indeed be deleted after 30 days of inactivity. This is due to Unity Catalog's automatic cleanup policies.

In a nutshell, the metadata is removed immediately. The underlying data is deleted asynchronously and permanently after 30 days. (You can find more details on this topic answered by @Retired_mod our community manager here Databricks Community - Permanently delete dropped table (Unity Catalog) )

2. Schema Paths and Management:

Unity Catalog managing paths in external locations can be convenient, but it's not always the best practice. Advantages include automatic path updates during schema changes and centralized metadata management.
Specifying custom paths provides more control and flexibility. You can define paths based on your specific needs and data organization logic. This might be more suitable if you have complex data structures or prefer manual control over file locations.

3. Delta File and Cloud files Format

This concern is a little unclear to me at this time but I'll try my best to answer, delta is a file format that has the same file extension as .parquet so when you load your bronze you use .parquet but at the same time while sharing the data between the layers through DLT Pipeline like bronze and silver we have cloud files think of this as data sharing between layers instead of reading and writing data at every layer. Take a look at this image , if you were looking for more details on file systems databricks can be found in the documentation. Please do follow up if I misunderstood this one!!

Recommendation:

For raw data (Delta files): Using an external location is always recommended, and Unity Catalog managing the paths is generally fine. This ensures data persistence and consistency with schema changes.
Consider your specific needs: If using custom paths simplifies your logic or provides more control, go for it. However, if Unity Catalog managing paths aligns with your organization and doesn't cause issues, it can be a convenient option.

Ensure proper access control and security for your data, both in the Unity Catalog and the external location.
Evaluate your specific use case and team preferences when deciding between custom paths and Unity Catalog management.

Leave a like if this helps!

Leave a like if this helps! Kudos,
Palash

Palash01 · ‎02-13-2024

Hey @jcozar

Just checking in if the provided solution was helpful to you. If yes, please accept this as a Best Solution so that this thread can be considered closed.

Leave a like if this helps! Kudos,
Palash