Databricks Community

User16776430979 · ‎06-07-2021

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physical place called bronze, silver and gold? What are the naming conventions/strategies recommended by Databricks?

ramdhilip · ‎08-13-2023

@Retired_mod , Thank you for the detailed guidelines on naming conventions for the Bronze, Silver, and Gold layers in Databricks. These conventions are certainly valuable for maintaining consistency and manageability.

I'd like to inquire about the best practices for structuring the Database and Schema names, especially in the context of managed tables within the Medallion Architecture in Delta Lake.

With unmanaged tables, the folder structure allows us to segregate the Gold, Silver, and Bronze layers effectively. However, with managed tables, we don't have control over the folder structure.

Is there a difference in maintaining the naming convention between Managed or Unmanaged tables, particularly in implementing the Medallion Architecture? Could you please provide insights or recommendations on how to approach this to ensure a well-structured and maintainable data engineering environment?

Your guidance on this matter would be greatly appreciated.

Thank you!
Ram

eimis_pacheco · ‎09-18-2023

Hi @Retired_mod,

I have a doubt. The bronze layer always causes confusion for me. You mentioned, "File Format: Store data in Delta Lake format to leverage its performance, ACID transactions, and schema evolution capabilities" for silver layers.

Then, does this mean that is not needed to preserve the data in its original format? for instance, if this comes in JSON format from the source system or if we are exporting this data from the source database in CSV format compressed in zip files?

This part confused me, should we not store the data in its original format as per the medallion architecture? and should we only rely on the bronze layer for data history, lineage, audit, and reprocessing?

Thank you very much in advance for clarify this for me.

Best Regards

-werners- · ‎09-19-2023

with Unity taking into account, it is certainly a good idea to think about your physical data storage.
As you cannot have overlap between volumes and tables this can become cumbersome.
F.e. we used to store delta tables of a data object in the same directory as your ingested files.
With unity, this structure is now impossible.
So I'd create a separate container for tables and one for volumes, to avoid this overlap.

This is of course easier said than done on an existing environment.
As much as I like Unity, it does give me a lot of headaches because we have to do serious refactoring to embrace Unity.