Databricks

User16776430979 · ‎06-07-2021

What's the best way to organize our data lake and delta setup? We’re trying to use the bronze, silver and gold classification strategy. The main question is how do we know what classification the data is inside Databricks if there’s no actual physical place called bronze, silver and gold? What are the naming conventions/strategies recommended by Databricks?

Kaniz · ‎04-11-2023

Hi @Josephine Ho , Database objects naming conventions and coding standards are crucial to maintaining consistency, readability, and manageability in a data engineering project.

In Databricks, you can use the naming conventions and coding norms for the Bronze, Silver, and Gold layers.

General Naming Conventions:
- Use lowercase letters for all object names (tables, views, columns, etc.).
- Separate words with underscores for readability.
- Be descriptive and concise. Use phrases that indicate the purpose of the object.
- Avoid using reserved keywords or special characters.
Bronze Layer (Raw Data Layer):
- Table Naming Convention: Use the prefix "bronze_" followed by the source system or data source and the object's name—for example, bronze_salesforce_opportunities.
- File Format: Store data in Delta Lake format to leverage its performance, ACID transactions, and schema evolution capabilities.
- Partitioning: Use partition columns that best suit your data access patterns, such as date or timestamp.
Silver Layer (Cleansed and Enriched Data Layer):
- Table Naming Convention: Use the prefix "silver_" followed by the functional area or business domain and the object's name—for example, silver_finance_transactions.
- File Format: Use Delta Lake format for storing the data.
- Partitioning: Choose appropriate partition columns, considering data access patterns and performance implications.
- Data Cleansing and Enrichment: Apply necessary data quality checks, type conversions, and enrichment processes.
- Documentation: Document the transformation logic and any assumptions made during the cleansing and enrichment process.
Gold Layer (Aggregated and Business Ready Data Layer):
- Table Naming Convention: Use the prefix "gold_" followed by the functional area or business domain and the object's name. For example, gold_sales_monthly_summary.
- File Format: Store the data in Delta Lake format.
- Partitioning: Choose appropriate partition columns, considering data access patterns and performance implications.
- Aggregations: Perform aggregations and calculations as required by the business requirements.
- Documentation: Document the aggregation logic and any assumptions made during the process.
Code Organization:
- Separate your code into different notebooks for each layer (Bronze, Silver, Gold) and maintain a clear hierarchy for ease of maintenance.
- Follow best code formatting and readability practices, such as user comments, consistent indentation, and modularization.
- Use version control systems like Git to manage your codebase and track changes.

Following these naming conventions and coding standards allows you to maintain a well-structured, easily understandable, and maintainable data engineering project in Databricks.

ramdhilip · ‎08-13-2023

@Kaniz , Thank you for the detailed guidelines on naming conventions for the Bronze, Silver, and Gold layers in Databricks. These conventions are certainly valuable for maintaining consistency and manageability.

I'd like to inquire about the best practices for structuring the Database and Schema names, especially in the context of managed tables within the Medallion Architecture in Delta Lake.

With unmanaged tables, the folder structure allows us to segregate the Gold, Silver, and Bronze layers effectively. However, with managed tables, we don't have control over the folder structure.

Is there a difference in maintaining the naming convention between Managed or Unmanaged tables, particularly in implementing the Medallion Architecture? Could you please provide insights or recommendations on how to approach this to ensure a well-structured and maintainable data engineering environment?

Your guidance on this matter would be greatly appreciated.

Thank you!
Ram

eimis_pacheco · ‎09-18-2023

Hi @Kaniz,

I have a doubt. The bronze layer always causes confusion for me. You mentioned, "File Format: Store data in Delta Lake format to leverage its performance, ACID transactions, and schema evolution capabilities" for silver layers.

Then, does this mean that is not needed to preserve the data in its original format? for instance, if this comes in JSON format from the source system or if we are exporting this data from the source database in CSV format compressed in zip files?

This part confused me, should we not store the data in its original format as per the medallion architecture? and should we only rely on the bronze layer for data history, lineage, audit, and reprocessing?

Thank you very much in advance for clarify this for me.

Best Regards

Kaniz · ‎08-13-2023

Hi @ramdhilip ,

Both tables managed and unmanaged, are part of the Databricks ecosystem and contribute to the Medallion Architecture. Still, the documents do not specify naming conventions exclusive to either table type.

Key points to remember:

• Managed tables: Databricks manages both the data and metadata. When a managed table is dropped, the underlying data is also deleted.
• Unmanaged tables: Databricks only manages the metadata. When an unmanaged table is dropped, the underlying data remains unaffected.

In the context of the Medallion Architecture:

• Bronze layer: Contains raw, unvalidated data.
• Silver layer: Contains validated and enriched data.
• Gold layer: Contains highly refined and aggregated data. However, the naming conventions for these tables would likely depend on the organization's internal data governance policies and not on whether the tables are managed or unmanaged.

Sources:
- [Docs: medallion](https://docs.databricks.com/lakehouse/medallion.html)
- [Docs: data-objects](https://docs.databricks.com/lakehouse/data-objects.html)
- [Docs: best-practices](https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)

-werners- · ‎09-19-2023

with Unity taking into account, it is certainly a good idea to think about your physical data storage.
As you cannot have overlap between volumes and tables this can become cumbersome.
F.e. we used to store delta tables of a data object in the same directory as your ingested files.
With unity, this structure is now impossible.
So I'd create a separate container for tables and one for volumes, to avoid this overlap.

This is of course easier said than done on an existing environment.
As much as I like Unity, it does give me a lot of headaches because we have to do serious refactoring to embrace Unity.