12-16-2024 12:28 PM - edited 12-16-2024 12:48 PM
I am in the process of rebuilding the data lake at my current company with databricks and I'm struggling to find comprehensive best practices for naming conventions and structuring medallion architecture to work optimally with the Databricks assistant.
I've been reading about the assistant and what sources it uses to determine what fields it should use in what table etc. Most of the examples I have read shows descriptive table names but without any prefixes or suffixes. The problem is I usually just organize medallion architecture, as well as other things like residency and ingestion source, either using prefixes or suffixes in the table names. For example, bronze_marketing_campaign_response_us_cdc . The documentation I am reading makes it seem like this is not going to be very optimal but I can't seem to find what the 'right' way actually is. Does all of the other information need to happen at the catalog or schema level? Is there something I can do in Unity Catalog to set this up so the assistant can interpret the other information in the table names?
12-16-2024 12:54 PM
When structuring your data lake with Databricks and implementing the medallion architecture (Bronze, Silver, Gold layers), it is essential to follow best practices for naming conventions and table organization to ensure optimal performance and usability, especially when using Unity Catalog.
Medallion Architecture Overview:
you can consider the following approach:
Naming Conventions:
12-16-2024 12:55 PM
Unity Catalog Setup:
Example Structure:
02-07-2025 04:13 PM
Hello!
I am in a similar position and the medallion architecture makes a lot of sense to me (indeed, I believe we've been following a version of that ourselves for a long time).
It seems to me having separate catalogs for each layer (bronze/silver/gold) seems to make the most sense (per the second reply above). My question is: are there any drawbacks to organizing our data where each catalog is a different layer?
Long-term I'm thinking we're likely to ETL using Delta Live Tables so I'd be particularly interested in knowing if there might be any limitations with DLTs pipelines.
a month ago
Our initial approach was to have catalogs for sources and uses, but that was confusing and the grade of data wasn't obvious. And, as we began to join data, it was unclear where to put the final datasets. We switched to three catalogs--one for bronze, one for silver, one for gold. This has made life so much less confusing for us, not to mention the systems which rely on our data.
In bronze we have a schema for each source. The landed data is in an external volume, and we flatten the data into tables in the same schema. Silver has schemas for different entities we're working with, like "person" or "product". Gold we have schemas for different consumers or departments, there isn't a lot of overlap in data. This way we can restrict end users to specific gold schemas.
a month ago
That makes perfect sense and dovetails with my thinking, as well. Thank you!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group