Databricks Community

Erik · ‎08-13-2022

As many of you, we have implemented a "medallion architecture" (raw/bronze/silver/gold layers), which are each stored on seperate storrage accounts. We only create proper hive tables of the gold layer tables, so our powerbi users connecting to the databricks sql endpoint only sees these and not the silver/bronze ones.

I looked into using delta live tables, and are left with three questions:

1: How to create a dlt pipeline which stores the bronze/silver/gold tables to their correct storrage accounts? Or is this not the recommended way anymore, and we should now have everything in the same storrage account?

2: How to best organize the tables into bronze/silver/gold? An illustration is this example from the (quite cool) databricks mosaic project. There are many tables, but the medallion seperation does not seem to be encoded anywhere. Is there any best practice here? Prepend e.g. "bronze_" in front of the table name? Tags?

3: And (related to question 2), is there any best practice in having only the gold tables visible to end-users? Maybe permissions?

merca · ‎08-22-2022

I can answer the first question:

You can define data storage by setting the `path` parameter for tables. The "storage path" in pipeline settings will then only hold checkpoints (and some other pipeline stuff) and data will be stored in the correct account (if you leave pipeline "storage path" blank, default `dbfs:/pipelines/<pipeline-id>` will be used).

Every delta live table is created in metastore - so schema/table grants should be used to manage permissions per layer. We use schemas to separate layers.

My wish is that the schema would also be part of the table property and overwritten, not only the pipeline setting.

View solution in original post

merca · ‎08-22-2022

I can answer the first question:

You can define data storage by setting the `path` parameter for tables. The "storage path" in pipeline settings will then only hold checkpoints (and some other pipeline stuff) and data will be stored in the correct account (if you leave pipeline "storage path" blank, default `dbfs:/pipelines/<pipeline-id>` will be used).

Every delta live table is created in metastore - so schema/table grants should be used to manage permissions per layer. We use schemas to separate layers.

My wish is that the schema would also be part of the table property and overwritten, not only the pipeline setting.