02-01-2024 05:27 AM
I completed the Data Engineering Lakehouse course and I am familiar with different objects and concepts of databricks and lakehouse but I cant tie them together in my mind.
Where do you typically use managed and non-managed tables? Bronze layer? Or not use at all?
How about the autoloader/cloud files? In bronze layer when new files come to the landing zone? But then they are not in their native format, so maybe in silver layer then as source for silver DLTs?
And the apply changes into? It updates existing records so this cant be used in data warehousing -like settings as then the history is not tracked correctly. So alternate between this and REFRESH live tables or?
So basically what kind of tables or filetypes are used generally in each of the medallion layers and what kind of loading style is used?
02-05-2024 01:13 PM - edited 02-05-2024 01:13 PM
Hi @sjmb ,
There are many questions here but hopefully I'll be able to provide a summarized answer that helps you understand the lakehouse, the medallion architecture and other concepts a bit better.
In the Medallion architecture, the usage of managed and non-managed tables, autoloader/cloud files, and the apply changes into function can vary depending on the layer (Bronze, Silver, Gold) and the specific use case.
As a general rule, use Managed Tables over External Tables because new product features, such as background auto compaction, and auto-optimize, will only work on Managed Tables. Managed tables are also recommended when you want to provision a governed location for your content without the overhead of creating and managing external locations and storage credentials. Managed tables always use the Delta table format. Use External Tables when you are already using them, when you are upgrading from Hive Metastore to Unity Catalog and you want a quick and seamless “one-click” upgrade without moving data. External tables are also recommended if you have DR requirements for this data, if you have requirements for external readers or writers to interact with this data outside of Databricks Clusters and DBSQL, or if you have requirements to enforce a particular naming or hierarchical storage convention. External tables support non-Delta tables, such as parquet/avro/orc etc.
The Bronze layer is where all the data from external source systems is landed. The table structures in this layer correspond to the source system table structures "as-is". It's generally a good idea to keep the data in the Bronze layer in Delta format for subsequent reads from the Bronze layer for ETL to be performant and to allow updates in Bronze to write CDC changes. Managed tables are typically used in this layer for their enhanced performance and low maintenance. In the Silver layer, the data from the Bronze layer is matched, merged, conformed, and cleaned. Only minimal or "just-enough" transformations and data cleansing rules are applied while loading the Silver layer. Both managed and external tables can be used in this layer, depending on your specific requirements. The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. The final layer of data transformations and data quality rules are applied here. Managed tables are typically used in this layer for their enhanced performance and low maintenance.
Autoloader is a tool that efficiently loads files from cloud storage into Delta tables. It's typically used in the Bronze layer when new files come to the landing zone.
The 'apply changes into' API supports updating tables with SCD type 1 and type 2. Use SCD type 1 to update records directly where history is not retained for records that are updated. Use SCD type 2 to retain a history of records, either on all updates or on updates to a specified set of columns. Note that history is also tracked on all delta tables, meaning you can query previous versions of a table. Please note that the REFRESH
command and the APPLY CHANGES INTO
command serve different purposes and are used in different contexts. The REFRESH
command is used for refreshing materialized views, while the APPLY CHANGES INTO
command is used for handling changes in data in the context of Delta Live Tables
To summarize, in general, the Bronze layer often deals with raw data in its native format, the Silver layer contains cleansed and conformed data, and the Gold layer holds curated business-level tables. The specific tables or file types used in each layer, as well as the loading style, can vary depending on the specific use case and data requirements.
Hope that helps.
Gab
References:
Create tables in Unity Catalog
Simplified change data capture with the APPLY CHANGES API in Delta Live Tables
What is the Medallion Lakehouse Architecture
02-05-2024 01:16 PM
Hi @sjmb ,
There are many questions here but hopefully I'll be able to provide a summarized answer that helps you understand the lakehouse, the medallion architecture and other concepts a bit better.
In the Medallion architecture, the usage of managed and non-managed tables, autoloader/cloud files, and the apply changes into function can vary depending on the layer (Bronze, Silver, Gold) and the specific use case.
As a general rule, use Managed Tables over External Tables because new product features, such as background auto compaction, and auto-optimize, will only work on Managed Tables. Managed tables are also recommended when you want to provision a governed location for your content without the overhead of creating and managing external locations and storage credentials. Managed tables always use the Delta table format. Use External Tables when you are already using them, when you are upgrading from Hive Metastore to Unity Catalog and you want a quick and seamless “one-click” upgrade without moving data. External tables are also recommended if you have DR requirements for this data, if you have requirements for external readers or writers to interact with this data outside of Databricks Clusters and DBSQL, or if you have requirements to enforce a particular naming or hierarchical storage convention. External tables support non-Delta tables, such as parquet/avro/orc etc.
The Bronze layer is where all the data from external source systems is landed. The table structures in this layer correspond to the source system table structures "as-is". It's generally a good idea to keep the data in the Bronze layer in Delta format for subsequent reads from the Bronze layer for ETL to be performant and to allow updates in Bronze to write CDC changes. Managed tables are typically used in this layer for their enhanced performance and low maintenance. In the Silver layer, the data from the Bronze layer is matched, merged, conformed, and cleaned. Only minimal or "just-enough" transformations and data cleansing rules are applied while loading the Silver layer. Both managed and external tables can be used in this layer, depending on your specific requirements. The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. The final layer of data transformations and data quality rules are applied here. Managed tables are typically used in this layer for their enhanced performance and low maintenance.
Autoloader is a tool that efficiently loads files from cloud storage into Delta tables. It's typically used in the Bronze layer when new files come to the landing zone.
The 'apply changes into' API supports updating tables with SCD type 1 and type 2. Use SCD type 1 to update records directly where history is not retained for records that are updated. Use SCD type 2 to retain a history of records, either on all updates or on updates to a specified set of columns. Note that history is also tracked on all delta tables, meaning you can query previous versions of a table. Please note that the REFRESH
command and the APPLY CHANGES INTO
command serve different purposes and are used in different contexts. The REFRESH
command is used for refreshing materialized views, while the APPLY CHANGES INTO
command is used for handling changes in data in the context of Delta Live Tables
To summarize, in general, the Bronze layer often deals with raw data in its native format, the Silver layer contains cleansed and conformed data, and the Gold layer holds curated business-level tables. The specific tables or file types used in each layer, as well as the loading style, can vary depending on the specific use case and data requirements.
Hope that helps.
Gab
References:
Create tables in Unity Catalog
Simplified change data capture with the APPLY CHANGES API in Delta Live Tables
What is the Medallion Lakehouse Architecture
02-05-2024 01:21 PM
Hi @sjmb ,
In the Medallion architecture, the usage of managed and non-managed tables, autoloader/cloud files, and the apply changes into function can vary depending on the layer (Bronze, Silver, Gold) and the specific use case.
As a general rule, use Managed Tables over External Tables because new product features, such as background auto compaction, and auto-optimize, will only work on Managed Tables. Managed tables are also recommended when you want to provision a governed location for your content without the overhead of creating and managing external locations and storage credentials. Managed tables always use the Delta table format. Use External Tables when you are already using them, when you are upgrading from Hive Metastore to Unity Catalog and you want a quick and seamless “one-click” upgrade without moving data. External tables are also recommended if you have DR requirements for this data, if you have requirements for external readers or writers to interact with this data outside of Databricks Clusters and DBSQL, or if you have requirements to enforce a particular naming or hierarchical storage convention. External tables support non-Delta tables, such as parquet/avro/orc etc.
The Bronze layer is where all the data from external source systems is landed. The table structures in this layer correspond to the source system table structures "as-is". Managed tables are typically used in this layer for their enhanced performance and low maintenance. In the Silver layer, the data from the Bronze layer is matched, merged, conformed, and cleaned. Both managed and external tables can be used in this layer, depending on your specific requirements. The Gold layer is for reporting and uses more de-normalized and read-optimized data models with fewer joins. Managed tables are typically used in this layer for their enhanced performance and low maintenance.
Autoloader is a tool that efficiently loads files from cloud storage into Delta tables. It's typically used in the Bronze layer when new files come to the landing zone.
The 'apply changes into' API supports updating tables with SCD type 1 and type 2. Use SCD type 2 to retain a history of records, either on all updates or on updates to a specified set of columns. Note that history is also tracked on all delta tables, meaning you can query previous versions of a table at any time. Also note that the REFRESH
command and the APPLY CHANGES INTO
command serve different purposes and are used in different contexts. The REFRESH
command is used for refreshing materialized views, while the APPLY CHANGES INTO
command is used for handling changes in data in the context of Delta Live Tables
To summarize, in general, the Bronze layer often deals with raw data in its native format, the Silver layer contains cleansed and conformed data, and the Gold layer holds curated business-level tables. The specific tables or file types used in each layer, as well as the loading style, can vary depending on the specific use case and data requirements.
Hope that helps.
Gab
References:
Create tables in Unity Catalog
Simplified change data capture with the APPLY CHANGES API in Delta Live Tables
What is the Medallion Lakehouse Architecture
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group