โ09-19-2023 04:05 PM
Hi Community
I have a doubt. The bronze layer always causes confusion for me. Someone mentioned, "File Format: Store data in Delta Lake format to leverage its performance, ACID transactions, and schema evolution capabilities" for bronze layers.
Then, does this mean that is not needed to preserve the data in its original format? for instance, if this comes in JSON format from the source system or if we are exporting this data from the source database in CSV format compressed in zip files?
This part confused me, should we not store the data in its original format as per the medallion architecture? and should we only rely on the bronze layer for data history, lineage, audit, and reprocessing?
Thank you very much in advance for clarifying this for me.
Best Regards
#medallionarchitecture #
โ09-20-2023 01:36 AM
Hi @eimis_pacheco ,
You can store the data in its original format in the Bronze layer. The recommendation to use Delta Lake format for the Bronze layer is mainly for better. The purpose of the Bronze layer in the Lambda architecture is to store data in its raw and unprocessed form, which means that data is not altered from its original format and schema. The Bronze layer is the storage layer closest to the data sources.
Regarding the file format, storing data in Delta Lake format provides several benefits, such as performance, ACID transactions, schema evolution capabilities, and other features. Delta Lake format allows you to maintain the schema and data types of the original data while adding additional features such as versioning, schema enforcement, etc.
Therefore, storing data in Delta Lake format in the Bronze layer is a good practice, providing better performance and data consistency. The original data files can also be kept in a different location and loaded into the Delta Lake tables.
So, in summary, it is recommended to store data in Delta Lake format in the Bronze layer. However, keeping the original data files, their format and schema, and tracking the lineage and audit history of the data processing is essential.
โ09-20-2023 05:21 PM
Hi @Kaniz_Fatma
Just a last question, what would happen if someone decided to change the name of one columns in the source system? For example, if someone renames the column "ID" for "cust_id" in the customer table? how Delta Lake format now will know that the values in the "cust_id" column are referencing the same values as in the "ID" column considering this statement "while adding additional features such as versioning, schema enforcement, etc.
Thank you once more time for your valuable insight.
Regards
#medallionarchitecture
โ09-20-2023 01:36 AM
Hi @eimis_pacheco ,
You can store the data in its original format in the Bronze layer. The recommendation to use Delta Lake format for the Bronze layer is mainly for better. The purpose of the Bronze layer in the Lambda architecture is to store data in its raw and unprocessed form, which means that data is not altered from its original format and schema. The Bronze layer is the storage layer closest to the data sources.
Regarding the file format, storing data in Delta Lake format provides several benefits, such as performance, ACID transactions, schema evolution capabilities, and other features. Delta Lake format allows you to maintain the schema and data types of the original data while adding additional features such as versioning, schema enforcement, etc.
Therefore, storing data in Delta Lake format in the Bronze layer is a good practice, providing better performance and data consistency. The original data files can also be kept in a different location and loaded into the Delta Lake tables.
So, in summary, it is recommended to store data in Delta Lake format in the Bronze layer. However, keeping the original data files, their format and schema, and tracking the lineage and audit history of the data processing is essential.
โ09-20-2023 05:21 PM
Hi @Kaniz_Fatma
Just a last question, what would happen if someone decided to change the name of one columns in the source system? For example, if someone renames the column "ID" for "cust_id" in the customer table? how Delta Lake format now will know that the values in the "cust_id" column are referencing the same values as in the "ID" column considering this statement "while adding additional features such as versioning, schema enforcement, etc.
Thank you once more time for your valuable insight.
Regards
#medallionarchitecture
โ09-21-2023 09:29 AM
โ09-22-2023 03:17 PM
Thank you very much for your answers and insights @Kaniz_Fatma
Regards!
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโt want to miss the chance to attend and share knowledge.
If there isnโt a group near you, start one and help create a community that brings people together.
Request a New Group