cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Community Platform Discussions
Connect with fellow community members to discuss general topics related to the Databricks platform, industry trends, and best practices. Share experiences, ask questions, and foster collaboration within the community.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Is it not needed to preserve the data in its original format anymore with the usage of medallion?

eimis_pacheco
Contributor

Hi Community 

I have a doubt. The bronze layer always causes confusion for me. Someone mentioned, "File Format: Store data in Delta Lake format to leverage its performance, ACID transactions, and schema evolution capabilities" for bronze layers.

Then, does this mean that is not needed to preserve the data in its original format? for instance, if this comes in JSON format from the source system or if we are exporting this data from the source database in CSV format compressed in zip files?

This part confused me, should we not store the data in its original format as per the medallion architecture? and should we only rely on the bronze layer for data history, lineage, audit, and reprocessing?

Thank you very much in advance for clarifying this for me.

Best Regards

#medallionarchitecture #

2 ACCEPTED SOLUTIONS

Accepted Solutions

Kaniz_Fatma
Community Manager
Community Manager

Hi @eimis_pacheco , 

You can store the data in its original format in the Bronze layer. The recommendation to use Delta Lake format for the Bronze layer is mainly for better. The purpose of the Bronze layer in the Lambda architecture is to store data in its raw and unprocessed form, which means that data is not altered from its original format and schema. The Bronze layer is the storage layer closest to the data sources.

Regarding the file format, storing data in Delta Lake format provides several benefits, such as performance, ACID transactions, schema evolution capabilities, and other features. Delta Lake format allows you to maintain the schema and data types of the original data while adding additional features such as versioning, schema enforcement, etc.

Therefore, storing data in Delta Lake format in the Bronze layer is a good practice, providing better performance and data consistency. The original data files can also be kept in a different location and loaded into the Delta Lake tables.

So, in summary, it is recommended to store data in Delta Lake format in the Bronze layer. However, keeping the original data files, their format and schema, and tracking the lineage and audit history of the data processing is essential.

View solution in original post

Hi @Kaniz_Fatma 

Just a last question, what would happen if someone decided to change the name of one columns in the source system? For example, if someone renames the column "ID" for "cust_id" in the customer table? how Delta Lake format now will know that the values in the "cust_id" column are referencing the same values as in the "ID" column considering this statement "while adding additional features such as versioning, schema enforcement, etc.

Thank you once more time for your valuable insight.

Regards

#medallionarchitecture 

View solution in original post

4 REPLIES 4

Kaniz_Fatma
Community Manager
Community Manager

Hi @eimis_pacheco , 

You can store the data in its original format in the Bronze layer. The recommendation to use Delta Lake format for the Bronze layer is mainly for better. The purpose of the Bronze layer in the Lambda architecture is to store data in its raw and unprocessed form, which means that data is not altered from its original format and schema. The Bronze layer is the storage layer closest to the data sources.

Regarding the file format, storing data in Delta Lake format provides several benefits, such as performance, ACID transactions, schema evolution capabilities, and other features. Delta Lake format allows you to maintain the schema and data types of the original data while adding additional features such as versioning, schema enforcement, etc.

Therefore, storing data in Delta Lake format in the Bronze layer is a good practice, providing better performance and data consistency. The original data files can also be kept in a different location and loaded into the Delta Lake tables.

So, in summary, it is recommended to store data in Delta Lake format in the Bronze layer. However, keeping the original data files, their format and schema, and tracking the lineage and audit history of the data processing is essential.

Hi @Kaniz_Fatma 

Just a last question, what would happen if someone decided to change the name of one columns in the source system? For example, if someone renames the column "ID" for "cust_id" in the customer table? how Delta Lake format now will know that the values in the "cust_id" column are referencing the same values as in the "ID" column considering this statement "while adding additional features such as versioning, schema enforcement, etc.

Thank you once more time for your valuable insight.

Regards

#medallionarchitecture 

If someone renames a column in the source system, for example, changing "ID" to "cust_id" in the customer table, Delta Lake can handle this change by using the feature of column renaming. This feature is available in Delta Lake from version 10.2 onwards as mentioned in the provided information. To make Delta Lake aware of this change, you need to upgrade your Delta table's version and then set the column mapping mode to name mapping. This will ensure that Delta Lake understands that the values in the "cust_id" column are referencing the same values as in the "ID" column. However, there are some limitations when column mapping is enabled on a Delta table. For instance, tables with column mapping enabled do not support streaming reads on change data feed. The changes indicated by column mapping are not captured in the change data feed. Also, in all versions, you cannot read change data feed for a transaction or range in which a non-additive schema change occurs. Moreover, if you encounter any schema change related errors such as DELTA_SCHEMA_CHANGED, DELTA_SCHEMA_CHANGED_WITH_STARTING_OPTIONS, or DELTA_SCHEMA_CHANGED_WITH_VERSION, you may need to restart the query or start your query from scratch using a new checkpoint directory.

Thank you very much for your answers and insights @Kaniz_Fatma

Regards!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you wonโ€™t want to miss the chance to attend and share knowledge.

If there isnโ€™t a group near you, start one and help create a community that brings people together.

Request a New Group