cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

What are the best practices in bronze layer regarding the column data types?

eimis_pacheco
Contributor

Hi dear community,

When I used to work in the Hadoop ecosystem with HDS the landing zone was our raw layer, and we used to use AVRO format for the serialization of this raw data (for the schema evolution feature), only assigning names to columns but not enforcing any data type, each column was stored as string type.

In the medallion architecture, bronze represents the raw layer. However, I need clarity regarding best practices here. My question is: Should we use delta tables and treat each column as a string type, or should we enforce a specific data type at this layer? Alternatively, should we allow the delta table to handle data discovery for us?"

Thank you very much for the clarification on the best practices and alternatives, pros and cons.

Best Regards

1 ACCEPTED SOLUTION

Accepted Solutions

swethaNandan
New Contributor III
New Contributor III

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

 

 

View solution in original post

3 REPLIES 3

swethaNandan
New Contributor III
New Contributor III

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

 

 

Thank you very much for your answer @swethaNandan.

Regards!

param_sen
New Contributor II

Hi dear community,

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?

 

Regards,

param_sen

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.