cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

What are the best practices in bronze layer regarding the column data types?

eimis_pacheco
Contributor

Hi dear community,

When I used to work in the Hadoop ecosystem with HDS the landing zone was our raw layer, and we used to use AVRO format for the serialization of this raw data (for the schema evolution feature), only assigning names to columns but not enforcing any data type, each column was stored as string type.

In the medallion architecture, bronze represents the raw layer. However, I need clarity regarding best practices here. My question is: Should we use delta tables and treat each column as a string type, or should we enforce a specific data type at this layer? Alternatively, should we allow the delta table to handle data discovery for us?"

Thank you very much for the clarification on the best practices and alternatives, pros and cons.

Best Regards

1 ACCEPTED SOLUTION

Accepted Solutions

swethaNandan
Databricks Employee
Databricks Employee

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

 

 

View solution in original post

3 REPLIES 3

swethaNandan
Databricks Employee
Databricks Employee

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

 

 

Thank you very much for your answer @swethaNandan.

Regards!

param_sen
New Contributor II

Hi dear community,

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?

 

Regards,

param_sen

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group