Databricks Community

eimis_pacheco · ‎08-25-2023

Hi dear community,

When I used to work in the Hadoop ecosystem with HDS the landing zone was our raw layer, and we used to use AVRO format for the serialization of this raw data (for the schema evolution feature), only assigning names to columns but not enforcing any data type, each column was stored as string type.

In the medallion architecture, bronze represents the raw layer. However, I need clarity regarding best practices here. My question is: Should we use delta tables and treat each column as a string type, or should we enforce a specific data type at this layer? Alternatively, should we allow the delta table to handle data discovery for us?"

Thank you very much for the clarification on the best practices and alternatives, pros and cons.

Best Regards

swethaNandan · ‎09-21-2023

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

View solution in original post

swethaNandan · ‎09-21-2023

The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.

It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.

The bronze layer being delta will help with more control over version so that re processing is easier.

eimis_pacheco · ‎09-22-2023

Thank you very much for your answer @swethaNandan.

Regards!

param_sen · ‎11-24-2023

Hi dear community,

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?

Regards,

param_sen