08-25-2023 09:00 PM
Hi dear community,
When I used to work in the Hadoop ecosystem with HDS the landing zone was our raw layer, and we used to use AVRO format for the serialization of this raw data (for the schema evolution feature), only assigning names to columns but not enforcing any data type, each column was stored as string type.
In the medallion architecture, bronze represents the raw layer. However, I need clarity regarding best practices here. My question is: Should we use delta tables and treat each column as a string type, or should we enforce a specific data type at this layer? Alternatively, should we allow the delta table to handle data discovery for us?"
Thank you very much for the clarification on the best practices and alternatives, pros and cons.
Best Regards
09-21-2023 12:54 PM
The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.
It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.
The bronze layer being delta will help with more control over version so that re processing is easier.
09-21-2023 12:54 PM
The focus in bronze layer is quick CDC and the ability to provide an historical archive of source (cold storage), data lineage, reprocessing if needed without rereading the data from the source system.
It is recommended that the datatypes and the structure of the table remain as is of the source system table so that re reading of data can happen at bronze layer instead of hitting source table again.
The bronze layer being delta will help with more control over version so that re processing is easier.
09-22-2023 03:14 PM
Thank you very much for your answer @swethaNandan.
Regards!
11-24-2023 07:29 AM - edited 11-24-2023 07:30 AM
Hi dear community,
I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?
Regards,
param_sen
Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.
If there isn’t a group near you, start one and help create a community that brings people together.
Request a New Group