I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data
Hi dear community,
I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?
Here are some considerations:
Consistency: Maintaining consistency in naming conventions across layers can make it easier for teams to work with the data. If your silver layer uses snake_case, you might choose to rename the columns in the bronze layer for consistency.
Downstream Processing: If downstream processes or tools expect a specific naming convention, it may be more convenient to align the bronze layer with those expectations.
Documentation: If you decide to keep the camelCase names in the bronze layer, make sure to document this decision clearly. This documentation should be easily accessible to anyone who works with or analyzes the data.
Transformation at Silver Layer: If your data transformation processes are well-defined and centralized, you might prefer to handle the column renaming during the transformation from bronze to silver. This way, the bronze layer remains a true representation of the raw data, and transformations are applied as needed in subsequent layers.
So I am asking for opinion/suggestions/best practices from this community as I am new in this . Looking forward to your support.
Regards,
param_sen