Databricks Community

param_sen · ‎11-24-2023

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data

Hi dear community,

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?

Here are some considerations:

Consistency: Maintaining consistency in naming conventions across layers can make it easier for teams to work with the data. If your silver layer uses snake_case, you might choose to rename the columns in the bronze layer for consistency.
Downstream Processing: If downstream processes or tools expect a specific naming convention, it may be more convenient to align the bronze layer with those expectations.
Documentation: If you decide to keep the camelCase names in the bronze layer, make sure to document this decision clearly. This documentation should be easily accessible to anyone who works with or analyzes the data.
Transformation at Silver Layer: If your data transformation processes are well-defined and centralized, you might prefer to handle the column renaming during the transformation from bronze to silver. This way, the bronze layer remains a true representation of the raw data, and transformations are applied as needed in subsequent layers.

So I am asking for opinion/suggestions/best practices from this community as I am new in this . Looking forward to your support.

Regards,

param_sen

Dribka · ‎11-24-2023

Hey @param_sen ,

Navigating the nuances of naming conventions, especially when dealing with different layers in a lakehouse architecture, can be a bit of a puzzle. Your considerations are on point. If consistency across layers is a priority and downstream processes or tools are accustomed to snake_case, renaming the columns in the bronze layer might streamline things. Documenting this decision is key, ensuring anyone interacting with the data is on the same page. On the flip side, if the camelCase in the bronze layer aligns with raw data principles and you have well-defined transformation processes in the silver layer, handling the renaming there could maintain the integrity of the raw data. It boils down to balancing consistency, downstream expectations, and the principles of each layer. Best practices can vary, so it might be worth exploring what feels most natural for your specific use case. Cheers!