cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Maintain the camelCase column names in the bronze layer, or is it advisable to rename column names

param_sen
New Contributor II

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data

 

Hi dear community,

I am utilizing the Databricks autoloader to ingest files from Google Cloud Storage (GCS) into Delta tables in the bronze layer of a Medallion architecture. According to lakehouse principles, the bronze layer should store raw data with minimal transformation. I have a scenario where the incoming file has columns in camelCase, but the corresponding table in the silver layer uses snake_case for column names. Should I maintain the camelCase column names in the bronze layer, or is it advisable to rename them to align with the snake_case convention in the silver layer?

 

Here are some considerations:

  1. Consistency: Maintaining consistency in naming conventions across layers can make it easier for teams to work with the data. If your silver layer uses snake_case, you might choose to rename the columns in the bronze layer for consistency.

  2. Downstream Processing: If downstream processes or tools expect a specific naming convention, it may be more convenient to align the bronze layer with those expectations.

  3. Documentation: If you decide to keep the camelCase names in the bronze layer, make sure to document this decision clearly. This documentation should be easily accessible to anyone who works with or analyzes the data.

  4. Transformation at Silver Layer: If your data transformation processes are well-defined and centralized, you might prefer to handle the column renaming during the transformation from bronze to silver. This way, the bronze layer remains a true representation of the raw data, and transformations are applied as needed in subsequent layers.

So I am asking for opinion/suggestions/best practices from this community as I am new in this . Looking forward to your support.

 

Regards,

param_sen

1 REPLY 1

Dribka
New Contributor III

Hey @param_sen ,

Navigating the nuances of naming conventions, especially when dealing with different layers in a lakehouse architecture, can be a bit of a puzzle. Your considerations are on point. If consistency across layers is a priority and downstream processes or tools are accustomed to snake_case, renaming the columns in the bronze layer might streamline things. Documenting this decision is key, ensuring anyone interacting with the data is on the same page. On the flip side, if the camelCase in the bronze layer aligns with raw data principles and you have well-defined transformation processes in the silver layer, handling the renaming there could maintain the integrity of the raw data. It boils down to balancing consistency, downstream expectations, and the principles of each layer. Best practices can vary, so it might be worth exploring what feels most natural for your specific use case. Cheers!

Connect with Databricks Users in Your Area

Join a Regional User Group to connect with local Databricks users. Events will be happening in your city, and you won’t want to miss the chance to attend and share knowledge.

If there isn’t a group near you, start one and help create a community that brings people together.

Request a New Group