cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Thoughts on AutoLoader schema inferral into raw table (+data flattening)

ChristianRRL
Valued Contributor III

I am curious to get the community's thoughts on this. Is it generally preferrable to load raw data based on its inferred columns or not? And is it preferred to keep the raw data in its original structure or to flatten it into a more tabular structure? If it's better to keep the original structure, what is generally validated at the raw data table versus the subsequent "base" (flattened) data?

For example, assuming we already have a data extraction process to land .json files in a landing path, we can leverage AutoLoader to incrementally load data into a target raw table. If I leverage schema inferral, I can more easily flatten the data in subsequent steps, but the inferred values may not always be accurate, so I may want to either use `schemaHints` or maybe force typing via Spark SQL or something.

ChristianRRL_0-1754923823715.png

 

1 ACCEPTED SOLUTION

Accepted Solutions

SP_6721
Contributor III

Hi @ChristianRRL ,

When loading raw data into bronze tables with Auto Loader, it’s usually best to keep the original structure rather than flattening it right away. You can use schema inference for convenience, but to avoid mistakes, add schema hints or rescue columns.
In the raw layer, stick to light checks. Save heavier work such as flattening, deduplication, detailed validations for your silver tables. This keeps your data traceable, and better aligned with lakehouse best practices.

View solution in original post

1 REPLY 1

SP_6721
Contributor III

Hi @ChristianRRL ,

When loading raw data into bronze tables with Auto Loader, it’s usually best to keep the original structure rather than flattening it right away. You can use schema inference for convenience, but to avoid mistakes, add schema hints or rescue columns.
In the raw layer, stick to light checks. Save heavier work such as flattening, deduplication, detailed validations for your silver tables. This keeps your data traceable, and better aligned with lakehouse best practices.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now