Databricks Community

ChristianRRL · ‎08-11-2025

I am curious to get the community's thoughts on this. Is it generally preferrable to load raw data based on its inferred columns or not? And is it preferred to keep the raw data in its original structure or to flatten it into a more tabular structure? If it's better to keep the original structure, what is generally validated at the raw data table versus the subsequent "base" (flattened) data?

For example, assuming we already have a data extraction process to land .json files in a landing path, we can leverage AutoLoader to incrementally load data into a target raw table. If I leverage schema inferral, I can more easily flatten the data in subsequent steps, but the inferred values may not always be accurate, so I may want to either use `schemaHints` or maybe force typing via Spark SQL or something.

SP_6721 · ‎08-12-2025

Hi @ChristianRRL ,

When loading raw data into bronze tables with Auto Loader, it’s usually best to keep the original structure rather than flattening it right away. You can use schema inference for convenience, but to avoid mistakes, add schema hints or rescue columns.
In the raw layer, stick to light checks. Save heavier work such as flattening, deduplication, detailed validations for your silver tables. This keeps your data traceable, and better aligned with lakehouse best practices.

View solution in original post

SP_6721 · ‎08-12-2025

Hi @ChristianRRL ,

When loading raw data into bronze tables with Auto Loader, it’s usually best to keep the original structure rather than flattening it right away. You can use schema inference for convenience, but to avoid mistakes, add schema hints or rescue columns.
In the raw layer, stick to light checks. Save heavier work such as flattening, deduplication, detailed validations for your silver tables. This keeps your data traceable, and better aligned with lakehouse best practices.

Databricks Community

Thoughts on AutoLoader schema inferral into raw table (+data flattening)

Join Us as a Local Community Builder!

Join us for another BrickTalk: Vibe-Coding Databricks Apps in Replit with Augusto!

🌟 Community Pulse: Your Weekly Roundup! November 14 – 20, 2025

Celebrating Our First Brickster Champion: Louis Frolio

⭐ Setup Spark with Hadoop Anywhere : A DBR aligned local Spark+HDFS+Hive stack on Docker⭐

Big Book of Data Engineering - Get how-tos, code snippets and real-world examples