cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Best practices for using autoloader

DynDe
New Contributor

I’m looking to follow best practices with Databricks Auto Loader. When ingesting different file formats, is it considered good practice to always read data as strings first, or is it better to use format-specific readers (e.g., JSON, CSV, binary) from the start?

1 ACCEPTED SOLUTION

Accepted Solutions

DivyaandData
Databricks Employee
Databricks Employee

Hey @DynDe ,

Use the format-specific readers from the start and let Auto Loader handle schema, rather than reading everything as generic strings/text yourself.

Key points:

  • Always set cloudFiles.format to the real file format (json, csv, xml, parquet, avro, text, binaryfile, etc.). This is how Auto Loader enables schema inference, evolution, and rescued data; treating everything as plain text/binary bypasses those features and shifts parsing complexity to your code.
     
  • For JSON / CSV / XML, Auto Loader’s own schema inference already reads all columns as STRING by default (including nested JSON), specifically to avoid brittle type mismatches across files. You still get the correct structure, but types are strings until you cast them downstream.
     
    • You can optionally tighten types with cloudFiles.inferColumnTypes, inferSchema (CSV), and cloudFiles.schemaHints when you’re ready, but that’s an optimization / governance choice, not a reason to avoid the JSON/CSV/XML readers.
       
  • For Parquet / Avro, the best practice is to let Auto Loader respect and merge the file’s typed schemas instead of forcing everything to string; it samples files and merges typed schemas for you.
     
  • For text and binary/unstructured content, use the text or binaryfile formats; they already have a fixed schema (content + metadata). You then interpret the payload in later stages if needed.
     
  • Reliability-wise, lean on rescued data + a structured Bronze layer:
    • When schema is inferred, Auto Loader populates _rescued_data with any fields that don’t match the current schema or have type issues, so you don’t lose data at ingest.
       
    • Cast to business types and enforce contracts in Silver/Gold, not by manually reading everything as raw strings in Bronze.

In practice:

  • Bronze: use the correct cloudFiles.format, accept Auto Loader’s default (string) inference for semi-structured formats and rescued data, or provide an explicit schema/schema hints for critical feeds.
  • Silver/Gold: perform type casting, normalization, and validation.

So: don’t implement a “string-first” ingestion layer yourself; instead, use format-specific Auto Loader readers and rely on their default string inference (for JSON/CSV/XML) plus rescued data for robustness.

If this helps resolve your question, please mark this reply as the Accepted Solution so it can help others in the community find it more easily.

View solution in original post

1 REPLY 1

DivyaandData
Databricks Employee
Databricks Employee

Hey @DynDe ,

Use the format-specific readers from the start and let Auto Loader handle schema, rather than reading everything as generic strings/text yourself.

Key points:

  • Always set cloudFiles.format to the real file format (json, csv, xml, parquet, avro, text, binaryfile, etc.). This is how Auto Loader enables schema inference, evolution, and rescued data; treating everything as plain text/binary bypasses those features and shifts parsing complexity to your code.
     
  • For JSON / CSV / XML, Auto Loader’s own schema inference already reads all columns as STRING by default (including nested JSON), specifically to avoid brittle type mismatches across files. You still get the correct structure, but types are strings until you cast them downstream.
     
    • You can optionally tighten types with cloudFiles.inferColumnTypes, inferSchema (CSV), and cloudFiles.schemaHints when you’re ready, but that’s an optimization / governance choice, not a reason to avoid the JSON/CSV/XML readers.
       
  • For Parquet / Avro, the best practice is to let Auto Loader respect and merge the file’s typed schemas instead of forcing everything to string; it samples files and merges typed schemas for you.
     
  • For text and binary/unstructured content, use the text or binaryfile formats; they already have a fixed schema (content + metadata). You then interpret the payload in later stages if needed.
     
  • Reliability-wise, lean on rescued data + a structured Bronze layer:
    • When schema is inferred, Auto Loader populates _rescued_data with any fields that don’t match the current schema or have type issues, so you don’t lose data at ingest.
       
    • Cast to business types and enforce contracts in Silver/Gold, not by manually reading everything as raw strings in Bronze.

In practice:

  • Bronze: use the correct cloudFiles.format, accept Auto Loader’s default (string) inference for semi-structured formats and rescued data, or provide an explicit schema/schema hints for critical feeds.
  • Silver/Gold: perform type casting, normalization, and validation.

So: don’t implement a “string-first” ingestion layer yourself; instead, use format-specific Auto Loader readers and rely on their default string inference (for JSON/CSV/XML) plus rescued data for robustness.

If this helps resolve your question, please mark this reply as the Accepted Solution so it can help others in the community find it more easily.