cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Best practices for using autoloader

DynDe
New Contributor

Iโ€™m looking to follow best practices with Databricks Auto Loader. When ingesting different file formats, is it considered good practice to always read data as strings first, or is it better to use format-specific readers (e.g., JSON, CSV, binary) from the start?

1 ACCEPTED SOLUTION

Accepted Solutions

DivyaandData
Databricks Employee
Databricks Employee

Hey @DynDe ,

Use the format-specific readers from the start and let Auto Loader handle schema, rather than reading everything as generic strings/text yourself.

Key points:

  • Always set cloudFiles.format to the real file format (json, csv, xml, parquet, avro, text, binaryfile, etc.). This is how Auto Loader enables schema inference, evolution, and rescued data; treating everything as plain text/binary bypasses those features and shifts parsing complexity to your code.
     
  • For JSON / CSV / XML, Auto Loaderโ€™s own schema inference already reads all columns as STRING by default (including nested JSON), specifically to avoid brittle type mismatches across files. You still get the correct structure, but types are strings until you cast them downstream.
     
    • You can optionally tighten types with cloudFiles.inferColumnTypes, inferSchema (CSV), and cloudFiles.schemaHints when youโ€™re ready, but thatโ€™s an optimization / governance choice, not a reason to avoid the JSON/CSV/XML readers.
       
  • For Parquet / Avro, the best practice is to let Auto Loader respect and merge the fileโ€™s typed schemas instead of forcing everything to string; it samples files and merges typed schemas for you.
     
  • For text and binary/unstructured content, use the text or binaryfile formats; they already have a fixed schema (content + metadata). You then interpret the payload in later stages if needed.
     
  • Reliability-wise, lean on rescued data + a structured Bronze layer:
    • When schema is inferred, Auto Loader populates _rescued_data with any fields that donโ€™t match the current schema or have type issues, so you donโ€™t lose data at ingest.
       
    • Cast to business types and enforce contracts in Silver/Gold, not by manually reading everything as raw strings in Bronze.

In practice:

  • Bronze: use the correct cloudFiles.format, accept Auto Loaderโ€™s default (string) inference for semi-structured formats and rescued data, or provide an explicit schema/schema hints for critical feeds.
  • Silver/Gold: perform type casting, normalization, and validation.

So: donโ€™t implement a โ€œstring-firstโ€ ingestion layer yourself; instead, use format-specific Auto Loader readers and rely on their default string inference (for JSON/CSV/XML) plus rescued data for robustness.

If this helps resolve your question, please mark this reply as the Accepted Solution so it can help others in the community find it more easily.

View solution in original post

1 REPLY 1

DivyaandData
Databricks Employee
Databricks Employee

Hey @DynDe ,

Use the format-specific readers from the start and let Auto Loader handle schema, rather than reading everything as generic strings/text yourself.

Key points:

  • Always set cloudFiles.format to the real file format (json, csv, xml, parquet, avro, text, binaryfile, etc.). This is how Auto Loader enables schema inference, evolution, and rescued data; treating everything as plain text/binary bypasses those features and shifts parsing complexity to your code.
     
  • For JSON / CSV / XML, Auto Loaderโ€™s own schema inference already reads all columns as STRING by default (including nested JSON), specifically to avoid brittle type mismatches across files. You still get the correct structure, but types are strings until you cast them downstream.
     
    • You can optionally tighten types with cloudFiles.inferColumnTypes, inferSchema (CSV), and cloudFiles.schemaHints when youโ€™re ready, but thatโ€™s an optimization / governance choice, not a reason to avoid the JSON/CSV/XML readers.
       
  • For Parquet / Avro, the best practice is to let Auto Loader respect and merge the fileโ€™s typed schemas instead of forcing everything to string; it samples files and merges typed schemas for you.
     
  • For text and binary/unstructured content, use the text or binaryfile formats; they already have a fixed schema (content + metadata). You then interpret the payload in later stages if needed.
     
  • Reliability-wise, lean on rescued data + a structured Bronze layer:
    • When schema is inferred, Auto Loader populates _rescued_data with any fields that donโ€™t match the current schema or have type issues, so you donโ€™t lose data at ingest.
       
    • Cast to business types and enforce contracts in Silver/Gold, not by manually reading everything as raw strings in Bronze.

In practice:

  • Bronze: use the correct cloudFiles.format, accept Auto Loaderโ€™s default (string) inference for semi-structured formats and rescued data, or provide an explicit schema/schema hints for critical feeds.
  • Silver/Gold: perform type casting, normalization, and validation.

So: donโ€™t implement a โ€œstring-firstโ€ ingestion layer yourself; instead, use format-specific Auto Loader readers and rely on their default string inference (for JSON/CSV/XML) plus rescued data for robustness.

If this helps resolve your question, please mark this reply as the Accepted Solution so it can help others in the community find it more easily.