Hey @DynDe ,
Use the format-specific readers from the start and let Auto Loader handle schema, rather than reading everything as generic strings/text yourself.
Key points:
- Always set
cloudFiles.format to the real file format (json, csv, xml, parquet, avro, text, binaryfile, etc.). This is how Auto Loader enables schema inference, evolution, and rescued data; treating everything as plain text/binary bypasses those features and shifts parsing complexity to your code.
- For JSON / CSV / XML, Auto Loaderโs own schema inference already reads all columns as
STRING by default (including nested JSON), specifically to avoid brittle type mismatches across files. You still get the correct structure, but types are strings until you cast them downstream.
- You can optionally tighten types with
cloudFiles.inferColumnTypes, inferSchema (CSV), and cloudFiles.schemaHints when youโre ready, but thatโs an optimization / governance choice, not a reason to avoid the JSON/CSV/XML readers.
- For Parquet / Avro, the best practice is to let Auto Loader respect and merge the fileโs typed schemas instead of forcing everything to string; it samples files and merges typed schemas for you.
- For text and binary/unstructured content, use the
text or binaryfile formats; they already have a fixed schema (content + metadata). You then interpret the payload in later stages if needed.
- Reliability-wise, lean on rescued data + a structured Bronze layer:
- When schema is inferred, Auto Loader populates
_rescued_data with any fields that donโt match the current schema or have type issues, so you donโt lose data at ingest.
- Cast to business types and enforce contracts in Silver/Gold, not by manually reading everything as raw strings in Bronze.
In practice:
- Bronze: use the correct
cloudFiles.format, accept Auto Loaderโs default (string) inference for semi-structured formats and rescued data, or provide an explicit schema/schema hints for critical feeds.
- Silver/Gold: perform type casting, normalization, and validation.
So: donโt implement a โstring-firstโ ingestion layer yourself; instead, use format-specific Auto Loader readers and rely on their default string inference (for JSON/CSV/XML) plus rescued data for robustness.
If this helps resolve your question, please mark this reply as the Accepted Solution so it can help others in the community find it more easily.