Hi @BF7
Yes โ there is a difference between how spark.read(...).option("inferSchema", "true")
and Auto Loader's schema inference (cloudFiles.schemaHints, cloudFiles.inferColumnTypes, etc.) work.
They are not guaranteed to produce identical results,
Key Differences
โ
1. Inference Timing
spark.read().option("inferSchema", "true"):
Happens immediately, as Spark reads the files in batch.
Schema is inferred from file sample size or first few rows.
Auto Loader:
Uses a schema inference engine behind the scenes.
Can persist schema at cloudFiles.schemaLocation and evolve it.
Not all files are read at once โ schema may evolve over time as new fields arrive.
โ
2. Sampling Behavior
In spark.read, schema inference is based on sample files or rows.
In Auto Loader, it can be configured to infer from fewer or more files, and it tries to do this efficiently.
โ
3. Data Types
Sometimes Auto Loader infers:
Different numeric types (LongType vs. DoubleType)
Timestamps vs. strings based on pattern matching
Missing fields (from file 1 but present in file 2)
This makes Auto Loader more flexible but less deterministic than batch inferSchema.
โ
4. Schema Evolution Support
spark.read = no schema evolution
Auto Loader = supports evolving schemas (if cloudFiles.schemaEvolutionMode is enabled)
Docs / References:
https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/schema
https://spark.apache.org/docs/latest/sql-data-sources-json.html#schema-inference-and-evolution
LR