Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.
Putting it together:
Different code paths
Different philosophy
read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.
PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.
BOM + large file increases risk
So your observation is exactly what you’d expect from that combination, annoying as it is.
Avoid relying on schema inference for production ingestion
Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.