bianca_unifeye
Databricks MVP

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.

Putting it together:

  1. Different code paths

    • read_files (SQL) uses Databricks SQL / Auto Loader style inference.

    • spark.read.csv (PySpark) uses Spark’s CSV reader & inference.

  2. Different philosophy

    • read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.

    • PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.

  3. BOM + large file increases risk

    • BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across all rows → I’ll just keep them as strings.”

So your observation is exactly what you’d expect from that combination, annoying as it is.

Avoid relying on schema inference for production ingestion

Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.

View solution in original post