Databricks Community

JackR · 7 hours ago

I have a simple piece of code to read a csv file from an AWS s3 bucket:

SELECT

*

FROM

read_files(

myfile,

format => 'csv',

header => true,

inferSchema => true,

mode => 'FAILFAST')

It's a large file with over 100 columns and it has been sufficient to infer the schema. However, the input csv has changed to be encoded as UTF-8 BOM (previously UTF-8) and now the data types are not being inferred and everything is being read as a string. However, this is not consistent, as I tried the same thing with a 100-record sample and the data types were identified correctly.

What's more, if I read in the data using equivalent pyspark code, it seems to be fine:

df_infer_schema = spark.read.format("csv") \

.option("InferSchema", "True") \

.option("header", "True") \

.option("sep", ",") \

.load(file_to_use)

Is anyone able to shed light on what's happening? Why is the SQL method behaving weirdly? And why is it behaving differently to pyspark?

bianca_unifeye · 6 hours ago

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.

Putting it together:

Different code paths
- read_files (SQL) uses Databricks SQL / Auto Loader style inference.
- spark.read.csv (PySpark) uses Spark’s CSV reader & inference.
Different philosophy
- read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.
- PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.
BOM + large file increases risk
- BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across all rows → I’ll just keep them as strings.”

So your observation is exactly what you’d expect from that combination, annoying as it is.

Avoid relying on schema inference for production ingestion

Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.

View solution in original post

bianca_unifeye · 6 hours ago

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.

Putting it together:

Different code paths
- read_files (SQL) uses Databricks SQL / Auto Loader style inference.
- spark.read.csv (PySpark) uses Spark’s CSV reader & inference.
Different philosophy
- read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.
- PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.
BOM + large file increases risk
- BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across all rows → I’ll just keep them as strings.”

So your observation is exactly what you’d expect from that combination, annoying as it is.

Avoid relying on schema inference for production ingestion

Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.