cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv

JackR
New Contributor II

I have a simple piece of code to read a csv file from an AWS s3 bucket:

 

SELECT
    *
  FROM
    read_files(
      myfile,
      format => 'csv',
      header => true,
      inferSchema => true,
      mode => 'FAILFAST')
 
It's a large file with over 100 columns and it has been sufficient to infer the schema.  However, the input csv has changed to be encoded as UTF-8 BOM (previously UTF-8) and now the data types are not being inferred and everything is being read as a string.  However, this is not consistent, as I tried the same thing with a 100-record sample and the data types were identified correctly.
 
What's more, if I read in the data using equivalent pyspark code, it seems to be fine:
 
df_infer_schema = spark.read.format("csv") \
  .option("InferSchema", "True") \
  .option("header", "True") \
  .option("sep", ",") \
  .load(file_to_use)
 
Is anyone able to shed light on what's happening?  Why is the SQL method behaving weirdly?  And why is it behaving differently to pyspark?
1 ACCEPTED SOLUTION

Accepted Solutions

bianca_unifeye
New Contributor III

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.

Putting it together:

  1. Different code paths

    • read_files (SQL) uses Databricks SQL / Auto Loader style inference.

    • spark.read.csv (PySpark) uses Spark’s CSV reader & inference.

  2. Different philosophy

    • read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.

    • PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.

  3. BOM + large file increases risk

    • BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across all rows → I’ll just keep them as strings.”

So your observation is exactly what you’d expect from that combination, annoying as it is.

Avoid relying on schema inference for production ingestion

Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.

View solution in original post

1 REPLY 1

bianca_unifeye
New Contributor III

Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path do not use the exact same schema inference code, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.

Putting it together:

  1. Different code paths

    • read_files (SQL) uses Databricks SQL / Auto Loader style inference.

    • spark.read.csv (PySpark) uses Spark’s CSV reader & inference.

  2. Different philosophy

    • read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.

    • PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.

  3. BOM + large file increases risk

    • BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across all rows → I’ll just keep them as strings.”

So your observation is exactly what you’d expect from that combination, annoying as it is.

Avoid relying on schema inference for production ingestion

Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).
Always prefer explicit schemas for stable pipelines.

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now