<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/inconsistent-behaviour-when-using-read-files-to-read-utf-8-bom/m-p/140516#M51451</link>
    <description>&lt;P&gt;I have a simple piece of code to read a csv file from an AWS s3 bucket:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;FROM&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; read_files(&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; myfile&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt; &lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;'csv'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; header &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;true&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; inferSchema &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;true&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; mode &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;'FAILFAST'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;It's a large file with over 100 columns and it has been sufficient to infer the schema.&amp;nbsp; However, the input csv has changed to be encoded as UTF-8 BOM (previously UTF-8) and now the data types are not being inferred and everything is being read as a string.&amp;nbsp; However, this is not consistent, as I tried the same thing with a 100-record sample and the data types were identified correctly.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;What's more, if I read in the data using equivalent pyspark code, it seems to be fine:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;df_infer_schema &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"csv"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"InferSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"True"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"header"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"True"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"sep"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;","&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(file_to_use)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Is anyone able to shed light on what's happening?&amp;nbsp; Why is the SQL method behaving weirdly?&amp;nbsp; And why is it behaving differently to pyspark?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
    <pubDate>Thu, 27 Nov 2025 12:51:52 GMT</pubDate>
    <dc:creator>JackR</dc:creator>
    <dc:date>2025-11-27T12:51:52Z</dc:date>
    <item>
      <title>Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-behaviour-when-using-read-files-to-read-utf-8-bom/m-p/140516#M51451</link>
      <description>&lt;P&gt;I have a simple piece of code to read a csv file from an AWS s3 bucket:&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;SELECT&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;*&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;FROM&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; read_files(&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; myfile&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt; &lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;'csv'&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; header &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;true&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; inferSchema &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;true&lt;/SPAN&gt;&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &amp;nbsp; mode &lt;/SPAN&gt;&lt;SPAN&gt;=&amp;gt;&lt;/SPAN&gt; &lt;SPAN&gt;'FAILFAST'&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;It's a large file with over 100 columns and it has been sufficient to infer the schema.&amp;nbsp; However, the input csv has changed to be encoded as UTF-8 BOM (previously UTF-8) and now the data types are not being inferred and everything is being read as a string.&amp;nbsp; However, this is not consistent, as I tried the same thing with a 100-record sample and the data types were identified correctly.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;What's more, if I read in the data using equivalent pyspark code, it seems to be fine:&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;df_infer_schema &lt;/SPAN&gt;&lt;SPAN&gt;=&lt;/SPAN&gt;&lt;SPAN&gt; spark.read.&lt;/SPAN&gt;&lt;SPAN&gt;format&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"csv"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"InferSchema"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"True"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"header"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;"True"&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;option&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"sep"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;","&lt;/SPAN&gt;&lt;SPAN&gt;) \&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;FONT size="2" color="#0000FF"&gt;&lt;SPAN&gt;&amp;nbsp; .&lt;/SPAN&gt;&lt;SPAN&gt;load&lt;/SPAN&gt;&lt;SPAN&gt;(file_to_use)&lt;/SPAN&gt;&lt;/FONT&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Is anyone able to shed light on what's happening?&amp;nbsp; Why is the SQL method behaving weirdly?&amp;nbsp; And why is it behaving differently to pyspark?&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Thu, 27 Nov 2025 12:51:52 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-behaviour-when-using-read-files-to-read-utf-8-bom/m-p/140516#M51451</guid>
      <dc:creator>JackR</dc:creator>
      <dc:date>2025-11-27T12:51:52Z</dc:date>
    </item>
    <item>
      <title>Re: Inconsistent behaviour when using read_files to read UTF-8 BOM encoded csv</title>
      <link>https://community.databricks.com/t5/data-engineering/inconsistent-behaviour-when-using-read-files-to-read-utf-8-bom/m-p/140522#M51455</link>
      <description>&lt;P&gt;Short version: this is (unfortunately) a Databricks quirk, not you going mad. The SQL read_files path and the PySpark spark.read.csv path &lt;STRONG&gt;do not use the exact same schema inference code&lt;/STRONG&gt;, and CSVs with a UTF-8 BOM hit a corner case where read_files falls back to “everything is STRING”.&lt;/P&gt;&lt;P&gt;Putting it together:&lt;/P&gt;&lt;OL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Different code paths&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;read_files (SQL) uses Databricks SQL / Auto Loader style inference.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;spark.read.csv (PySpark) uses Spark’s CSV reader &amp;amp; inference.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Different philosophy&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;read_files is designed to be safe and robust for production ingestion (esp. in streaming / Auto Loader scenarios). When it’s unsure, it tends to default to STRING to avoid runtime parse failures later.&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;PySpark’s inferSchema is more “optimistic” and may happily promote to numeric/date types as long as sampled values look consistent.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;BOM + large file increases risk&lt;/STRONG&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;BOM in the header line plus a big file increases the chance that read_files thinks “I’m not 100% sure these types are clean across &lt;EM&gt;all&lt;/EM&gt; rows → I’ll just keep them as strings.”&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;/LI&gt;&lt;/OL&gt;&lt;P&gt;So your observation is exactly what you’d expect from that combination, annoying as it is.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Avoid relying on schema inference for production ingestion&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;Inference is convenient for exploration, but not reliable for large, messy, or changing CSVs (encoding changes, BOM, mixed types, dirty rows).&lt;BR /&gt;Always prefer &lt;STRONG&gt;explicit schemas&lt;/STRONG&gt; for stable pipelines.&lt;/P&gt;</description>
      <pubDate>Thu, 27 Nov 2025 14:09:13 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/inconsistent-behaviour-when-using-read-files-to-read-utf-8-bom/m-p/140522#M51455</guid>
      <dc:creator>bianca_unifeye</dc:creator>
      <dc:date>2025-11-27T14:09:13Z</dc:date>
    </item>
  </channel>
</rss>

