<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH] in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160062#M54852</link>
    <description>&lt;P&gt;The _rescued_data column in Auto Loader works for JSON and CSV formats - not Parquet. Parquet is a strongly typed format where data types are encoded in the file metadata. When you have a timestamp column that becomes INT64 in a new file, it creates a file-format-level incompatibility that occurs during the Parquet reader initialization before Auto Loader's schema evolution or rescued data logic chip in.&lt;/P&gt;&lt;P&gt;FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH: Expected Spark type timestamp, actual Parquet type INT64 is generally from the low level Parquet reader when it detects the metadata mismatch.&lt;/P&gt;&lt;P&gt;In schemaEvolutionMode: addNewColumnsWithTypeWidening - It handles widening (int to long) but timestamp to INT64 is not widening. It's an incompatible change&lt;BR /&gt;rescuedDataColumn - Only rescues data for JSON/CSV where type mismatches are detected during parsing, not for Parquet format-level conflicts&lt;/P&gt;&lt;P&gt;You can use badRecordsPath for Parquet files with incompatible type changes. It catches file-level read failures and allows the stream to continue while logging the error files.&lt;/P&gt;</description>
    <pubDate>Mon, 22 Jun 2026 09:57:21 GMT</pubDate>
    <dc:creator>balajij8</dc:creator>
    <dc:date>2026-06-22T09:57:21Z</dc:date>
    <item>
      <title>Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160058#M54851</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;I am using autoloader to load parquet files into my unity catalog with the following settings:&lt;/P&gt;&lt;P&gt;.option("cloudFiles.format", "parquet") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWidening") .option("cloudFiles.rescuedDataColumn", "_rescued_data")&lt;/P&gt;&lt;P&gt;In one of the newest file I have a file where a column which is a timestamp is now a Long type. I was under the impression that this faulty records would just propagate to `_rescued_data` column. but unfortunately it breaks and I can only fix my pipeline with the badRecordsPath option.&lt;/P&gt;&lt;P&gt;Why is it that this breaks my pipeline with:&amp;nbsp; Expected Spark type timestamp, actual Parquet type INT64. SQLSTATE: KD001, instead of moving the bad data to _rescued_data.&lt;/P&gt;&lt;P&gt;Thanks in advance!&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 09:31:30 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160058#M54851</guid>
      <dc:creator>Maxrb</dc:creator>
      <dc:date>2026-06-22T09:31:30Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160062#M54852</link>
      <description>&lt;P&gt;The _rescued_data column in Auto Loader works for JSON and CSV formats - not Parquet. Parquet is a strongly typed format where data types are encoded in the file metadata. When you have a timestamp column that becomes INT64 in a new file, it creates a file-format-level incompatibility that occurs during the Parquet reader initialization before Auto Loader's schema evolution or rescued data logic chip in.&lt;/P&gt;&lt;P&gt;FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH: Expected Spark type timestamp, actual Parquet type INT64 is generally from the low level Parquet reader when it detects the metadata mismatch.&lt;/P&gt;&lt;P&gt;In schemaEvolutionMode: addNewColumnsWithTypeWidening - It handles widening (int to long) but timestamp to INT64 is not widening. It's an incompatible change&lt;BR /&gt;rescuedDataColumn - Only rescues data for JSON/CSV where type mismatches are detected during parsing, not for Parquet format-level conflicts&lt;/P&gt;&lt;P&gt;You can use badRecordsPath for Parquet files with incompatible type changes. It catches file-level read failures and allows the stream to continue while logging the error files.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 09:57:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160062#M54852</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-06-22T09:57:21Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160064#M54853</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/210897"&gt;@balajij8&lt;/a&gt;&amp;nbsp;Thanks for your replay.&lt;/P&gt;&lt;P&gt;I do see what you mean, at the same time I see that _rescued_data works for some type mismatches, which is why I am confused. Do you have any idea why it works when I get string data in a integer column but not for this specific case?&lt;/P&gt;&lt;P&gt;Thanks!&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 10:04:06 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160064#M54853</guid>
      <dc:creator>Maxrb</dc:creator>
      <dc:date>2026-06-22T10:04:06Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160082#M54855</link>
      <description>&lt;P&gt;What you're seeing comes down to &lt;EM&gt;where&lt;/EM&gt; the type mismatch is detected.&lt;/P&gt;&lt;P&gt;For Parquet, some mismatches can be handled at the Auto Loader layer and end up in &lt;STRONG&gt;_rescued_data&lt;/STRONG&gt;, while others fail earlier inside the Parquet reader itself.&lt;/P&gt;&lt;P&gt;In your example, the existing schema expects a timestamp, but the new file stores the column as a plain &lt;STRONG&gt;INT64&lt;/STRONG&gt;. That mismatch is detected by the Parquet reader before Auto Loader's rescue logic gets a chance to process the row, which is why you get:&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;instead of seeing the value in &lt;STRONG&gt;_rescued_data&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;The reason a string appearing in an integer column may be rescued is that the file can still be read successfully and the mismatch is encountered during value conversion/parsing at the record level. In that case Auto Loader can route the problematic value to &lt;STRONG&gt;_rescued_data&lt;/STRONG&gt;.&lt;/P&gt;&lt;P&gt;So the distinction is roughly:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Record-level parsing/conversion issue&lt;/STRONG&gt; → can often be rescued into _rescued_data&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;Parquet schema/file-level incompatibility&lt;/STRONG&gt; → fails during file read and cannot be rescued&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;For production pipelines, the common pattern is to combine:&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;cloudFiles.schemaHints&lt;/STRONG&gt; for known drift-prone columns, and&lt;/P&gt;&lt;/LI&gt;&lt;LI&gt;&lt;P&gt;&lt;STRONG&gt;badRecordsPath&lt;/STRONG&gt; as a safety net for unexpected schema incompatibilities.&lt;/P&gt;&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 10:45:32 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160082#M54855</guid>
      <dc:creator>Yogasathyandrun</dc:creator>
      <dc:date>2026-06-22T10:45:32Z</dc:date>
    </item>
    <item>
      <title>Re: Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]</title>
      <link>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160094#M54858</link>
      <description>&lt;P&gt;&lt;a href="https://community.databricks.com/t5/user/viewprofilepage/user-id/201317"&gt;@Maxrb&lt;/a&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;String to Integer is an Value-Level Mismatch&lt;/STRONG&gt; -&amp;nbsp;Parquet reader successfully reads the STRING physical type from the file.&amp;nbsp;Auto Loader attempts to cast STRING to INTEGER (a Spark-level operation).&amp;nbsp;Cast fails for "invalid" at the value level during Spark's type conversion.&amp;nbsp;Auto Loader's rescued data logic catches this conversion failure and routes it to _rescued_data.&lt;/P&gt;&lt;P&gt;&lt;STRONG&gt;Timestamp to INT64 is a Format-Level Mismatch&lt;/STRONG&gt; -&amp;nbsp;Parquet reader examines file metadata and sees conflicting physical type annotations.&amp;nbsp;The Parquet reader rejects this as invalid at the format level before any data is read&lt;BR /&gt;Auto Loader never gets a chance to apply rescued data logic because the failure happens in the Parquet reader, not in Spark's type system.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Jun 2026 11:11:21 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/autoloader-failed-read-file-parquet-column-data-type-mismatch/m-p/160094#M54858</guid>
      <dc:creator>balajij8</dc:creator>
      <dc:date>2026-06-22T11:11:21Z</dc:date>
    </item>
  </channel>
</rss>

