cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Autoloader [FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH]

Maxrb
New Contributor III

Hi,

I am using autoloader to load parquet files into my unity catalog with the following settings:

.option("cloudFiles.format", "parquet") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWidening") .option("cloudFiles.rescuedDataColumn", "_rescued_data")

In one of the newest file I have a file where a column which is a timestamp is now a Long type. I was under the impression that this faulty records would just propagate to `_rescued_data` column. but unfortunately it breaks and I can only fix my pipeline with the badRecordsPath option.

Why is it that this breaks my pipeline with:  Expected Spark type timestamp, actual Parquet type INT64. SQLSTATE: KD001, instead of moving the bad data to _rescued_data.

Thanks in advance!

1 ACCEPTED SOLUTION

Accepted Solutions

Yogasathyandrun
New Contributor

What you're seeing comes down to where the type mismatch is detected.

For Parquet, some mismatches can be handled at the Auto Loader layer and end up in _rescued_data, while others fail earlier inside the Parquet reader itself.

In your example, the existing schema expects a timestamp, but the new file stores the column as a plain INT64. That mismatch is detected by the Parquet reader before Auto Loader's rescue logic gets a chance to process the row, which is why you get:

FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH

instead of seeing the value in _rescued_data.

The reason a string appearing in an integer column may be rescued is that the file can still be read successfully and the mismatch is encountered during value conversion/parsing at the record level. In that case Auto Loader can route the problematic value to _rescued_data.

So the distinction is roughly:

  • Record-level parsing/conversion issue → can often be rescued into _rescued_data

  • Parquet schema/file-level incompatibility → fails during file read and cannot be rescued

For production pipelines, the common pattern is to combine:

  • cloudFiles.schemaHints for known drift-prone columns, and

  • badRecordsPath as a safety net for unexpected schema incompatibilities.

 

Data Engineer | Apache Spark | Delta Lake | Databricks

View solution in original post

4 REPLIES 4

balajij8
Contributor III

The _rescued_data column in Auto Loader works for JSON and CSV formats - not Parquet. Parquet is a strongly typed format where data types are encoded in the file metadata. When you have a timestamp column that becomes INT64 in a new file, it creates a file-format-level incompatibility that occurs during the Parquet reader initialization before Auto Loader's schema evolution or rescued data logic chip in.

FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH: Expected Spark type timestamp, actual Parquet type INT64 is generally from the low level Parquet reader when it detects the metadata mismatch.

In schemaEvolutionMode: addNewColumnsWithTypeWidening - It handles widening (int to long) but timestamp to INT64 is not widening. It's an incompatible change
rescuedDataColumn - Only rescues data for JSON/CSV where type mismatches are detected during parsing, not for Parquet format-level conflicts

You can use badRecordsPath for Parquet files with incompatible type changes. It catches file-level read failures and allows the stream to continue while logging the error files.

Maxrb
New Contributor III

@balajij8 Thanks for your replay.

I do see what you mean, at the same time I see that _rescued_data works for some type mismatches, which is why I am confused. Do you have any idea why it works when I get string data in a integer column but not for this specific case?

Thanks!

balajij8
Contributor III

@Maxrb 

String to Integer is an Value-Level Mismatch - Parquet reader successfully reads the STRING physical type from the file. Auto Loader attempts to cast STRING to INTEGER (a Spark-level operation). Cast fails for "invalid" at the value level during Spark's type conversion. Auto Loader's rescued data logic catches this conversion failure and routes it to _rescued_data.

Timestamp to INT64 is a Format-Level Mismatch - Parquet reader examines file metadata and sees conflicting physical type annotations. The Parquet reader rejects this as invalid at the format level before any data is read
Auto Loader never gets a chance to apply rescued data logic because the failure happens in the Parquet reader, not in Spark's type system.

Yogasathyandrun
New Contributor

What you're seeing comes down to where the type mismatch is detected.

For Parquet, some mismatches can be handled at the Auto Loader layer and end up in _rescued_data, while others fail earlier inside the Parquet reader itself.

In your example, the existing schema expects a timestamp, but the new file stores the column as a plain INT64. That mismatch is detected by the Parquet reader before Auto Loader's rescue logic gets a chance to process the row, which is why you get:

FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH

instead of seeing the value in _rescued_data.

The reason a string appearing in an integer column may be rescued is that the file can still be read successfully and the mismatch is encountered during value conversion/parsing at the record level. In that case Auto Loader can route the problematic value to _rescued_data.

So the distinction is roughly:

  • Record-level parsing/conversion issue → can often be rescued into _rescued_data

  • Parquet schema/file-level incompatibility → fails during file read and cannot be rescued

For production pipelines, the common pattern is to combine:

  • cloudFiles.schemaHints for known drift-prone columns, and

  • badRecordsPath as a safety net for unexpected schema incompatibilities.

 

Data Engineer | Apache Spark | Delta Lake | Databricks