Databricks Community

Maxrb · 13 hours ago

Hi,

I am using autoloader to load parquet files into my unity catalog with the following settings:

.option("cloudFiles.format", "parquet") .option("cloudFiles.inferColumnTypes", "true") .option("cloudFiles.schemaEvolutionMode", "addNewColumnsWithTypeWidening") .option("cloudFiles.rescuedDataColumn", "_rescued_data")

In one of the newest file I have a file where a column which is a timestamp is now a Long type. I was under the impression that this faulty records would just propagate to `_rescued_data` column. but unfortunately it breaks and I can only fix my pipeline with the badRecordsPath option.

Why is it that this breaks my pipeline with: Expected Spark type timestamp, actual Parquet type INT64. SQLSTATE: KD001, instead of moving the bad data to _rescued_data.

Thanks in advance!

Yogasathyandrun · 12 hours ago

What you're seeing comes down to where the type mismatch is detected.

For Parquet, some mismatches can be handled at the Auto Loader layer and end up in _rescued_data, while others fail earlier inside the Parquet reader itself.

In your example, the existing schema expects a timestamp, but the new file stores the column as a plain INT64. That mismatch is detected by the Parquet reader before Auto Loader's rescue logic gets a chance to process the row, which is why you get:

FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH

instead of seeing the value in _rescued_data.

The reason a string appearing in an integer column may be rescued is that the file can still be read successfully and the mismatch is encountered during value conversion/parsing at the record level. In that case Auto Loader can route the problematic value to _rescued_data.

So the distinction is roughly:

Record-level parsing/conversion issue → can often be rescued into _rescued_data
Parquet schema/file-level incompatibility → fails during file read and cannot be rescued

For production pipelines, the common pattern is to combine:

cloudFiles.schemaHints for known drift-prone columns, and
badRecordsPath as a safety net for unexpected schema incompatibilities.

Data Engineer | Apache Spark | Delta Lake | Databricks

View solution in original post

balajij8 · 13 hours ago

The _rescued_data column in Auto Loader works for JSON and CSV formats - not Parquet. Parquet is a strongly typed format where data types are encoded in the file metadata. When you have a timestamp column that becomes INT64 in a new file, it creates a file-format-level incompatibility that occurs during the Parquet reader initialization before Auto Loader's schema evolution or rescued data logic chip in.

FAILED_READ_FILE.PARQUET_COLUMN_DATA_TYPE_MISMATCH: Expected Spark type timestamp, actual Parquet type INT64 is generally from the low level Parquet reader when it detects the metadata mismatch.

In schemaEvolutionMode: addNewColumnsWithTypeWidening - It handles widening (int to long) but timestamp to INT64 is not widening. It's an incompatible change
rescuedDataColumn - Only rescues data for JSON/CSV where type mismatches are detected during parsing, not for Parquet format-level conflicts

You can use badRecordsPath for Parquet files with incompatible type changes. It catches file-level read failures and allows the stream to continue while logging the error files.

Maxrb · 13 hours ago

@balajij8 Thanks for your replay.

I do see what you mean, at the same time I see that _rescued_data works for some type mismatches, which is why I am confused. Do you have any idea why it works when I get string data in a integer column but not for this specific case?

Thanks!

balajij8 · 12 hours ago

@Maxrb

String to Integer is an Value-Level Mismatch - Parquet reader successfully reads the STRING physical type from the file. Auto Loader attempts to cast STRING to INTEGER (a Spark-level operation). Cast fails for "invalid" at the value level during Spark's type conversion. Auto Loader's rescued data logic catches this conversion failure and routes it to _rescued_data.

Timestamp to INT64 is a Format-Level Mismatch - Parquet reader examines file metadata and sees conflicting physical type annotations. The Parquet reader rejects this as invalid at the format level before any data is read
Auto Loader never gets a chance to apply rescued data logic because the failure happens in the Parquet reader, not in Spark's type system.

Yogasathyandrun · 12 hours ago