cancel
Showing results forย 
Search instead forย 
Did you mean:ย 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Querying CDF on a Delta-Sharing table after data type change in the Table (INT to DECIMAL)

fdubourdeau
New Contributor

Hi,

I am trying to query the CDF of a Delta-Sharing table that have had a change in data type of one its columns. The change was from an INT to a DECIMAL. When reading the specific version where the schema change happened, I am receiving an error mentioning a conflict between the new schema of the Delta-Sharing (with DECIMAL) and the Parquet file having INT in the column. 

I have tried to add mergeSchema = true but still receiving the same error. 

My question is: is there any way to maintain readability of the CDF of a Delta-Sharing table to which a schema have been changed with a data type change or a full reload of the table is required in that specific instance? 

Thanks!

1 ACCEPTED SOLUTION

Accepted Solutions

anuj_lathi
Databricks Employee
Databricks Employee

Hi โ€” this is a known limitation of Change Data Feed. Here's what's happening and your options.

Why This Happens

Changing a column from INT to DECIMAL is a non-additive schema change. When reading CDF in batch mode, Delta Lake applies a single schema (the latest or end-version schema) to all Parquet files in the version range. Since the older Parquet files still have INT and the schema expects DECIMAL, you get a conflict.

`mergeSchema` won't help here โ€” it handles additive changes like new columns, not data type changes.

Your Options

  1. Split your CDF reads at the schema change boundary (recommended if you want to avoid a full reload)

Read CDF in two separate ranges โ€” before and after the type change โ€” then cast and union:

# Read versions BEFORE the type change (e.g., up to version N-1)

df_before = (spark.read.format("delta")

    .option("readChangeFeed", "true")

    .option("startingVersion", start_version)

    .option("endingVersion", schema_change_version - 1)

    .table("your_table")

)

 

# Read versions AFTER the type change (version N onward)

df_after = (spark.read.format("delta")

    .option("readChangeFeed", "true")

    .option("startingVersion", schema_change_version)

    .option("endingVersion", end_version)

    .table("your_table")

)

 

# Cast the old schema to match and union

df_before_casted = df_before.withColumn("col_name", df_before["col_name"].cast("decimal"))

df_combined = df_before_casted.unionByName(df_after)

 

You can find the version where the schema changed using DESCRIBE HISTORY your_table.

  1. Full reload of the table

If splitting reads is too complex for your pipeline, a one-time full reload at the new schema is the simplest path. After the reload, future CDF reads will work normally since all files will have the new schema.

  1. Use Type Widening for future-proofing (DBR 15.4+)

The Type Widening feature lets you widen column types (e.g., INT โ†’ DECIMAL) without rewriting data files. However, even with type widening, CDF reads across the type change boundary are still not supported โ€” you'd still need to split reads. The benefit is it avoids the costly full-table rewrite on the provider side.

Note: Type widening over Delta Sharing requires both provider and recipient on DBR 16.1+ and is only supported for Databricks-to-Databricks sharing.

TL;DR

You cannot read CDF across a data type change in a single query โ€” this is by design. Split your reads at the schema change version boundary, or do a full reload. For future schema changes, consider type widening to minimize disruption.

Docs:

Anuj Lathi
Solutions Engineer @ Databricks

View solution in original post

1 REPLY 1

anuj_lathi
Databricks Employee
Databricks Employee

Hi โ€” this is a known limitation of Change Data Feed. Here's what's happening and your options.

Why This Happens

Changing a column from INT to DECIMAL is a non-additive schema change. When reading CDF in batch mode, Delta Lake applies a single schema (the latest or end-version schema) to all Parquet files in the version range. Since the older Parquet files still have INT and the schema expects DECIMAL, you get a conflict.

`mergeSchema` won't help here โ€” it handles additive changes like new columns, not data type changes.

Your Options

  1. Split your CDF reads at the schema change boundary (recommended if you want to avoid a full reload)

Read CDF in two separate ranges โ€” before and after the type change โ€” then cast and union:

# Read versions BEFORE the type change (e.g., up to version N-1)

df_before = (spark.read.format("delta")

    .option("readChangeFeed", "true")

    .option("startingVersion", start_version)

    .option("endingVersion", schema_change_version - 1)

    .table("your_table")

)

 

# Read versions AFTER the type change (version N onward)

df_after = (spark.read.format("delta")

    .option("readChangeFeed", "true")

    .option("startingVersion", schema_change_version)

    .option("endingVersion", end_version)

    .table("your_table")

)

 

# Cast the old schema to match and union

df_before_casted = df_before.withColumn("col_name", df_before["col_name"].cast("decimal"))

df_combined = df_before_casted.unionByName(df_after)

 

You can find the version where the schema changed using DESCRIBE HISTORY your_table.

  1. Full reload of the table

If splitting reads is too complex for your pipeline, a one-time full reload at the new schema is the simplest path. After the reload, future CDF reads will work normally since all files will have the new schema.

  1. Use Type Widening for future-proofing (DBR 15.4+)

The Type Widening feature lets you widen column types (e.g., INT โ†’ DECIMAL) without rewriting data files. However, even with type widening, CDF reads across the type change boundary are still not supported โ€” you'd still need to split reads. The benefit is it avoids the costly full-table rewrite on the provider side.

Note: Type widening over Delta Sharing requires both provider and recipient on DBR 16.1+ and is only supported for Databricks-to-Databricks sharing.

TL;DR

You cannot read CDF across a data type change in a single query โ€” this is by design. Split your reads at the schema change version boundary, or do a full reload. For future schema changes, consider type widening to minimize disruption.

Docs:

Anuj Lathi
Solutions Engineer @ Databricks