cancel
Showing results for 
Search instead for 
Did you mean: 
Get Started Discussions
Start your journey with Databricks by joining discussions on getting started guides, tutorials, and introductory topics. Connect with beginners and experts alike to kickstart your Databricks experience.
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema Evolution and Schema Enforcement without Delta live Tables & Unity catalog

Rupa0503
Visitor

In Delta Lake, schema evolution with mergeSchema handles column additions perfectly — new columns get added and old rows get NULL. But when there is a data type change in the incoming data (for example, a column that was INT now coming as STRING from the source), mergeSchema throws an error even in append mode. However, in formats like ORC, Avro, and Parquet, mergeSchema handles both column additions and data type changes without any issue. So my question is — is this data type restriction in Delta's mergeSchema a design decision to protect existing data integrity, or is there a way to handle data type changes in append mode without resorting to casting before the write or doing a full overwrite? Also, in a production pipeline where the source schema keeps changing dynamically and we cannot hardcode the schema, what is the recommended approach to handle data type changes gracefully without breaking the pipeline?

#deltalake #Schema

2 REPLIES 2

Lu_Wang_ENB_DBX
Databricks Employee
Databricks Employee

Here are the answers to your questions:

  1. Is Delta’s restriction a design decision?
    Yes. In Delta, mergeSchema is mainly for schema evolution by adding columns; type changes are still controlled by schema enforcement unless the change qualifies for type widening. If the mismatch does not meet type-widening conditions, Delta follows normal enforcement rules instead of silently changing the column type.

  2. Can append mode handle type changes without pre-casting or full overwrite?
    Yes, but only for supported widening changes such as INT -> BIGINT, and only when the target table has delta.enableTypeWidening = true and schema evolution is enabled on the write.
    For INT -> STRING, that is not a supported automatic widening path; docs explicitly call it an unsupported data type change in Auto Loader, where it gets rescued instead of widened.

  3. What Git repo code result shows the intended pattern?
    A GitHub code file in databrickslabs/lakebridge uses the exact Delta pattern of enabling type widening first, then altering column types:

sqls: list | None = [
  f"ALTER TABLE {table_identifier} SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true')",
  f"ALTER TABLE {table_identifier} ALTER COLUMN recon_metrics.row_comparison.missing_in_source TYPE BIGINT",
  f"ALTER TABLE {table_identifier} ALTER COLUMN recon_metrics.row_comparison.missing_in_target TYPE BIGINT",
]
  1. What is the recommended production approach for dynamic schema drift?
    Use a Bronze/Silver pattern with Auto Loader. By default, Auto Loader is designed to avoid breaking on type mismatches: for text formats it infers columns as STRING, and with rescue modes it places unsupported type-change values into the rescued data column instead of failing the pipeline.
    If you want automatic widening for compatible changes, use addNewColumnsWithTypeWidening plus delta.enableTypeWidening=true; unsupported changes like INT -> STRING should be rescued/quarantined and normalized downstream rather than forced into the Delta target schema during append.

  2. Summary

  • New columns → use mergeSchema.
  • Widenable type changes → enable type widening and keep append mode.
  • Non-widening changes like INT -> STRING → do not rely on Delta mergeSchema; land raw data, rescue the bad values, and reconcile/cast in a downstream layer, or explicitly alter/overwrite the table schema when you choose to accept the change.

Is it okay if we define the schema manually for production, since we are not using Auto Loader?