Re: Schema Evolution and Schema Enforcement withou...

Rupa0503 · 6 hours ago

In Delta Lake, schema evolution with mergeSchema handles column additions perfectly — new columns get added and old rows get NULL. But when there is a data type change in the incoming data (for example, a column that was INT now coming as STRING from the source), mergeSchema throws an error even in append mode. However, in formats like ORC, Avro, and Parquet, mergeSchema handles both column additions and data type changes without any issue. So my question is — is this data type restriction in Delta's mergeSchema a design decision to protect existing data integrity, or is there a way to handle data type changes in append mode without resorting to casting before the write or doing a full overwrite? Also, in a production pipeline where the source schema keeps changing dynamically and we cannot hardcode the schema, what is the recommended approach to handle data type changes gracefully without breaking the pipeline?

#deltalake #Schema

Lu_Wang_ENB_DBX · 5 hours ago

Here are the answers to your questions:

Is Delta’s restriction a design decision?
Yes. In Delta, mergeSchema is mainly for schema evolution by adding columns; type changes are still controlled by schema enforcement unless the change qualifies for type widening. If the mismatch does not meet type-widening conditions, Delta follows normal enforcement rules instead of silently changing the column type.
Can append mode handle type changes without pre-casting or full overwrite?
Yes, but only for supported widening changes such as INT -> BIGINT, and only when the target table has delta.enableTypeWidening = true and schema evolution is enabled on the write.
For INT -> STRING, that is not a supported automatic widening path; docs explicitly call it an unsupported data type change in Auto Loader, where it gets rescued instead of widened.
What Git repo code result shows the intended pattern?
A GitHub code file in databrickslabs/lakebridge uses the exact Delta pattern of enabling type widening first, then altering column types:

sqls: list | None = [
  f"ALTER TABLE {table_identifier} SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true')",
  f"ALTER TABLE {table_identifier} ALTER COLUMN recon_metrics.row_comparison.missing_in_source TYPE BIGINT",
  f"ALTER TABLE {table_identifier} ALTER COLUMN recon_metrics.row_comparison.missing_in_target TYPE BIGINT",
]

What is the recommended production approach for dynamic schema drift?
Use a Bronze/Silver pattern with Auto Loader. By default, Auto Loader is designed to avoid breaking on type mismatches: for text formats it infers columns as STRING, and with rescue modes it places unsupported type-change values into the rescued data column instead of failing the pipeline.
If you want automatic widening for compatible changes, use addNewColumnsWithTypeWidening plus delta.enableTypeWidening=true; unsupported changes like INT -> STRING should be rescued/quarantined and normalized downstream rather than forced into the Delta target schema during append.
Summary

New columns → use mergeSchema.
Widenable type changes → enable type widening and keep append mode.
Non-widening changes like INT -> STRING → do not rely on Delta mergeSchema; land raw data, rescue the bad values, and reconcile/cast in a downstream layer, or explicitly alter/overwrite the table schema when you choose to accept the change.

Rupa0503 · 3 hours ago

Is it okay if we define the schema manually for production, since we are not using Auto Loader?

Lu_Wang_ENB_DBX · 3 hours ago

Yes — defining the schema manually for production is okay when you are not using Auto Loader.

With a manually provided schema, you should expect stricter enforcement: Delta will not automatically absorb non-widening type changes like INT -> STRING in append mode.

So the practical recommendation is:

Use a fixed contract at the ingestion boundary if your downstream table is a curated production Delta table.
Handle drift before the final write — either by:
- normalizing/casting in code, or
- landing raw data in a staging/bronze table and quarantining bad/type-drifted records for later reconciliation.
If the drift is only a supported widening change, you can enable type widening on the Delta table; otherwise, manual schema alone will not solve the issue.

Summary: manual schema is not a graceful solution for arbitrary source type changes by itself.

Schema Evolution and Schema Enforcement without Delta live Tables & Unity catalog