cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

AutoLoader Type Widening

parth_db
New Contributor

I have a few doubts regarding AutoLoader behavior and capabilities. Please check and correct wherever my assumptions or understanding are incorrect, much appreciated. Below is my specific code 

Example scenario:
Target Managed Delta Table (Type Widening enabled) has a field -  'Quantity' , Type - int. 

My code:

df = (
    spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("cloudFiles.schemaLocation", schema_location)
# Schema Inference
        .
option("cloudFiles.inferColumnTypes", "true")  
        .option("cloudFiles.schemaEvolutionMode", "addNewColumns")
        .option("cloudFiles.rescuedDataColumn", "_rescued_data")
        .option("cloudFiles.allowOverwrites", "true")
        .option("header", "true")
        .load(source_dir)
        .withColumn("_rescued_data", col("_rescued_data").cast("string"))
)

def upsert_batch(microBatchDF, batchId: int😞
    # If target table doesn't exist yet, bootstrap it from inferred schema
    if not table_exists_uc(target_table):
        (microBatchDF.write
            .format("delta")
            .mode("overwrite")
            .saveAsTable(target_table)
        )
        return

    delta_t = DeltaTable.forName(spark, target_table)

    (delta_t.alias("t")
        .merge(microBatchDF.alias("s"), "t.order_id = s.order_id")
        .withSchemaEvolution()
        .whenMatchedUpdateAll()
        .whenNotMatchedInsertAll()
        .execute()
    )

query = (
    df.writeStream
      .foreachBatch(upsert_batch)
      .option("checkpointLocation", checkpoint_location)
      .trigger(availableNow=True)
      .start()
)

Let's say on our first run we ingest a file which has all rows with quantity field as int, so my assumption our stored schema 0 should have quantity inferred as an int.
On our next run we load a file which has float values in the quantity column. So tell me, in this case will our target delta table automatically widen to accomodate this new file data which is a float? Before our writestream happens in the readstream portion itself my understanding is that since the latest schema it has mentions quantity with type int, the readstream will try to parse the float values against the int type and will not be able to do so these will end in rescued data and hence the target table quantity field will never see these float values. Please confirm this behavior.
Also as an additional question, does Autoloader ever parse floats to int (Ex : 4.5 -> 4 OR 2.0 -> 2) and if not, why?

Thanks,
Parth

1 ACCEPTED SOLUTION

Accepted Solutions

nayan_wylde
Esteemed Contributor

1. No, Auto Loader does not provide an option to automatically widen column types. Its schema evolution modes are:

addNewColumns → Adds new nullable columns.
rescue → Captures unexpected fields in _rescued_data.
failOnNewColumns → Stops the stream on schema drift.

Type widening is a Delta Lake feature, not an Auto Loader feature. You can enable it on the target table with

ALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');

 

2. Yes, Delta Live Tables (DLT) uses Auto Loader under the hood for streaming ingestion. The default behavior is the same:

  • Adds new columns when detected.
  • Does not change existing column types automatically. For type widening, you rely on Delta Lake’s schema evolution and type widening features in the Silver/Gold layers. DLT does not override Auto Loader’s limitations.

3. Suggested approach for unstable upstream schemas 100s of columns

When explicit schemas or hints are impractical, consider these strategies:

  • Use schema-on-read with flexible types
  • For CSV/JSON, Auto Loader often infers all columns as STRING if inferColumnTypes is off. You can then cast selectively downstream.
  • Alternatively, set cloudFiles.inferColumnTypes=true but combine with schemaHints for critical columns
  • Leverage _rescued_data for audit and repair.Keep _rescued_data in Bronze for anomaly detection.
  • Apply transformations in Silver to reconcile rescued values when needed.
  • Cast numeric columns to DOUBLE or DECIMAL in the Bronze layer.
  • Enable column mapping on the Delta table ('delta.columnMapping.mode' = 'name') to handle column renames and drops as metadata-only changes without data rewrites, which improves performance and resilience to structural changes. 

View solution in original post

5 REPLIES 5

nayan_wylde
Esteemed Contributor

  • Auto Loader does not change column types on its own. It can add new columns automatically, but won’t widen types of existing columns.
  • With your current settings and an initially inferred Quantity INT, if the next CSV has float values, Auto Loader will try to parse them as INT. Those values won’t load as floats; they’ll either become NULL and/or be captured in _rescued_data (depending on the format and parsing rules). They won’t reach your target Quantity as floats.
  • Type widening is a Delta Lake table feature (DR 15.4+), not an Auto Loader feature. It can widen INT → DOUBLE/DECIMAL/LONG, but only during writes and when enabled on the target table. You still need the read side to produce data of the widened type (e.g., cast to DOUBLE) before MERGE.
  • Auto Loader won’t “round” floats to ints for you. It doesn’t silently coerce 4.5 → 4 or 2.0 → 2. If you explicitly cast in Spark, standard casting rules apply (fraction truncated), but Auto Loader itself avoids implicit lossy conversions for data-quality reasons.

Thanks for the response. As a final verification, my understanding is that Autoloader schema inference and schema evolution only tracks when colums are added in our schema, which then get get stored as the new schema version. In all the rest cases - column drops, type mismatches (such as float to an int col) the schema remains the same and will get rescued or ignored as per our rescue column settings. Therefore even if our target delta table has type widening enabled - since our schema doesn’t widen itself our target delta table will not get widened via autoloader unless we manually update our schema column data type to allow accurate parsing.


1) Could you tell me if there is any configuration/option/parameter/setting to override this behavior and force schema data type widening to columns? 

2) Is a same behavior seen even if we are working with Delta Live Tables?

3) And as sort of a inspiration is there any approach you would suggest to handle such schema changes when we are working with upstream data (100s of columns) that keeps on changing so providing an explicit schema or schema hints is counterproductive, and our initial inference may result in too narrow types causing mismatches in the future.

Apologies for the long questions, I am trying to understand Autoloader functions and capabilities to handle such edge cases when upstream is not stable and clean.

Thanks,

Parth

 

nayan_wylde
Esteemed Contributor

1. No, Auto Loader does not provide an option to automatically widen column types. Its schema evolution modes are:

addNewColumns → Adds new nullable columns.
rescue → Captures unexpected fields in _rescued_data.
failOnNewColumns → Stops the stream on schema drift.

Type widening is a Delta Lake feature, not an Auto Loader feature. You can enable it on the target table with

ALTER TABLE <table_name> SET TBLPROPERTIES ('delta.enableTypeWidening' = 'true');

 

2. Yes, Delta Live Tables (DLT) uses Auto Loader under the hood for streaming ingestion. The default behavior is the same:

  • Adds new columns when detected.
  • Does not change existing column types automatically. For type widening, you rely on Delta Lake’s schema evolution and type widening features in the Silver/Gold layers. DLT does not override Auto Loader’s limitations.

3. Suggested approach for unstable upstream schemas 100s of columns

When explicit schemas or hints are impractical, consider these strategies:

  • Use schema-on-read with flexible types
  • For CSV/JSON, Auto Loader often infers all columns as STRING if inferColumnTypes is off. You can then cast selectively downstream.
  • Alternatively, set cloudFiles.inferColumnTypes=true but combine with schemaHints for critical columns
  • Leverage _rescued_data for audit and repair.Keep _rescued_data in Bronze for anomaly detection.
  • Apply transformations in Silver to reconcile rescued values when needed.
  • Cast numeric columns to DOUBLE or DECIMAL in the Bronze layer.
  • Enable column mapping on the Delta table ('delta.columnMapping.mode' = 'name') to handle column renames and drops as metadata-only changes without data rewrites, which improves performance and resilience to structural changes. 

parth_db
New Contributor

This provides alot of clarity. Thank you.

Sanjeeb2024
Contributor II

Thank you @nayan_wylde for the details. This is really useful.

Sanjeeb Mohapatra