cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
Join discussions on data engineering best practices, architectures, and optimization strategies within the Databricks Community. Exchange insights and solutions with fellow data engineers.
cancel
Showing results for 
Search instead for 
Did you mean: 

Using cloudFiles.inferColumnTypes with inferSchema and without defining schema checkpoint

BF7
New Contributor III

Two Issues:

1. What is the behavior of cloudFiles.inferColumnTypes with and without cloudFiles.inferSchema? Why would you use both?

2. When can cloudFiles.inferColumnTypes be used without a schema checkpoint?  How does that affect the behavior of cloudFiles.inferColumnTypes?

Discussion:

1. I see example notebooks from databricks that use inferColumnTypes both WITH inferSchema: delta-live-tables-notebooks/dms-dlt-cdc-demo/resources/dlt/dms-mysql-cdc-demo.py at main · databrick...    and WITHOUT inferSchema: delta-live-tables-notebooks/dms-dlt-cdc-demo/resources/dlt/dms-mysql-cdc-demo.py at main · databrick...

What is the use case for using both or only one of them? I would think that using both together is redundant and just creates unnecessary compute overhead. Except I find that's not necessarily true from my explorations on the behavior of these options.

2. Schema checkpoints: are they necessary or not?

All the documentation I find on cloudFiles.inferColumnTypes says that when using it, you must also define a schema checkpoint: Configure schema inference and evolution in Auto Loader - Azure Databricks | Microsoft Learn

However, I see some example notebooks from databricks that depict using cloudFiles.inferColumnTypes = True without ever defining a schema checkpoint:  

delta-live-tables-notebooks/dms-dlt-cdc-demo/resources/dlt/dms-mysql-cdc-demo.py at main · databrick...

- delta-live-tables-notebooks/change-data-capture-example/notebooks/2-Retail_DLT_CDC_Python.py at ma...

 

1 ACCEPTED SOLUTION

Accepted Solutions

BigRoux
Databricks Employee
Databricks Employee
  1. Behavior of cloudFiles.inferColumnTypes with and without cloudFiles.inferSchema:
    When cloudFiles.inferColumnTypes is enabled, Auto Loader attempts to identify the appropriate data types for columns instead of defaulting everything to strings, which is the default behavior for file formats like JSON, CSV, and XML.
    Without enabling cloudFiles.inferSchema, Auto Loader does not perform automatic schema inference. Instead, users must provide a schema explicitly or use schema hints. When both cloudFiles.inferColumnTypes and cloudFiles.inferSchema are enabled together, Auto Loader performs full schema inference on the incoming data, determining appropriate column data types based on the sampled data. This is especially useful for file formats lacking inherent type encoding (e.g., CSV, JSON).
    Why use both: The combination is beneficial when you want Auto Loader to infer both the schema structure (new columns, changes) and column data types dynamically, reducing manual intervention in managing schema during ingestion.
  2.  
  3. Using cloudFiles.inferColumnTypes without a schema checkpoint and its behavior:
    The cloudFiles.inferColumnTypes option can technically be enabled without specifying a schema checkpoint (cloudFiles.schemaLocation), but this setup is not recommended. Without a schema checkpoint, inferred schema changes cannot be tracked or persisted across runs, leading to potential issues when new data arrives with schema alterations.
    The schema checkpoint enables Auto Loader to persist schema evolution information and manage additions like new columns or changes in the data structure across micro-batches. Without a schema checkpoint, the behavior of cloudFiles.inferColumnTypes is limited to inferring column types for the current batch or sample scope, and schema consistency is the user’s responsibility.
    Using both cloudFiles.inferColumnTypes and a schema checkpoint allows seamless management of schema evolution while ensuring column types are accurately inferred and tracked. Missing checkpoint information may result in redundant inference and susceptibility to runtime errors if data evolves unexpectedly.

 

Hope this helps. BigRoux.

View solution in original post

2 REPLIES 2

BigRoux
Databricks Employee
Databricks Employee
  1. Behavior of cloudFiles.inferColumnTypes with and without cloudFiles.inferSchema:
    When cloudFiles.inferColumnTypes is enabled, Auto Loader attempts to identify the appropriate data types for columns instead of defaulting everything to strings, which is the default behavior for file formats like JSON, CSV, and XML.
    Without enabling cloudFiles.inferSchema, Auto Loader does not perform automatic schema inference. Instead, users must provide a schema explicitly or use schema hints. When both cloudFiles.inferColumnTypes and cloudFiles.inferSchema are enabled together, Auto Loader performs full schema inference on the incoming data, determining appropriate column data types based on the sampled data. This is especially useful for file formats lacking inherent type encoding (e.g., CSV, JSON).
    Why use both: The combination is beneficial when you want Auto Loader to infer both the schema structure (new columns, changes) and column data types dynamically, reducing manual intervention in managing schema during ingestion.
  2.  
  3. Using cloudFiles.inferColumnTypes without a schema checkpoint and its behavior:
    The cloudFiles.inferColumnTypes option can technically be enabled without specifying a schema checkpoint (cloudFiles.schemaLocation), but this setup is not recommended. Without a schema checkpoint, inferred schema changes cannot be tracked or persisted across runs, leading to potential issues when new data arrives with schema alterations.
    The schema checkpoint enables Auto Loader to persist schema evolution information and manage additions like new columns or changes in the data structure across micro-batches. Without a schema checkpoint, the behavior of cloudFiles.inferColumnTypes is limited to inferring column types for the current batch or sample scope, and schema consistency is the user’s responsibility.
    Using both cloudFiles.inferColumnTypes and a schema checkpoint allows seamless management of schema evolution while ensuring column types are accurately inferred and tracked. Missing checkpoint information may result in redundant inference and susceptibility to runtime errors if data evolves unexpectedly.

 

Hope this helps. BigRoux.

BF7
New Contributor III

Yes! This is exactly what I needed! Thank you so much!

Join Us as a Local Community Builder!

Passionate about hosting events and connecting people? Help us grow a vibrant local community—sign up today to get started!

Sign Up Now