cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Schema inference with auto loader (non-DLT and DLT)

ilarsen
Contributor

Hi.

 

Another question, this time about schema inference and column types.  I have dabbled with DLT and structured streaming with auto loader (as in, not DLT).  My data source use case is json files, which contain nested structures.

 

I noticed that in the resulting streaming DLT table, all columns were strings.  In the resulting delta table from the structured streaming + auto loader approach, the nested columns are structs.

 

  • Is this the option cloudFiles.inferColumnTypes at work?
  • As I understand it from the doc, if I were to use false in the non-DLT structured streaming approach, the columns would all be strings, correct?
  • It doesn't look like I set anything for that option in the DLT declaration, so is false the default for DLT?  Based on the doc I assume DLT using false is the case:
cloudFiles.inferColumnTypes
Type: Boolean
Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See schema inference for more details.
Default value: false
  • If I use infer false in the structured streaming approach, would schema changes in those nested struct columns not cause failures due to schema evolution, because they're just strings instead?


Cheers.

 

1 ACCEPTED SOLUTION

Accepted Solutions

Kaniz
Community Manager
Community Manager

Hi @ilarsen , Certainly! Let’s delve into the nuances of schema inference and column types in the context of Delta Live Tables (DLT) and structured streaming with auto loader.

 

DLT vs. Structured Streaming:

  • DLT (Delta Live Tables) is a managed service provided by Databricks that simplifies streaming data processing and ETL tasks. It offers a domain-specific language (DSL) to streamline writing streaming code with fewer lines.
  • Structured Streaming, on the other hand, is a core feature of Apache Spark. It allows you to process streaming data using structured APIs and SQL expressions.

Schema Inference and Column Types:

  • When dealing with JSON files containing nested structures, schema inference plays a crucial role in determining column types.
  • By default, when inferring schema from JSON datasets, all columns are treated as strings. This behavior applies to both DLT and stock Spark Structured Streaming.
  • However, you can control this behavior using the cloudFiles.inferColumnTypes option.

cloudFiles.inferColumnTypes Option:

  • This option determines whether to infer exact column types during schema inference.
  • When set to true, the system attempts to infer more precise data types based on the sample data. For example, it may recognize integers, floats, or nested structures.
  • When set to false, all columns are inferred as strings.
  • In your case, the DLT declaration does not explicitly set this option, so it defaults to false (i.e., inferring columns as strings).

Schema Evolution and Nested Struct Columns:

  • If you use false for schema inference (treating all columns as strings), schema changes in nested struct columns will not cause failures due to schema evolution.
  • However, keep in mind that treating everything as strings may not be ideal for complex nested structures. You won’t benefit from the precision of data types.
  • If you expect schema changes (e.g., adding new fields or modifying nested structures), consider setting cloudFiles.inferColumnTypes to true. This way, the system will adapt to evolving schemas.

Decision Considerations:

  • DLT provides convenience and simplification but comes at an additional cost. Evaluate whether the benefits align with your use case.
  • Structured Streaming remains a robust and widely used feature. It’s developed by Apache and will continue to evolve.
  • Understand the trade-offs, costs, and benefits before choosing between DLT and stock Spark Structured Streaming.

Remember that both DLT and Structured Streaming have their merits, and your choice should align with your specific requirements and constraints. 

 

Happy streaming! 😊

View solution in original post

2 REPLIES 2

Kaniz
Community Manager
Community Manager

Hi @ilarsen , Certainly! Let’s delve into the nuances of schema inference and column types in the context of Delta Live Tables (DLT) and structured streaming with auto loader.

 

DLT vs. Structured Streaming:

  • DLT (Delta Live Tables) is a managed service provided by Databricks that simplifies streaming data processing and ETL tasks. It offers a domain-specific language (DSL) to streamline writing streaming code with fewer lines.
  • Structured Streaming, on the other hand, is a core feature of Apache Spark. It allows you to process streaming data using structured APIs and SQL expressions.

Schema Inference and Column Types:

  • When dealing with JSON files containing nested structures, schema inference plays a crucial role in determining column types.
  • By default, when inferring schema from JSON datasets, all columns are treated as strings. This behavior applies to both DLT and stock Spark Structured Streaming.
  • However, you can control this behavior using the cloudFiles.inferColumnTypes option.

cloudFiles.inferColumnTypes Option:

  • This option determines whether to infer exact column types during schema inference.
  • When set to true, the system attempts to infer more precise data types based on the sample data. For example, it may recognize integers, floats, or nested structures.
  • When set to false, all columns are inferred as strings.
  • In your case, the DLT declaration does not explicitly set this option, so it defaults to false (i.e., inferring columns as strings).

Schema Evolution and Nested Struct Columns:

  • If you use false for schema inference (treating all columns as strings), schema changes in nested struct columns will not cause failures due to schema evolution.
  • However, keep in mind that treating everything as strings may not be ideal for complex nested structures. You won’t benefit from the precision of data types.
  • If you expect schema changes (e.g., adding new fields or modifying nested structures), consider setting cloudFiles.inferColumnTypes to true. This way, the system will adapt to evolving schemas.

Decision Considerations:

  • DLT provides convenience and simplification but comes at an additional cost. Evaluate whether the benefits align with your use case.
  • Structured Streaming remains a robust and widely used feature. It’s developed by Apache and will continue to evolve.
  • Understand the trade-offs, costs, and benefits before choosing between DLT and stock Spark Structured Streaming.

Remember that both DLT and Structured Streaming have their merits, and your choice should align with your specific requirements and constraints. 

 

Happy streaming! 😊

A late thank you for your reply, Kaniz.  From my experience in the platform so far, I do like what schema inference does and I prefer to use it.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.