Schema inference with auto loader (non-DLT and DLT)

ilarsen — Tue, 21 Nov 2023 23:27:15 GMT

Hi.

Another question, this time about schema inference and column types. I have dabbled with DLT and structured streaming with auto loader (as in, not DLT). My data source use case is json files, which contain nested structures.

I noticed that in the resulting streaming DLT table, all columns were strings. In the resulting delta table from the structured streaming + auto loader approach, the nested columns are structs.

Is this the option cloudFiles.inferColumnTypes at work?
As I understand it from the doc, if I were to use false in the non-DLT structured streaming approach, the columns would all be strings, correct?
It doesn't look like I set anything for that option in the DLT declaration, so is false the default for DLT? Based on the doc I assume DLT using false is the case:

cloudFiles.inferColumnTypes Type: Boolean Whether to infer exact column types when leveraging schema inference. By default, columns are inferred as strings when inferring JSON and CSV datasets. See schema inference for more details. Default value: false

If I use infer false in the structured streaming approach, would schema changes in those nested struct columns not cause failures due to schema evolution, because they're just strings instead?

Cheers.

Re: Schema inference with auto loader (non-DLT and DLT)

ilarsen — Tue, 23 Jan 2024 20:45:35 GMT

A late thank you for your reply, Kaniz. From my experience in the platform so far, I do like what schema inference does and I prefer to use it.

topic Schema inference with auto loader (non-DLT and DLT) in Data Engineering

Schema inference with auto loader (non-DLT and DLT)

Re: Schema inference with auto loader (non-DLT and DLT)