Autoloader schema inference

Cosimo_F_
Contributor

Hello,

is it possible to turn off schema inference with AutoLoader?

Thank you,

Cosimo

Thank you for your reply!

The documentation mentions passing a schema to AutoLoader but does not explain how. The solution is simply to use the .schema method like so:

spark.\

 readStream.\

 schema(<schema>).\

 format("cloudFiles").\

load()

Best,

Cosimo.

Do you mean? .option("cloudFiles.schemaLocation", "<path_to_checkpoint>")

If thats the case, then you can check the following docs https://docs.databricks.com/ingestion/auto-loader/options.html

Hi Jose,

No, that's the location of the schema hints (which work together with schema inference). Specifying a schema location does not turn off schema inference as I wanted. In fact schemaLocation is a required option _unless_ the schema is passed explicitly as I showed.

Best,

Cosimo,

shivagarg
New Contributor II

https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/patterns.html#language-pyt...

you can enforce the schema or use the "cloudFiles.schemaHints"  to override the Inference.

 

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("header", "true") \
  .option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
  .schema(<schema>) \ # provide a schema here for the files
  .load(<path>