Re: Autoloader schema inference

Cosimo_F_ · ‎09-06-2022

Hello,

is it possible to turn off schema inference with AutoLoader?

Thank you,

Cosimo

Cosimo_F_ · ‎09-08-2022

Thank you for your reply!

The documentation mentions passing a schema to AutoLoader but does not explain how. The solution is simply to use the .schema method like so:

spark.\

readStream.\

schema(<schema>).\

format("cloudFiles").\

load()

Best,

Cosimo.

jose_gonzalez · ‎09-09-2022

Do you mean? .option("cloudFiles.schemaLocation", "<path_to_checkpoint>")

If thats the case, then you can check the following docs https://docs.databricks.com/ingestion/auto-loader/options.html

Cosimo_F_ · ‎09-10-2022

Hi Jose,

No, that's the location of the schema hints (which work together with schema inference). Specifying a schema location does not turn off schema inference as I wanted. In fact schemaLocation is a required option _unless_ the schema is passed explicitly as I showed.

Best,

Cosimo,

shivagarg · ‎12-03-2024

https://docs.databricks.com/en/ingestion/cloud-object-storage/auto-loader/patterns.html#language-pyt...

you can enforce the schema or use the "cloudFiles.schemaHints" to override the Inference.

df = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("header", "true") \
  .option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data
  .schema(<schema>) \ # provide a schema here for the files
  .load(<path>