Autoloader schema inference
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-06-2022 01:32 PM
Hello,
is it possible to turn off schema inference with AutoLoader?
Thank you,
Cosimo
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-08-2022 11:35 AM
Thank you for your reply!
The documentation mentions passing a schema to AutoLoader but does not explain how. The solution is simply to use the .schema method like so:
spark.\
readStream.\
schema(<schema>).\
format("cloudFiles").\
load()
Best,
Cosimo.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-09-2022 04:04 PM
Do you mean? .option("cloudFiles.schemaLocation", "<path_to_checkpoint>")
If thats the case, then you can check the following docs https://docs.databricks.com/ingestion/auto-loader/options.html
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
09-10-2022 07:01 AM
Hi Jose,
No, that's the location of the schema hints (which work together with schema inference). Specifying a schema location does not turn off schema inference as I wanted. In fact schemaLocation is a required option _unless_ the schema is passed explicitly as I showed.
Best,
Cosimo,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
12-03-2024 07:26 AM
you can enforce the schema or use the "cloudFiles.schemaHints" to override the Inference.
df = spark.readStream.format("cloudFiles") \ .option("cloudFiles.format", "csv") \ .option("header", "true") \ .option("rescuedDataColumn", "_rescued_data") \ # makes sure that you don't lose data .schema(<schema>) \ # provide a schema here for the files .load(<path>