topic Autoloader issue in Data Engineering

Autoloader issue

The_Demigorgan — Tue, 21 Nov 2023 13:57:42 GMT

I'm trying to ingest data from Parquet files using Autoloader. Now, I have my custom schema, I don't want to infer the schema from the parquet files.

During readstream everything is fine. But during writestream, it is somehow inferring the schema from the files and I'm getting a schema mismatch error.

Any idea why it is happening? Help will be appreciated.

Re: Autoloader issue

cgrant — Thu, 05 Dec 2024 17:22:43 GMT

In this case, please make sure you specify the schema explicitly when reading the Parquet files and do not specify any inference options.

Something like

spark.readStream.format("cloudFiles").schema(schema)...

If you want to more easily grab the schema, you can read with the batch reader and capture the schema:

schema = spark.read.parquet("/your/path/here").schema