When trying to ingest parquet files with autoloader with the following code
df = (spark
.readStream
.format("cloudFiles")
.option("cloudfiles.format","parquet")
.load(filePath))
I get the following error:
java.lang.UnsupportedOperationException: Schema inference is not supported for format: parquet. Please specify the schema.
I find this strange because parquet files contain schema information. There is nothing to infer.
If I pull the schema from one of the existing parquet files autoloader works.
filePath = '/dbfs/mnt/dops/streamtest/public/streamme/'
files = os.listdir(filePath)
files.sort()
sdata = spark.read.parquet(os.path.join(file_path[5:], files[-1]))
df = (spark
.readStream
.format("cloudFiles")
.option("cloudfiles.format","parquet")
.schema(sdata.schema)
.load(filePath))
This does work but eliminates one of the primary benefits of autoloader: no directory listing.
Is this expected behavior? I have trouble understanding why autoloader cannot read the schema from parquet files.
Thanks,
Ben