Pyspark read multiple Parquet type expansion failu...

Erik_L · ‎03-22-2023

Problem

Reading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.

Attempts at resolving

Using streaming files
Removing delta caching, vectorization
Using ,cache() explicitly

Notes

This is a known problem, but I need a work around.

Example code

(spark.read.option("mergeSchema", False)
    .option("spark.databricks.io.cache.enabled", False)
    .parquet(
        f"s3://my-bucket/data/*"
    )
    .write.mode("append").saveAsTable("my_table"))

Pyspark read multiple Parquet type expansion failure