Problem
I have a directory in S3 with a bunch of data files, like "data-20221101.parquet". They all have the same columns: timestamp, reading_a, reading_b, reading_c. In the earlier files, the readings are floats, but in the later ones they are doubles. When I run the following read, this fails due to merge failure.
from pyspark.sql.functions import col, expr
from pyspark.sql.types import DoubleType, LongType, StructField, StructType
schema = StructType([
StructField("timestamp", LongType()),
StructField("reading_a", DoubleType()),
StructField("reading_b", DoubleType()),
StructField("reading_c", DoubleType()),
])
(spark.read.schema(schema)
.option("mergeSchema", False)
.parquet('s3://readings/path/to/data/data-*.parquet')
.write
.saveAsTable('readings.data'))
And it gives the following error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableDouble cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableFloat
Attempts
- Turn mergeSchema on and off;
- set the schema, don't set the schema; and
- read individual files (succeeds).
What I think is going on
Spark reads a file that has float type, then tries to continue reading files with that before upcasting to double type, but this fails when it gets to the file with a double. Really, spark should obey my schema from the start and always upcast.
More info
Someone else did a rather deep dive into solving this and shows a bunch of different methods, but their final solution is a hack and not sustainable. They read every file individually, then convert to their schema, then merge them. This negates a lot of the benefit of Sparks magical reading capabilities.
https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce
Question
How can I read many files with only slightly different parquet types without having to do this hack above?