Databricks Community

Erik_L · ‎03-22-2023

Problem

Reading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.

Attempts at resolving

Using streaming files
Removing delta caching, vectorization
Using ,cache() explicitly

Notes

This is a known problem, but I need a work around.

Example code

(spark.read.option("mergeSchema", False)
    .option("spark.databricks.io.cache.enabled", False)
    .parquet(
        f"s3://my-bucket/data/*"
    )
    .write.mode("append").saveAsTable("my_table"))

Erik_L · ‎03-22-2023

After many, many hours of trying to resolve this, I figured out a hack that _solves_ the problem, but it's not optimal. I basically read the directory listing of files and then merge them via unions and do a save out.

my_schema = StructType([
    StructField("ordered", StringType()),
    StructField("by", TimestampType()),
    StructField("schema", LongType()),
    StructField("provided", DoubleType()),
])
df = spark.createDataFrame(data=[], schema=my_schema)
 
# ...
        for table_file in table_files:
            df = df.union(
                spark.read.option("mergeSchema", False)
                .option("spark.databricks.io.cache.enabled", False)
                .parquet(
                    f"s3://my-bucket/data/{table_file}"
                )
                # Transformations
                .select('ordered', 'by', 'schema', 'provided')
            )

View solution in original post

Erik_L · ‎03-22-2023

After many, many hours of trying to resolve this, I figured out a hack that _solves_ the problem, but it's not optimal. I basically read the directory listing of files and then merge them via unions and do a save out.

my_schema = StructType([
    StructField("ordered", StringType()),
    StructField("by", TimestampType()),
    StructField("schema", LongType()),
    StructField("provided", DoubleType()),
])
df = spark.createDataFrame(data=[], schema=my_schema)
 
# ...
        for table_file in table_files:
            df = df.union(
                spark.read.option("mergeSchema", False)
                .option("spark.databricks.io.cache.enabled", False)
                .parquet(
                    f"s3://my-bucket/data/{table_file}"
                )
                # Transformations
                .select('ordered', 'by', 'schema', 'provided')
            )

Anonymous · ‎03-29-2023

Hi @Erik Louie

Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!

Regards

Databricks Community

Pyspark read multiple Parquet type expansion failure

Connect with Databricks Users in Your Area

Databricks Named a Leader in the 2024 Gartner® Magic Quadrant™ for Cloud Database Management Systems

Announcing the new Meta Llama 3.3 model on Databricks

Milestone: DatabricksTV Reaches 100 Videos!

Dotmatics and Databricks Partner to Advance Scientific Intelligence in Life Sciences

Databricks Community Champion - December 2024 - Sujesh Menon