- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-22-2023 02:03 PM
Problem
Reading nearly equivalent parquet tables in a directory with some having column X with type float and some with type double fails.
Attempts at resolving
- Using streaming files
- Removing delta caching, vectorization
- Using ,cache() explicitly
Notes
This is a known problem, but I need a work around.
Example code
(spark.read.option("mergeSchema", False)
.option("spark.databricks.io.cache.enabled", False)
.parquet(
f"s3://my-bucket/data/*"
)
.write.mode("append").saveAsTable("my_table"))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-22-2023 02:34 PM
After many, many hours of trying to resolve this, I figured out a hack that _solves_ the problem, but it's not optimal. I basically read the directory listing of files and then merge them via unions and do a save out.
my_schema = StructType([
StructField("ordered", StringType()),
StructField("by", TimestampType()),
StructField("schema", LongType()),
StructField("provided", DoubleType()),
])
df = spark.createDataFrame(data=[], schema=my_schema)
# ...
for table_file in table_files:
df = df.union(
spark.read.option("mergeSchema", False)
.option("spark.databricks.io.cache.enabled", False)
.parquet(
f"s3://my-bucket/data/{table_file}"
)
# Transformations
.select('ordered', 'by', 'schema', 'provided')
)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
03-29-2023 09:45 PM
Hi @Erik Louie
Help us build a vibrant and resourceful community by recognizing and highlighting insightful contributions. Mark the best answers and show your appreciation!
Regards