<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: How to merge parquets with different column types in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/how-to-merge-parquets-with-different-column-types/m-p/7513#M3365</link>
    <description>&lt;P&gt;1) Can you let us know what was the error message when you don't set the schema &amp;amp; use mergeSchema&lt;/P&gt;&lt;P&gt;2) What happens when you define schema (with FloatType) &amp;amp; use mergeSchema ? what error message do you get ? &lt;/P&gt;</description>
    <pubDate>Wed, 22 Mar 2023 22:26:15 GMT</pubDate>
    <dc:creator>mathan_pillai</dc:creator>
    <dc:date>2023-03-22T22:26:15Z</dc:date>
    <item>
      <title>How to merge parquets with different column types</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-merge-parquets-with-different-column-types/m-p/7512#M3364</link>
      <description>&lt;P&gt;&lt;B&gt;Problem&lt;/B&gt;&lt;/P&gt;&lt;P&gt;I have a directory in S3 with a bunch of data files, like "data-20221101.parquet". They all have the same columns: timestamp, reading_a, reading_b, reading_c. In the earlier files, the readings are floats, but in the later ones they are doubles. When I run the following read, this fails due to merge failure.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;from pyspark.sql.functions import col, expr
from pyspark.sql.types import DoubleType, LongType, StructField, StructType
&amp;nbsp;
schema = StructType([
    StructField("timestamp", LongType()),
    StructField("reading_a", DoubleType()),
    StructField("reading_b", DoubleType()),
    StructField("reading_c", DoubleType()),
])
&amp;nbsp;
(spark.read.schema(schema)
    .option("mergeSchema", False)
    .parquet('s3://readings/path/to/data/data-*.parquet')
    .write
    .saveAsTable('readings.data'))&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;And it gives the following error:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableDouble cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableFloat&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;B&gt;Attempts&lt;/B&gt;&lt;/P&gt;&lt;UL&gt;&lt;LI&gt;Turn mergeSchema on and off;&lt;/LI&gt;&lt;LI&gt;set the schema, don't set the schema; and&lt;/LI&gt;&lt;LI&gt;read individual files (succeeds).&lt;/LI&gt;&lt;/UL&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;What I think is going on&lt;/B&gt;&lt;/P&gt;&lt;P&gt;Spark reads a file that has float type, then tries to continue reading files with that before upcasting to double type, but this fails when it gets to the file with a double. &lt;I&gt;Really&lt;/I&gt;, spark should obey my schema from the start and always upcast.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;More info&lt;/B&gt;&lt;/P&gt;&lt;P&gt;Someone else did a rather deep dive into solving this and shows a bunch of different methods, but their final solution is a hack and not sustainable. They read every file individually, then convert to their schema, then merge them. This negates a lot of the benefit of Sparks magical reading capabilities.&lt;/P&gt;&lt;P&gt;&lt;A href="https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce" target="test_blank"&gt;https://medium.com/data-arena/merging-different-schemas-in-apache-spark-2a9caca2c5ce&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;B&gt;Question&lt;/B&gt;&lt;/P&gt;&lt;P&gt;How can I read many files with only slightly different parquet types without having to do this hack above?&lt;/P&gt;</description>
      <pubDate>Fri, 17 Mar 2023 18:25:28 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-merge-parquets-with-different-column-types/m-p/7512#M3364</guid>
      <dc:creator>Erik_L</dc:creator>
      <dc:date>2023-03-17T18:25:28Z</dc:date>
    </item>
    <item>
      <title>Re: How to merge parquets with different column types</title>
      <link>https://community.databricks.com/t5/data-engineering/how-to-merge-parquets-with-different-column-types/m-p/7513#M3365</link>
      <description>&lt;P&gt;1) Can you let us know what was the error message when you don't set the schema &amp;amp; use mergeSchema&lt;/P&gt;&lt;P&gt;2) What happens when you define schema (with FloatType) &amp;amp; use mergeSchema ? what error message do you get ? &lt;/P&gt;</description>
      <pubDate>Wed, 22 Mar 2023 22:26:15 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/how-to-merge-parquets-with-different-column-types/m-p/7513#M3365</guid>
      <dc:creator>mathan_pillai</dc:creator>
      <dc:date>2023-03-22T22:26:15Z</dc:date>
    </item>
  </channel>
</rss>

