<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Partitioned parquet table (folder) with different structure in Data Engineering</title>
    <link>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34617#M25355</link>
    <description>&lt;P&gt;I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
    <pubDate>Mon, 22 Nov 2021 11:34:50 GMT</pubDate>
    <dc:creator>Hubert-Dudek</dc:creator>
    <dc:date>2021-11-22T11:34:50Z</dc:date>
    <item>
      <title>Partitioned parquet table (folder) with different structure</title>
      <link>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34616#M25354</link>
      <description>&lt;P&gt;Hi,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We have a parquet table (folder) in Azure Storage Account.&lt;/P&gt;&lt;P&gt;The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;We have a new development that adds a new column to this table from 20211101 onwards.&lt;/P&gt;&lt;P&gt;When we read the data for the interval [20211101, 20211121] in a Scala notebook, the dataframe does not return the new column.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;What is the best way to solve this problem without having to rewrite all partitions with all columns?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Having the table in Delta format instead of parquet would solve the problem?&lt;/P&gt;&lt;P&gt;Or is just changing the way the table (folder) is saved?&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;This is an excerpt of the code used to &lt;I&gt;create&lt;/I&gt; the table (if it does not exists) or &lt;I&gt;insert&lt;/I&gt; data into a partition.&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;val fileFormat      = "parquet"
val filePartitionBy = "PeriodId"
val fileSaveMode    = "overwrite"
val filePath        = "abfss://&amp;lt;container&amp;gt;@&amp;lt;storage account&amp;gt;.dfs.core.windows.net/&amp;lt;folder&amp;gt;/&amp;lt;table name&amp;gt;"
&amp;nbsp;
var fileOptions = Map (
                        "header" -&amp;gt; "true",
                        "overwriteSchema" -&amp;gt; "true"
                      )
&amp;nbsp;
dfFinal
  .write
  .format      (fileFormat)
  .partitionBy (filePartitionBy)
  .mode        (fileSaveMode)
  .options     (fileOptions)
  .save        (filePath)&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;Thanks in advance,&lt;/P&gt;&lt;P&gt;Tiago Rente.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 11:15:54 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34616#M25354</guid>
      <dc:creator>tarente</dc:creator>
      <dc:date>2021-11-22T11:15:54Z</dc:date>
    </item>
    <item>
      <title>Re: Partitioned parquet table (folder) with different structure</title>
      <link>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34617#M25355</link>
      <description>&lt;P&gt;I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:&lt;/P&gt;&lt;PRE&gt;&lt;CODE&gt;spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")&lt;/CODE&gt;&lt;/PRE&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 22 Nov 2021 11:34:50 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34617#M25355</guid>
      <dc:creator>Hubert-Dudek</dc:creator>
      <dc:date>2021-11-22T11:34:50Z</dc:date>
    </item>
    <item>
      <title>Re: Partitioned parquet table (folder) with different structure</title>
      <link>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34618#M25356</link>
      <description>&lt;P&gt;Hi Hubert,&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;The &lt;I&gt;overwrite&lt;/I&gt; is not overwriting all folders, it only adds the new column to the re-written partitions.&lt;/P&gt;&lt;P&gt;The problem is that even I filter only the re-written partitions in the dataframe I do not see the new added columns. However, if I open one of the &lt;I&gt;parquet&lt;/I&gt; files of the re-written partitions, I do see the new columns.&lt;/P&gt;&lt;P&gt;If I open one of the &lt;I&gt;parquet&lt;/I&gt; files of the original partitions, I do not see the new columns.&lt;/P&gt;&lt;P&gt;I.e., the &lt;I&gt;parquet&lt;/I&gt; files have the new column in the new partitions but not in the original partitions. That is something I would expect.&lt;/P&gt;&lt;P&gt;What I would expect and is not happening, is to get the new column when filtering only re-written partitions.&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Thanks,&lt;/P&gt;&lt;P&gt;Tiago Rente.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Nov 2021 16:41:56 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34618#M25356</guid>
      <dc:creator>tarente</dc:creator>
      <dc:date>2021-11-23T16:41:56Z</dc:date>
    </item>
    <item>
      <title>Re: Partitioned parquet table (folder) with different structure</title>
      <link>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34619#M25357</link>
      <description>&lt;P&gt;Hi @Tiago Rente​&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;Have you try schema evolution? docs here &lt;A href="https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging" target="test_blank"&gt;https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;/P&gt;&lt;P&gt;I think that having your table as Delta will solve this issue. You might want to test it.&lt;/P&gt;</description>
      <pubDate>Fri, 10 Dec 2021 23:30:07 GMT</pubDate>
      <guid>https://community.databricks.com/t5/data-engineering/partitioned-parquet-table-folder-with-different-structure/m-p/34619#M25357</guid>
      <dc:creator>jose_gonzalez</dc:creator>
      <dc:date>2021-12-10T23:30:07Z</dc:date>
    </item>
  </channel>
</rss>

