topic Re: Partitioned parquet table (folder) with different structure in Data Engineering

Partitioned parquet table (folder) with different structure

tarente — Mon, 22 Nov 2021 11:15:54 GMT

Hi,

We have a parquet table (folder) in Azure Storage Account.

The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).

We have a new development that adds a new column to this table from 20211101 onwards.

When we read the data for the interval [20211101, 20211121] in a Scala notebook, the dataframe does not return the new column.

What is the best way to solve this problem without having to rewrite all partitions with all columns?

Having the table in Delta format instead of parquet would solve the problem?

Or is just changing the way the table (folder) is saved?

This is an excerpt of the code used to create the table (if it does not exists) or insert data into a partition.

val fileFormat      = "parquet"
val filePartitionBy = "PeriodId"
val fileSaveMode    = "overwrite"
val filePath        = "abfss://<container>@<storage account>.dfs.core.windows.net/<folder>/<table name>"
 
var fileOptions = Map (
                        "header" -> "true",
                        "overwriteSchema" -> "true"
                      )
 
dfFinal
  .write
  .format      (fileFormat)
  .partitionBy (filePartitionBy)
  .mode        (fileSaveMode)
  .options     (fileOptions)
  .save        (filePath)

Thanks in advance,

Tiago Rente.

Re: Partitioned parquet table (folder) with different structure

Hubert-Dudek — Mon, 22 Nov 2021 11:34:50 GMT

I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

Re: Partitioned parquet table (folder) with different structure

tarente — Tue, 23 Nov 2021 16:41:56 GMT

Hi Hubert,

The overwrite is not overwriting all folders, it only adds the new column to the re-written partitions.

The problem is that even I filter only the re-written partitions in the dataframe I do not see the new added columns. However, if I open one of the parquet files of the re-written partitions, I do see the new columns.

If I open one of the parquet files of the original partitions, I do not see the new columns.

I.e., the parquet files have the new column in the new partitions but not in the original partitions. That is something I would expect.

What I would expect and is not happening, is to get the new column when filtering only re-written partitions.

Thanks,

Tiago Rente.

Re: Partitioned parquet table (folder) with different structure

jose_gonzalez — Fri, 10 Dec 2021 23:30:07 GMT

Hi @Tiago Rente

Have you try schema evolution? docs here https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

I think that having your table as Delta will solve this issue. You might want to test it.