cancel
Showing results for 
Search instead for 
Did you mean: 
Data Engineering
cancel
Showing results for 
Search instead for 
Did you mean: 

Partitioned parquet table (folder) with different structure

tarente
New Contributor III

Hi,

We have a parquet table (folder) in Azure Storage Account.

The table is partitioned by column PeriodId (represents a day in the format YYYYMMDD) and has data from 20181001 until 20211121 (yesterday).

We have a new development that adds a new column to this table from 20211101 onwards.

When we read the data for the interval [20211101, 20211121] in a Scala notebook, the dataframe does not return the new column.

What is the best way to solve this problem without having to rewrite all partitions with all columns?

Having the table in Delta format instead of parquet would solve the problem?

Or is just changing the way the table (folder) is saved?

This is an excerpt of the code used to create the table (if it does not exists) or insert data into a partition.

val fileFormat      = "parquet"
val filePartitionBy = "PeriodId"
val fileSaveMode    = "overwrite"
val filePath        = "abfss://<container>@<storage account>.dfs.core.windows.net/<folder>/<table name>"
 
var fileOptions = Map (
                        "header" -> "true",
                        "overwriteSchema" -> "true"
                      )
 
dfFinal
  .write
  .format      (fileFormat)
  .partitionBy (filePartitionBy)
  .mode        (fileSaveMode)
  .options     (fileOptions)
  .save        (filePath)

Thanks in advance,

Tiago Rente.

3 REPLIES 3

Hubert-Dudek
Esteemed Contributor III

I think problem is in overwrite as when you overwrite it overwrites all folders. Solution is to mix append with dynamic overwrite so it will overwrite only folders which have data and doesn't affect old partitions:

spark.conf.set("spark.sql.sources.partitionOverwriteMode", "dynamic")

tarente
New Contributor III

Hi Hubert,

The overwrite is not overwriting all folders, it only adds the new column to the re-written partitions.

The problem is that even I filter only the re-written partitions in the dataframe I do not see the new added columns. However, if I open one of the parquet files of the re-written partitions, I do see the new columns.

If I open one of the parquet files of the original partitions, I do not see the new columns.

I.e., the parquet files have the new column in the new partitions but not in the original partitions. That is something I would expect.

What I would expect and is not happening, is to get the new column when filtering only re-written partitions.

Thanks,

Tiago Rente.

Hi @Tiago Rente​ 

Have you try schema evolution? docs here https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#schema-merging

I think that having your table as Delta will solve this issue. You might want to test it.

Welcome to Databricks Community: Lets learn, network and celebrate together

Join our fast-growing data practitioner and expert community of 80K+ members, ready to discover, help and collaborate together while making meaningful connections. 

Click here to register and join today! 

Engage in exciting technical discussions, join a group with your peers and meet our Featured Members.